Problem definition :
We have been seeing these Exception on the production machines – when a SQL query comes from the AppServer ( JBoss ) and does the query using one of the connections in the Connection pool.
SQL Error: I/O Error: Connection reset
java.sql.SQLException: I/O Error: Connection reset
Caused by: java.net.SocketException: Connection reset
Setup under which this is happening
- JBoss AppServer 4.2.2 running RedHat Linux 5.5
- SQLServer JDBC Driver 1.2.2 – jtds-1.2.2.jar
- Java 1.6
- SQLServer 2005(9.00.4035.00 SP3 Standard Edition (64-bit)) running on – Windows Server 2008 R2
These sort of problems always excite me – and I itch to get jumping on these sort of problems. This helps me to get away from some of the routine work I do and also gives me an opportunity to learn more and improve my troubleshooting skills.
So with the permission of my manager – I jumped on this problem. Here is the story in more details.
I created a standalone Java Program to make sure that the problem is can be recreated from anywhere without much setup.
I added JDBC logging to my Standalone Java program by adding the following lines to my code – to get more details of the error – the JDBC Log file shows the following exception immediately after the Query packet information.
Class driverClass = Class.forName(driverClassName);
java.sql.Driver driver = (java.sql.Driver) driverClass.newInstance();
FileOutputStream fs = new FileOutputStream(“DriverManagerConnectionLogging.log”);
Exception shown in the JDBC Log File
TdsSocket: Max buffer memory used = 0KB
java.sql.SQLException: I/O Error: Connection reset
The weird part of this was – the error never happened when the Standalone Java program was running from a windows box still connecting to the same SQLServer machine (remote) to the same DB. However EVERY time I ran this from the Linux box – the error happened without fail.
We were at a loss to explain why this was happening.
My technique to solve such kind of problems is elimination and isolation. I try to eliminate possibilities that could cause these errors and also try to isolate the areas where these errors could happen.
On further investigation I found that the SQLServer machine had 2 NIC Cards, but the Linux Machine had only 1.
Lets go back 6 months when we had a similar problem :
About 6 months back I had solved a similar case of Connection Resets from the Socket on Linux boxes – after I had discovered that the Linux box had 2 NIC cards and 1 of the NIC Cards was showing the following – see image below – which shows the amount of packets being dropped from one of the interfaces on the Linux box.
One the 2nd NIC Card was removed from the Linux box and rebooted – we did not see the same error again until now when the same error started happening again.
Here are the steps we tried to resolve the problem this time – The outcome of the steps is given at the end of this post.
- Upgraded to the latest JDBC Driver for SQLServer for Java jtds 1.2.5 – no change happened with that – same exception kept on happening
- Did a ping from both the Linux box to the SQLServer Box and vice versa and both returned the ICMP packets in < 1 ms – so this was not the problem.
- I did the following command to make sure – that Linux was not running out of FileHandles
“netstat -a -l –tcp|wc -l”
“lsof|grep TCP|wc -l”
Both of them returned numbers which seemed to be fine.
- So my first guess was to remove the 2nd NIC Card on the SQLServer machine and reboot – cold start the machine and try my standalone program. That did not solve the problem
- Tried a new JDBC Driver for SQLServer – from jtds – jtds-1.2.5.jar – that did not solve the problem too.
- Made sure that the Java code closes the connection is closed Query is done and the statements are closed. The code was already doing this – so this was not the cause.
- Looked at SQLServer log from Management Studio for errors / issues – found none related to Socket Exception
- Looked at Windows Event Viewer for error conditions – found none related to Sockets/Network
- Looked at SQLServer Surface Configurations settings – and the port numbers were fine and it was set to accept TCP/IP Only – so this seemed fine as well.
- Changed the connection string from the Java program to use the IP Address of the SQLServer box and not the MachineName – this had no effect too.
- Recently we had upgraded our Linux Boxes from RHEL 5.3 to RHEL 5.4. I got hold of an older Linux box and ran my standalone java program to the see if the problem happened there too – and to my digust – it still happened there.
So now I was trying to be imaginative and trying to think up ways to solve it. All the above 10 points resulted in negative and I had come up with those points in less than 10 min.
Now the real challenge began – I realized this was not an easy problem to solve.
- I turned off the firewall on both the SQLServer and Linux boxes – tried my same java program – nothing happened – same exception
- Turned off all – non essential services running on SQLServer and tried again – same exception
- Moved the DB from 1 SQLServer machine to another – same thing happened
- Tried my Standalone Java program from a Different Linux box – no change – same exception
- Another engineer on my team wrote a Python script with the same query – to make sure – it is nothing related to the JDBC Driver – and the same error happened when running the Python script as well
I now reverted back to the original Linux box and original SQLServer machine where this was happening
- I ran the same standalone Java program from the Linux and ran SQLServer Profiler on the SQLServer box to see what it was doing. I saw it was doing fine – ( see picture below ). I could see the SQL query came in – it executed ( too about 45 sec ) and then the batch completed.
- I started running the protocol Analyzer on SQLServer machine and I see this ( see picture below ) – I see SQLServer sending a Reset request
FINALLY – an idea dawned – and I decided let me do a cold start of the Linux box.
I did that and once the machine was up and running – I ran the same standalone Java program and it ran perfectly fine. I could not believe my eyes. I ran it 10 times in a row and it worked each time.
I was rejoicing.
BUT here is the bad part – Next day I came to work and tried the same query again and it failed.
Uh! – Something happened between the reboot and the next day – which has resulted in some stack corruption in the Network Card / Network itself that has caused it to fail again.
None of the above tests have helped me to isolate the error to either of the machines.
My next thought was to hook up ethereal / wireshark on the Linux and SQLServer boxes and monitor the traffic and see if that helps.
However before that – I finally decided 1 last thing I needed to verify and that was the OS of the Machine on which SQLServer was installed and running.
– The OS of the machine for SQLServer 2005 was – Windows 2008 Server R2 : This was pleasantly surprising to me and I was happy I discovered another line of thought which I could chase.
So I looked at all other production machines on which SQLServer was installed and found to my joy that all of them were – Windows 2003 Server.
Someone had become a cowboy and installed Windows 2008 R2 and without any testing deployed it on to production.
I immediately got hold of another machine with was not Windows 2008 Server R2 machine, Installed the same version of SQLServer 2005 we had on production, copied the tables from the production SQLServer DB to this machine and ran the same query against this machine and lo it all worked.
Yes, indeed after a lot of thought, excitement and some disappointments, I was successful in getting rid of this weird SQLException from the production DBs.