Different things I tried to solve – SQL Error: I/O Error: Connection reset

Problem definition :

We have been seeing these Exception on the production machines – when a SQL query comes from the AppServer ( JBoss ) and does the query using one of the connections in the Connection pool.

SQL Error: I/O Error: Connection reset

java.sql.SQLException: I/O Error: Connection reset

at net.sourceforge.jtds.jdbc.TdsCore.executeSQL(TdsCore.java:1053)

at net.sourceforge.jtds.jdbc.JtdsStatement.executeSQLQuery(JtdsStatement.java:465)

at net.sourceforge.jtds.jdbc.JtdsStatement.executeQuery(JtdsStatement.java:1304)

at TestSQLQuery.process(TestSQLQuery.java:46)

at TestSQLQuery.main(TestSQLQuery.java:27)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketInputStream.read(SocketInputStream.java:168)

at java.io.DataInputStream.readFully(DataInputStream.java:178)

at java.io.DataInputStream.readFully(DataInputStream.java:152)

at net.sourceforge.jtds.jdbc.SharedSocket.readPacket(SharedSocket.java:846)

at net.sourceforge.jtds.jdbc.SharedSocket.getNetPacket(SharedSocket.java:727)

at net.sourceforge.jtds.jdbc.ResponseStream.getPacket(ResponseStream.java:466)

at net.sourceforge.jtds.jdbc.ResponseStream.read(ResponseStream.java:103)

at net.sourceforge.jtds.jdbc.ResponseStream.peek(ResponseStream.java:88)

at net.sourceforge.jtds.jdbc.TdsCore.wait(TdsCore.java:3932)

at net.sourceforge.jtds.jdbc.TdsCore.executeSQL(TdsCore.java:1046)

Setup under which this is happening

  • JBoss AppServer 4.2.2 running RedHat Linux 5.5
  • SQLServer JDBC Driver 1.2.2 – jtds-1.2.2.jar
  • Java 1.6
  • SQLServer 2005(9.00.4035.00      SP3         Standard Edition (64-bit))   running on – Windows Server 2008 R2

These sort of problems always excite me – and I itch to get jumping on these sort of problems. This helps me to get away from some of the routine work I do and also gives me an opportunity to learn more and improve my troubleshooting skills.

So with the permission of my manager – I jumped on this problem. Here is the story in more details.

I created a standalone Java Program to make sure that the problem is can be recreated from anywhere without much setup.

I added JDBC logging to my Standalone Java program by adding the following lines to my code – to get more details of the error – the JDBC Log file shows the following exception immediately after the Query packet information.

Class driverClass = Class.forName(driverClassName);

java.sql.Driver driver = (java.sql.Driver) driverClass.newInstance();

FileOutputStream fs = new FileOutputStream(“DriverManagerConnectionLogging.log”);

DriverManager.setLogWriter(new PrintWriter(“DriverManagerConnectionLogging.log”));

Exception shown in the JDBC Log File

TdsSocket: Max buffer memory used = 0KB

java.sql.SQLException: I/O Error: Connection reset

at net.sourceforge.jtds.jdbc.TdsCore.executeSQL(TdsCore.java:1053)

at net.sourceforge.jtds.jdbc.JtdsStatement.executeSQLQuery(JtdsStatement.java:465)

at net.sourceforge.jtds.jdbc.JtdsStatement.executeQuery(JtdsStatement.java:1304)

at TestSQLQuery.process(TestSQLQuery.java:46)

at TestSQLQuery.main(TestSQLQuery.java:27)

SQLException: SQLState(08S01)

The weird part of this was – the error never happened when the Standalone Java program was running from a windows box still connecting to the same SQLServer machine (remote) to the same DB. However EVERY time I ran this from the Linux box – the error happened without fail.

We were at a loss to explain why this was happening.

My technique to solve such kind of problems is elimination and isolation. I try to eliminate possibilities that could cause these errors and also try to isolate the areas where these errors could happen.

On further investigation I found that the SQLServer machine had 2 NIC Cards, but the Linux Machine had only 1.

Lets go back 6 months when we had a similar problem :

About 6 months back I had solved a similar case of Connection Resets from the Socket on Linux boxes – after I had discovered that the Linux box had 2 NIC cards and 1 of the NIC Cards was showing the following – see image below – which shows the amount of packets being dropped from one of the interfaces on the Linux box.

One the 2nd NIC Card was removed from the Linux box and rebooted – we did not see the same error again until now when the same error started happening again.

Here are the steps we tried to resolve the problem this time – The outcome of the steps is given at the end of this post.

  1. Upgraded to the latest JDBC Driver for SQLServer for Java jtds 1.2.5 – no change happened with that – same exception kept on happening
  2. Did a ping from both the Linux box to the SQLServer Box and vice versa and both returned the ICMP packets in < 1 ms – so this was not the problem.
  3. I did the following command to make sure – that Linux was not running out of FileHandles

“netstat -a -l –tcp|wc -l”
“lsof|grep TCP|wc -l”
Both of them returned numbers which seemed to be fine.

  1. So my first guess was to remove the 2nd NIC Card on the SQLServer machine and reboot – cold start the machine and try my standalone program.  That did not solve the problem
  2. Tried a new JDBC Driver for SQLServer – from jtds – jtds-1.2.5.jar – that did not solve the problem too.
  3. Made sure that the Java code closes the connection is closed Query is done and the statements are closed. The code was already doing this – so this was not the cause.
  4. Looked at SQLServer log from Management Studio for errors / issues – found none related to Socket Exception
  5. Looked at Windows Event Viewer for error conditions – found none related to Sockets/Network
  6. Looked at SQLServer Surface Configurations settings – and the port numbers were fine and it was set to accept TCP/IP Only – so this seemed fine as well.
  7. Changed the connection string from the Java program to use the IP Address of the SQLServer box and not the MachineName – this had no effect too.
  8. Recently we had upgraded our Linux Boxes from RHEL 5.3 to RHEL 5.4. I got hold of an older Linux box and ran my standalone java program to the see if the problem happened there too – and to my digust – it still happened there.

So now I was trying to be imaginative and trying to think up ways to solve it. All the above 10 points resulted in negative and I had come up with those points in less than 10 min.

Now the real challenge began – I realized this was not an easy problem to solve.

  1. I turned off the firewall on both the SQLServer and Linux boxes – tried my same java program – nothing happened – same exception
  2. Turned off all – non essential services running on SQLServer and tried again – same exception
  3. Moved the DB from 1 SQLServer machine to another – same thing happened
  4. Tried my Standalone Java program from a Different Linux box – no change – same exception
  5. Another engineer on my team wrote a Python script with the same query – to make sure – it is nothing related to the JDBC Driver – and the same error happened when running the Python script as well

I now reverted back to the original Linux box and original SQLServer machine where this was happening

  1. I ran the same standalone Java program from the Linux and ran SQLServer Profiler on the SQLServer box to see what it was doing. I saw it was doing fine – ( see picture below ). I could see the SQL query came in – it executed ( too about 45 sec ) and then the batch completed.

  1. I started running the protocol Analyzer on SQLServer machine and I see this ( see picture below ) – I see SQLServer sending a Reset request

 

FINALLY – an idea dawned – and I decided let me do a cold start of the Linux box.

I did that and once the machine was up and running – I ran the same standalone Java program and it ran perfectly fine. I could not believe my eyes. I ran it 10 times in a row and it worked each time.

I was rejoicing.

BUT here is the bad part – Next day I came to work and tried the same query again and it failed.

Uh! – Something happened between the reboot and the next day – which has resulted in some stack corruption in the Network Card / Network itself that has caused it to fail again.

None of the above tests have helped me to isolate the error to either of the machines.

My next thought was to hook up ethereal / wireshark on the Linux and SQLServer boxes and monitor the traffic and see if that helps.

However before that – I finally decided 1 last thing I needed to verify and that was the OS of the Machine on which SQLServer was installed and running.

– The OS of the machine for SQLServer 2005 was – Windows 2008 Server R2 : This was pleasantly surprising to me and I was happy I discovered another line of thought which I could chase.

So I looked at all other production machines on which SQLServer was installed and found to my joy that all of them were – Windows 2003 Server.

Someone had become a cowboy and installed Windows 2008 R2 and without any testing deployed it on to production.

I immediately got hold of another machine with was not Windows 2008 Server R2 machine, Installed the same version of SQLServer 2005 we had on production, copied the tables from the production SQLServer DB to this machine and ran the same query against this machine and lo it all worked.

Yes, indeed after a lot of thought, excitement and some disappointments, I was successful in getting rid of this weird SQLException from the production DBs.

Advertisements

11 responses to “Different things I tried to solve – SQL Error: I/O Error: Connection reset

  1. I’m confused. What did you do to fix the problem? Rollback Windows 2008?

  2. sorry can i ask some question, i’m still newbie..
    I’ve got same problem with you.
    Did u solve this problem by change the OS, from Windows 2008 Server R2 to Windows 2003 Server R2 ?

    But in my case, i’m already using Windows 2003 Server R2 OS, but the Connection reset still appear when i run java application in linux.

    Thx 4 ur response

    • Try to see if there are 2 NIC cards on the Linux box.
      That has caused problems for me before when interacting with the SQLServer through a Java program.
      What is the JDBC driver you are using.
      See if you get a different version of the JDBC driver to work with without getting the connection errors

  3. This was an excellent explanation. Thank you!

  4. Pingback: sql wildcard,sql wildcards,sql rollback,rollback sql,sql copy table,sql sum,sql mirroring,sum sql,sql cluster,sql server performance,truncate in sql,backup sql,backup sql database,backup sql server,sql performance,date functions in sql,sql over,truncate s

  5. We had this issue recently – it is related to the tcp/ip layer and the packet timestamps between linux and windows.

    see this link to solve the issue:
    http://mohammednv.wordpress.com/2009/05/27/disable-tcp-timestamping/

  6. In some cases this might be caused by ‘optimized mode’ of the network switch.

    Using MSSQL profiler you can see smth like this:

    2012-03-30 10:40:25.16 Logon Error: 17836, Severity: 20, State: 16.
    2012-03-30 10:40:25.16 Logon Length specified in network packet payload did not match number of bytes read; the connection has been closed. Please contact the vendor of the client library. [CLIENT: ]

    Switch off or ask your network administrators to switch off any switcher optimization mode and this will work.

  7. We had a similar problem when some of the queries fired from the web box to the db box started intermittently failing. In our case we tracked it down to header checksum errors causing a tcp reset so we overcame it by setting checksum offload on the NIC to disabled.

  8. We had a similar problem when some of the queries fired from the web box to the db box started intermittently failing. In our case we tracked it down to header checksum errors causing a tcp reset so we overcame it by setting checksum offload on the NIC to disabled.

  9. I am also facing the same issue.Can you please tell me the exact resolution?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s