Category Archives: Java and Java EE

Java Based Thoughts, findings and Notes

Different things I tried to solve – SQL Error: I/O Error: Connection reset

Problem definition :

We have been seeing these Exception on the production machines – when a SQL query comes from the AppServer ( JBoss ) and does the query using one of the connections in the Connection pool.

SQL Error: I/O Error: Connection reset

java.sql.SQLException: I/O Error: Connection reset

at net.sourceforge.jtds.jdbc.TdsCore.executeSQL(TdsCore.java:1053)

at net.sourceforge.jtds.jdbc.JtdsStatement.executeSQLQuery(JtdsStatement.java:465)

at net.sourceforge.jtds.jdbc.JtdsStatement.executeQuery(JtdsStatement.java:1304)

at TestSQLQuery.process(TestSQLQuery.java:46)

at TestSQLQuery.main(TestSQLQuery.java:27)

Caused by: java.net.SocketException: Connection reset

at java.net.SocketInputStream.read(SocketInputStream.java:168)

at java.io.DataInputStream.readFully(DataInputStream.java:178)

at java.io.DataInputStream.readFully(DataInputStream.java:152)

at net.sourceforge.jtds.jdbc.SharedSocket.readPacket(SharedSocket.java:846)

at net.sourceforge.jtds.jdbc.SharedSocket.getNetPacket(SharedSocket.java:727)

at net.sourceforge.jtds.jdbc.ResponseStream.getPacket(ResponseStream.java:466)

at net.sourceforge.jtds.jdbc.ResponseStream.read(ResponseStream.java:103)

at net.sourceforge.jtds.jdbc.ResponseStream.peek(ResponseStream.java:88)

at net.sourceforge.jtds.jdbc.TdsCore.wait(TdsCore.java:3932)

at net.sourceforge.jtds.jdbc.TdsCore.executeSQL(TdsCore.java:1046)

Setup under which this is happening

  • JBoss AppServer 4.2.2 running RedHat Linux 5.5
  • SQLServer JDBC Driver 1.2.2 – jtds-1.2.2.jar
  • Java 1.6
  • SQLServer 2005(9.00.4035.00      SP3         Standard Edition (64-bit))   running on – Windows Server 2008 R2

These sort of problems always excite me – and I itch to get jumping on these sort of problems. This helps me to get away from some of the routine work I do and also gives me an opportunity to learn more and improve my troubleshooting skills.

So with the permission of my manager – I jumped on this problem. Here is the story in more details.

I created a standalone Java Program to make sure that the problem is can be recreated from anywhere without much setup.

I added JDBC logging to my Standalone Java program by adding the following lines to my code – to get more details of the error – the JDBC Log file shows the following exception immediately after the Query packet information.

Class driverClass = Class.forName(driverClassName);

java.sql.Driver driver = (java.sql.Driver) driverClass.newInstance();

FileOutputStream fs = new FileOutputStream(“DriverManagerConnectionLogging.log”);

DriverManager.setLogWriter(new PrintWriter(“DriverManagerConnectionLogging.log”));

Exception shown in the JDBC Log File

TdsSocket: Max buffer memory used = 0KB

java.sql.SQLException: I/O Error: Connection reset

at net.sourceforge.jtds.jdbc.TdsCore.executeSQL(TdsCore.java:1053)

at net.sourceforge.jtds.jdbc.JtdsStatement.executeSQLQuery(JtdsStatement.java:465)

at net.sourceforge.jtds.jdbc.JtdsStatement.executeQuery(JtdsStatement.java:1304)

at TestSQLQuery.process(TestSQLQuery.java:46)

at TestSQLQuery.main(TestSQLQuery.java:27)

SQLException: SQLState(08S01)

The weird part of this was – the error never happened when the Standalone Java program was running from a windows box still connecting to the same SQLServer machine (remote) to the same DB. However EVERY time I ran this from the Linux box – the error happened without fail.

We were at a loss to explain why this was happening.

My technique to solve such kind of problems is elimination and isolation. I try to eliminate possibilities that could cause these errors and also try to isolate the areas where these errors could happen.

On further investigation I found that the SQLServer machine had 2 NIC Cards, but the Linux Machine had only 1.

Lets go back 6 months when we had a similar problem :

About 6 months back I had solved a similar case of Connection Resets from the Socket on Linux boxes – after I had discovered that the Linux box had 2 NIC cards and 1 of the NIC Cards was showing the following – see image below – which shows the amount of packets being dropped from one of the interfaces on the Linux box.

One the 2nd NIC Card was removed from the Linux box and rebooted – we did not see the same error again until now when the same error started happening again.

Here are the steps we tried to resolve the problem this time – The outcome of the steps is given at the end of this post.

  1. Upgraded to the latest JDBC Driver for SQLServer for Java jtds 1.2.5 – no change happened with that – same exception kept on happening
  2. Did a ping from both the Linux box to the SQLServer Box and vice versa and both returned the ICMP packets in < 1 ms – so this was not the problem.
  3. I did the following command to make sure – that Linux was not running out of FileHandles

“netstat -a -l –tcp|wc -l”
“lsof|grep TCP|wc -l”
Both of them returned numbers which seemed to be fine.

  1. So my first guess was to remove the 2nd NIC Card on the SQLServer machine and reboot – cold start the machine and try my standalone program.  That did not solve the problem
  2. Tried a new JDBC Driver for SQLServer – from jtds – jtds-1.2.5.jar – that did not solve the problem too.
  3. Made sure that the Java code closes the connection is closed Query is done and the statements are closed. The code was already doing this – so this was not the cause.
  4. Looked at SQLServer log from Management Studio for errors / issues – found none related to Socket Exception
  5. Looked at Windows Event Viewer for error conditions – found none related to Sockets/Network
  6. Looked at SQLServer Surface Configurations settings – and the port numbers were fine and it was set to accept TCP/IP Only – so this seemed fine as well.
  7. Changed the connection string from the Java program to use the IP Address of the SQLServer box and not the MachineName – this had no effect too.
  8. Recently we had upgraded our Linux Boxes from RHEL 5.3 to RHEL 5.4. I got hold of an older Linux box and ran my standalone java program to the see if the problem happened there too – and to my digust – it still happened there.

So now I was trying to be imaginative and trying to think up ways to solve it. All the above 10 points resulted in negative and I had come up with those points in less than 10 min.

Now the real challenge began – I realized this was not an easy problem to solve.

  1. I turned off the firewall on both the SQLServer and Linux boxes – tried my same java program – nothing happened – same exception
  2. Turned off all – non essential services running on SQLServer and tried again – same exception
  3. Moved the DB from 1 SQLServer machine to another – same thing happened
  4. Tried my Standalone Java program from a Different Linux box – no change – same exception
  5. Another engineer on my team wrote a Python script with the same query – to make sure – it is nothing related to the JDBC Driver – and the same error happened when running the Python script as well

I now reverted back to the original Linux box and original SQLServer machine where this was happening

  1. I ran the same standalone Java program from the Linux and ran SQLServer Profiler on the SQLServer box to see what it was doing. I saw it was doing fine – ( see picture below ). I could see the SQL query came in – it executed ( too about 45 sec ) and then the batch completed.

  1. I started running the protocol Analyzer on SQLServer machine and I see this ( see picture below ) – I see SQLServer sending a Reset request

 

FINALLY – an idea dawned – and I decided let me do a cold start of the Linux box.

I did that and once the machine was up and running – I ran the same standalone Java program and it ran perfectly fine. I could not believe my eyes. I ran it 10 times in a row and it worked each time.

I was rejoicing.

BUT here is the bad part – Next day I came to work and tried the same query again and it failed.

Uh! – Something happened between the reboot and the next day – which has resulted in some stack corruption in the Network Card / Network itself that has caused it to fail again.

None of the above tests have helped me to isolate the error to either of the machines.

My next thought was to hook up ethereal / wireshark on the Linux and SQLServer boxes and monitor the traffic and see if that helps.

However before that – I finally decided 1 last thing I needed to verify and that was the OS of the Machine on which SQLServer was installed and running.

– The OS of the machine for SQLServer 2005 was – Windows 2008 Server R2 : This was pleasantly surprising to me and I was happy I discovered another line of thought which I could chase.

So I looked at all other production machines on which SQLServer was installed and found to my joy that all of them were – Windows 2003 Server.

Someone had become a cowboy and installed Windows 2008 R2 and without any testing deployed it on to production.

I immediately got hold of another machine with was not Windows 2008 Server R2 machine, Installed the same version of SQLServer 2005 we had on production, copied the tables from the production SQLServer DB to this machine and ran the same query against this machine and lo it all worked.

Yes, indeed after a lot of thought, excitement and some disappointments, I was successful in getting rid of this weird SQLException from the production DBs.

How we reduced ETL Build Time for our Data Warehouse in SQLServer by Using BCP and Java Based String Concatenation rather than SQLServer’s String Concat mechanism

We had an ETL Build step – that used to take 75-80% of the build time. This step heavily used SQLServer’s String concatenation method and often this used to take about 30-40 hours. It has been widely documented that SQLServer’s String Concatenation is extremely slow and we had also observed that during the time SQLServer does String Concatenation the CPU usage on the machine goes up and stays there for the entire duration.

The article below describes how – I speeded up the above step by getting rid of String concatenation entirely and using an alternate technique to speed up the ETL build times. After the steps below were implemented – the above ETL Step took about 1.5-2.0 hours. What an improvement.  Thanks to the crude raw speed of BCP ( bulk copy protocol ) from SQLServer and Java.

Let me lay the background of what we were trying to do before I describe the steps.

Assume we had a fact table that looked like below – Everything in Dimension1, Dimension2, Dimension3 is shown as a Foreign KeyID to the Dimensional tables.

Dimension1 Dimension2 Dimension3 Product-Market Hierarchy Time Measure1 Measure2
345 435 823 A Jan-2009 145 675
345 435 823 B Jan-2009 329 719
345 435 823 C Jan-2009 146 782
345 435 823 A Mar-2009 435 671
350 432 829 B Mar-2009 1232 5540
350 432 829 A Apr-2009 435 671
350 432 829 C May-2009 567 987
.
355 435 823 A Jan-2009 145 675
355 435 823 B Jan-2009 329 719
355 435 823 A Feb-2009 146 782
365 442 823 A Mar-2009 435 671
365 442 829 C Mar-2009 1232 5540
370 442 829 C Apr-2009 435 671
370 442 829 B Feb-2009 567 987
370 442 829 A Apr-2009 45 25

Table : 1

We needed to create an Aggregate Table as shown below for doing some reporting in our BI Analytic Tool.

Also in our domain – Dimension 1 had a cardinality of about 2 million.

Dim1 Dim2 Dim3 Measure1-Str Measure2-Str
345 435 823 A,Jan-2009,145; B,Jan-2009,329; C,Jan-2009,146; A,Mar-2009,435; A,Jan-2009,675; B,Jan-2009,719; C,Jan-2009,782; A,Mar-2009,671;
350 432 829 B,Mar-2009,1232; A,Apr-2009,435; C,May-2009,567; B,Mar-2009,5540; A,Apr-2009,671; C,May-2009,987;
355 435 823 A,Jan-2009,145; B,Jan-2009,329; A,Feb-2009,146; A,Jan-2009,675; B,Jan-2009,719; A,Feb-2009,782;
365 442 823 A,Mar-2009,435; C,Mar-2009,1232; A,Mar-2009,671; C,Mar-2009,5540;
370 442 829 C,Apr-2009,435; B,Feb-2009,567; A,Apr-2009,45; C,Apr-2009,671; B,Feb-2009,987; A,Apr-2009,25;

Table : 2

What I am going to discuss is – how I dramatically improved the building times of the Agg Table as shown in the above Table.

The above Aggregate table took about 10 hours to be build on a 235 million row fact table – when we used a SQLServer String concatenation approach of using cursors to loop through the rows. I had verified that the slowdown was due to the String Concatenation only and nothing else.

The solution I discuss below reduced the built time to about 1.5 hours.

Please remember that the Aggregate table build time was mainly because SQLServer was very very very very slow doing String Concatenation.

3 Step Process to get 10 time speed improvements in ETL Time to generate the Aggregate Table

1st Step – BCP – Bulk Copy Protocol  in SQLServer

I used the BCP program to get the data out from the SQLServer Tables to a flat file separated by Tabs.

I was at times amazed at the raw speed of BCP in SQLServer to get the data out from a very very large table sometimes almost the size of 250 Million row tables in less than 10-15 min to a flat file and each column value separated by a Tab.

For example I used the command line as shown below :

bcp “SELECT i.Dim1,i.Dim2, i.Dim3, i.Product-Market-Hierarchy, i.Time,  isNull(SUM(i.Measure1), 0) ,  isNull(SUM(i.Measure2), 0)  FROM FactTable i

GROUP BY i.Dim1,i.Dim2, i.Dim3, i.Product-Market-Hier, i.Time

ORDER BY  i.Dim1, i.Dim2, i.Dim3, i.Product-Market-Hier, i.Time”

QUERYOUT    tmp_FactTable_FROM_BCP_OUT.txt -c -Usa -PPassword

2nd Step – Use Java to Do the String Concatenation

I then wrote a java program that reads the above generated Tab separated text file from BCP and uses FileIO to read each line – parse each column separated with Tabs and then generates an output which is a Text file – with each column separated by Tabs and with the columns as shown in Table 2.

It was the Java Program that was doing the String Concatenation.

3rd Step – Take the output from the Java Based step – and BCP in the results back

After that it was easy. Just BCP in the results obtained from 2nd step into the  DB Table 2.

bcp TableName_asTable2

IN  FLatFileGeneratedByJava.txt -c -Usa -Ppassword

How I solved the Slow Login Problem with Liferay

We had been using the Liferay Portal Server for our CMS and UserManagement on a Tomcat server. When the new systems were deployed initially there was no problem initially for the 1st 3 months or so after which we started slow trickling comments from users that the logins were getting slower and slower with each passing month.

The initial logging time was < 5 sec (which is slow but still quite acceptable). Then over period of months – login times started deteriorating to 30 sec and in some cases for some users to 90 sec – which was outrageous.

Liferay is known to be slow (see JIRA ticket LEP-2182, http://support.liferay.com/browse/LEP-2182 )

Especially when using Microsoft SQL Server (see  https://www.liferay.com/web/guest/community/forums/message_boards/message/71332)

And indeed we were using SQLServer as the repository for Liferay.

On further investigation I found out that – it was the Liferay permissions checker algorithm that was causing the slowdown in Login response time.

I found that over a period of time as the resources (which in our case is Analytic Charts that users create over a period of time ) increased – Liferay’s permissions checking Algorithm did not scale up well to handle the response time for checking if the User who is trying to login had permissions to Read/Write/Clone/Migrate all the resource in the system.

So the crux of the slowdown was the – inability of Liferay to scale – checking of different types of permissions across numerous resources. At one point in time we had more than 100,000 resources ( charts in our case ) and Liferay became slower and slower in handling the necessity to check across all the resources in the system.

Solution  :

I found that internally Liferay was using Hibernate’s auto generated SQL ( HQL ) to do the queries to search for the resources the User would be able to Modify/Clone/Migrate.

I found out the ER Diagram of the Liferay DB Tables and how they are interrelated and formulated my own SQL query ( which is given below ) to get back the resources for a given User who belongs to a given Customer. Then I intercepted the Hibernate Calls made from Liferay APIs and injected our own SQL to get back the results using the optimized SQL. The SQL query was simple Union across multiple Tables to check how a user can get permissions – based on Group Permission/Roles. The whole thing now works in < 2 sec response time.

Modified SQL Query to get the Permissions for a given user for a given company

 select distinct r.primKey r, r.companyid c
from user_ u , role_ ro, Users_Roles ur, resource_ r
where ro.name='Administrator' and
r.name like 'com.leapfrogrx.gpharma.service.viewerservice.common.ChartSet' and
r.scope = 'individual' and
r.typeid = 'class' and
ur.userId=u.userId and
ur.roleId = ro.roleId and
r.companyId = u.companyId
and u.userId = 'cbergh.agn'
and  r.companyid = 'ALLERGAN'
union
-- Get All Resources that the User can access directly : just by itself
select distinct  r.primKey r, r.companyid c
from resource_ r, permission_ p, users_permissions up, user_ u
where
r.name like 'com.leapfrogrx.gpharma.service.viewerservice.common.ChartSet' and
r.scope = 'individual' and
r.typeid = 'class' and
r.resourceId = p.resourceId and
r.CompanyId = p.CompanyId and
p.permissionID = up.permissionId and
up.userId = u.userId and u.userId = 'cbergh.agn'
and  r.companyid = 'ALLERGAN'
union
-- Get All Resources that the User can access from the group
select distinct  r.primKey r , r.companyid c
from resource_ r, permission_ p, groups_permissions gp, group_ g, users_groups ug
where
r.name like 'com.leapfrogrx.gpharma.service.viewerservice.common.ChartSet' and
r.scope = 'individual' and
r.typeid = 'class' and
r.resourceId = p.resourceId and
p.permissionID = gp.permissionId and
gp.groupId = g.groupId and
g.groupId = ug.groupId and ug.userId = 'cbergh.agn'
and  r.companyid = 'ALLERGAN'
union
-- Get all the resources the user can access through the Roles
select distinct r.primKey r, r.companyid c
from resource_ r, permission_ p, Roles_Permissions rp, Role_ ro, Users_Roles ur
where
r.name like 'com.leapfrogrx.gpharma.service.viewerservice.common.ChartSet' and
r.scope = 'individual' and
r.typeid = 'class' and
r.resourceId = p.resourceId and
p.permissionID = rp.permissionId and
rp.roleId = ro.roleId and
ro.roleId = ur.roleId and ur.userId = 'cbergh.agn'
and  r.companyid = 'ALLERGAN'
union
-- Get all the resources the user can access through the group Roles
select distinct r.primKey r, r.companyid c
from resource_ r, permission_ p, Roles_Permissions rp, Role_ ro, groups_roles gr, Users_groups ug, group_ g
where
r.name like 'com.leapfrogrx.gpharma.service.viewerservice.common.ChartSet' and
r.scope = 'individual' and
r.typeid = 'class' and
r.resourceId = p.resourceId and
p.permissionID = rp.permissionId and
rp.roleId = ro.roleId and
ro.roleId = gr.roleId and
gr.groupid = g.groupId and
g.groupId = ug.groupId and
ug.userId = 'cbergh.agn'
and  r.companyid = 'ALLERGAN'
union
select distinct primKey, null, null, null from resource_ r where resourceID in
(
select resourceID from permission_ where permissionId in
(
select permissionID from groups_permissions where groupId in
(
select groupId from group_ where className like 'com.liferay.portal.model.UserGroup' and companyId = 'ALLERGAN'
and classPK in ( select u1.userGroupId from userGroup u1, Users_UserGroups u2  where companyId =  'ALLERGAN' and
u1.usergroupId = u2.userGroupId and u2.userId = 'cbergh.agn'
)
)
)
)
and
r.companyid = 'ALLERGAN'
and
r.name like 'com.leapfrogrx.gpharma.service.viewerservice.common.ChartSet'
and r.scope = 'individual'
r.typeid = 'class'

Volatile and Synchronized in Java Explained

This can be best understood by looking at the effects that volatile & synchronized on a method. volatile is a field modifier, while synchronized modifies code blocks and methods. So we can specify three variations of a simple accessor using those two keywords:

int i1;              int geti1() {return i1;}

volatile int i2;              int geti2() {return i2;}

int i3; synchronized int geti3() {return i3;}

 

geti1() accesses the value currently stored in i1 in the current thread. Threads can have local copies of variables, and the data does not have to be the same as the data held in other threads. In particular, another thread may have updated i1 in it’s thread, but the value in the current thread could be different from that updated value. In fact Java has the idea of a “main” memory, and this is the memory that holds the current “correct” value for variables. Threads can have their own copy of data for variables, and the thread copy can be different from the “main” memory. So in fact, it is possible for the “main” memory to have a value of 1 for i1, for thread1 to have a value of 2 for i1 and for thread2 to have a value of 3 for i1 if thread1 and thread2 have both updated i1 but those updated value has not yet been propagated to “main” memory or other threads.

On the other hand, geti2() effectively accesses the value of i2 from “main” memory. A volatile variable is not allowed to have a local copy of a variable that is different from the value currently held in “main” memory. Effectively, a variable declared volatile must have it’s data synchronized across all threads, so that whenever you access or update the variable in any thread, all other threads immediately see the same value. Of course, it is likely that volatile variables have a higher access and update overhead than “plain” variables, since the reason threads can have their own copy of data is for better efficiency.

Well if volatile already synchronizes data across threads, what is synchronized for? Well there are two differences. Firstly synchronized obtains and releases locks on monitors which can force only one thread at a time to execute a code block, if both threads use the same monitor (effectively the same object lock). That’s the fairly well known aspect to synchronized. But synchronized also synchronizes memory. In fact synchronized synchronizes the whole of thread memory with “main” memory. So executing geti3() does the following:

  1. The thread acquires the lock on the monitor for object this (assuming the monitor is unlocked, otherwise the thread waits until the monitor is unlocked).
  2. The thread memory flushes all its variables, i.e. it has all of its variables effectively read from “main” memory (JVMs can use dirty sets to optimize this so that only “dirty” variables are flushed, but conceptually this is the same. See section 17.9 of the Java language specification).
  3. The code block is executed (in this case setting the return value to the current value of i3, which may have just been reset from “main” memory).
  4. (Any changes to variables would normally now be written out to “main” memory, but for geti3() we have no changes.)
  5. The thread releases the lock on the monitor for object this.

So where volatile only synchronizes the value of one variable between thread memory and “main” memory, synchronized synchronizes the value of all variables between thread memory and “main” memory, and locks and releases a monitor to boot. Clearly synchronized is likely to have more overhead than volatile.

Using the volatile keyword ensures that the variable is never kept in a register. This

guarantees that the variable is truly shared between threads.

 

Synchronization boundaries signal to the virtual machine that it must invalidate its registers.

When the virtual machine enters a synchronized method or block, it must reload data it has cached in its local registers. Before the virtual machine exits a synchronization method or block, it must store its local registers to main memory.

How we solved – GC every 1 minute on Tomcat

Our system doing repeated FULL GC calls almost every 60 odd seconds and that was affecting performance on Production machines. I found the GC happening using JConsole which comes bundled with Java5 upwards.

This was quite puzzling, but the good thing was it was happening predictably every 60 odd seconds.  Sometimes the GC was taking 2-3 sec and sometimes less.

It seemed that some library we are using on TOMCAT side – has some hard coded System GC calls.

My initial thought was – that somecode was calling System.gc() every min or so. I searched through the whole code base but found no explicit GC in our code.

Then I researched the JVM Options and found this option

-XX:+DisableExplicitGC

And decided to add it to the JVM Startup options and restart the production machines at night.

And then the FULL GC every minute vanished from the log.

Though I could never conclusively prove where the GC was coming from – but my belief was that some 3rd party library code – was calling System.gc() and once the flag was set on the JVM Startup – the GCs vanished.