We had an ETL Build step – that used to take 75-80% of the build time. This step heavily used SQLServer’s String concatenation method and often this used to take about 30-40 hours. It has been widely documented that SQLServer’s String Concatenation is extremely slow and we had also observed that during the time SQLServer does String Concatenation the CPU usage on the machine goes up and stays there for the entire duration.
The article below describes how – I speeded up the above step by getting rid of String concatenation entirely and using an alternate technique to speed up the ETL build times. After the steps below were implemented – the above ETL Step took about 1.5-2.0 hours. What an improvement. Thanks to the crude raw speed of BCP ( bulk copy protocol ) from SQLServer and Java.
Let me lay the background of what we were trying to do before I describe the steps.
Assume we had a fact table that looked like below – Everything in Dimension1, Dimension2, Dimension3 is shown as a Foreign KeyID to the Dimensional tables.
Table : 1
We needed to create an Aggregate Table as shown below for doing some reporting in our BI Analytic Tool.
Also in our domain – Dimension 1 had a cardinality of about 2 million.
|345||435||823||A,Jan-2009,145; B,Jan-2009,329; C,Jan-2009,146; A,Mar-2009,435;||A,Jan-2009,675; B,Jan-2009,719; C,Jan-2009,782; A,Mar-2009,671;|
|350||432||829||B,Mar-2009,1232; A,Apr-2009,435; C,May-2009,567;||B,Mar-2009,5540; A,Apr-2009,671; C,May-2009,987;|
|355||435||823||A,Jan-2009,145; B,Jan-2009,329; A,Feb-2009,146;||A,Jan-2009,675; B,Jan-2009,719; A,Feb-2009,782;|
|365||442||823||A,Mar-2009,435; C,Mar-2009,1232;||A,Mar-2009,671; C,Mar-2009,5540;|
|370||442||829||C,Apr-2009,435; B,Feb-2009,567; A,Apr-2009,45;||C,Apr-2009,671; B,Feb-2009,987; A,Apr-2009,25;|
Table : 2
What I am going to discuss is – how I dramatically improved the building times of the Agg Table as shown in the above Table.
The above Aggregate table took about 10 hours to be build on a 235 million row fact table – when we used a SQLServer String concatenation approach of using cursors to loop through the rows. I had verified that the slowdown was due to the String Concatenation only and nothing else.
The solution I discuss below reduced the built time to about 1.5 hours.
Please remember that the Aggregate table build time was mainly because SQLServer was very very very very slow doing String Concatenation.
3 Step Process to get 10 time speed improvements in ETL Time to generate the Aggregate Table
1st Step – BCP – Bulk Copy Protocol in SQLServer
I used the BCP program to get the data out from the SQLServer Tables to a flat file separated by Tabs.
I was at times amazed at the raw speed of BCP in SQLServer to get the data out from a very very large table sometimes almost the size of 250 Million row tables in less than 10-15 min to a flat file and each column value separated by a Tab.
For example I used the command line as shown below :
bcp “SELECT i.Dim1,i.Dim2, i.Dim3, i.Product-Market-Hierarchy, i.Time, isNull(SUM(i.Measure1), 0) , isNull(SUM(i.Measure2), 0) FROM FactTable i
GROUP BY i.Dim1,i.Dim2, i.Dim3, i.Product-Market-Hier, i.Time
ORDER BY i.Dim1, i.Dim2, i.Dim3, i.Product-Market-Hier, i.Time”
QUERYOUT tmp_FactTable_FROM_BCP_OUT.txt -c -Usa -PPassword
2nd Step – Use Java to Do the String Concatenation
I then wrote a java program that reads the above generated Tab separated text file from BCP and uses FileIO to read each line – parse each column separated with Tabs and then generates an output which is a Text file – with each column separated by Tabs and with the columns as shown in Table 2.
It was the Java Program that was doing the String Concatenation.
3rd Step – Take the output from the Java Based step – and BCP in the results back
After that it was easy. Just BCP in the results obtained from 2nd step into the DB Table 2.
IN FLatFileGeneratedByJava.txt -c -Usa -Ppassword