How we reduced ETL Build Time for our Data Warehouse in SQLServer by Using BCP and Java Based String Concatenation rather than SQLServer’s String Concat mechanism

We had an ETL Build step – that used to take 75-80% of the build time. This step heavily used SQLServer’s String concatenation method and often this used to take about 30-40 hours. It has been widely documented that SQLServer’s String Concatenation is extremely slow and we had also observed that during the time SQLServer does String Concatenation the CPU usage on the machine goes up and stays there for the entire duration.

The article below describes how – I speeded up the above step by getting rid of String concatenation entirely and using an alternate technique to speed up the ETL build times. After the steps below were implemented – the above ETL Step took about 1.5-2.0 hours. What an improvement.  Thanks to the crude raw speed of BCP ( bulk copy protocol ) from SQLServer and Java.

Let me lay the background of what we were trying to do before I describe the steps.

Assume we had a fact table that looked like below – Everything in Dimension1, Dimension2, Dimension3 is shown as a Foreign KeyID to the Dimensional tables.

Dimension1 Dimension2 Dimension3 Product-Market Hierarchy Time Measure1 Measure2
345 435 823 A Jan-2009 145 675
345 435 823 B Jan-2009 329 719
345 435 823 C Jan-2009 146 782
345 435 823 A Mar-2009 435 671
350 432 829 B Mar-2009 1232 5540
350 432 829 A Apr-2009 435 671
350 432 829 C May-2009 567 987
.
355 435 823 A Jan-2009 145 675
355 435 823 B Jan-2009 329 719
355 435 823 A Feb-2009 146 782
365 442 823 A Mar-2009 435 671
365 442 829 C Mar-2009 1232 5540
370 442 829 C Apr-2009 435 671
370 442 829 B Feb-2009 567 987
370 442 829 A Apr-2009 45 25

Table : 1

We needed to create an Aggregate Table as shown below for doing some reporting in our BI Analytic Tool.

Also in our domain – Dimension 1 had a cardinality of about 2 million.

Dim1 Dim2 Dim3 Measure1-Str Measure2-Str
345 435 823 A,Jan-2009,145; B,Jan-2009,329; C,Jan-2009,146; A,Mar-2009,435; A,Jan-2009,675; B,Jan-2009,719; C,Jan-2009,782; A,Mar-2009,671;
350 432 829 B,Mar-2009,1232; A,Apr-2009,435; C,May-2009,567; B,Mar-2009,5540; A,Apr-2009,671; C,May-2009,987;
355 435 823 A,Jan-2009,145; B,Jan-2009,329; A,Feb-2009,146; A,Jan-2009,675; B,Jan-2009,719; A,Feb-2009,782;
365 442 823 A,Mar-2009,435; C,Mar-2009,1232; A,Mar-2009,671; C,Mar-2009,5540;
370 442 829 C,Apr-2009,435; B,Feb-2009,567; A,Apr-2009,45; C,Apr-2009,671; B,Feb-2009,987; A,Apr-2009,25;

Table : 2

What I am going to discuss is – how I dramatically improved the building times of the Agg Table as shown in the above Table.

The above Aggregate table took about 10 hours to be build on a 235 million row fact table – when we used a SQLServer String concatenation approach of using cursors to loop through the rows. I had verified that the slowdown was due to the String Concatenation only and nothing else.

The solution I discuss below reduced the built time to about 1.5 hours.

Please remember that the Aggregate table build time was mainly because SQLServer was very very very very slow doing String Concatenation.

3 Step Process to get 10 time speed improvements in ETL Time to generate the Aggregate Table

1st Step – BCP – Bulk Copy Protocol  in SQLServer

I used the BCP program to get the data out from the SQLServer Tables to a flat file separated by Tabs.

I was at times amazed at the raw speed of BCP in SQLServer to get the data out from a very very large table sometimes almost the size of 250 Million row tables in less than 10-15 min to a flat file and each column value separated by a Tab.

For example I used the command line as shown below :

bcp “SELECT i.Dim1,i.Dim2, i.Dim3, i.Product-Market-Hierarchy, i.Time,  isNull(SUM(i.Measure1), 0) ,  isNull(SUM(i.Measure2), 0)  FROM FactTable i

GROUP BY i.Dim1,i.Dim2, i.Dim3, i.Product-Market-Hier, i.Time

ORDER BY  i.Dim1, i.Dim2, i.Dim3, i.Product-Market-Hier, i.Time”

QUERYOUT    tmp_FactTable_FROM_BCP_OUT.txt -c -Usa -PPassword

2nd Step – Use Java to Do the String Concatenation

I then wrote a java program that reads the above generated Tab separated text file from BCP and uses FileIO to read each line – parse each column separated with Tabs and then generates an output which is a Text file – with each column separated by Tabs and with the columns as shown in Table 2.

It was the Java Program that was doing the String Concatenation.

3rd Step – Take the output from the Java Based step – and BCP in the results back

After that it was easy. Just BCP in the results obtained from 2nd step into the  DB Table 2.

bcp TableName_asTable2

IN  FLatFileGeneratedByJava.txt -c -Usa -Ppassword

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s