Data characters that failed to be Inserted to HBase using Pigs HBaseStorage Class

In this article I describe how to insert data into an HBase Column Family using Pig.

While doing this process I also learnt – the kind of characters ( some special ones
which CANNOT be inserted using this process )

Figuring out the above was a pretty tedious task – especially since I ran into these problems while the loader was running on extremely large data sets and I had to use
sort of Binary Search in these multi million row files where the code failed
( by this I mean – I divided up the data set into 2-3 almost equal chunks and ran the
loading process for each chunk and did this recursively for the chunks that failed to load – to zero down on the characters / rows of data that was causing the problem.

Soon I discovered that the best way to debug was to generate a data set for all the special characters and see which ones pass and which ones fail and then adopt my Data Loading Code to replace the characters which failed the data loading with a white space.

Characters in the data that did not work in Pig bulk loading into HBase

(
)
&
,
[
]
{
}

General Format of the data file for insertion into an HBase Column Family using Pig

ROWID [ColumnQualifierName#ColumValue,ColumnQualifierName#ColumValue………]

The 1st field is the ROWID for the HBase Table
There is a TAB Separator with the next Field
The next field is enclosed in []

Since there can be any number of ColumnQualifiers – which can be defined dynamically in an HBase Column Family – this 2nd Field contains a different number of Key, Value Pairs

This field contains the ColumnQualifierName & ColumnQualifierValue

Each such columnQualifier is be separated by a ‘,’

ColumnQualifierName and ColumnQualifierValue is separated by the ‘#’ character.

Since HBase allows the flexibility to have any number of ColumnQualifiers for a given row in a Column Family with any name all the ColumnQualifiers for a given row can be specified in 1 line of the data file
ROWID1 [ColumnQualifierName1#ColumValue1,ColumnQualifierName34#ColumValue7]
ROWID2 [ColumnQualifierName1#ColumValue2,ColumnQualifierName12#ColumValue7,ColumnQualifierName10#ColumValue17]
ROWID3 [ColumnQualifierName2#ColumValue22,ColumnQualifierName12#ColumValue5,ColumnQualifierName15#ColumValue15,ColumnQualifierName16#ColumValue16]

Pig Script to load the data into HBase

Once the dataSet is generated and it is on HDFS – use the line below to specify the data set location in PIG and its format

dataSet = load ‘pigdata/DataSet’ as (rowID:chararray, dataMap:map[]);

Then make a call to PIG’s HBase Loader Class – HBaseStorage

store dataSet into ‘hbase://Table_Name’ using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘A:*’);

The parameter to the HBaseStorage method contains –
‘A’ — Which is the name of the Column Family in HBase – where the data is to be inserted

Once the 2nd command is executed within PIG – the data file starts getting loded into HBASE

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s