In this article I describe how to insert data into an HBase Column Family using Pig.
While doing this process I also learnt – the kind of characters ( some special ones
which CANNOT be inserted using this process )
Figuring out the above was a pretty tedious task – especially since I ran into these problems while the loader was running on extremely large data sets and I had to use
sort of Binary Search in these multi million row files where the code failed
( by this I mean – I divided up the data set into 2-3 almost equal chunks and ran the
loading process for each chunk and did this recursively for the chunks that failed to load – to zero down on the characters / rows of data that was causing the problem.
Soon I discovered that the best way to debug was to generate a data set for all the special characters and see which ones pass and which ones fail and then adopt my Data Loading Code to replace the characters which failed the data loading with a white space.
Characters in the data that did not work in Pig bulk loading into HBase
General Format of the data file for insertion into an HBase Column Family using Pig
The 1st field is the ROWID for the HBase Table
There is a TAB Separator with the next Field
The next field is enclosed in 
Since there can be any number of ColumnQualifiers – which can be defined dynamically in an HBase Column Family – this 2nd Field contains a different number of Key, Value Pairs
This field contains the ColumnQualifierName & ColumnQualifierValue
Each such columnQualifier is be separated by a ‘,’
ColumnQualifierName and ColumnQualifierValue is separated by the ‘#’ character.
Since HBase allows the flexibility to have any number of ColumnQualifiers for a given row in a Column Family with any name all the ColumnQualifiers for a given row can be specified in 1 line of the data file
Pig Script to load the data into HBase
Once the dataSet is generated and it is on HDFS – use the line below to specify the data set location in PIG and its format
dataSet = load ‘pigdata/DataSet’ as (rowID:chararray, dataMap:map);
Then make a call to PIG’s HBase Loader Class – HBaseStorage
store dataSet into ‘hbase://Table_Name’ using org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘A:*’);
The parameter to the HBaseStorage method contains –
‘A’ — Which is the name of the Column Family in HBase – where the data is to be inserted
Once the 2nd command is executed within PIG – the data file starts getting loded into HBASE