Category Archives: BigData

Map of Big Data Tools and Solutions Landscape

BigDataMindMap

Advertisements

Binarization Using Map Reduce

Code Available at – https://github.com/palsumitpal/JavaCode/tree/master/Binarization

Description

Binarization is often needed for Machine Learning Algorithms as a preprocessing step on input data before application of the algorithms. Here is my initial attempt to write a Map Reduce code for Hadoop to achieve binarization.

The code at Github can be built using the pom.xml file

Command Line

hadoop  jar binarization.jar hivetableinput hivecolumnmetatest hivebinarizeouttest \N ? ColumnNames.txt 1 2 4 5 6 7 11 16 17 18 19 20 27

Command Line Parameter

1st Param — InputDirectory for files to be binarized

2nd Param — Directory where ColumnMetaData is generated

3rd Param — Directory of output of the Binarization

4th Param — Missing Column Value Indicator Character in Input files

5th Param — Missing Column Value Indicator Character in Output files

6th Param — Column Names file in the input data set(1 Column Name per line — this is generated in the output binarization file as header)

7th Param… — the Column numbers (starting from 1) from the Input File ( as mentioned in 6th Param ) — which need to be binarized

Assumptions

All Columns which are not in the 7th param onwards ( but are part of the input file ) they are not binarized – but appended to the end columns of the output file.

Example

If you have an input file with the following columns — Col1, Col2, Col3, Col4, Col5

You want to binarize the following columns — Col1, Col4, Col5

Then only the contents of Col1, Col4, Col5 — will be binarized and outputted in the output file

However, the values of Col2, Col3 will appear in the output file but as the end columns

If

Col1 — Has Distinct Values as C11, C12, C13

Col4 — Has Distinct Values as C41, C42

Col5 — Has Distinct Values as C51, C52, C53, C54, C55

Then the output format will be like this

C11, C12, C13, C41, C42, C51, C52, C53, C54, C55, Col2, Col3

C11, C12, C13, C41, C42, C51, C52, C53, C54, C55

Will have values 1 or 0  — Since they are the ones that is binarized

Col2, Col3 will have original values from the input file

Remember the #of Rows in Output File — will be the same as the #Of Rows in Input file

Program Description

The MR program is composed of 2 jobs.

ASSUME – THE Input DATA FILE HAS NO HEADERS

a1      b1      c1      d1

a2      b1      c1      d3

a1      b2      c2      d1

a3      b1      c4      d2

a1      b2      c3      d1

a2      b1      c1      d1

JOB1

Output of the Mapper

1      [a1, a2, a1, a3, a1, a2]

Key Column#

Value – List of Values in Column#

Output from Reducer

1,a1, a2, a3

2,b1, b2

3,c1, c2, c4, c3

4,d1, d2

Key – Column#

Value – List of Unique Values in Column#

 

The above is written to the ColumnMeta File — this file — will be used in Job2 – This file – contains data for those columns which are mentioned from Param 7 onwards in the input command line

1st Job – Output is in 2nd Param directory

This directory also contains the header String for the final output to be generated after 2nd Job is run – this is in a file named – columnHeader.txt

The headers are generated using ColumnNames from the – 6th Param File ( which is the column Names file) + The Unique values in each Column

This directory also contains the ColumnMetaData – which is the output of the Reduce Step

2nd Job – Map Only Job

This generates the actual Binarized output file.

In order to get a file which has the header and the binarized output – concatenate the file columnHeader.txt ( from the output _directory of 1st Job) with the contents of the file in this directory.

Enhancements which can be made to this Program

  • Checking of Parameters and writing out error message
  • Error Handling
  • Input data can be from Hive / HBase / Impala
  • Test Cases

 Testing

Here is a Python code – which can test the results of the output of the Binarization Code using Map Reduce above – with something you may have generated results from some other program.

https://github.com/palsumitpal/JavaCode/blob/master/Binarization/CompareBinarizeResults.py

The python program needs the following inputs

Param1 – Name of 1st Output file

Param2 – Name of 2nd Output file

Mapping File – which maps which column in file in Param1 – matches with which column in file in Param2 – the format of this file is Col#FromFile1 TAB Col#FromFile2

https://github.com/palsumitpal/JavaCode/blob/master/Binarization/MappingHeader.txt

An Example Mapping file is also posted at the github location