Friday, March 25, 2016

Handling $ sign in Pig

Today I came across a task of calculating min value in a dataset. Though the task was straight forward, the issue was that the data had $ signs in them.Loading these fields using PigStorage was causing data loss. In order to handle this I had to use regular expressions to remove the $ sign perform the necessary aggregate functions and get the results. 

Input:

A,$820.48,$11992.70,996,891,1629
A,$817.12,$2105.57,1087,845,1630
B,$974.48,$5479.10,965,827,1634
B,$943.70,$9162.57,939,895,1635

PigScript:

A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));

DUMP D;

Output:



No comments: