Friday, July 17, 2015

pig - Extract data from large Bags after GROUP BY

I have multiple files with same columns and I wanted to aggregate the values in two columns using SUM.

The column structure is below

ID first_count second_count name desc
1  10          10           A    A_Desc
1  25          45           A    A_Desc
1  30          25           A    A_Desc
2  20          20           B    B_Desc
2  40          10           B    B_Desc

In wanted the below output

ID first_count second_count name desc
1  65          80           A    A_Desc
2  60          30           B    B_Desc

Below is the script I wrote .

A = LOAD '/output/*/part*' AS (id:chararray,first_count:int,second_count:int,name:chararray,desc:chararray);
B = GROUP A BY id;

C = FOREACH B GENERATE group as id,

              SUM(A.first_count) as first_count,
              SUM(A.second_count) as second_count,
              A.name as name,
              A.desc as desc;


This resulted in the the below output.The output has multiple tuples for each group i.e. the name and desc were repeated.

1  65          80           {(A)(A)(A)}{(A_Desc)(A_Desc)( A_Desc)}
2  60          30           {(B)(B)} {(B_Desc)( B_Desc)}

When I ran this script on the entire dataset that has 500 millions rows in it the reducer got stuck since the Bags after the group had large number of tuples.

In order to get the desired output.I had to distinct the tuple i.e. name and desc and then FLATTEN them to get the desired output.

D = FOREACH C 
    {
distinctnamebag = DISTINCT name;
distinctdescbag = DISTINCT desc;
GENERATE id,first_count,second_count,flatten(distinctnamebag) as name, flatten(distinctdescbag) as desc;
    }

Now the output looks like this

1  65          80          A     A_Desc
2  60          30          B     B_Desc

No comments: