I have multiple files with same columns and I wanted to aggregate the values in two columns using SUM.
The column structure is below
ID first_count second_count name desc
1 10 10 A A_Desc
1 25 45 A A_Desc
1 30 25 A A_Desc
2 20 20 B B_Desc
2 40 10 B B_Desc
In wanted the below output
ID first_count second_count name desc
1 65 80 A A_Desc
2 60 30 B B_Desc
Below is the script I wrote .
A = LOAD '/output/*/part*' AS (id:chararray,first_count:int,second_count:int,name:chararray,desc:chararray);
B = GROUP A BY id;
C = FOREACH B GENERATE group as id,
SUM(A.first_count) as first_count,
SUM(A.second_count) as second_count,
A.name as name,
A.desc as desc;
This resulted in the the below output.The output has multiple tuples for each group i.e. the name and desc were repeated.
1 65 80 {(A)(A)(A)}{(A_Desc)(A_Desc)( A_Desc)}
2 60 30 {(B)(B)} {(B_Desc)( B_Desc)}
When I ran this script on the entire dataset that has 500 millions rows in it the reducer got stuck since the Bags after the group had large number of tuples.
In order to get the desired output.I had to distinct the tuple i.e. name and desc and then FLATTEN them to get the desired output.
D = FOREACH C
{
distinctnamebag = DISTINCT name;
distinctdescbag = DISTINCT desc;
GENERATE id,first_count,second_count,flatten(distinctnamebag) as name, flatten(distinctdescbag) as desc;
}
Now the output looks like this
1 65 80 A A_Desc
2 60 30 B B_Desc
The column structure is below
ID first_count second_count name desc
1 10 10 A A_Desc
1 25 45 A A_Desc
1 30 25 A A_Desc
2 20 20 B B_Desc
2 40 10 B B_Desc
In wanted the below output
ID first_count second_count name desc
1 65 80 A A_Desc
2 60 30 B B_Desc
Below is the script I wrote .
A = LOAD '/output/*/part*' AS (id:chararray,first_count:int,second_count:int,name:chararray,desc:chararray);
B = GROUP A BY id;
C = FOREACH B GENERATE group as id,
SUM(A.first_count) as first_count,
SUM(A.second_count) as second_count,
A.name as name,
A.desc as desc;
This resulted in the the below output.The output has multiple tuples for each group i.e. the name and desc were repeated.
1 65 80 {(A)(A)(A)}{(A_Desc)(A_Desc)( A_Desc)}
2 60 30 {(B)(B)} {(B_Desc)( B_Desc)}
When I ran this script on the entire dataset that has 500 millions rows in it the reducer got stuck since the Bags after the group had large number of tuples.
In order to get the desired output.I had to distinct the tuple i.e. name and desc and then FLATTEN them to get the desired output.
D = FOREACH C
{
distinctnamebag = DISTINCT name;
distinctdescbag = DISTINCT desc;
GENERATE id,first_count,second_count,flatten(distinctnamebag) as name, flatten(distinctdescbag) as desc;
}
Now the output looks like this
1 65 80 A A_Desc
2 60 30 B B_Desc
No comments:
Post a Comment