Friday, December 12, 2014

PIG - GroupBy, Count total records,total distinct records,min and max date

Syntactically Group By in PIG is a little different than SQL Group By.I wanted to get some counts from 500+ logs files so I decided to get the counts in PIG, Hadoop.I copied all the log files to HDFS.The files had two columns and were tab delimited.id and timestamp with format(YYYYMMDD).

Below is the script to get the counts using group by in PIG Script.Once the data is grouped by id,getting the totalcounts,min and max is fairly straightforward but for getting distinct counts I had to use a nested FOREACH. In the nested FOREACH, I distinct out the timestamps since I have multiple records in a given day and then use the COUNT operator on the distinct data.

rawdata = LOAD '/myfolder/my_logs/2014*.txt' USING PigStorage('\t') AS (id:chararray,postdate:chararray);

groupdata = GROUP rawdata BY id;
idcountdata = FOREACH groupdata 
                     {
distinctrawdata = DISTINCT rawdata;
GENERATE group as id,COUNT(rawdata) as totalcount,COUNT(distinctrawdata ) as distinctcount,MIN(rawdata .postdate) as mindate,MAX(rawdata .postdate) as maxdate;
     };

store idcountdata INTO '/myoutputfolder/idcounts/' USING PigStorage('\t');

No comments: