Thursday, July 23, 2015

Hadoop - Decompress gz files from reducer output

Due to space constraints on our Hadoop clusters we had store output files in compressed gz format.

STORE finaldata INTO '/output/${month}/${folder}.gz/' USING PigStorage('\t'); 

Once we started storing our output files in the compressed format, we had to change the way we merge the output files to extract it to the local system.The getmerge command copied the files in compressed format. 

hadoop fs -getmerge /output/${month}/${folder}.gz/part-* /output/handoff.txt

In order to decompress the files and copy it over to local system we had to use the text command..

hadoop fs -text /output/${month}/${folder}.gz/part-* > /output/handoff.txt

Friday, July 17, 2015

pig - Extract data from large Bags after GROUP BY

I have multiple files with same columns and I wanted to aggregate the values in two columns using SUM.

The column structure is below

ID first_count second_count name desc
1  10          10           A    A_Desc
1  25          45           A    A_Desc
1  30          25           A    A_Desc
2  20          20           B    B_Desc
2  40          10           B    B_Desc

In wanted the below output

ID first_count second_count name desc
1  65          80           A    A_Desc
2  60          30           B    B_Desc

Below is the script I wrote .

A = LOAD '/output/*/part*' AS (id:chararray,first_count:int,second_count:int,name:chararray,desc:chararray);
B = GROUP A BY id;

C = FOREACH B GENERATE group as id,

              SUM(A.first_count) as first_count,
              SUM(A.second_count) as second_count,
              A.name as name,
              A.desc as desc;


This resulted in the the below output.The output has multiple tuples for each group i.e. the name and desc were repeated.

1  65          80           {(A)(A)(A)}{(A_Desc)(A_Desc)( A_Desc)}
2  60          30           {(B)(B)} {(B_Desc)( B_Desc)}

When I ran this script on the entire dataset that has 500 millions rows in it the reducer got stuck since the Bags after the group had large number of tuples.

In order to get the desired output.I had to distinct the tuple i.e. name and desc and then FLATTEN them to get the desired output.

D = FOREACH C 
    {
distinctnamebag = DISTINCT name;
distinctdescbag = DISTINCT desc;
GENERATE id,first_count,second_count,flatten(distinctnamebag) as name, flatten(distinctdescbag) as desc;
    }

Now the output looks like this

1  65          80          A     A_Desc
2  60          30          B     B_Desc

Wednesday, July 8, 2015

Increase process priority in Windows Server 2008

If you have used the task scheduler in windows server 2008 to schedule a windows task,you will see that by default the task scheduler runs the process with a Below Normal priority.
When I moved the tasks from windows server 2003 to windows server 2008, I noticed a sharp decline in performance.Most of the tasks that I have scheduled are multi threaded and IO intensive.





















Even though the task settings is set to run under an Admin account with highest privileges the process runs with a Below Normal priority.
















Note:You can change the priority here but I did not notice any change in the performance.














One way to bump up the process priority is in code
System.Diagnostics.Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High
However I did not see any improvement in process performance.

After playing with the settings I was able to tweak the configuration file for the task in the task scheduler to set the process priority to High.The process now executes faster and I did see performance improvement over the previous process execution time.

Below are the screenshots of how to override windows server task scheduler process priority.

Create the task in the task scheduler.Once the task is created,right click on the task and choose export.This exports the task properties/settings to a xml file.Save the xml file on the local machine.Delete the task that was created.













Open the xml file which has the task settings and you will see the process priority set to 7.

















Edit the priority and set it the desired priority.4 is for Normal and 1 is for High.Note that setting a higher priority is not recommended.My process is highly IO intensive and hence I have used  1.















Save the xml file.Go to the task scheduler,right click in the task list window and select import.Select the xml file for the process and the task should be created with a higher process priority.















Now my process runs with a High priority.Also the process performance has been higher. Even though the recommended priority is Normal, after 3 months of execution I have not encountered any issues with the process or the server.