Monday, December 15, 2014

SQL - Iterate,loop through a string

Recently I had to update a table based on a file which had some counts in it.After loading that file into a table we found that it had 25 Million plus rows in it.While we tried to run an update using the entire table we ended up getting timeouts and the table was getting locked.So we decided to do the update using smaller sets of rows.
The subset was decided on the first character of the Id column.In our case it fell into one of the following characters.

abcdefghijklmnopqrstuvwxyz0123456789

So I created a stored procedure that would take the first character of the Id column and load it into a temp table.Then I ran the update command using this subset of rows in the temp table.I repeated this for all the characters in the above string.

Below is the SQL to achieve this.

declare @id_string varchar(200)
declare @index int
declare @string_length int
declare @char char(1)

set @id_string = 'abcdefghijklmnopqrstuvwxyz0123456789'

set @index = 1
set @string_length= LEN(@id_string )

while @index<= @string_length
begin

set @char = SUBSTRING(@id_string , @index, 1)
--print @char
exec mystoredprocedure @char
set @index= @index+ 1
end

Friday, December 12, 2014

PIG - GroupBy, Count total records,total distinct records,min and max date

Syntactically Group By in PIG is a little different than SQL Group By.I wanted to get some counts from 500+ logs files so I decided to get the counts in PIG, Hadoop.I copied all the log files to HDFS.The files had two columns and were tab delimited.id and timestamp with format(YYYYMMDD).

Below is the script to get the counts using group by in PIG Script.Once the data is grouped by id,getting the totalcounts,min and max is fairly straightforward but for getting distinct counts I had to use a nested FOREACH. In the nested FOREACH, I distinct out the timestamps since I have multiple records in a given day and then use the COUNT operator on the distinct data.

rawdata = LOAD '/myfolder/my_logs/2014*.txt' USING PigStorage('\t') AS (id:chararray,postdate:chararray);

groupdata = GROUP rawdata BY id;
idcountdata = FOREACH groupdata 
                     {
distinctrawdata = DISTINCT rawdata;
GENERATE group as id,COUNT(rawdata) as totalcount,COUNT(distinctrawdata ) as distinctcount,MIN(rawdata .postdate) as mindate,MAX(rawdata .postdate) as maxdate;
     };

store idcountdata INTO '/myoutputfolder/idcounts/' USING PigStorage('\t');

Tuesday, December 9, 2014

iis - Post size exceeded the allowed limits

After hosting our appliction on IIS we started noticing a lot of warnings about the post size in the event viewer.


Once we tweaked the settings for the request length most of the warnings vanished.Below are the screenshots of the warnings,settings that was tweaked on IIS to handle post size from the client and the difference in the files received after the change in settings.










The default post size an application hosted on IIS can handle is 4 MB.After changing this setting to accept post size up to 20 MB in the web.config file for the application, we started receiving posts that were rejected before.


<system.web>
        <httpRuntime maxRequestLength="20480" />
</system.web>




















Monday, December 8, 2014

c# - Create child processes in parallel and wait till all child processes terminate

In one of my projects I had to invoke an external exe in parallel for carrying out simple command to execute a script in Hadoop. The parent process had to wait till all the child processes would complete execution on HDFS and then would carry on other steps.

Below is a simple example to achieve the same.In this example I have created an exe to write to a file i.e. SimpleWrite.exe. I call this exe in parallel (5 instances in parallel) and pass in an argument.The parent process that calls this exe waits till all instances of SimpleWrite completes and then continues its steps.


Program to call external exes in Parallel

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ProcessWait

{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {

                Console.WriteLine("Parent process started executing...");

                Console.WriteLine("Parent process is starting child processes...");
                Parallel.For(0, 5, iIndex =>
                {
                    Console.WriteLine("Child process {0} started",iIndex);
                    NewProcess(iIndex);
                });
                Console.WriteLine("All child processes completed.Waiting for user to press the enter key !!!");
                Console.Read();
                Console.WriteLine("Parent process to continue execution.");
                for (int iIndex = 0; iIndex < 10000; iIndex++)
                {
                    Console.WriteLine(iIndex.ToString());
                }
                Console.WriteLine("Parent process completed execution.");
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
            }
            Console.Read();
        }

        private static void NewProcess(int iIndex)

        {
            Process oRunIndex = new Process();
            oRunIndex.StartInfo.FileName = "SimpleWrite.exe";
            oRunIndex.StartInfo.WorkingDirectory = @"c:\Data\";
            oRunIndex.StartInfo.CreateNoWindow = true;
            oRunIndex.StartInfo.Arguments = string.Format("{0}",iIndex);
            
            oRunIndex.Start();
            oRunIndex.WaitForExit();
        }
    }

}



SimpleWrite.exe code

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace SimpleWrite

{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                string sIndex = args[0];
                using(StreamWriter oWriter = new StreamWriter(@"c:\Data\" + sIndex +".txt"))
                {
                    for (int iIndex = 0; iIndex < 100000; iIndex++)
                    {
                        Console.WriteLine(iIndex.ToString());
                        oWriter.WriteLine(iIndex.ToString());
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                Console.Read();
            }
        }
    }
}


Output