Sunday, July 8, 2012

c# - Count TABs in text files,Data scrubbing

On a daily basis I deal with huge .txt files,mainly tab delimited files.These files contain data extracted from various sources. The data files usually are gigabytes in size.Often the data files are not in correct format due to junk strings in the data extracted.The warehouse load fails due to this.

In order to get each line in the data files in the right format I use the following code to count the tabs.The code also gives the line number which is not in the right format.

I use NotePad++ to remove the offending line to make the file datawarehouse loadable.



            int count = 0;
            int lineNumber = 1;
            int errorCount = 0;
            int tabCount = int.Parse(ConfigurationSettings.AppSettings["TAB_COUNT"]);
            
                using (StreamReader reader = new StreamReader(ConfigurationSettings.AppSettings["INPUT_FILE_NAME"]))
                {
                    while (!reader.EndOfStream)
                    {
                        String sLine = reader.ReadLine();
                        foreach (char c in sLine)
                        {
                            if (c == '\t')
                                count++;
                        }
                        if (count != tabCount)
                        {
                            errorCount++;
                            Console.WriteLine("Line Number: " + lineNumber.ToString() + " Tab Count:" + count.ToString());
                            Console.WriteLine("***   ERROR ***:" + sLine);
                         }
                        lineNumber++;
                        count = 0;
                    }
                }
            
            Console.WriteLine("Number of lines not in the right format:" + errorCount.ToString());

No comments: