Thursday, July 31, 2014

c# - Remove duplicate lines

A lot of the programs I have written are related to parsing files and converting them to tab delimited files for the data warehouse.The files range from few megabytes to gigabytes in size.In one of the processes I noticed that the output files had duplicate lines.I wanted to remove the duplicate lines from these files.The file sizes ranged from 2 GB to 6 GB.

Below is the c# program I wrote to remove the duplicates.Fortunately I have Windows 2008 server with 16 GB RAM so that helps :)


using System;
using System.Collections.Generic;
using System.Configuration;
using System.IO;
using System.Linq;

namespace DeDupeTextFiles

{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                string sInputFilePath = ConfigurationManager.AppSettings["INPUT_FILEPATH"];
                HashSet<string> oDistinctLines = new HashSet<string>();
                using (StreamReader oReader = new StreamReader(sInputFilePath))
                {
                    string sLine = oReader.ReadLine();
                    while (sLine != null)
                    {
                        if (!oDistinctLines.Contains(sLine))
                            oDistinctLines.Add(sLine);

                        sLine = oReader.ReadLine();

                    }
                }
                            File.WriteAllLines(sInputFilePath.Replace(Path.GetFileName(sInputFilePath), Path.GetFileNameWithoutExtension(sInputFilePath) + "_Distinct.txt"), oDistinctLines.ToArray<string>());
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                Console.WriteLine(ex.StackTrace);
            }
            Console.Read();
        }
    }
}

App.config

<?xml version="1.0" encoding="utf-8" ?>

<configuration>
  <appSettings>
    <add key="INPUT_FILEPATH" value="E:\test_data\myfolder\file\lookups\201401\0.txt"/>
  </appSettings>
</configuration>

There are other ways to achieve the same using DOS batch commands or load files to database and distinct it out.The program took less than a minute to remove duplicates from 3 GB file.The final file was reduced to 180 MB !


No comments: