Thursday, July 31, 2014

The process cannot access the file because it is being used by another process Task Parallel Library

In one of the projects I am converting to .Net 4.0, I started getting the following error 'The process cannot access the file because it is being used by another process'. The program is processing a bunch of files in a set of folders. I am using .Net 4.0 Task Parallel Library features to handle these folders across different threads on the server. The server has 16 GB RAM and 24 cores.

The code that was causing the error is below


Parallel.ForEach(rangePartitioner,new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount}, (range,loopState) =>

{
     try
     {
         using (FileStream oFileStream = new FileStream(oFileInfo.FullName, FileMode.Open))
         using (GZipStream oGzipStream = new GZipStream(oFileStream, CompressionMode.Decompress))
         using (StreamReader oStreamReader = new StreamReader(oGzipStream, Encoding.ASCII))
         {
                   String sRawDataLine = oStreamReader.ReadLine();
                   while (sRawDataLine != null)
                   {
                           //Do Something();
                            sRawDataLine = oStreamReader.ReadLine();
                    }
           }
     }
     catch (Exception ex)
     {
       oLog.WriteLine("Error processing file name = {0} Exception {1}.", oFileInfo.FullName, ex.ToString());
     }
});

The "key" for the "locks" was to add the following options highlighted in the below code to the FileStream object so the file access is for read operation and that the file could be shared while reading.

Parallel.ForEach(rangePartitioner,new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount}, (range,loopState) =>
{
     try
     {
         using (FileStream oFileStream = new FileStream(oFileInfo.FullName, FileMode.Open,FileAccess.Read, FileShare.Read)))
         using (GZipStream oGzipStream = new GZipStream(oFileStream, CompressionMode.Decompress))
         using (StreamReader oStreamReader = new StreamReader(oGzipStream, Encoding.ASCII))
         {
                   String sRawDataLine = oStreamReader.ReadLine();
                   while (sRawDataLine != null)
                   {
                           //Do Something();
                            sRawDataLine = oStreamReader.ReadLine();
                    }
           }
     }
     catch (Exception ex)
     {
       oLog.WriteLine("Error processing file name = {0} Exception {1}.", oFileInfo.FullName, ex.ToString());
     }
});

I have not seen the error since adding these two options.The process no longer fails and executes on all the 24 cores on the server without any issues.

c# - Remove duplicate lines

A lot of the programs I have written are related to parsing files and converting them to tab delimited files for the data warehouse.The files range from few megabytes to gigabytes in size.In one of the processes I noticed that the output files had duplicate lines.I wanted to remove the duplicate lines from these files.The file sizes ranged from 2 GB to 6 GB.

Below is the c# program I wrote to remove the duplicates.Fortunately I have Windows 2008 server with 16 GB RAM so that helps :)


using System;
using System.Collections.Generic;
using System.Configuration;
using System.IO;
using System.Linq;

namespace DeDupeTextFiles

{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                string sInputFilePath = ConfigurationManager.AppSettings["INPUT_FILEPATH"];
                HashSet<string> oDistinctLines = new HashSet<string>();
                using (StreamReader oReader = new StreamReader(sInputFilePath))
                {
                    string sLine = oReader.ReadLine();
                    while (sLine != null)
                    {
                        if (!oDistinctLines.Contains(sLine))
                            oDistinctLines.Add(sLine);

                        sLine = oReader.ReadLine();

                    }
                }
                            File.WriteAllLines(sInputFilePath.Replace(Path.GetFileName(sInputFilePath), Path.GetFileNameWithoutExtension(sInputFilePath) + "_Distinct.txt"), oDistinctLines.ToArray<string>());
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                Console.WriteLine(ex.StackTrace);
            }
            Console.Read();
        }
    }
}

App.config

<?xml version="1.0" encoding="utf-8" ?>

<configuration>
  <appSettings>
    <add key="INPUT_FILEPATH" value="E:\test_data\myfolder\file\lookups\201401\0.txt"/>
  </appSettings>
</configuration>

There are other ways to achieve the same using DOS batch commands or load files to database and distinct it out.The program took less than a minute to remove duplicates from 3 GB file.The final file was reduced to 180 MB !