1

I have the PAF raw data in several files (list of all addresses in the UK).

My goal is to create a PostCode lookup in our software.

I have created a new database but there is no need to understand it for the moment.

Let's take a file, his extension is ".c01" and can be open with a text editor. The data in this file are in the following format :

0000000123A

With (according to the developer guide), 8 char for the KEY, 50 char for the NAME.

This file contains 2,449,652 rows (it's a small one !)

I create a Parsing class for this

private class SerializedBuilding
{
   public int Key
   {
      get; set; 
   }

   public string Name
   {
       get; set;
   }

        public bool isValid = false;

        public Building ToBuilding()
        {
            Building b = new Building();
            b.BuildingKey = Key;
            b.BuildingName = Name;
            return b;
        }

        private readonly int KEYLENGTH = 8;
        private readonly int NAMELENGTH = 50;

        public SerializedBuilding(String line)
        {
            string KeyStr = null;
            string Name = null;
            try
            {
                KeyStr = line.Substring(0, KEYLENGTH);
            }
            catch (Exception e)
            {
                Console.WriteLine("erreur parsing key line " + line);
                return;
            }
            try
            {
                Name = line.Substring(KEYLENGTH - 1, NAMELENGTH);
            }
            catch (Exception e)
            {
                Console.WriteLine("erreur parsing name line " + line);
                return;
            }
            int value;
            if (!Int32.TryParse(KeyStr, out value))
                return;
            if (value == 0 || value == 99999999)
                return;
            this.Name = Name;
            this.Key = value;
            this.isValid = true;
        }
    }

I use this method to read the file

public void start()
{
        AddressDataContext d = new AddressDataContext();
        Count = 0;
        string line;

        // Read the file and display it line by line.
        System.IO.StreamReader file =
           new System.IO.StreamReader(filename);
        SerializedBuilding sb = null;
        Console.WriteLine("Number of line detected : " + File.ReadLines(filename).Count());
        while ((line = file.ReadLine()) != null)
        {
            sb = new SerializedBuilding(line);
            if (sb.isValid)
            {
                d.Buildings.InsertOnSubmit(sb.ToBuilding());
                if (Count % 100 == 0)
                    d.SubmitChanges();
            }
            Count++;
        }
        d.SubmitChanges();
        file.Close();
        Console.WriteLine("building added");
}

I use Linq to SQL classes to insert data to my database. The connection string is the default one.

This seems to work, I have added 67200 lines. It just crashed but my questions are not about that.

My estimations :

  • 33,647,015 rows to parse

Time needed for execution : 13 hours

It's a one-time job (just needs to be done on my sql and on the client server later) so I don't really care about performances but I think it can be interesting to know how it can be improved.

My questions are :

  • Is readline() and substring() the most powerful ways to read these huge files ?
  • Can the performance be improved by modifying the connection string ?
4

0 回答 0