0

I have 14 files of prepared word-based 3-grams and Total Size of the txt files is 75GB. The ngram is separated by ";" and the word that follows the word sequence is separated by "|". Now I want to count how often a word follows a 3-word-sequence. Because of the amount of data I need to do it as fast as possible.

My approach was:

  1. Split line by ngram, by separator ;
  2. Split ngram by separator |
  3. Store the ngram in two tables sequences and words and count how often the word appears to that sequence in the words table

I have SQL Server 2014 Express and my tables have the following structure:

  • [dbo].[sequences]: Id | Sequence
  • [dbo].[words]: Id | sid | word | count

The sequence table should be clear, and in the words table the sid is the related sequence id, the word is the word string, and count is the int number which counts how often the word appeared after that sequence.

My following solution needs at the beginning about 1 second per line, which is far to slow. I tried to use Parallel, but then I get a SQL error, I guess because the table is locked when another process is inserting something.

My program:

    static void Main(string[] args)
    {
        DateTime begin = DateTime.Now;
        SqlConnection myConnection = new SqlConnection(@"Data Source=(localdb)\Projects;Database=ngrams;Integrated Security=True;Connect Timeout=30;Encrypt=False;TrustServerCertificate=False");
        myConnection.Open();
        for (int i = 0; i < 14; i++)
        {
            using (FileStream fs = File.Open(@"F:\Documents\ngrams\prepared_" + i + ".txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            using (BufferedStream bs = new BufferedStream(fs))
            using (StreamReader sr = new StreamReader(bs))
            {
                string line;
                int a = 0;
                while ((line = sr.ReadLine()) != null)
                {
                    string[] ngrams = line.Split(new char[] { ';' });
                    foreach (string ngram in ngrams)
                    {
                        string[] gram = ngram.Split(new Char[] { '|' });
                        if (gram.Length > 1)
                        {
                            string sequence = gram[0];
                            string word = gram[1];
                            storeNgrams(myConnection, sequence, word);
                        }
                    }
                    Console.WriteLine(DateTime.Now.Subtract(begin).TotalMinutes);
                    a++;
                }
            }
        }

        Console.WriteLine("Processed 75 Gigabyte in hours: " + DateTime.Now.Subtract(begin).TotalHours);
    }

    private static void storeNgrams(SqlConnection myConnection, string sequence, string word)
    {
        SqlCommand insSeq = new SqlCommand("INSERT INTO sequences (sequence) VALUES (@sequence); SELECT SCOPE_IDENTITY()", myConnection);
        SqlCommand insWord = new SqlCommand("INSERT INTO words (sid, word, count) VALUES (@sid, @word, @count)", myConnection);
        SqlCommand updateWordCount = new SqlCommand("UPDATE words SET count = @count WHERE sid = @sid AND word = @word", myConnection);
        SqlCommand searchSeq = new SqlCommand("SELECT Id from sequences WHERE sequence = @sequence", myConnection);
        SqlCommand getWordCount = new SqlCommand("Select count from words WHERE sid = @sid AND word = @word", myConnection);
        searchSeq.Parameters.AddWithValue("@sequence", sequence);
        object searchSeq_obj = searchSeq.ExecuteScalar();
        if (searchSeq_obj != null)
        {
            insNgram(insWord, updateWordCount, getWordCount, searchSeq_obj, word).ExecuteNonQuery();
        }
        else
        {
            insSeq.Parameters.AddWithValue("@sequence", sequence);
            object sid_obj = insSeq.ExecuteScalar();
            if (sid_obj != null)
            {
                insNgram(insWord, updateWordCount, getWordCount, sid_obj, word).ExecuteNonQuery();
            }
        }
    }

    private static SqlCommand insNgram(SqlCommand insWord, SqlCommand updateWordCount, SqlCommand getWordCount, object sid_obj, string word)
    {
        int sid = Convert.ToInt32(sid_obj);
        getWordCount.Parameters.AddWithValue("@sid", sid);
        getWordCount.Parameters.AddWithValue("@word", word);
        object wordCount_obj = getWordCount.ExecuteScalar();
        if (wordCount_obj != null)
        {
            int wordCount = Convert.ToInt32(wordCount_obj) + 1;
            return storeWord(updateWordCount, sid, word, wordCount);
        }
        else
        {
            int wordCount = 1;
            return storeWord(insWord, sid, word, wordCount);
        }
    }

    private static SqlCommand storeWord(SqlCommand updateWord, int sid, string word, int wordCount)
    {
        updateWord.Parameters.AddWithValue("@sid", sid);
        updateWord.Parameters.AddWithValue("@word", word);
        updateWord.Parameters.AddWithValue("@count", wordCount);
        return updateWord;
    }

How can I process the ngrams faster, so that I won't need an exorbitant amount of time?

P.S.: I'm totally new to C# and Natural Language Processing.

Edit 1: As requested a sample ngram, from which there are about 4 or 5 in each line (but with different word combinations of course): much the same | like;

Edit 2: When I change the code to the following I get the error System.AggregateException: At least one failure occured ---> System.InvalidOperationException: There is already an open DataReader associated with this Command which must be closed first., just like here.

 Parallel.For(0, 14, i => sqlaction(myConnection, i, begin));

Edit 3: When adding MultipleActiveResultSets=true to the connection string, I don't get any errors using Parallel. I replaced all relevant loops with the Parallel equivalent, and I run through all files just counting line numbers (169521628 lines), also I have calculated the average time needed for 1 line, which is 0,051502946 sec. Even then I would need about 101 days!

4

0 回答 0