1

我有一个功能对我的任务来说非常慢(它必须快 10-100 倍)

这是代码

public long Support(List<string[]> sequences, string[] words)
{

            var count = 0;
            foreach (var sequence in sequences)
            {
                for (int i = 0; i < sequence.Length - words.Length + 1; i++)
                {
                    bool foundSeq = true;
                    for (int j = 0; j < words.Length; j++)
                    {
                        foundSeq = foundSeq && sequence[i + j] == words[j];
                    }
                    if (foundSeq)
                    {
                        count++;
                        break;
                    }
                }
            }

            return count;
}

public void Support(List<string[]> sequences, List<SequenceInfo> sequenceInfoCollection)
{
    System.Threading.Tasks.Parallel.ForEach(sequenceInfoCollection.Where(x => x.Support==null),sequenceInfo =>
    {
        sequenceInfo.Support = Support(sequences, sequenceInfo.Sequence);
    });

}

whereList<string[]> sequences是一个单词数组。该数组通常包含 250k+ 行。每行大约4-7个单词。 string[] words是我们试图计算的单词数组(所有单词至少包含一次序列)。

问题是foundSeq = foundSeq && sequence[i + j] == words[j];。此代码占用大部分执行时间(Enumerable.MoveNext 排在第二位)。我想对数组中的所有单词进行哈希处理。数字比字符串更快,对吧?我认为它可以帮助我获得 30%-80% 的性能。但我需要10倍!我能做什么?如果你想知道它是先验算法的一部分。


Support function check if the words sequence is a part any sequence in the sequences list and count how much times.

4

3 回答 3

2

Knuth–Morris–Pratt algorithm

In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977.


From Wikipedia: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

This is one of the improvements that you should make. With a small difference: a "word" in your code is a "characters" in the terminology of the algorithm; your "words" array is what is a word in KMP.

The idea is that when you search for "abc def ghi jkl", and have matched "abc def ghi" already, but the next word does not match, you can jump three positions.

Search:   abc def ghi jkl
Text:     abc def ghi klm abc def ghi jkl
i=0:      abc def ghi jkl?
skip 2:       XXX XXX  <--- you save two iterations here, i += 2
i=2:                  abc?
i=3:                      abc? ...
于 2012-05-08T16:05:24.920 回答
0

The first optimisation I would make is an early-fail. Your inner loop continues through the whole sequence even if you know it's failed, and you're doing an unnecessary bit of boolean logic. Your code is this:

    for (int j = 0; j < words.Length; j++)
    {
        foundSeq = foundSeq && sequence[i + j] == words[j];
    }

Instead, just do this (or equivalent):

    for (int j = 0; j < words.Length; j++)
    {
        if (sequence[i + j] != words[j])
        {
            foundSeq = false;
            break;
        }
    }

This will save you the majority of your comparisons (you will drop out at the first word if it doesn't match, instead of continuing to compare when you know the result is false). It could even make the tenfold difference you're looking for if you expect the occurrence of the individual words in each sequence to be low (if, say, you are finding sentences in a page of English text).

于 2012-05-08T11:41:05.433 回答
0

Theoretically, you could concatenate each sequence and using substring matching. I don't have a compiler at hand now, so I can't test whether it will actually improve performance, but this is the general idea:

List<string> sentences = sequences.Select(seq => String.Join(" ", seq));
string toMatch = String.Join(" ", words);

return sentences.Count(sentence => sentence.Contains(toMatch));
于 2012-05-08T11:43:48.770 回答