c# - 查找字符串中哪些短语已被多次使用

Question

通过使用字典来识别最常用的单词，很容易计算文件中单词的出现次数，但是给定一个文本文件，我如何找到常用短语，其中“短语”是一组两个或多个连续的字？

例如，下面是一些示例文本：

除口头遗嘱外，每份遗嘱均应采用书面形式，但可以手写或打字。遗嘱应包含立遗嘱人的签名或在立遗嘱人有意识在场 并在立遗嘱人明确指示下由其他人签字。遗嘱应由两名或更多有能力的见证人在立遗嘱人有意识在场的情况下见证和签署，他们看到立遗嘱人签署，或听到立遗嘱人承认立遗嘱人的签名。

就本节而言，有意识的存在是指在立遗嘱人的任何感官范围内，不包括通过电话、电子或其他远程通信感知的视觉或声音。

我如何确定短语“有意识的存在”（3 次）和“遗嘱人的签名”（2 次）出现了不止一次（除了蛮力搜索每组两个或三个单词）？

我将用 c# 编写这个，所以 c# 代码会很棒，但我什至无法确定一个好的算法，所以我将完全满足于任何代码，甚至是伪代码来解决这个问题。

score 5 · Accepted Answer

试试这个。这绝不是万无一失的，但现在应该完成工作。

是的，这仅匹配 2 字组合，不去除标点符号，并且是蛮力的。不，ToList没有必要。

string text = "that big long text block";

var splitBySpace = text.Split(' ');

var doubleWords = splitBySpace
    .Select((x, i) => new { Value = x, Index = i })
    .Where(x => x.Index != splitBySpace.Length - 1)
    .Select(x => x.Value + " " + splitBySpace.ElementAt(x.Index + 1)).ToList();

var duplicates = doubleWords
    .GroupBy(x => x)
    .Where(x => x.Count() > 1)
    .Select(x => new { x.Key, Count = x.Count() }).ToList();

我得到以下结果：

在此处输入图像描述

这是我尝试获得超过 2 个单词的组合。同样，与之前的警告相同。

List<string> multiWords = new List<string>();

//i is the number of words to combine
//in this case, 2-6 words
for (int i = 2; i <= 6; i++)
{
    multiWords.AddRange(splitBySpace
        .Select((x, index) => new { Value = x, Index = index })
        .Where(x => x.Index != splitBySpace.Length - i + 1)
        .Select(x => CombineItems(splitBySpace, x.Index, x.Index + i - 1)));
}

var duplicates = multiWords
    .GroupBy(x => x)
    .Where(x => x.Count() > 1)
    .Select(x => new { x.Key, Count = x.Count() }).ToList();

private string CombineItems(IEnumerable<string> source, int startIndex, int endIndex)
{
    return string.Join(" ", source.Where((x, i) => i >= startIndex && i <= endIndex).ToArray());
}

这次的结果：
在此处输入图像描述

现在我只想说我的代码很有可能出现一个错误。我没有完全测试它，所以请确保在使用之前对其进行测试。

score 5 · Accepted Answer

以为我会快速解决这个问题-不确定这是否不是您要避免的蛮力方法-但是：

static void Main(string[] args)
{
    string txt = @"Except oral wills, every will shall be in writing, 
but may be handwritten or typewritten. The will shall contain the testator's 
signature or by some other person in the testator's conscious presence and at the
testator's express direction . The will shall be attested and subscribed in the
conscious presence of the testator, by two or more competent witnesses, who saw the
testator subscribe, or heard the testator acknowledge the testator's signature.

For purposes of this section, conscious presence means within the range of any of the
testator's senses, excluding the sense of sight or sound that is sensed by telephonic,
electronic, or other distant communication.";

    //split string using common seperators - could add more or use regex.
    string[] words = txt.Split(',', '.', ';', ' ', '\n', '\r');

    //trim each tring and get rid of any empty ones
    words = words.Select(t=>t.Trim()).Where(t=>t.Trim()!=string.Empty).ToArray();

    const int MaxPhraseLength = 20;

    Dictionary<string, int> Counts = new Dictionary<string,int>();

    for (int phraseLen = MaxPhraseLength; phraseLen >= 2; phraseLen--)
    {
        for (int i = 0; i < words.Length - 1; i++)
        {
            //get the phrase to match based on phraselen
            string[] phrase = GetPhrase(words, i, phraseLen);
            string sphrase = string.Join(" ", phrase);

            Console.WriteLine("Phrase : {0}", sphrase);

            int index = FindPhraseIndex(words, i+phrase.Length, phrase);

            if (index > -1)
            {
                Console.WriteLine("Phrase : {0} found at {1}", sphrase, index);

                if(!Counts.ContainsKey(sphrase))
                    Counts.Add(sphrase, 1);

                Counts[sphrase]++;
            }
        }
    }

    foreach (var foo in Counts)
    {
        Console.WriteLine("[{0}] - {1}", foo.Key, foo.Value);
    }

    Console.ReadKey();
}

static string[] GetPhrase(string[] words, int startpos, int len)
{
    return words.Skip(startpos).Take(len).ToArray();
}

static int  FindPhraseIndex(string[] words, int startIndex, string[] matchWords)
{
    for (int i = startIndex; i < words.Length; i++)
    {
        int j;

        for(j=0; j<matchWords.Length && (i+j)<words.Length; j++)
            if(matchWords[j]!=words[i+j])
                break;

        if (j == matchWords.Length)
            return startIndex;
    }

    return -1;
}

score 0 · Accepted Answer

如果我这样做，我可能会从蛮力方法开始，但听起来你想避免这种情况。两阶段方法可以对每个单词进行计数，获取前几个结果（仅从出现次数最多的前几个单词开始），然后仅搜索并计算包含这些热门单词的短语。这样您就不会花时间搜索所有短语。

我有这种感觉，CS 的人会纠正我说这实际上比直接蛮力需要更多的时间。也许一些语言学家会提出一些检测短语或其他东西的方法。

祝你好运！

c# - 查找字符串中哪些短语已被多次使用

3 回答 3

Related

Reference