.net - 有哪些技术/工具可用于发现文本块中的常用短语？

Question

假设我有 100000 个电子邮件正文，其中 2000 个包含一个随意的常见字符串，例如“the quick brown fox jumps over the lazy dog”或“lorem ipsum dolor sit amet”。我可以/应该使用什么技术来“挖掘”这些短语？我对挖掘单个单词或短语不感兴趣。我还需要过滤掉我已经知道所有邮件中出现的短语。

例子：

string mailbody1 = "Welcome to the world of tomorrow! This is the first mail body. Lorem ipsum dolor sit AMET. Have a nice day dude. Cya!";
string mailbody2 = "Welcome to the world of yesterday! Lorem ipsum dolor sit amet Please note this is the body of the second mail. Have a nice day.";
string mailbody3 = "A completely different body.";
string[] mailbodies = new[] {mailbody1, mailbody2, mailbody3};
string[] ignoredPhrases = new[] {"Welcome to the world of"};

string[] results = DiscoverPhrases(mailbodies, ignoredPhrases);

在此示例中，我希望 DiscoverPhrases 函数返回“lorem ipsum dolor sit amet”和“祝你有美好的一天”。如果函数还返回较短的“噪音”短语，这并不重要，但如果可能的话，最好在此过程中消除这些短语。

编辑：我忘记在示例中包含 mailbody3 。

score 8 · Accepted Answer

看看N -grams。最常见的短语必然会贡献最常见的N -gram。我会从单词三元组开始，看看它会导致什么。（所需的空间是文本长度的N倍，所以你不能让N变得太大。）如果你保存位置而不只是计数，那么你可以看看三元组是否可以扩展以形成常用短语。

score 1 · Accepted Answer

我不确定这是否是您想要的，但请查看最长常见子字符串问题和diff 实用程序算法。

score 1 · Accepted Answer

这样的事情可能会起作用，这取决于您是否关心单词边界。在伪代码中（其中LCS是计算最长公共子序列的函数）：

someMinimumLengthParameter = 20;
foundPhrases = [];

do {
    lcs = LCS(mailbodies);
    if (lcs in ignoredPhrases) continue;

    foundPhrases += lcs;

    for body in mailbodies {
        body.remove(lcs);
    }    
} while(lcs.length > someMinimumLengthParameter);

.net - 有哪些技术/工具可用于发现文本块中的常用短语？

3 回答 3

Related

Reference