假设我有 100000 个电子邮件正文,其中 2000 个包含一个随意的常见字符串,例如“the quick brown fox jumps over the lazy dog”或“lorem ipsum dolor sit amet”。我可以/应该使用什么技术来“挖掘”这些短语?我对挖掘单个单词或短语不感兴趣。我还需要过滤掉我已经知道所有邮件中出现的短语。
例子:
string mailbody1 = "Welcome to the world of tomorrow! This is the first mail body. Lorem ipsum dolor sit AMET. Have a nice day dude. Cya!";
string mailbody2 = "Welcome to the world of yesterday! Lorem ipsum dolor sit amet Please note this is the body of the second mail. Have a nice day.";
string mailbody3 = "A completely different body.";
string[] mailbodies = new[] {mailbody1, mailbody2, mailbody3};
string[] ignoredPhrases = new[] {"Welcome to the world of"};
string[] results = DiscoverPhrases(mailbodies, ignoredPhrases);
在此示例中,我希望 DiscoverPhrases 函数返回“lorem ipsum dolor sit amet”和“祝你有美好的一天”。如果函数还返回较短的“噪音”短语,这并不重要,但如果可能的话,最好在此过程中消除这些短语。
编辑:我忘记在示例中包含 mailbody3 。