statistics - 使用错误分隔和连接的单词对文本进行规范化

Question

假设我有一堆类似的带有噪音的字符串，主要是错误连接/断开的单词。喜欢：

 "Once more unto the breach, dear friends. Once more!"
 "Once more unto the breach , dearfriends. Once more!"
 "Once more unto the breach, de ar friends. Once more!"
 "Once more unto the breach, dear friends. Once more!"

我怎样才能将每个人规范化为同一组单词？即

 ["once" "more" "unto" "the" "breach" "dear" "friends" "once" "more"]

谢谢！

score 3 · Accepted Answer

这里有一些建议。我认为您最终将不得不编写一组例程/函数来修复您遇到的所有各种类型的违规行为。

好消息是您可以逐步添加到您的“修复”集并不断改进解析器。

我必须做类似的事情，我发现Peter Norvig 的这篇文章非常有用。（请注意，它是在 Python 中。）

这个功能，特别是你需要的想法：拆分、删除、转置和插入不规则的单词来“纠正”它们。

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

以上是 Norvig 的拼写校正器的一个片段

即使您不能按原样使用代码，核心思想也适用于您的情况：您取一个标记（“单词”），这是您的情况下的不规则单词，尝试不同的调整以查看它是否属于大已知和接受单词的字典。

希望有帮助。

score 1 · Accepted Answer

有点疯狂的想法，我只是建议它，因为我正在教我本周将要向我的学生提出的算法。

删除句子中的所有空格，例如de ar friends变成dearfriends. 存在一种二次时间、线性空间动态规划算法，可将无空格字符串拆分为最可能的单词序列。该算法在此处和此处进行了讨论--- 第二种解决方案是我想到的。这里的假设是你有一个很好的单词模型，并且查询该模型需要恒定的时间。

statistics - 使用错误分隔和连接的单词对文本进行规范化

2 回答 2

Related

Reference