我想答案不再需要了,但我喜欢这个问题,它让我思考如何结合 RegEx 和 Levenshtein 字符串度量的优点,但减少对距离的依赖。
到目前为止,我已经提出了一个解析器,它遵循这个前提和逻辑:
- 它使用 Python3 和正则表达式模块(OP 没有提到任何语言/模块要求)
- 任何
needle
被搜索的内容都将从其标点字符中删除
- Every
haystack
也被剥夺了标点符号 - 所以N.A.S.A
会NASA
- 就像needle
它最初一样N.A.S.A.
- 我知道这在相当多的情况下可能会出现问题,但鉴于前提我无法想出更好的解决方案。
- 长度不超过 3 个字符的每个单词都
needle
将被删除(不需要 is、on、at、no 等)
- 匹配不区分大小写
- 将
needle
被拆分为wordgroup
包含n
项目的 s:n
在 dict 中定义,0 < k <= l
其中k
dict 键是
- a 中的每个单词
wordgroup
必须相互跟随,n
它们之间的单词距离最大
- 每个单词,根据它的
n
长度,可以有一个不同的允许错误阈值:可以指定e
总共的错误,s
ubstitions,i
nserts 和eletions,同样用一个 dict 保存 key whered
0 < k <= n
- 前面提到的两个 dict 都包含键/lambda 对,这对于它们的最后/第一项进行计算很有用
在线演示在这里
contextual_fuzzy_matcher.py:
from collections import OrderedDict
import regex
class ContextualFuzzyMatcher(object):
maximum_word_distance = 2
word_distance = r"\s(?:[\w]+\s){{0,{}}}".format(maximum_word_distance)
punctuation = regex.compile(r"[\u2000-\u206F\u2E00-\u2E7F\\'!\"#$%&\(\)\*\+,\-\.\/:;<=>\?@\[\]\^_`\{\|\}~]")
groups = OrderedDict((
(0, lambda l: l),
(4, lambda l: 3),
(8, lambda l: 6),
(10, lambda l: l // 0.75),
))
tolerances = OrderedDict((
(0, {
'e': lambda l: 0,
's': lambda l: 0,
'i': lambda l: 0,
'd': lambda l: 0,
}),
(3, {
'e': lambda l: 1,
's': lambda l: 1,
'i': lambda l: 1,
'd': lambda l: 1,
}),
(6, {
'e': lambda l: 2,
's': lambda l: 1,
'i': lambda l: 1,
'd': lambda l: 1,
}),
(9, {
'e': lambda l: 3,
's': lambda l: 2,
'i': lambda l: 2,
'd': lambda l: 2,
}),
(12, {
'e': lambda l: l // 4,
's': lambda l: l // 6,
'i': lambda l: l // 6,
'd': lambda l: l // 6,
}),
))
def __init__(self, needle):
self.sentence = needle
self.words = self.sentence_to_words(sentence)
self.words_len = len(self.words)
self.group_size = self.get_group_size()
self.word_groups = self.get_word_groups()
self.regexp = self.get_regexp()
def sentence_to_words(self, sentence):
sentence = regex.sub(self.punctuation, "", sentence)
sentence = regex.sub(" +", " ", sentence)
return [word for word in sentence.split(' ') if len(word) > 2]
def get_group_size(self):
return list(value for key, value in self.groups.items() if self.words_len >= key)[-1](self.words_len)
def get_word_groups(self):
return [self.words[i:i + self.group_size] for i in range(self.words_len - self.group_size + 1)]
def get_tolerance(self, word_len):
return list(value for key, value in self.tolerances.items() if word_len >= key)[-1]
def get_regexp(self):
combinations = []
for word_group in self.word_groups:
distants = []
for word in word_group:
word_len = len(word)
tolerance = self.get_tolerance(word_len)
distants.append(r"({}){{e<={},s<={},i<={},d<={}}}".format(
word,
tolerance['e'](word_len),
tolerance['s'](word_len),
tolerance['i'](word_len),
tolerance['d'](word_len),
))
combinations.append(
self.word_distance.join(distants)
)
return regex.compile(
r"|".join(combinations),
regex.MULTILINE | regex.IGNORECASE
)
def findall(self, haystack):
return self.regexp.findall(haystack)
主要.py:
test_sentences = [
'Levi Watkins Learning Center - Alabama State University',
'ETH Library'
]
test_texts = [
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sapien eget mi proin sed libero enim sed. Nec tincidunt praesent semper feugiat nibh sed pulvinar. Habitasse platea dictumst quisque sagittis. Tortor condimentum lacinia quis vel eros donec ac odio. Platea dictumst vestibulum rhoncus est pellentesque elit ullamcorper dignissim. Ultricies tristique nulla aliquet enim tortor at. Mi proin sed libero enim sed faucibus. Fames ac turpis egestas integer eget aliquet nibh. Potenti nullam ac tortor vitae purus faucibus ornare suspendisse. Cras semper auctor neque vitae tempus quam pellentesque nec. Quam lacus suspendisse faucibus interdum posuere. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl tincidunt. Viverra tellus in hac habitasse. Nibh nisl condimentum id venenatis a condimentum vitae. Tincidunt dui ut ornare lectus."
"Mattis aliquam faucibus purus in massa tempor nec feugiat nisl. Amet consectetur adipiscing elit ut aliquam purus. Turpis massa tincidunt dui ut ornare. Suscipit tellus mauris a diam maecenas sed enim ut sem. Id consectetur purus ut faucibus pulvinar elementum. Est velit egestas dui id. Felis imperdiet proin fermentum leo. Faucibus nisl tincidunt eget nullam non nisi est sit. Elit pellentesque habitant morbi tristique. Nisi lacus sed viverra tellus. Morbi tristique senectus et netus et malesuada fames. Id diam vel quam elementum pulvinar. Id nibh tortor id aliquet lectus. Sem integer vitae justo eget magna. Quisque sagittis purus sit amet volutpat consequat. Auctor elit sed vulputate mi sit amet. Venenatis lectus magna fringilla urna porttitor rhoncus dolor purus. Adipiscing diam donec adipiscing tristique risus nec feugiat in fermentum. Bibendum est ultricies integer quis."
"Interdum posuere lorem ipsum dolor sit. Convallis convallis tellus id interdum velit. Sollicitudin aliquam ultrices sagittis orci a scelerisque purus. Vel quam elementum pulvinar etiam. Adipiscing bibendum est ultricies integer quis. Tellus molestie nunc non blandit. Sit amet porttitor eget dolor morbi non arcu. Scelerisque purus semper eget duis at tellus. Diam maecenas sed enim ut sem viverra. Vulputate odio ut enim blandit volutpat maecenas. Faucibus purus in massa tempor nec. Bibendum ut tristique et egestas quis ipsum suspendisse. Ut aliquam purus sit amet luctus venenatis lectus magna. Ac placerat vestibulum lectus mauris ultrices eros in cursus turpis. Feugiat pretium nibh ipsum consequat nisl vel pretium. Elit pellentesque habitant morbi tristique senectus et.",
"Found at ETH's own Library", # ' will be a problem - it adds one extra deletion
"State University of Alabama has a learning center called Levi Watkins",
"The ETH library is not to be confused with Alabama State university's Levi Watkins Learning center",
"ETH Library",
"Alabma State Unversity",
"Levi Wtkins Learning"
]
for test_sentence in test_sentences:
parser = ContextualFuzzyMatcher(test_sentence)
for test_text in test_texts:
for match in parser.findall(test_text):
print(match)
返回:
('', '', '', '', '', '', '', '', '', '', '', '', ' Alabama', 'State', 'university')
(' Levi', 'Watkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
('', '', '', '', '', '', '', '', '', '', '', '', 'Alabma', 'State', 'Unversity')
('Levi', 'Wtkins', 'Learning', '', '', '', '', '', '', '', '', '', '', '', '')
(' ETH', 'library')
('ETH', 'Library')
我完全意识到这离完美的解决方案还很遥远,而且我的示例很少而且没有真正的代表性——但也许通过调整配置并进行大量实际测试,它可能能够涵盖相当多的不会产生太多误报的情况。此外,由于它是基于类的,因此可以针对不同的来源进行不同的继承和配置——也许在科学文本中,最大字距 1 就足够了,在报纸文章中可能需要 3,等等。