python - 我可以在python中以百分比精度执行“字符串包含X”吗？

Question

我需要对一大块文本进行一些 OCR 并检查它是否包含某个字符串，但由于 OCR 的不准确性，我需要它来检查它是否包含类似于字符串的约 85% 匹配的内容。

例如，我可能会 OCR 一大块文本以确保它不包含no information available但 OCR 可能会看到n0 inf0rmation available或误解一些字符。

有没有一种简单的方法可以在 Python 中做到这一点？

score 35 · Accepted Answer

正如发布的那样gauden，SequenceMatcherindifflib是一种简单的方法。使用ratio(), 从文档中返回一个介于两个字符串之间0并1对应于相似性的值：

其中 T 是两个序列中元素的总数，M 是匹配数，这是 2.0*M / T。请注意，如果序列相同，则为 1.0，如果它们没有共同点，则为 0.0。

例子：

>>> import difflib
>>> difflib.SequenceMatcher(None,'no information available','n0 inf0rmation available').ratio()
0.91666666666666663

还有get_close_matches，这可能对您有用，您可以指定距离截止值，它将从列表返回该距离内的所有匹配项：

>>> difflib.get_close_matches('unicorn', ['unicycle', 'uncorn', 'corny', 
                              'house'], cutoff=0.8)
['uncorn']
>>> difflib.get_close_matches('unicorn', ['unicycle'  'uncorn', 'corny',
                              'house'], cutoff=0.5)
['uncorn', 'corny', 'unicycle']

更新：找到部分子序列匹配

为了找到与三个单词序列的紧密匹配，我会将文本拆分为单词，然后将它们分组为三个单词序列，然后 apply difflib.get_close_matches，如下所示：

import difflib
text = "Here is the text we are trying to match across to find the three word
        sequence n0 inf0rmation available I wonder if we will find it?"    
words = text.split()
three = [' '.join([i,j,k]) for i,j,k in zip(words, words[1:], words[2:])]
print difflib.get_close_matches('no information available', three, cutoff=0.9)
#Oyutput:
['n0 inf0rmation available']

score 6 · Accepted Answer

6

标准库模块中的SequenceMatcher对象difflib会直接给你一个比率：

于 2012-06-01T11:19:49.267 回答

score 4 · Accepted Answer

您可以计算Levenshtein distance。这是一个 Python 实现： http: //pypi.python.org/pypi/python-Levenshtein/

score 0 · Accepted Answer

我不知道任何可用的 python 库可以开箱即用，但您可能会找到一个（或找到一个 C 或 C++ 库并为其编写 Python 包装器）。

您还可以尝试推出自己的解决方案，基于“蛮力”逐个字符比较，规则定义两个给定字符之间的“接近度”并根据这些规则计算“准确性”（即“o”=> "0" : 90% 准确率, "o" => "w" : 1% 准确率等），或者玩更多涉及 IA 的东西（如果你不熟悉 IA，“编程集体智能”一书可以获得您开始了，尽管实施示例有些差）。

score 0 · Accepted Answer

只是为了扩展 fraxel 的答案，这允许找到任意长度的字符串。很抱歉格式不好，所以很难。准确度是 findWords 中的截止值

def joinAllInTupleList(toupe):
#joinAllInTuple( [("hello", "world"),("face","book")]) = ['hello world', 'face book']
result=[]
for i in toupe:
    #i is the tuple itself
    carry = " "
    for z in i:
        #z is an element of i
        carry+=" "+z

    result.append(carry.strip())
return result

def findWords(text,wordSequence):

#setup
words = text.split(" ")

#get a list of subLists based on the length of wordSequence
#i.e. get all wordSequence length sub-sequences in text!

result=[]
numberOfWordsInSequence = len(wordSequence.strip().split(" ")) 
for i in range(numberOfWordsInSequence):
    result.append(words[i:])

# print 'result',result
c=zip(*result)

# print 'c',c
#join each tuple to a string
joined = joinAllInTupleList(c)

return difflib.get_close_matches(wordSequence, joined, cutoff=0.72389)

python - 我可以在python中以百分比精度执行“字符串包含X”吗？

5 回答 5

Related

Reference