0

我正在尝试实现搜索一组包含多个部分的多个单词。例如,我们有这些医学术语。

R Deep Transverse Metatarsal Ligament 4 GEODE
R Distal JointCapsule 1 GEODE
R Dorsal Calcaneocuboid Ligament GEODE
R Dorsal Carpometacarpal Ligament 2 GEODE
R Dorsal Cuboideavicular Ligament GEODE
R Dorsal Tarsometatarsal Ligament 5 GEODE
R Elbow Capsule GEODE
R F Distal JointCapsule 1 GEODE
R Fibular Collateral Bursa GEODE
R Fibular Collateral Ligament GEODE
R Fibular Ligament GEODE

用户可以输入这样的搜索词:

例如,“R De Me Li”那么这应该找到“R Deep Transverse Metatarsal Ligament 4 GEODE”

例如,“Fi Colla” ==> “R Fibular Collat​​eral Bursa GEODE”、“R Fibular Collat​​eral Ligament GEODE”

例如,“弓形 ODE”==>“R 弯头胶囊 GEODE”

也就是说,即使用户输入了单词的某些部分,它也应该找到答案。如果有多个答案,它应该显示所有。我提前感谢您的帮助。

补充)哦..我忘记了一些东西。

例如,“ral lar” ==> 不应显示“R Fibular Collat​​eral Bursa GEODE”或“R Fibular Collat​​eral Ligament GEODE”,因为应考虑查询词的顺序。

此外,查询词之间的空格表示每一行(数据库)的不同词。

查询词的顺序应与每行(数据库)的词相同,但查询词可以比数据库词短。

eg, "R De Me 4" ==> "R Deep Transverse Metatarsal Ligament 4 GEODE" 其中我们可以看到 'Metatarsal' 和 'Ligament' 包括 'me',但是第一个匹配 'Metatarsal' 很好,并且 4将被搜索。

此外,查询词的不同组合可以返回相同的结果。

例如。,

'汽车' ==> 'R 背侧腕掌韧带 2 GEODE'

'Do Car' ==> 'R 腕掌背侧韧带 2 GEODE'

'R Do Carp' ==> 'R 腕掌背侧韧带 2 GEODE'

注意:不区分大小写。

4

4 回答 4

4

您可以在标准发行版中使用difflib执行此操作:

import difflib

s="""R Deep Transverse Metatarsal Ligament 4 GEODE
R Distal JointCapsule 1 GEODE
R Dorsal Calcaneocuboid Ligament GEODE
R Dorsal Carpometacarpal Ligament 2 GEODE
R Dorsal Cuboideavicular Ligament GEODE
R Dorsal Tarsometatarsal Ligament 5 GEODE
R Elbow Capsule GEODE
R F Distal JointCapsule 1 GEODE
R Fibular Collateral Bursa GEODE
R Fibular Collateral Ligament GEODE
R Fibular Ligament GEODE""".split('\n')

qs="""R De Me Li
Fi Colla
bow ODE""".split('\n')

for q in qs:
    print "results for '{}':".format(q)
    matches=difflib.get_close_matches(q,s,3,0.3)
    for i,e in enumerate(matches,1):
        print "\t{}. {}".format(i,e)

印刷:

results for 'R De Me Li':
    1. R Deep Transverse Metatarsal Ligament 4 GEODE
    2. R Dorsal Calcaneocuboid Ligament GEODE
    3. R Dorsal Cuboideavicular Ligament GEODE
results for 'Fi Colla':
    1. R Fibular Collateral Bursa GEODE
    2. R Fibular Collateral Ligament GEODE
results for 'bow ODE':
    1. R Elbow Capsule GEODE

结合cblab 关于将正则表达式与 difflib 结合的答案,您可以得到:

s="""R Deep Transverse Metatarsal Ligament 4 GEODE
R Distal JointCapsule 1 GEODE
R Dorsal Calcaneocuboid Ligament GEODE
R Dorsal Carpometacarpal Ligament 2 GEODE
R Dorsal Cuboideavicular Ligament GEODE
R Dorsal Tarsometatarsal Ligament 5 GEODE
R Elbow Capsule GEODE
R F Distal JointCapsule 1 GEODE
R Fibular Collateral Bursa GEODE
R Fibular Collateral Ligament GEODE
R Fibular Ligament GEODE""".split('\n')
s=set(s)
qs="""R De Me Li
Fi Colla
bow ODE
Car
Do Car
ral lar
R De Me 4
R Do Carp""".split('\n')

for q in sorted(qs):
    print "results for '{}':".format(q)
    pattern = r'.*' + re.sub(r'\W', '.*', q.strip()) + '.*'
    matches=[item for item in s if re.match(pattern, item, re.I)]
    for e in difflib.get_close_matches(q,s,3,0.33):
        if e not in matches: 
            matches.append(e)

    for i,e in enumerate(matches,1):
        print "\t{}. {}".format(i,e)
    else:
        if len(matches)==0:
            print "\tNo matches"    

印刷:

results for 'Car':
    1. R Dorsal Carpometacarpal Ligament 2 GEODE
results for 'Do Car':
    1. R Dorsal Carpometacarpal Ligament 2 GEODE
results for 'Fi Colla':
    1. R Fibular Collateral Bursa GEODE
    2. R Fibular Collateral Ligament GEODE
results for 'R De Me 4':
    1. R Deep Transverse Metatarsal Ligament 4 GEODE
results for 'R De Me Li':
    1. R Deep Transverse Metatarsal Ligament 4 GEODE
    2. R Dorsal Calcaneocuboid Ligament GEODE
results for 'R Do Carp':
    1. R Dorsal Carpometacarpal Ligament 2 GEODE
    2. R Elbow Capsule GEODE
    3. R Distal JointCapsule 1 GEODE
results for 'bow ODE':
    1. R Elbow Capsule GEODE
results for 'ral lar':
    No matches
于 2012-06-07T01:49:19.740 回答
3

一个简单的 pythonic 解决方案,可以按照回答完成工作,并且不区分大小写

import re

def search(request, base):
    pattern = r'.*' + re.sub(r'\W', '.*', request.strip()) + '.*'
    return [item for item in base if re.match(pattern, item, re.I)]

基本上,我们创建了一个简单的正则表达式,它匹配任何包含请求的所有子字符串(所有由非单词字符分隔)的字符串以原始顺序匹配之前、中间和之后的任何内容。

例如,一个请求'R De Me Li'变成了一个模式r'.*R.*De.*Me.Li.*'

然后,我们返回所有匹配结果的列表。由于. re.I_re.match()

然后,它按预期工作,您可以尝试使用基础:

>>> base = ['R Deep Transverse Metatarsal Ligament 4 GEODE',
'R Distal JointCapsule 1 GEODE',
'R Dorsal Calcaneocuboid Ligament GEODE',
'R Dorsal Carpometacarpal Ligament 2 GEODE',
'R Dorsal Cuboideavicular Ligament GEODE',
'R Dorsal Tarsometatarsal Ligament 5 GEODE',
'R Elbow Capsule GEODE',
'R F Distal JointCapsule 1 GEODE',
'R Fibular Collateral Bursa GEODE',
'R Fibular Collateral Ligament GEODE',
'R Fibular Ligament GEODE']

一些示例请求:

>>> search('R De Me Li', base)
['R Deep Transverse Metatarsal Ligament 4 GEODE']
>>> search('Fi Colla', base)
['R Fibular Collateral Bursa GEODE', 'R Fibular Collateral Ligament GEODE']
>>> search('bow ODE', base)
['R Elbow Capsule GEODE']
>>> search('Car', base)
['R Dorsal Carpometacarpal Ligament 2 GEODE']
>>> search('F', base)
['R F Distal JointCapsule 1 GEODE', 'R Fibular Collateral Bursa GEODE', 'R Fibular Collateral Ligament GEODE', 'R Fibular Ligament GEODE']
>>> search('F Ca', base)
['R F Distal JointCapsule 1 GEODE']
>>> search('F Co', base)
['R Fibular Collateral Bursa GEODE', 'R Fibular Collateral Ligament GEODE']

注意:仅当请求和项目中的顺序相同时才会匹配(即'ode bow'请求不匹配['R Elbow Capsule GEODE'],而'bow ode'会匹配)。

注意:我不认为模糊搜索在这里会有很大帮助,至少一开始是这样,因为它是基于距离,例如 Levenshtein 的(编辑距离),它在“Fi”和“Fibular”之间会很大( 7 字中的 5 距离...在 35% 时我不认为匹配是个好主意...如果您非常确定请求仅包含完整的单词且可能有少量错误输入,则可以使用它)

于 2012-06-07T01:36:40.010 回答
1

不是真正的“正则表达式”问题;您应该查看字符串的模糊比较,即 Levenshtein 距离或差异。

请参阅https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

编辑:一些示例代码:

import Levenshtein

base_strings = [
    "R Deep Transverse Metatarsal Ligament 4 GEODE",
    "R Distal JointCapsule 1 GEODE",
    "R Dorsal Calcaneocuboid Ligament GEODE",
    "R Dorsal Carpometacarpal Ligament 2 GEODE",
    "R Dorsal Cuboideavicular Ligament GEODE",
    "R Dorsal Tarsometatarsal Ligament 5 GEODE",
    "R Elbow Capsule GEODE",
    "R F Distal JointCapsule 1 GEODE",
    "R Fibular Collateral Bursa GEODE",
    "R Fibular Collateral Ligament GEODE",
    "R Fibular Ligament GEODE"
]

def main():
    print("Medical term matcher:")
    while True:
        t = raw_input('Match what? ').strip()
        if len(t):
            print("Best match: {}".format(sorted(base_strings, key = lambda x: Levenshtein.ratio(x, t), reverse=True)[0]))
        else:
            break

if __name__=="__main__":
    main()

实际输出:

Medical term matcher:
Match what? R De Me Li
Best match: R Deep Transverse Metatarsal Ligament 4 GEODE
Match what? Fi Colla
Best match: R Fibular Collateral Bursa GEODE
Match what? bow ODE
Best match: R Elbow Capsule GEODE
Match what? 

编辑2: “如果有多个答案,它应该显示所有” - 基础字符串都是不同程度的答案。那么,问题是您要使用哪种相似性值截止值;也许像“所有答案至少与最佳匹配一样好 90%”?

于 2012-06-07T00:30:04.523 回答
1

当所有粒子(输入中由空格分隔的字符串片段)出现在结果中时,以下代码假定您要考虑“匹配”。我在示例中使用了循环,但是您当然应该将其调整为使用raw_input.

虽然它使用正则表达式(允许多个空格),但使用的主要功能是if particle in line

import re

entry = """R Deep Transverse Metatarsal Ligament 4 GEODE
R Distal JointCapsule 1 GEODE
R Dorsal Calcaneocuboid Ligament GEODE
R Dorsal Carpometacarpal Ligament 2 GEODE
R Dorsal Cuboideavicular Ligament GEODE
R Dorsal Tarsometatarsal Ligament 5 GEODE
R Elbow Capsule GEODE
R F Distal JointCapsule 1 GEODE
R Fibular Collateral Bursa GEODE
R Fibular Collateral Ligament GEODE
R Fibular Ligament GEODE
"""

searches = """R De Me Li
Fi Colla
bow ODE"""

for search in searches.split('\n'):
    print search, ':'
    termlist = re.split('\s', search)
    for line in entry.split('\n'):
        match = True
        for term in termlist:
            if not term in line:
                match = False
        if match:
            print '\t', line
    print
于 2012-06-07T01:04:40.787 回答