python - Python中的顺序模式匹配算法

Question

我发现自己处于这种情况，我需要在 Python 中实现一种用于顺序模式匹配的算法。搜索几个小时后，在互联网上找不到任何工作库/片段。

问题定义：

实现一个函数sequential_pattern_match

输入：令牌，（字符串的有序集合）

输出：一个元组列表，每个元组=（标记的任何子集合，标记）

领域专家将定义匹配规则，通常使用正则表达式

测试（令牌）-> 标记或无

例子：

输入：[“新加坡”、“Python”、“用户”、“组”、“是”、“这里”]

输出：[(["Singapore", "Python", "User", "Group"], "ORGANIZATION"), ("is", 'O'), ("here", 'O')]

'O' 表示不匹配。

冲突解决规则：

首先出现的匹配具有更高的优先级。例如“Singapore property sales”，如果可能有两个相互冲突的匹配，“Singapore property”作为资产，“property sales”作为事件，则使用第一个。
较长的匹配比较短的匹配具有更高的优先级。例如，作为组织的“Singapore Python User Group”比作为位置的“Singapore”+作为语言的“Python”的单个匹配具有更高的优先级。

凭借我在算法和数据结构方面的专业知识，这是我的实现：

from itertools import ifilter, imap


MAX_PATTERN_LENGTH = 3

def test(tokens):
    length = len(tokens)
    if (length == 1):
        if tokens[0] == "Nexium":
            return "MEDICINE"
        elif tokens[0] == "pain":
            return "SYMPTOM"
    elif (length == 2):
        string = ' '.join(tokens)
        if string == "Barium Swallow":
            return "INTERVENTION"
        elif string == "Swallow Test":
            return "INTERVENTION"
    else:
        if ' '.join(tokens) == "pain in stomach":
            return "SYMPTOM"

def _evaluate(tokens):
    tag = test(tokens)
    if tag:
        return (tokens, tag)
    elif len(tokens) == 1:
        return (tokens, 'O')

def _splits(tokens):
    return ((tokens[:i], tokens[i:]) for i in xrange(min(len(tokens), MAX_PATTERN_LENGTH), 0, -1))

def sequential_pattern_match(tokens):
    return ifilter(bool, imap(_halves_match, _splits(tokens))).next()

def _halves_match(halves):
    result = _evaluate(halves[0])
    if result:
        return [result] + (halves[1] and sequential_pattern_match(halves[1]))

if __name__ == "__main__":
    tokens = "I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split()
    output = sequential_pattern_match(tokens)
    slashTags = ' '.join(t + '/' + tag for tokens, tag in output for t in tokens)
    print(slashTags)
    assert slashTags == "I/O went/O to/O a/O clinic/O to/O do/O a/O Barium/INTERVENTION Swallow/INTERVENTION Test/O because/O I/O had/O pain/SYMPTOM in/SYMPTOM stomach/SYMPTOM after/O taking/O Nexium/MEDICINE"

    import timeit
    t = timeit.Timer(
        'sequential_pattern_match("I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split())',
        'from __main__ import sequential_pattern_match'
    )
    print(t.repeat(3, 10000))

我不认为它可以更快。不幸的是，它是用函数式编写的，这可能不适合 Python。您是否能够以 OO 或命令式风格更快地实现？

（注意：我相信如果用 C 实现会更快，但目前我没有使用 Python 以外的其他语言的计划）

score 1 · Accepted Answer

def sequential_pattern_match(tokens):
    for first, rest in _splits(tokens):
        x = _halves_match(first, rest)
        if x:
            return x

def _splits(tokens):
    for i in xrange(min(len(tokens), MAX_PATTERN_LENGTH), 0, -1):
        yield tokens[:i], tokens[i:]

def _halves_match(first, rest):
    tag = test(first)
    if tag:
        return [(first, tag)] + (rest and sequential_pattern_match(rest))

def test(tokens):
    length = len(tokens)
    if length == 1:
        if tokens[0] == "Nexium":
            return "MEDICINE"
        elif tokens[0] == "pain":
            return "SYMPTOM"
        else:
            return "O"
    elif length == 2:
        if tokens == ["Barium", "Swallow"]:
            return "INTERVENTION"
        elif tokens == ["Swallow", "Test"]:
            return "INTERVENTION"
    elif tokens == ["pain", "in", "stomach"]:
        return "SYMPTOM"

替换ifilter，imap用简单的for循环。for带有循环的生成器表达式yield。

在我的机器上减少了时间：

1.02694065435 -> 0.708227394544 (Python 2.7.5)
1.1575780184 -> 0.425939527209 (PyPy 2.1)

score 0 · Accepted Answer

您的解决方案并不优雅。考虑使用来自 htql.net 的 htql.RegEx。这是您的问题的部分解决方案：

tokens = "I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split()
symptoms = ['Nexium', 'pain', 'Barium Swallow', 'Swallow Test', 'pain in stomach']

import htql
a=htql.RegEx()
a.setNameSet('symptoms', symptoms)

a.reSearchList(tokens, '&[ws:symptoms]')
# [['Barium', 'Swallow'], ['pain', 'in', 'stomach'], ['Nexium']]

a.reSearchList(tokens, '&[ws:symptoms]', useindex=True)
# [(8L, 2L), (14L, 3L), (19L, 1L)]

您可以轻松地将其扩展到更复杂的场景。

python - Python中的顺序模式匹配算法

2 回答 2

Related

Reference