python - 用python比较带有大量正则表达式的文本的最快方法是什么？

Question

我有一个正则表达式列表，我想匹配它们到达的推文，以便我可以将它们与特定帐户相关联。上面的规则数量很少，它运行得非常快，但是一旦你增加规则的数量，它就会变得越来越慢。

import string, re2, datetime, time, array

rules = [
    [[1],["(?!.*ipiranga).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]],
    [[2],["(?!.*brasil).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]],
]

#cache compile
compilled_rules = []
for rule in rules:
    compilled_scopes.append([[rule[0][0]],[re2.compile(rule[1][0])]])

def get_rules(text):
    new_tweet = string.lower(tweet)
    for rule in compilled_rules:
        ok = 1
        if not re2.search(rule[1][0], new_tweet): ok=0
        print ok

def test():
    t0=datetime.datetime.now()
    i=0
    time.sleep(1)
    while i<1000000:
        get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto brasil")
        i+=1
        t1=datetime.datetime.now()-t0
        print "test"
        print i
        print t1
        print i/t1.seconds

当我用 550 条规则进行测试时，我不能超过 50 个请求/秒。有没有更好的方法来做到这一点？我需要至少 200 个请求/秒

编辑：在乔纳森的提示之后，我可以将速度提高 5 倍，但要嵌套一些我的规则。请看下面的代码：

scope_rules = {
    "1": {
        "termo 1" : "^(?!.*brasil)(?=.*petrobras).*",
        "termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*",
        "termo 3" : "^(?!.*petrobras)(?=.*ipiranga).*",
        "termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*",
        },
    "2": {
        "termo 1" : "^(?!.*ipiranga)(?=.*petrobras).*",
        "termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*",
        "termo 3" : "^(?!.*brasil)(?=.*ipiranga).*",
        "termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*",
        }
    }
compilled_rules = {}
for scope,rules in scope_rules.iteritems():
    compilled_rules[scope]={}
    for term,rule in rules.iteritems():
        compilled_rules[scope][term] = re.compile(rule)


def get_rules(text):
    new_tweet = string.lower(text)
    for scope,rules in compilled_rules.iteritems():
        ok = 1
        for term,rule in rules.iteritems():
            if ok==1:
                if re.search(rule, new_tweet):
                    ok=0
                    print "found in scope" + scope + " term:"+ term


def test():
    t0=datetime.datetime.now()
    i=0
    time.sleep(1)
    while i<1000000:
        get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto ipiranga da lagoa")
        i+=1
        t1=datetime.datetime.now()-t0
        print "test"
        print i
        print t1
        print i/t1.seconds

cProfile.run('test()', 'testproof')

score 4 · Accepted Answer

您的规则似乎是这里的罪魁祸首：由于这两个.*由前瞻分隔，因此必须检查非常多的排列才能成功匹配（或排除匹配）。re.search()不使用锚点会进一步加剧这种情况。此外，包括该posto部分的交替是多余的 - 正则表达式匹配您的字符串中是否有任何posto内容，因此您不妨完全放弃它。

例如，您的第一条规则可以重写为

^(?!.*ipiranga)(?=.*petrobras)

结果没有任何变化。如果您正在寻找确切的单词，您可以使用单词边界进一步优化它：

^(?!.*\bipiranga\b)(?=.*\petrobras\b)

一些测量（使用RegexBuddy）：

应用于字符串的第一个正则表达式Acabei de ir no posto petrobras. Moro pertinho do posto brasil需要正则表达式引擎大约 4700 步才能找出匹配项。如果我取出sin petrobras，则需要超过 100.000 步才能确定不匹配。

我的匹配需要 230 步（在 260 步中失败），因此仅通过正确构建正则表达式即可获得 20-400 倍的加速。

score 1 · Accepted Answer

更进一步，我创建了一个 Cython 扩展来评估规则，现在它的速度非常快。我可以使用大约 3000 条正则表达式规则每秒处理大约 70 个请求

正则表达式.pyx

import re2
import string as pystring

cpdef list match_rules(char *pytext, dict compilled_rules):
    cdef int ok, scope, term
    cdef list response = []
    text = pystring.lower(pytext)
    for scope, rules in compilled_rules.iteritems():
        ok = 1
        for term,rule in rules.iteritems():
            if ok==1:
                if re2.search(rule, text):
                    ok=0
                    response.append([scope,term])
    return response

蟒蛇代码

import re2 as re
import datetime, time, cProfile

scope_rules = {1: {1 : "^(?!.*brasil)(?=.*petrobras).*", 2: "^(?!.*petrobras)(?=.*ipiranga).*",3 : "^(?!.*petrobras)(?=.*ipiranga).*",4 : "^(?!.*petrobras)(?=.*ipiranga).*",},2: {1 : "^(?!.*brasil)(?=.*petrobras).*", 2: "^(?!.*petrobras)(?=.*ipiranga).*",3 : "^(?!.*petrobras)(?=.*ipiranga).*",4 : "^(?!.*petrobras)(?=.*ipiranga).*",},}

compilled_rules = {}
for scope,rules in scope_rules.iteritems():
    compilled_rules[scope]={}
    for term,rule in rules.iteritems():
        compilled_rules[scope][term] = re.compile(rule)

def test():
    t0=datetime.datetime.now()
    i=0
    time.sleep(1)
    while i<1000000:
        mregex.match_rules("Acabei de ir no posto petrobras. Moro pertinho do posto brasil",compilled_rules)
        i+=1
        t1=datetime.datetime.now()-t0
        print t1
        print i/t1.seconds

cProfile.run('test()', 'testproof')

score 0 · Accepted Answer

除了优化你的正则表达式模式本身（这将产生巨大的影响），你可以试试谷歌的 RE2——它应该比 Python 的标准正则表达式模块更快。

它是用 C++ 完成的，但是有PyRE2，Facebook 的 RE2 的 Python 包装器 :)

PS感谢您的问题，我发现了有关正则表达式匹配的精彩读物！

score 0 · Accepted Answer

除了@Tim Pietzcker 的推荐

如果您有很多规则，那么尝试基于按共同点分组的较小规则构建的分层方法可能是有意义的。

例如，您上面的两条规则都匹配“posto”和“petrobras”。如果将这些正则表达式组合到一个列表中，然后将调度限定为该列表，则可以避免运行许多永远不会应用的规则。

在伪代码中......

# a dict of qualifier rules with list of rules 
rules = {
    "+ petrobras" : [
        "- ipiranga"
        "- brasil"
        "? posto"
    ] ,
}

for rule_qualifier in rules:
   if regex( rule_qualifier , text ):
       for rule in rule_qualifier:
           if regex( rule , text ):
               yay

python - 用python比较带有大量正则表达式的文本的最快方法是什么？

4 回答 4

正则表达式.pyx

蟒蛇代码

Related

Reference