我有一个正则表达式列表,我想匹配它们到达的推文,以便我可以将它们与特定帐户相关联。上面的规则数量很少,它运行得非常快,但是一旦你增加规则的数量,它就会变得越来越慢。
import string, re2, datetime, time, array
rules = [
[[1],["(?!.*ipiranga).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]],
[[2],["(?!.*brasil).*((?=.*posto)(?=.*petrobras).*|(?=.*petrobras)).*"]],
]
#cache compile
compilled_rules = []
for rule in rules:
compilled_scopes.append([[rule[0][0]],[re2.compile(rule[1][0])]])
def get_rules(text):
new_tweet = string.lower(tweet)
for rule in compilled_rules:
ok = 1
if not re2.search(rule[1][0], new_tweet): ok=0
print ok
def test():
t0=datetime.datetime.now()
i=0
time.sleep(1)
while i<1000000:
get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto brasil")
i+=1
t1=datetime.datetime.now()-t0
print "test"
print i
print t1
print i/t1.seconds
当我用 550 条规则进行测试时,我不能超过 50 个请求/秒。有没有更好的方法来做到这一点?我需要至少 200 个请求/秒
编辑:在乔纳森的提示之后,我可以将速度提高 5 倍,但要嵌套一些我的规则。请看下面的代码:
scope_rules = {
"1": {
"termo 1" : "^(?!.*brasil)(?=.*petrobras).*",
"termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*",
"termo 3" : "^(?!.*petrobras)(?=.*ipiranga).*",
"termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*",
},
"2": {
"termo 1" : "^(?!.*ipiranga)(?=.*petrobras).*",
"termo 2" : "^(?!.*petrobras)(?=.*ipiranga).*",
"termo 3" : "^(?!.*brasil)(?=.*ipiranga).*",
"termo 4" : "^(?!.*petrobras)(?=.*ipiranga).*",
}
}
compilled_rules = {}
for scope,rules in scope_rules.iteritems():
compilled_rules[scope]={}
for term,rule in rules.iteritems():
compilled_rules[scope][term] = re.compile(rule)
def get_rules(text):
new_tweet = string.lower(text)
for scope,rules in compilled_rules.iteritems():
ok = 1
for term,rule in rules.iteritems():
if ok==1:
if re.search(rule, new_tweet):
ok=0
print "found in scope" + scope + " term:"+ term
def test():
t0=datetime.datetime.now()
i=0
time.sleep(1)
while i<1000000:
get_rules("Acabei de ir no posto petrobras. Moro pertinho do posto ipiranga da lagoa")
i+=1
t1=datetime.datetime.now()-t0
print "test"
print i
print t1
print i/t1.seconds
cProfile.run('test()', 'testproof')