python - 用于大型 Unicode 文本的 Python 查找器或子程序

问问题 2019-05-06T19:39:45.970

160 次

我必须将多次出现的标记替换为大型 Unicode 文本文档。目前我正在迭代我的字典中的单词并用sub编译的正则表达式替换：

for token, replacement in dictionary.tokens().iteritems():
    r = re.compile(word_regex_unicode(token), flags=re.I | re.X | re.UNICODE)
    text = r.sub(replacement, text)

我的话正则表达式就像

# UTF8 unicode word regex
def word_regex_unicode(word):
    return r"(?<!\S){}(?!\S)".format(re.escape(word))

这意味着必须编译一个新的正则表达式，然后sub对每个令牌（如果存在或不存在于文档中）进行调用text。作为替代方法，可以使用re.finditer查找令牌的出现，然后在找到令牌时调用re.sub：

for token, replacement in dictionary.tokens().iteritems():
    r = re.compile(word_regex_unicode(token), flags=re.I | re.X | re.UNICODE)
    for m in r.finditer(token, text):
        # now call sub 
        text = r.sub(replacement, text)

从而避免re.sub在实际不需要时调用。最后一种方法可以使用re.finditer组结果进行改进：

for m in r.finditer(token, text):
    # match start: match.start()
    index = match.start()
    # replace from start to end
    text = text[:index] + token + text[index + 1:]

在这些方法中，哪种方法更快？

python - 用于大型 Unicode 文本的 Python 查找器或子程序

0 回答 0

Related

Reference