python - 如何进一步分析和优化此字符串标记化功能？

Question

如果查看源代码更容易，请随意跳过我冗长的解释！

所以我写了一个函数来标记文本字符串。在最简单的情况下，它接受一个类似的字符串It's a beautiful morning并返回一个标记列表。对于前面的示例，输出将是['It', "'", 's', ' ', 'a', ' ', 'beautiful', ' ', 'morning'].

这是通过函数的前两行实现的：

separators = dict.fromkeys(whitespace + punctuation, True)
tokens = [''.join(g) for _, g in groupby(phrase, separators.get)]

这里要注意的是It'sget 被拆分为["It", "'", "s"]. 在大多数情况下，这不是问题，但有时确实是。出于这个原因，我添加了stop_wordskwarg，它采用一组要“未标记化”的字符串。例如：

>>> tokenize("It's a beautiful morning", stop_words=set("It's"))
>>> ["It's", , ' ', 'a', ' ', 'beautiful', ' ', 'morning']

这种“取消标记化”通过在标记列表中移动的滑动窗口来工作。考虑下面的模式。窗口被描绘为[]

Iteration 1:  ['It', "'",] 's', ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Iteration 2:  'It', ["'", 's',] ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Iteration 3:  'It', "'", ['s', ' ',] 'a', ' ', 'beautiful', ' ', 'morning'

在每次迭代中，包含在窗口中的字符串被连接起来，并与stop_words. 如果窗口到达令牌列表的末尾并且没有找到匹配项，则窗口的大小增加 1。因此：

Iteration 9:  ['It', "'", 's',] ' ', 'a', ' ', 'beautiful', ' ', 'morning'

这里我们有一个匹配，所以整个窗口被一个元素替换：它的内容，连接。因此，在第 9 次迭代结束时，我们得到：

"It's", ' ', 'a', ' ', 'beautiful', ' ', 'morning'

现在，我们必须重新开始，以防这个新令牌在与它的邻居组合时形成停用词。该算法将窗口大小设置回 2 并继续。 整个过程在迭代结束时停止，其中窗口大小等于令牌列表的长度。

这种递归是我的算法效率低下的根源。对于很少取消标记的小字符串，它的工作速度非常快。然而，计算时间似乎随着取消标记的数量和原始字符串的总长度呈指数增长。

以下是该函数的完整源代码：

from itertools import groupby, tee, izip
from string import punctuation, whitespace

def tokenize(phrase, stop_words=None):
    separators = dict.fromkeys(whitespace + punctuation, True)
    tokens = [''.join(g) for _, g in groupby(phrase, separators.get)]

    if stop_words:
        assert isinstance(stop_words, set), 'stop_words must be a set'
        window = 2  # Iterating over single tokens is useless
        while window <= len(tokens):
            # "sliding window" over token list
            iters = tee(tokens, window)
            for i, offset in izip(iters, xrange(window)):
                for _ in xrange(offset):
                    next(i, None)

            # Join each window and check if it's in `stop_words`
            for offset, tkgrp in enumerate(izip(*iters)):
                tk = ''.join(tkgrp)
                if tk in stop_words:
                    pre = tokens[0: offset]
                    post = tokens[offset + window + 1::]
                    tokens = pre + [tk] + post
                    window = 1  # will be incremented after breaking from loop
                    break

            window += 1

    return tokens

这里有一些很难处理的数字（无论如何，我能做到的最好）。

>>> import cProfile
>>> strn = "it's a beautiful morning."
>>> ignore = set(["they're", "we'll", "she'll", "it's", "we're", "i'm"])
>>> cProfile.run('tokenize(strn * 100, ignore=ignore)')
cProfile.run('tokenize(strn * 100, ignore=ignore)')
         57534203 function calls in 15.737 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.405   10.405   15.737   15.737 <ipython-input-140-6ef74347708e>:1(tokenize)
        1    0.000    0.000   15.737   15.737 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method fromkeys}
      899    0.037    0.000    0.037    0.000 {itertools.tee}
      900    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   365450    1.459    0.000    1.459    0.000 {method 'join' of 'str' objects}
 57166950    3.836    0.000    3.836    0.000 {next}

由此我了解到，大部分执行时间都发生在我的函数范围内。如上所述，我怀疑不断重置window是导致效率低下的原因，但我不确定如何进一步诊断。

我的问题如下：

我如何进一步分析此功能以确定它是否确实window是导致执行时间长的原因？
我可以做些什么来提高性能？

首先十分感谢！

score 1 · Accepted Answer

我可能误解了这个问题，但似乎只是在拆分之前搜索被忽略的单词就可以解决问题：

def tokenize(phrase, stop_words=()):
    stop_words = '|'.join(re.escape(x) + r'\b' for x in stop_words)
    other = '\s+|\w+|[^\s\w]+'
    regex = stop_words + '|' + other if stop_words else other
    return re.findall(regex, phrase)

正如迈克尔安德森指出的那样，您应该添加\b以避免匹配部分单词

编辑：新的正则表达式将空格与标点符号分开。

score 0 · Accepted Answer

我投票支持正则表达式！

如果您不关心从标记列表中排除标点符号，您可以这样做

import re
text = '''It's a beautiful morning''' 
tokens = re.split(text, ' ')

给你

["It's", 'a', 'beautiful', 'morning']

如果你想删除所有标点符号，你可以

tokens = re.split(r'\W+', text)

去取回

['它'，'s'，'a'，'美丽'，'早上']

tokens = re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", text)

python - 如何进一步分析和优化此字符串标记化功能？

2 回答 2

Related

Reference