1

我将单词定义为可能还包含撇号的字符序列(从 a 到 Z)。我希望将一个句子分成单词,并从单词中删除撇号。

我目前正在执行以下操作以从一段文本中获取单词。

import re
text = "Don't ' thread \r\n on \nme ''\n "
words_iter = re.finditer(r'(\w|\')+', text)
words = (word.group(0).lower() for word in words_iter)
for i in words:
    print(i)

这给了我:

don't
'
thread
on
me
''

但我不想要的是:

dont
thread
on
me

我怎样才能改变我的代码来实现这一点?

请注意,'我的输出中没有。

我也想words成为一个发电机。

4

4 回答 4

3

这看起来像是 Regex 的工作。

import re

text = "Don't ' thread \r\n on \nme ''\n "

# Define a function so as to make a generator
def get_words(text):

    # Find each block, separated by spaces
    for section in re.finditer("[^\s]+", text):

        # Get the text from the selection, lowercase it
        # (`.lower()` for Python 2 or if you hate people who use Unicode)
        section = section.group().casefold()

        # Filter so only letters are kept and yield
        section = "".join(char for char in section if char.isalpha())
        if section:
            yield section

list(get_words(text))
#>>> ['dont', 'thread', 'on', 'me']

正则表达式的解释:

[^    # An "inverse set" of characters, matches anything that isn't in the set
\s    # Any whitespace character
]+    # One or more times

所以这匹配任何非空白字符块。

于 2013-09-23T14:18:10.227 回答
1
words = (x.replace("'", '') for x in text.split())
result = tuple(x for x in words if x)

...仅对拆分数据进行一次迭代。

如果数据集很大,使用re.finditer而不是str.split()避免将整个数据集读入内存:

words = (x.replace("'", '') for x in re.finditer(r'[^\s]+', text))
result = tuple(x for x in words if x)

...虽然,tuple()-ing 数据无论如何都会读取内存中的所有内容。

于 2013-09-23T14:22:57.743 回答
0
import string
tuple(str(filter(lambda x: x if x in string.letters + string.whitespace else '', "strings don't have '")).split())
于 2013-09-23T14:19:01.280 回答
0

使用str.translatere.finditer

>>> text = "Don't ' thread \r\n on \nme ''\n "
>>> import re
>>> from string import punctuation
>>> tab = dict.fromkeys(map(ord, punctuation))
def solve(text):
    for m in re.finditer(r'\b(\S+)\b', text):
        x = m.group(1).translate(tab).lower()
        if x : yield x
>>> list(solve(text))
['dont', 'thread', 'on', 'me']

时序对比:

>>> strs = text * 1000
>>> %timeit list(solve(strs))
10 loops, best of 3: 11.1 ms per loop
>>> %timeit list(get_words(strs))
10 loops, best of 3: 36.7 ms per loop
>>> strs = text * 10000
>>> %timeit list(solve(strs))
1 loops, best of 3: 146 ms per loop
>>> %timeit list(get_words(strs))
1 loops, best of 3: 411 ms per loop
于 2013-09-23T14:18:13.810 回答