python - 对字符串进行标记可以合并一些单词

Question

我使用以下代码对字符串进行标记，从标准输入读取。

d=[]
cur = ''
for i in sys.stdin.readline():
    if i in ' .':
        if cur not in d and (cur != ''):
            d.append(cur)
            cur = ''
    else:
        cur = cur + i.lower()

这给了我一组不重复的单词。但是，在我的输出中，有些单词没有被拆分。

我的输入是

Dan went to the north pole to lead an expedition during summer.

并且输出数组 d 是

['dan', 'went', 'to', 'the', 'north', 'pole', 'tolead', 'an', '远征', 'during', 'summer']

为什么tolead在一起？

score 3 · Accepted Answer

尝试这个

d=[]
cur = ''
for i in sys.stdin.readline():
    if i in ' .':
        if cur not in d and (cur != ''):
            d.append(cur)
        cur = '' # note the different indentation
    else:
        cur = cur + i.lower()

score 1 · Accepted Answer

尝试这个：

for line in sys.stdin.readline():
    res = set(word.lower() for word in line[:-1].split(" "))
    print res

例子：

line = "Dan went to the north pole to lead an expedition during summer."
res = set(word.lower() for word in line[:-1].split(" "))
print res

set(['north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'summer', 'the'])

评论后，我编辑：此解决方案保留输入顺序并过滤分隔符

import re
from collections import OrderedDict
line = "Dan went to the north pole to lead an expedition during summer."
list(OrderedDict.fromkeys(re.findall(r"[\w']+", line)))
# ['Dan', 'went', 'to', 'the', 'north', 'pole', 'lead', 'an', 'expedition', 'during', 'summer']

score 1 · Accepted Answer

"to"已经在"d". 因此，您的循环跳过了 and 之间的空格"to"，"lead"但继续连接；一旦它到达下一个空间，它就会发现它"tolead"不在d，所以它会附加它。

更简单的解决方案；它还去除了所有形式的标点符号：

>>> import string
>>> set("Dan went to the north pole to lead an expedition during summer.".translate(None, string.punctuation).lower().split())
set(['summer', 'north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'the'])

python - 对字符串进行标记可以合并一些单词

3 回答 3

Related

Reference