python - 从输入文件中的唯一字符串中去除标点符号

Question

这个问题（Best way to strip punctuation from a string in Python）涉及从单个字符串中去除标点符号。但是，我希望从输入文件中读取文本，但只打印出所有字符串的一份副本而不结束标点符号。我已经开始这样的事情：

f = open('#file name ...', 'a+')
for x in set(f.read().split()):
    print x

但问题是，如果输入文件有，例如，这一行：

This is not is, clearly is: weird

它以不同的方式处理“is”的三种不同情况，但我想忽略任何标点符号并让它只打印“is”一次，而不是三次。如何删除任何类型的结束标点符号，然后将生成的字符串放入集合中？

谢谢你的帮助。（我对 Python 真的很陌生。）

score 1 · Accepted Answer

import re

for x in set(re.findall(r'\b\w+\b', f.read())):

应该更能正确区分单词。

此正则表达式查找紧凑的字母数字字符组（az、AZ、0-9、_）。

如果您只想查找字母（无数字且无下划线），则将替换\w为[a-zA-Z].

>>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']

score 0 · Accepted Answer

例如，如果您不关心用空格替换标点符号，则可以使用翻译表。

>>> from string import maketrans
>>> punctuation = ",;.:"
>>> replacement = "    "
>>> trans_table = maketrans(punctuation, replacement)
>>> 'This is not is, clearly is: weird'.translate(trans_table)
'This is not is  clearly is  weird'
# And for your case of creating a set of unique words.
>>> set('This is not is  clearly is  weird'.split())
set(['This', 'not', 'is', 'clearly', 'weird'])

python - 从输入文件中的唯一字符串中去除标点符号

2 回答 2

Related

Reference