我知道有很多关于删除标点符号的例子,但我想知道最有效的方法。我有一个从 txt 文件中读取并拆分的单词列表
wordlist = open('Tyger.txt', 'r').read().split()
检查每个单词并删除任何标点符号的最快方法是什么?我可以用一堆代码来做到这一点,但我知道这不是最简单的方法。
谢谢!!
我知道有很多关于删除标点符号的例子,但我想知道最有效的方法。我有一个从 txt 文件中读取并拆分的单词列表
wordlist = open('Tyger.txt', 'r').read().split()
检查每个单词并删除任何标点符号的最快方法是什么?我可以用一堆代码来做到这一点,但我知道这不是最简单的方法。
谢谢!!
我认为最简单的方法是首先只提取由字母组成的单词:
import re
with open("Tyger.txt") as f:
    words = re.findall("\w+", f.read())
例如:
text = """
Tyger! Tyger! burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry? 
"""
import re
words = re.findall(r'\w+', text)
或者
import string
ps = string.punctuation
words = text.translate(string.maketrans(ps, ' ' * len(ps))).split()
第二个要快得多。
我会用这样的东西:
import re
with open("Tyger.txt") as f:
    print " ".join(re.split("[\-\,\!\?\.]", f.read())
它只会删除真正需要的内容,并且不会由于过度匹配而造成过度过载。
>>> import re
>>> the_tyger
'\n    Tyger! Tyger! burning bright \n    In the forests of the night, \n    What immortal hand or eye \n    Could frame thy fearful symmetry? \n    \n    In what distant deeps or skies \n    Burnt the fire of thine eyes? \n    On what wings dare he aspire? \n    What the hand dare sieze the fire? \n    \n    And what shoulder, & what art. \n    Could twist the sinews of thy heart? \n    And when thy heart began to beat, \n    What dread hand? & what dread feet? \n    \n    What the hammer? what the chain? \n    In what furnace was thy brain? \n    What the anvil? what dread grasp \n    Dare its deadly terrors clasp? \n    \n    When the stars threw down their spears, \n    And watered heaven with their tears, \n    Did he smile his work to see? \n    Did he who made the Lamb make thee? \n    \n    Tyger! Tyger! burning bright \n    In the forests of the night, \n    What immortal hand or eye \n    Dare frame thy fearful symmetry? \n    '
>>> print re.sub(r'["-,!?.]','',the_tyger)
印刷:
Tyger Tyger burning bright 
In the forests of the night 
What immortal hand or eye 
Could frame thy fearful symmetry 
In what distant deeps or skies 
Burnt the fire of thine eyes 
On what wings dare he aspire 
What the hand dare sieze the fire 
And what shoulder  what art 
Could twist the sinews of thy heart 
And when thy heart began to beat 
What dread hand  what dread feet 
What the hammer what the chain 
In what furnace was thy brain 
What the anvil what dread grasp 
Dare its deadly terrors clasp 
When the stars threw down their spears 
And watered heaven with their tears 
Did he smile his work to see 
Did he who made the Lamb make thee 
Tyger Tyger burning bright 
In the forests of the night 
What immortal hand or eye 
Dare frame thy fearful symmetry 
或者,使用文件:
>>> with open('tyger.txt', 'r') as WmBlake:
...    print re.sub(r'["-,!?.]','',WmBlake.read())
如果要创建行列表:
>>> lines=[]
>>> with open('tyger.txt', 'r') as WmBlake:
...    lines.append(re.sub(r'["-,!?.]','',WmBlake.read()))