Python csv 模块是一个很棒的库,但经常将它用于更简单的任务可能是矫枉过正。对我来说,这种特殊情况是一个典型的例子,使用 csv 模块可能会使事情变得过于复杂
大部头书,
- 只是遍历文件,
- 用逗号分割每一行,并提取第一个分割
- 然后在空白处分割剩余部分
- 将每个单词转换为小写
- 去掉所有的标点符号和数字
- 并将结果理解为一组
是一种线性的直接方法
使用以下文件内容运行的示例
Lorem Ipsum is simply dummy "text" of the ,0
printing and typesetting; industry. Lorem,1
Ipsum has been the industry's standard ,2
dummy text ever since the 1500s, when an,3
unknown printer took a galley of type and,4
scrambled it to make a type specimen ,5
book. It has survived not only five ,6
centuries, but also the leap into electronic,7
typesetting, remaining essentially unch,8
anged. It was popularised in the 1960s with ,9
the release of Letraset sheets conta,10
ining Lorem Ipsum passages, and more rec,11
ently with desktop publishing software like,12
!!Aldus PageMaker!! including versions of,13
Lorem Ipsum.,14
>>> from string import digits, punctuation
>>> remove_set = digits + punctuation
>>> with open("test.csv") as fin:
words = {word.lower().strip(remove_set) for line in fin
for word in line.rsplit(",",1)[0].split()}
>>> words
set(['and', 'pagemaker', 'passages', 'sheets', 'galley', 'text', 'is', 'in', 'it', 'anged', 'an', 'simply', 'type', 'electronic', 'was', 'publishing', 'also', 'unknown', 'make', 'since', 'when', 'scrambled', 'been', 'desktop', 'to', 'only', 'book', 'typesetting', 'rec', "industry's", 'has', 'ever', 'into', 'more', 'printer', 'centuries', 'dummy', 'with', 'specimen', 'took', 'but', 'standard', 'five', 'survived', 'leap', 'not', 'lorem', 'a', 'ipsum', 'essentially', 'unch', 'conta', 'like', 'ining', 'versions', 'of', 'industry', 'ently', 'remaining', 's', 'printing', 'letraset', 'popularised', 'release', 'including', 'the', 'aldus', 'software'])