python - 删除标点符号并创建带有单词列表的 .csv 文件，标记是否存在标点符号

Question

这是我到目前为止所拥有的：

import re
import csv

outfile1 = open('test_output.csv', 'wt')
outfileWriter1 = csv.writer(outfile1, delimiter=',')

rawtext = open('rawtext.txt', 'r').read()
print(rawtext)

rawtext = rawtext.lower()
print(rawtext)

re.sub('[^A-Za-z0-9]+', '', rawtext)
print(rawtext)

首先，当我运行它时，标点符号不会被删除，所以我想知道我的表达是否有问题？

其次，我正在尝试生成一个 .csv 列表，其中包含所有标记有标点符号的单词，例如，一个文本文件，内容为“你好！这是美好的一天”。会输出：

ID, PUNCTUATION, WORD
1,  Y,           hello
2,  Y,           its
3,  N,           a
4,  N,           nice
5,  Y,           day

我知道我可以使用 .split() 来拆分单词，但除此之外我不知道该怎么做！任何帮助，将不胜感激。

score 0 · Accepted Answer

试试这个版本：

import string
import csv

header = ('ID','PUNCTUATION','WORD')
with open('test_output.csv', 'wt') as outf, open('rawtext.txt') as inf:
    outfileWriter1 = csv.DictWriter(outf, header, delimiter=',')
    for k, rawtext in enumerate(inf):
        out = {'PUNCTUATION': 'N', 'ID': k+1}
        for word in rawtext.split():
           stripped = ''.join(i for i in word if i not in string.punctuation)
           if len(stripped) != len(word):
               out['PUNCTUATION'] = 'Y'
           out['WORD'] = stripped.lower()
        outfileWriter1.writerow(out)

score 0 · Accepted Answer

你可以这样做：

from string import punctuation
import csv

strs = "Hello! It's a nice day."

with open('abc.csv', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerow(['ID', 'PUNCTUATION', 'WORD'])
    #use enumerate to get word as well as index
    table = dict.fromkeys(map(ord, punctuation))
    for i, word in enumerate(strs.split(), 1):
        #str.translate is faster than regex
        new_strs = word.translate(table)
        #if the new word is not equal to original word then use 'Y'
        punc = 'Y' if new_strs != word else 'N'
        writer.writerow([i, punc, new_strs])

python - 删除标点符号并创建带有单词列表的 .csv 文件，标记是否存在标点符号

2 回答 2

Related

Reference