下面的代码 1 允许我只保留超过 3 个单词的行。在我的大文本文档的某些行中,有一些行包含非字母字符和 3 个或更少的单词,我也想从我清理的行列表中排除这些行。在代码 2 中使用.isalpha()时,在计算一行中的单词时似乎不再逐行进行。我是 Python 新手,如果有人能帮助我,我将不胜感激。我想保留的线条是lines_clean = ["This is some text as an", "what I want to"]
代码 1:
import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
for line in lines:
words = word_tokenize(line)
n_words = len(words)
if n_words >=3:
lines_clean.append(line)
print(lines_clean)
代码 2(未按预期工作):
import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
alpha_only = []
for line in lines:
words = word_tokenize(line)
for word in words:
if word.isalpha():
alpha_only.append(word)
n_words = len(alpha_only)
if n_words >=3:
lines_clean.append(line)
print(lines_clean)