0

下面的代码 1 允许我只保留超过 3 个单词的行。在我的大文本文档的某些行中,有一些行包含非字母字符和 3 个或更少的单词,我也想从我清理的行列表中排除这些行。在代码 2 中使用.isalpha()时,在计算一行中的单词时似乎不再逐行进行。我是 Python 新手,如果有人能帮助我,我将不胜感激。我想保留的线条是lines_clean = ["This is some text as an", "what I want to"]

代码 1:

import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
for line in lines:
    words = word_tokenize(line)
    n_words = len(words)
    if n_words >=3:
        lines_clean.append(line)
print(lines_clean)

代码 2(未按预期工作):

import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
alpha_only = []
for line in lines:
    words = word_tokenize(line)
    for word in words:
        if word.isalpha():
            alpha_only.append(word)     
    n_words = len(alpha_only)
    if n_words >=3:
        lines_clean.append(line)
print(lines_clean)
4

0 回答 0