python-3.x - Python：如果单词中的字符是字母，则仅保留超过 3 个单词的行

Question

下面的代码 1 允许我只保留超过 3 个单词的行。在我的大文本文档的某些行中，有一些行包含非字母字符和 3 个或更少的单词，我也想从我清理的行列表中排除这些行。在代码 2 中使用.isalpha()时，在计算一行中的单词时似乎不再逐行进行。我是 Python 新手，如果有人能帮助我，我将不胜感激。我想保留的线条是lines_clean = ["This is some text as an", "what I want to"]

代码 1：

import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
for line in lines:
    words = word_tokenize(line)
    n_words = len(words)
    if n_words >=3:
        lines_clean.append(line)
print(lines_clean)

代码 2（未按预期工作）：

import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
alpha_only = []
for line in lines:
    words = word_tokenize(line)
    for word in words:
        if word.isalpha():
            alpha_only.append(word)     
    n_words = len(alpha_only)
    if n_words >=3:
        lines_clean.append(line)
print(lines_clean)

python-3.x - Python：如果单词中的字符是字母，则仅保留超过 3 个单词的行

0 回答 0

Related

Reference