python-3.x - 修剪tfidfvectorizer后如何检查术语是否为空

Question

我正在使用 tfidfvectorizer 对来自许多不同语料库的术语进行评分。
这是我的代码

tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words = 'english', min_df = 0.5)
for corpus in all_corpus:
    tfidf.fit_transform(corpus)

每个语料库中的文档数量是多种多样的，因此在构建词汇表时，一些语料库保持为空并返回错误：

after pruning, no terms remain. Try a lower min_df or higher max_df

我不想更改最小或最大 DF。我需要的是当没有术语时，跳过转换过程。所以我做了一个条件过滤器，如下所示

for corpus in all_corpus:
    tfidf.fit_transform(corpus)
    if tfidf.shape[0] > 0:
    \\execute some code here

然而，条件行不通。有没有办法解决这个问题？

非常感谢所有答案和评论。谢谢

score 2 · Accepted Answer

首先，我相信您的问题的最小工作示例如下：

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words = 'english', min_df = 0.5)
tfidf.fit_transform(['not I you'])

我无法复制包含您共享的错误消息部分的错误消息，但这给了我一个ValueError因为我的文档中的所有单词都是英文停用词。（如果在上面的代码段中删除，代码就会运行stop_words = 'english'。）

在 for 循环的情况下处理错误的一种方法是使用 try/except 块。

for corpus in all_corpus:
    try:
        tfidf.fit_transform(corpus)
    except ValueError:
        print('Transforming process skipped')
        # Here you can do more stuff
        continue  # go to the beginning of the for-loop to start the next iteration
    # Here goes the rest of the code for corpus for which the transform functioned

python-3.x - 修剪tfidfvectorizer后如何检查术语是否为空

1 回答 1

Related

Reference