python - TextBlob NaiveBayesAnalyzer 极慢（与 Pattern 相比）

Question

我正在使用 TextBlob for python 对推文进行一些情绪分析。TextBlob 中的默认分析器是 PatternAnalyzer，它工作得很好，而且速度很快。

sent = TextBlob(tweet.decode('utf-8')).sentiment

我现在尝试切换到 NaiveBayesAnalyzer，发现运行时不适合我的需求。（每条推文接近 5 秒。）

sent = TextBlob(tweet.decode('utf-8'), analyzer=NaiveBayesAnalyzer()).sentiment

我以前使用过朴素贝叶斯分类器的 scikit learn 实现，并没有发现它这么慢，所以我想知道在这种情况下我是否正确使用它。

我假设分析器是预训练的，至少文档说明“在电影评论数据集上训练的朴素贝叶斯分析器”。但它也有一个函数 train()，被描述为“在电影评论语料库上训练朴素贝叶斯分类器”。它是否在每次运行之前在内部训练分析器？我希望不是。

有谁知道加快速度的方法？

score 17 · Accepted Answer

是的，Textblob 将在每次运行前训练分析器。您可以使用以下代码来避免每次都训练分析器。

from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer
tb = Blobber(analyzer=NaiveBayesAnalyzer())

print tb("sentence you want to test")

score 0 · Accepted Answer

如果您在数据框中有表数据并且想要使用 textblob 的 NaiveBayesAnalyzer，那么添加到 Alan 的非常有用的答案中，那么这很有效。只需更改word_list您的相关字符串系列即可。

import textblob
import pandas as pd

tb = textblob.Blobber(analyzer=NaiveBayesAnalyzer())
for index, row in df.iterrows():
    sent = tb(row['word_list']).sentiment
    df.loc[index, 'classification'] = sent[0]
    df.loc[index, 'p_pos'] = sent[1]
    df.loc[index, 'p_neg'] = sent[2]

上面将sentiment返回的元组拆分为三个单独的系列。

如果该系列是所有字符串但如果它具有混合数据类型，则此方法有效，因为数据类型在 pandas 中可能是一个问题，object那么您可能需要在其周围放置一个 try/except 块以捕获异常。

在我的测试中，它准时在 4.7 秒内完成 1000 行。

希望这会有所帮助。

python - TextBlob NaiveBayesAnalyzer 极慢（与 Pattern 相比）

2 回答 2

Related

Reference