python - nltk：使用自定义特征集的文本分类

Question

我有一个如下所示的数据集：

featureDict = {identifier1: [[first 3-gram], [second 3-gram], ... [last 3-gram]],
               ...
               identifierN: [[first 3-gram], [second 3-gram], ... [last 3-gram]]}

另外，我对同一组文档有一个标签字典：

labelDict = {identifier1: label1,
             ...
             identifierN: labelN}

我想找出最合适的 nltk 容器，我可以在其中将这些信息存储在一个地方并无缝应用 nltk 分类器。

此外，在我对这个数据集使用任何分类器之前，我还想在这个特征空间上使用一个 tf-idf 过滤器。

参考资料和文档会有所帮助。

score 1 · Accepted Answer

你只需要一个简单的字典。使用经过训练的分类器查看 NLTK 分类界面中的片段。

这方面的参考文档仍然是 nltk 书：http ://nltk.org/book/ch06.html和 API 规范：http ://nltk.org/api/nltk.classify.html

以下是一些可能对您有所帮助的页面：http ://snipperize.todayclose.com/snippet/py/Use-NLTK-Toolkit-to-Classify-Documents--5671027/，http : //streamhacker.com/tag/feature -extraction/，http: //web2dot5.wordpress.com/2012/03/21/text-classification-in-python/ 。

另外，请记住，nltk 在它提供的分类器算法方面是有限的。对于更高级的探索，最好使用 scikit-learn。

python - nltk：使用自定义特征集的文本分类

1 回答 1

Related

Reference