python - python library for identifying article topic

Question

I have a large collection of articles, 80.000 and I want to extract those that are about one topic. Is there a python library or script in which i can input a manually chosen sample of articles about say Topic A then it would extract from the archive those articles about topic A by comparing the word used and their frequencies.

I have read about the Dunning method, but is there a ready script that I can use preferably python.

Thanks

score 3 · Accepted Answer

查看 Natural Language Toolkit ( http://nltk.org )，这是一个出色的 Python 库，用于处理和提取自然语言语料库（如您的文章集）的含义。此外，根据您还想做什么，我推荐 scikit-learn 库 ( http://scikit-learn.org/ ) 用于提取文本的其他机器学习任务。

score 0 · Accepted Answer

让我正式提出我的建议，即使只是为了后代。

0.) 据我所知，没有任何东西可以满足您开箱即用的所有要求，而且您可以免费获得。要付费，请搜索“谷歌企业搜索”。

1.) 使用弹性搜索使用 JSON 索引您的文档。设置起来非常容易。弹性搜索有很多补救搜索功能，它们不会直接解决您的问题，但可以让您在尝试构建自己的搜索引擎时进行简单的关键字搜索。

2.) 要按主题搜索，您必须编写学习程序。一个非常简单的，实际上是解决您的问题的一个很好的起点，就在这里。该示例将为您提供一个起点。

python - python library for identifying article topic

2 回答 2

Related

Reference