python - 使用 Python 进行自动文档分类：将游戏文章分类为体育

Question

我有大约 500 篇预先分类的文章的语料库。我为每个类别选取了最常用的名词和形容词，并按相关性对它们进行了排序。

每个类别（世界、商业、科技、娱乐、科学、健康、体育）都有几百个与之相关的词。

我在这篇文章中遇到了问题： http ://www.techhive.com/article/2052311/hands-on-with-the-2ds-an-entry-level-investment.html

这是关于游戏的。根据我看过的文章，诸如“游戏、球员等”之类的词与体育密切相关。

本文评分如下：

{u'business': 51, u'entertainment': 58, u'science': 48, u'sports': 62, u'health': 35, u'world': 48, u'technology': 59}

如您所见，科技在 59 岁时处于领先地位，但在 62 岁时被体育超越。

我希望如果我的语料库增加到几千篇，这个问题会得到解决，但我不知道这是否可能。

你对解决这个问题有什么想法？

我考虑过有一个赠品词列表，比如“Twitter、Facebook、技术、任天堂等”，如果它们出现的话，它会自动将文章聚集到技术中。唯一的问题是找到与之相关的词汇，并避免与商业/世界等发生冲突。

谢谢。

score 0 · Accepted Answer

The gaming category should blur with hunting, war correspondence, pen-and-paper RPGs... - Anything that has a game-version of it.

I think you are looking to differentiate fact from fiction. An idea that I derive from the one which you proposed is to grab the fiction section and the fact section of a library and reduce them to a short-list and a long-list of keywords.

ed: It's something that I have only just discovered, but the typical 'hello world' example, which is word frequency analysis, from a map-reduce framework such as Disco should let you simply point to a set of URLs which you know are either fact or fiction. You should have two lists of tuples and then you can filter these to the keywords which most certainly speak of fact or fiction.

python - 使用 Python 进行自动文档分类：将游戏文章分类为体育

1 回答 1

Related

Reference