1

我有大约 500 篇预先分类的文章的语料库。我为每个类别选取了最常用的名词和形容词,并按相关性对它们进行了排序。

每个类别(世界、商业、科技、娱乐、科学、健康、体育)都有几百个与之相关的词。

我在这篇文章中遇到了问题: http ://www.techhive.com/article/2052311/hands-on-with-the-2ds-an-entry-level-investment.html

这是关于游戏的。根据我看过的文章,诸如“游戏、球员等”之类的词与体育密切相关。

本文评分如下:

{u'business': 51, u'entertainment': 58, u'science': 48, u'sports': 62, u'health': 35, u'world': 48, u'technology': 59}

如您所见,科技在 59 岁时处于领先地位,但在 62 岁时被体育超越。

我希望如果我的语料库增加到几千篇,这个问题会得到解决,但我不知道这是否可能。

你对解决这个问题有什么想法?

我考虑过有一个赠品词列表,比如“Twitter、Facebook、技术、任天堂等”,如果它们出现的话,它会自动将文章聚集到技术中。唯一的问题是找到与之相关的词汇,并避免与商业/世界等发生冲突。

谢谢。

4

1 回答 1

0

The gaming category should blur with hunting, war correspondence, pen-and-paper RPGs... - Anything that has a game-version of it.

I think you are looking to differentiate fact from fiction. An idea that I derive from the one which you proposed is to grab the fiction section and the fact section of a library and reduce them to a short-list and a long-list of keywords.

ed: It's something that I have only just discovered, but the typical 'hello world' example, which is word frequency analysis, from a map-reduce framework such as Disco should let you simply point to a set of URLs which you know are either fact or fiction. You should have two lists of tuples and then you can filter these to the keywords which most certainly speak of fact or fiction.

于 2013-10-28T14:30:10.453 回答