1

I am learning about data mining. My dream is to develop a system that receives a small text (a few sentences) and delivers a dictionary with phrases from the text and most relevant tags from a database. For example,

Input (from NYTimes website): "LOS ANGELES — The Walt Disney Company, in an effort to address concerns about entertainment’s role in childhood obesity, plans to announce on Tuesday that all products advertised on its child-focused television channels, radio stations and Web sites must comply with a strict new set of nutritional standards."

Output:

"LOS ANGELES" : [USA, California, Los_Angeles, city], 
"The Walt Disney Company": [Walt_Disney, Corporation, USA, movies, entertainment], 
"childhood obesity" : [childhood, illness, health],
"all products advertised": [product, advertisement,
"television channel": [TV, broadcast, advertisement],
"radio station": [Radio, broadcast, advertisement],
"web sites": [Web, broadcast, advertisement]

I have downloaded the english and spanish wikipedia dumps. So far, I managed to extract all the titles, and the words from the titles, with python, lxml, and nltk. Now, I am developing a program to find the links network between the articles in the dumps, the links to external sites, etc. Also, I am thinking about the extraction of infoboxes. Also, I am going publish the python code in github this week. Now, I am commenting and testing it.

What advice can you give me? Do you think this proposal is feasible?

4

1 回答 1

3

我建议您查看DBpedia,而不是手动处理原始维基百科转储。DBpedia 收集 Wikipedia 并对其进行结构化,以使关系易于查询。

还有其他项目可以抓取维基百科,例如Semantic MediaWikiFreebaseWordNet也可能是有用的信息来源。它是字典/词库,显示了单词之间的多种类型的关系。

于 2012-06-06T14:11:16.153 回答