data-mining - A data extraction project with wikipedia dumps

Question

I am learning about data mining. My dream is to develop a system that receives a small text (a few sentences) and delivers a dictionary with phrases from the text and most relevant tags from a database. For example,

Input (from NYTimes website): "LOS ANGELES — The Walt Disney Company, in an effort to address concerns about entertainment’s role in childhood obesity, plans to announce on Tuesday that all products advertised on its child-focused television channels, radio stations and Web sites must comply with a strict new set of nutritional standards."

Output:

"LOS ANGELES" : [USA, California, Los_Angeles, city], 
"The Walt Disney Company": [Walt_Disney, Corporation, USA, movies, entertainment], 
"childhood obesity" : [childhood, illness, health],
"all products advertised": [product, advertisement,
"television channel": [TV, broadcast, advertisement],
"radio station": [Radio, broadcast, advertisement],
"web sites": [Web, broadcast, advertisement]

I have downloaded the english and spanish wikipedia dumps. So far, I managed to extract all the titles, and the words from the titles, with python, lxml, and nltk. Now, I am developing a program to find the links network between the articles in the dumps, the links to external sites, etc. Also, I am thinking about the extraction of infoboxes. Also, I am going publish the python code in github this week. Now, I am commenting and testing it.

What advice can you give me? Do you think this proposal is feasible?

score 3 · Accepted Answer

我建议您查看DBpedia，而不是手动处理原始维基百科转储。DBpedia 收集 Wikipedia 并对其进行结构化，以使关系易于查询。

还有其他项目可以抓取维基百科，例如Semantic MediaWiki和Freebase。WordNet也可能是有用的信息来源。它是字典/词库，显示了单词之间的多种类型的关系。

data-mining - A data extraction project with wikipedia dumps

1 回答 1

Related

Reference