I am learning about data mining. My dream is to develop a system that receives a small text (a few sentences) and delivers a dictionary with phrases from the text and most relevant tags from a database. For example,
Input (from NYTimes website): "LOS ANGELES — The Walt Disney Company, in an effort to address concerns about entertainment’s role in childhood obesity, plans to announce on Tuesday that all products advertised on its child-focused television channels, radio stations and Web sites must comply with a strict new set of nutritional standards."
Output:
"LOS ANGELES" : [USA, California, Los_Angeles, city],
"The Walt Disney Company": [Walt_Disney, Corporation, USA, movies, entertainment],
"childhood obesity" : [childhood, illness, health],
"all products advertised": [product, advertisement,
"television channel": [TV, broadcast, advertisement],
"radio station": [Radio, broadcast, advertisement],
"web sites": [Web, broadcast, advertisement]
I have downloaded the english and spanish wikipedia dumps. So far, I managed to extract all the titles, and the words from the titles, with python, lxml, and nltk. Now, I am developing a program to find the links network between the articles in the dumps, the links to external sites, etc. Also, I am thinking about the extraction of infoboxes. Also, I am going publish the python code in github this week. Now, I am commenting and testing it.
What advice can you give me? Do you think this proposal is feasible?