How do you go about parsing an HTML page with free text, lists, tables, headings, etc., into sentences?
Take this wikipedia page for example. There is/are:
- free text: http://en.wikipedia.org/wiki/Neurotransmitter#Discovery
- lists: http://en.wikipedia.org/wiki/Neurotransmitter#Actions
- tables: http://en.wikipedia.org/wiki/Neurotransmitter#Common_neurotransmitters
After messing around with the python NLTK, I want to test out all of these different corpus annotation methods (from http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html#deciding-which-layers-of-annotation-to-include):
- Word Tokenization: The orthographic form of text does not unambiguously identify its tokens. A tokenized and normalized version, in addition to the conventional orthographic version, may be a very convenient resource.
- Sentence Segmentation: As we saw in Chapter 3, sentence segmentation can be more difficult than it seems. Some corpora therefore use explicit annotations to mark sentence segmentation.
- Paragraph Segmentation: Paragraphs and other structural elements (headings, chapters, etc.) may be explicitly annotated.
- Part of Speech: The syntactic category of each word in a document.
- Syntactic Structure: A tree structure showing the constituent structure of a sentence.
- Shallow Semantics: Named entity and coreference annotations, semantic role labels.
- Dialogue and Discourse: dialogue act tags, rhetorical structure
Once you break a document into sentences it seems pretty straightforward. But how do you go about breaking down something like the HTML from that Wikipedia page? I am very familiar with using HTML/XML parsers and traversing the tree, and I have tried just stripping the HTML tags to get the plain text, but because punctuation is missing after HTML is removed, NLTK doesn't parse things like table cells, or even lists, correctly.
Is there some best-practice or strategy for parsing that stuff with NLP? Or do you just have to manually write a parser specific to that individual page?
Just looking for some pointers in the right direction, really want to try this NLTK out!