python - Parsing HTML into sentences - how to handle tables/lists/headings/etc?

Question

How do you go about parsing an HTML page with free text, lists, tables, headings, etc., into sentences?

Take this wikipedia page for example. There is/are:

After messing around with the python NLTK, I want to test out all of these different corpus annotation methods (from http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html#deciding-which-layers-of-annotation-to-include):

Word Tokenization: The orthographic form of text does not unambiguously identify its tokens. A tokenized and normalized version, in addition to the conventional orthographic version, may be a very convenient resource.
Sentence Segmentation: As we saw in Chapter 3, sentence segmentation can be more difficult than it seems. Some corpora therefore use explicit annotations to mark sentence segmentation.
Paragraph Segmentation: Paragraphs and other structural elements (headings, chapters, etc.) may be explicitly annotated.
Part of Speech: The syntactic category of each word in a document.
Syntactic Structure: A tree structure showing the constituent structure of a sentence.
Shallow Semantics: Named entity and coreference annotations, semantic role labels.
Dialogue and Discourse: dialogue act tags, rhetorical structure

Once you break a document into sentences it seems pretty straightforward. But how do you go about breaking down something like the HTML from that Wikipedia page? I am very familiar with using HTML/XML parsers and traversing the tree, and I have tried just stripping the HTML tags to get the plain text, but because punctuation is missing after HTML is removed, NLTK doesn't parse things like table cells, or even lists, correctly.

Is there some best-practice or strategy for parsing that stuff with NLP? Or do you just have to manually write a parser specific to that individual page?

Just looking for some pointers in the right direction, really want to try this NLTK out!

score 1 · Accepted Answer

听起来您正在剥离所有 HTML 并生成一个平面文档，这会使解析器感到困惑，因为松散的部分粘在一起。由于您熟悉 XML，因此我建议将您的输入映射到一个简单的 XML 结构，以使各个部分保持分离。你可以让它变得尽可能简单，但也许你会想要保留一些信息。例如，标记标题、章节标题等可能很有用。当您拥有一个可将各个块分开的可用 XML 树时，可以将XMLCorpusReader其导入到 NLTK 世界中。

score 1 · Accepted Answer

我必须编写特定于我正在分析的 XML 文档的规则。

我所做的是将html标签映射到段。此映射基于研究多个文档/页面并确定 html 标记所代表的内容。前任。<h1> 是一个词组；<li> 是段落；<td> 是标记

如果要使用 XML，可以将新映射表示为标记。前任。<h1> 到 <短语>；<li> 到 <paragraph>；<td> 到 <令牌>

如果您想处理纯文本，您可以将映射表示为一组字符（例如 [PHRASESTART][PHRASEEND]），就像 POS 或 EOS 标记一样。

score 0 · Accepted Answer

正如亚历克西斯回答的那样，python-goose可能是一个不错的选择。

还有HTML Sentence Tokenizer，一个（新的）库，旨在解决这个确切的问题。它的语法非常简单。在一行中parsed_sentences = HTMLSentenceTokenizer().feed(example_html_one)，您可以获取存储在数组中的 HTML 页面中的句子parsed_sentences。

score 0 · Accepted Answer

您可以使用诸如python-goose 之类的工具，该工具旨在从 html 页面中提取文章。

否则，我制作了以下小程序，效果很好：

from html5lib import parse


with open('page.html') as f:
    doc = parse(f.read(), treebuilder='lxml', namespaceHTMLElements=False)

html = doc.getroot()
body = html.xpath('//body')[0]


def sanitize(element):
    """Retrieve all the text contained in an element as a single line of
    text. This must be executed only on blocks that have only inlines
    as children
    """
    # join all the strings and remove \n
    out = ' '.join(element.itertext()).replace('\n', ' ')
    # replace multiple space with a single space
    out = ' '.join(out.split())
    return out


def parse(element):
    # those elements can contain other block inside them
    if element.tag in ['div', 'li', 'a', 'body', 'ul']:
        if element.text is None or element.text.isspace():
            for child in element.getchildren():
                yield from parse(child)
        else:
            yield sanitize(element)
    # those elements are "guaranteed" to contains only inlines
    elif element.tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        yield sanitize(element)
    else:
        try:
            print('> ignored', element.tag)
        except:
            pass


for e in filter(lambda x: len(x) > 80, parse(body)):
    print(e)

python - Parsing HTML into sentences - how to handle tables/lists/headings/etc?

4 回答 4

Related

Reference