0

我正在尝试使用scrapely从 html 页面中提取一些数据。

我试图抓取的 html 页面包含一些 html 标签,其中包含一些要抓取的文本和一个内部标签,其内容也需要被抓取。结果,当我尝试训练刮板时,我得到了一个FragmentAlreadyAnnotated异常,因为分类器最终为两个字符串注释了外部 html 标记。

有谁知道如何规避这种情况?

我创建了一个最小的工作示例供您试验:

import json

from scrapely import HtmlPage, Scraper

train_html = """<!doctype html>
<html>
<head>
    <title>Example</title>
</head>

<body>
    <p><span>Example 1</span> * 2018</p>
    <p><span>Example 2</span> * 2017</p>
    <p><span>Example 3</span> * 2016</p>
</body>
</html>"""

test_html = """<!doctype html>
<html>
<head>
    <title>Example</title>
</head>

<body>
    <p><span>Example A</span> * 2015</p>
    <p><span>Example B</span> * 2014</p>
    <p><span>Example C</span> * 2013</p>
</body>
</html>"""

if __name__ == '__main__':
    train_page = HtmlPage(url='http://example.com/', page_id=1, body=train_html)
    train_data = {
        'special': ['Example 1', 'Example 2', 'Example 3'],
        'year': ['2018', '2017', '2016']
    }
    test_page = HtmlPage(url='http://example.com/', page_id=2, body=test_html)

    s = Scraper()
    s.train_from_htmlpage(train_page, train_data)

    matches = s.scrape_page(test_page)
    print(json.dumps(matches, indent=4))

    print('Done.')

当我尝试执行此脚本时,我得到以下信息:

Traceback (most recent call last):
  File "/Users/stefano/Workspace/2018/re-searcher/src/main/python/researcher/mwe.py", line 40, in <module>
    s.train_from_htmlpage(train_page, train_data)
  File "/Users/stefano/Workspace/2018/re-searcher/.env/lib/python3.5/site-packages/scrapely/__init__.py", line 44, in train_from_htmlpage
    tm.annotate(field, best_match(value))
  File "/Users/stefano/Workspace/2018/re-searcher/.env/lib/python3.5/site-packages/scrapely/template.py", line 44, in annotate
    self.annotate_fragment(i, field)
  File "/Users/stefano/Workspace/2018/re-searcher/.env/lib/python3.5/site-packages/scrapely/template.py", line 83, in annotate_fragment
    raise FragmentAlreadyAnnotated("Fragment already annotated: %s" % fstr)
scrapely.template.FragmentAlreadyAnnotated: Fragment already annotated: <span data-scrapy-annotate="{&quot;annotations&quot;: {&quot;content&quot;: &quot;year&quot;}}">

虽然我期望类似:

[
    {
        "year": [
            "2015",
            "2014",
            "2013"
        ],
        "special": [
            "Example A",
            "Example B",
            "Example C"
        ]
    }
]
Done.

提前谢谢了!

额外的问题:你知道是否有办法让每个special人都与最接近的人相关联year?请注意,在某些情况下,年份可能会丢失:

<body>
    <p><span>Example D</span> * 2012</p>
    <p><span>Example E</span></p>
    <p><span>Example F</span> * 2011</p>
</body>
4

1 回答 1

0

不是一个真正的答案,而是一个黑客。

我创建了一个函数,该函数使用 RegEx 删除不必要的空格和新行,然后查找<X><Y>some text</Y>more text</X>替换它们的模式<X><Y>some text</Y><span>mote text</span></Y>(RegEx 可能不适用于某些边缘情况,如果您发现它们,请在下面建议如何修复它)。

通过使用此类函数预处理任何 HTML,上述错误永远不会发生,并且(几乎——注意星号)会产生预期的结果,即:

[
    {
        "year": [
           "* 2015",
           "* 2014",
           "* 2013"
        ], 
        "special": [
            "Example A",
            "Example B",
            "Example C"
        ]
    }
]

请在下面找到修改后的代码:

def fix(html: str) -> str:
    html = re.sub(r'\s+', ' ', html)
    html = re.sub(r'> <', '><', html)
    html = re.sub(
        r'<([^>]+)>([^<]*)<([^>]+)>([^<]+)</([^>]+)>([^<]+)</([^>]+)>',
        r'<\1>\2<\3>\4</\5><span>\6</span></\7>',
        html
    )
    return html


def clean(text: str) -> str:
    return re.sub(r'\s+', ' ', re.sub(r'<.*>', ' ', text)).strip()


if __name__ == '__main__':
    train_page = HtmlPage(url='http://example.com/', page_id=1, body=fix(train_html))
    train_data = {
        'special': ['Example 1', 'Example 2', 'Example 3'],
        'year': ['2018', '2017', '2016']
    }
    test_page = HtmlPage(url='http://example.com/', page_id=2, body=fix(test_html))

    s = Scraper()
    s.train_from_htmlpage(train_page, train_data)

    matches = s.scrape_page(test_page)
    for match in matches:
        for key in match:
            match[key] = [clean(value) for value in match[key]]
    print(json.dumps(matches, indent=4))

    print('Done.')
于 2018-06-20T12:53:36.823 回答