我正在尝试使用scrapely从 html 页面中提取一些数据。
我试图抓取的 html 页面包含一些 html 标签,其中包含一些要抓取的文本和一个内部标签,其内容也需要被抓取。结果,当我尝试训练刮板时,我得到了一个FragmentAlreadyAnnotated
异常,因为分类器最终为两个字符串注释了外部 html 标记。
有谁知道如何规避这种情况?
我创建了一个最小的工作示例供您试验:
import json
from scrapely import HtmlPage, Scraper
train_html = """<!doctype html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p><span>Example 1</span> * 2018</p>
<p><span>Example 2</span> * 2017</p>
<p><span>Example 3</span> * 2016</p>
</body>
</html>"""
test_html = """<!doctype html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p><span>Example A</span> * 2015</p>
<p><span>Example B</span> * 2014</p>
<p><span>Example C</span> * 2013</p>
</body>
</html>"""
if __name__ == '__main__':
train_page = HtmlPage(url='http://example.com/', page_id=1, body=train_html)
train_data = {
'special': ['Example 1', 'Example 2', 'Example 3'],
'year': ['2018', '2017', '2016']
}
test_page = HtmlPage(url='http://example.com/', page_id=2, body=test_html)
s = Scraper()
s.train_from_htmlpage(train_page, train_data)
matches = s.scrape_page(test_page)
print(json.dumps(matches, indent=4))
print('Done.')
当我尝试执行此脚本时,我得到以下信息:
Traceback (most recent call last):
File "/Users/stefano/Workspace/2018/re-searcher/src/main/python/researcher/mwe.py", line 40, in <module>
s.train_from_htmlpage(train_page, train_data)
File "/Users/stefano/Workspace/2018/re-searcher/.env/lib/python3.5/site-packages/scrapely/__init__.py", line 44, in train_from_htmlpage
tm.annotate(field, best_match(value))
File "/Users/stefano/Workspace/2018/re-searcher/.env/lib/python3.5/site-packages/scrapely/template.py", line 44, in annotate
self.annotate_fragment(i, field)
File "/Users/stefano/Workspace/2018/re-searcher/.env/lib/python3.5/site-packages/scrapely/template.py", line 83, in annotate_fragment
raise FragmentAlreadyAnnotated("Fragment already annotated: %s" % fstr)
scrapely.template.FragmentAlreadyAnnotated: Fragment already annotated: <span data-scrapy-annotate="{"annotations": {"content": "year"}}">
虽然我期望类似:
[
{
"year": [
"2015",
"2014",
"2013"
],
"special": [
"Example A",
"Example B",
"Example C"
]
}
]
Done.
提前谢谢了!
额外的问题:你知道是否有办法让每个special
人都与最接近的人相关联year
?请注意,在某些情况下,年份可能会丢失:
<body>
<p><span>Example D</span> * 2012</p>
<p><span>Example E</span></p>
<p><span>Example F</span> * 2011</p>
</body>