4

I'm using scrapely to extract data from some HTML, but I'm having difficulties extracting a list of items.

The scrapely github project describes only a simple example:

from scrapely import Scraper
s = Scraper()

s.train(url, data)
s.scrape(another_url)

This is nice if, for example, you are trying to extract data as described:

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows that section is a quick example of the simplest possible usage, that you can run in a Python shell.

However, I'm not sure how to extract data if you found something like

Ingredientes

- 50 gr de hojas de albahaca
- 4 cucharadas (60 ml) de piñones
- 2 - 4 dientes de ajo
- 120 ml (1/2 vaso) de aceite de oliva virgen extra
- 115 gr de queso parmesano recién rallado
- 25 gr de queso pecorino recién rallado ( o queso de leche de oveja curado)

I know I can't extract this by using xpath or css selector, but I'm more interested in using parsers that can extract data for similar pages.

4

2 回答 2

6

可以训练Scrapely提取项目列表。诀窍是在训练时将要提取的列表的第一项和最后一项作为 Python 列表传递。这是一个受问题启发的示例:(培训:来自 的 10 项成分列表url1,测试:来自 的 7 项列表url2。)

from scrapely import Scraper

s = Scraper()

url1 = 'http://www.sabormediterraneo.com/recetas/postres/leche_frita.htm'
data = {'ingreds': ['medio litro de leche',   # first and last items
  u'canela y az\xfacar para espolvorear']}
s.train(url1, data)

url2 = 'http://www.sabormediterraneo.com/recetas/cordero_horno.htm'
print s.scrape(url2)

这里的输出:

[{u'ingreds': [
  u' 2 piernas o dos paletillas de cordero lechal o recental ',
  u'3 dientes de ajo',
  u'una copita de vino tinto / o / blanco',
  u'una copita de agua',
  u'media copita de aceite de oliva',
  u'or\xe9gano, perejil',
  u'sal, pimienta negra y aceite de oliva']}]

对问题成分列表 ( http://www.sabormediterraneo.com/cocina/salsas6.htm ) 的培训并未直接推广到“recetas”页面。一种解决方案是训练几个刮板,然后检查哪一个在给定页面上工作。(在我的快速测试中,在几页上训练一个刮板并没有给出一个通用的解决方案。)

于 2016-06-06T10:07:28.030 回答
3

Scrapely 可以从结构列表中提取项目列表(例如<ul><ol>) - 请参阅另一个答案。但是,由于它使用 HTML/文档片段提取内容,因此无法提取包含在没有定界标记 ( <li></li>) 的单个标记中的文本格式数据,这似乎是您在此处尝试执行的操作。

但是,如果您能够选择整个成分块,您可以轻松地对收到的数据进行后处理以获得所需的输出。例如,在您的示例中,.split('\n')[3:-2]您的成分如下表所示:

['- 50 gr de hojas de albahaca',
 '- 4 cucharadas (60 ml) de piñones',
 '- 2 - 4 dientes de ajo',
 '- 120 ml (1/2 vaso) de aceite de oliva virgen extra',
 '- 115 gr de queso parmesano recién rallado',
 '- 25 gr de queso pecorino recién rallado ( o queso de leche de oveja curado)']

如果您想定期执行此操作(或需要为多个字段添加后处理),您可以Scraper按如下方式子类化该类以添加自定义方法:

class PostprocessScraper(Scraper):

    def scrape_page_postprocess(self, page, processors=None):
        if processors == None:
            processors = {}

        result = self.scrape_page(page)
        for r in result:
            for field, items in r.items():
                if field in processors:
                    fn = processors[field]
                    r[field] = [fn(i) for i in items]

        return result

这种新方法scrape_page_postprocess接受一个后处理器字典,以在返回的字段键控数据中运行。例如:

processors = {'ingredients': lambda s: s.split('\n')[3:-2]}
scrape_page_postprocess(page, processors)
于 2016-06-06T10:04:46.010 回答