问题标签 [scrapely]

问问题

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

7 问题

0 投票

2 回答

1386 浏览

python - How to extract a list of items using scrapely?

I'm using scrapely to extract data from some HTML, but I'm having difficulties extracting a list of items.

The scrapely github project describes only a simple example:

This is nice if, for example, you are trying to extract data as described:

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows that section is a quick example of the simplest possible usage, that you can run in a Python shell.

However, I'm not sure how to extract data if you found something like

I know I can't extract this by using xpath or css selector, but I'm more interested in using parsers that can extract data for similar pages.

2016-05-31T20:58:27.157

0 投票

0 回答

221 浏览

python - 训练用刮擦提取href属性

我正在使用Scrapely从 HTML 中提取数据字段。根据文档使用trainthen无法从链接中提取属性。有没有办法以类似于从元素中提取文本的方式提取 href 属性？scrapehref

在上面的训练示例中，给定的 url 是该页面上标签的唯一href属性，a因此我希望算法能够学会找到它。

python web-scraping scrapely

2016-06-09T16:50:23.150

0 投票

2 回答

190 浏览

python - 如何抓取每个主题下的每一页

我需要抓取每个类别下的每一页。目前，我能够进入列表类别并抓取下一页后面的每一页。我想做的是，我想进入一个类别，抓取该类别中的每一页，一旦完成，我想进入下一个类别并做同样的事情。有时某些类别中嵌套了其他类别。

例如; https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_unv_b_1_173508_2（<- 这些是书籍列表）例如左侧有类别（艺术与摄影、有声有声读物……）在每个类别下，例如在艺术与摄影类别下，还有更多类别（建筑、艺术商业，...）然后在建筑下有更多类别（建筑，批评，...）在建筑（地标和纪念碑，宗教建筑，..）下，一旦你到达地标和纪念碑，那就是根节点，它有100 页的列表。所以我想做的我想去艺术和摄影并继续在每个子类别下进行，直到我点击一个根节点并刮掉每个页面的所有列表，然后在我完成每个要回滚的兄弟节点后转到兄弟节点并进入宗教建筑完成该回滚进入下一个类别在建筑物下完成建筑物下的每个类别回滚进入批评......等等。所以几乎刮掉了亚马逊列出的每个子类别下的每一本书。

现在我有这个逻辑来完成 start_urls 中给出的类别中的每个页面。; （注意：我不能真正列出开始 url 列表中的每个类别，因为它们太多了）下面的代码可以工作并抓取在开始 url 中给出的一个类别下列出的每个页面。我需要的是如何制作它的想法，以便它自动跳转到下一个子类别并在完成复出并进入下一个子类别后执行相同的操作......等等

有人可以帮忙吗？？谢谢

python xpath scrapy scrapely

2017-02-02T16:43:03.020

0 投票

1 回答

199 浏览