问题标签 [scrapy-shell]

问问题

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

160 问题

0 投票

1 回答

4752 浏览

web-scraping - Scrapy 错误：'NotSupported：不支持的 URL 方案''：没有可用于该方案的处理程序'

我正在尝试废弃一个网站，但在运行脚本时，出现以下错误

'NotSupported：不支持的 URL 方案''：该方案没有可用的处理程序'

如果规则没有错，为什么会发生以及您的建议是什么，请帮助我。非常感谢。

代码在这里：

2017-04-03T20:38:35.697

0 投票

1 回答

41 浏览

python - 如何获取此页面中每个广告的数据？

我正在抓取此页面以获取每个广告的数据：http: //www.cars2buy.co.uk/business-car-leasing/Abarth/695C/？

这是我在scrapy shell中的代码：

但它在每次迭代中只提取 48 个！！disered 输出应该是：

48 个月

48 个月

48 个月

36 个月

48 个月

48 个月

48 个月

48 个月

48 个月

36 个月

根据页面上的广告！有什么建议么？

python xpath scrapy scrapy-spider scrapy-shell

2017-04-29T18:42:07.113

0 投票

1 回答

2969 浏览

python - Scrapy shell 无响应返回

我在抓取网站时遇到了一点问题。我按照scrapy的教程学习了如何抓取网站，我有兴趣在网站' https://www.leboncoin.fr '上测试它，但蜘蛛不起作用。所以，我试过：

但是，我没有对该网站的回应。

如果我使用：

打印一个 AttributeError...

AttributeError：“NoneType”对象没有属性“body”

编辑 1：

To rrschmidt：完整的日志已更新，当我运行时

我收到这个：

那么，我该如何解决呢？

谢谢你的回答，

克里斯

python python-3.x attributeerror scrapy-shell

2017-05-15T07:41:55.787

0 投票

1 回答

113 浏览

scrapy - python scrapy 302（我要回原页）

我要去刮 https://movie.douban.com/subject/1292052/这个页面

但是 url 重定向到 http://m.douban.com/movie/subject/1292052 我是如何回到第一页并使用第一页的解析方式（xpath）继续的？谢谢！

scrapy scrapy-shell

2017-05-23T13:09:34.583

0 投票

1 回答

1437 浏览

python - 使用 scrapy 更改 HTML 元素的值

我正在尝试从该网站上抓取数据：网站链接。

我想下载特定日期的所有 PDF 文件。

虽然我设法从第一页获取文件并正确下载它们，但我无法更改日期，因此我可以返回之前的日期并获取旧的 PDF。

我试过这条线：

scrapy.FormRequest.from_response(response,formxpath='//table//td//input[@type="text"]', formdata={'value': "20.05.2017"}, clickdata={'type':'submit'}, method='POST')

在scrapy shell中，但view(response)总是向我显示当前日期。

我不确定这是否正确，我是scrapy的新手，我正在努力解决问题。我认为该方法是正确的，因为当我更改日期时链接不会更改，所以应该是POST而不是GET。

关于如何让它发挥作用的任何想法？
我认为这FormRequest()将是最好的选择，但我没有在网上看到任何其他示例，并且scrapy网站上的文档对我没有太大帮助，所以我尝试研究涉及登录凭据的示例，它们都使用FormRequest.from_response()

PS：我已经包含了与日期更改有关的 HTML 代码段的屏幕截图。

python scrapy web-crawler scrapy-spider scrapy-shell

2017-05-28T19:00:08.467

0 投票

2 回答

96 浏览

python - 无法从网站获取列表值

我从欲望网站获取所有细节，但无法获得一些具体信息，请指导我。

目标域名：https ://shop.adidas.ae/en/messi-16-3-indoor-boots/BA9855.html

我的代码是response.xpath('//ul[@class="product-size"]//li/text()').extract()

需要获取数据！！！

谢谢！

python scrapy scrapy-spider scrapy-shell

2017-06-06T09:19:41.503

0 投票

1 回答

603 浏览

python - Robots.txt 和允许？

所以我是网络爬虫的新手，我无法理解特定的 robots.txt 文件。在这种情况下，这就是网站所拥有的：

所以我查了一下/ here，发现它适用于任何路径。那么这是否意味着该网站允许对所有页面进行爬取呢？但是，当我尝试使用scrapy对sitemap.xml（或另一个站点URL）链接进行基本爬网时，即

我收到了403 HTTP回复，我从这个链接中假设这意味着该网站不希望您抓取...那么该网站的确切robots.txt含义是什么？

编辑我正在谈论的文件是here

python scrapy web-crawler robots.txt scrapy-shell

2017-06-08T23:58:39.470

0 投票

4 回答

7483 浏览

cookies - 我应该怎么做才能启用cookie并为此网址使用scrapy？

我正在使用 scrapy 进行带有此 URL 的抓取项目https://www.walmart.ca/en/clothing-shoes-accessories/men/mens-tops/N-2566+11

我尝试使用 url 并在 shell 中打开它，但出现 430 错误，所以我在标题中添加了一些设置，如下所示：

scrapy shell -s COOKIES_ENABLED=1 -s USER_AGENT='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0' " https://www.walmart.ca/en/clothing-shoes -配件/男士/男士上衣/N-2566+11 "

它得到了“200”页面，但是一旦我使用视图（响应），它就会将我引导到一个页面，上面写着：对不起！您的网络浏览器不接受 cookie。

这是日志的屏幕截图：

cookies scrapy scrapy-spider scrapy-shell

2017-06-15T05:10:04.123

0 投票

0 回答

291 浏览

scrapy - 使用 DNSCACHE_ENABLED=False 的scrapy 不起作用

当我使用 DNSCACHE_ENABLED=False 运行 scrapy shell 时，得到 KeyError: 'dictionary is empty' twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.mydomain.com。

欢迎任何想法

scrapy scrapy-shell

2017-07-03T03:16:10.200

0 投票

0 回答

225 浏览

shell - Scrapy Shell has the Correct Output, but the script does not

So I'm very confused here. When I use the scrapy shell and input the xpath the correct data is returned, but when I set that same xpath equal to a variable within the script, it outputs a blank. I'm really not sure what is going on.

class FestivalSpider(scrapy.Spider): name = 'basketball'

The item in question is the fg_per_mp, when I use the response.xpath('//tr[@class = "full_table"]/td[@data-stat = "fg_per_mp"]/text()').extract() in the shell it works, but the same line in the script returns an empty list.

What am I doing wrong?

shell xpath scrapy scrapy-spider scrapy-shell

2017-07-07T21:05:05.487

1 2 3 4 5 6 7 8 9 10

问题标签 [scrapy-shell]

AttributeError：“NoneType”对象没有属性“body”

编辑 1：

Reference