问题标签 [scrapy-spider]

问问题

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

1529 问题

0 投票

1 回答

2959 浏览

python - 爬虫爬虫输出

我在通过Scrapy 文档中的CrawlSpider 示例运行时遇到问题。它似乎爬行得很好，但我无法将其输出到 CSV 文件（或任何其他文件）。

所以，我的问题是我可以使用这个：

还是我必须创建一个项目管道？

更新，现在有代码！：

2014-10-23T13:04:54.043

0 投票

1 回答

1897 浏览

python - Scrapy Spiders - 处理非 HTML 链接（PDF、PPT 等）

我正在学习 Scrapy 和 Python，并从一个空白项目开始。我正在使用 Scrapy LxmlLinkExtractor 解析链接，但蜘蛛在遇到非 HTML 链接/页面（如 PDfs 或其他文档）时总是卡住。

问题：我们如何处理 - 一般来说 - 那些带有 Scrapy 的链接，如果我只想存储那些 URls（我现在不想要文档的内容......）

包含文档的示例页面：http: //afcorfmc.org/2009.html

这是我的蜘蛛代码：

python scrapy scrapy-spider

2014-10-27T08:32:55.427

0 投票

2 回答

144 浏览

scrapy - Scrapy 没有这样的主机爬虫

我使用这个爬虫作为我的基础爬虫 https://github.com/alecxe/broken-links-checker/blob/master/broken_links_spider.py

创建它是为了捕获 404 错误域并保存它们。我想稍微修改一下，让它寻找“没有这样的主机”错误，即错误 12002。

但是，使用此代码，Scrapy 没有收到任何响应（因为没有主机可以返回响应），并且当 scrapy 遇到此类域时，它会返回

未找到：[Errno 11001] getaddrinfo 失败。

如何捕获此未找到错误并保存域？

scrapy web-crawler host scrapy-spider

2014-10-28T09:31:46.690

0 投票

1 回答

10052 浏览

python - Beautiful Soup 遍历 html 标签

我在html中有以下代码

在某些部分有小节，而有些则没有。我想获取子部分和没有子部分的部分的内容，我正在尝试遍历这些子部分，以便我可以在 scrapy 中创建索引。我有以下scrapy代码：

一些结果的格式正确，尽管一些没有小节的部分被分解为单独的元素。

我想要每个部分单独。我的意思是因为 1 部分有其他部分。我想遍历这些部分并单独获取它们，以便我可以跟踪循环。由于某些部分没有子部分，因此无需遍历它们。

在 BeautifulSoup 中有没有更好的方法来做到这一点？我想要以下输出

python html django-views beautifulsoup scrapy-spider

2014-10-29T07:33:39.193

0 投票

1 回答

391 浏览

python - 在 scrapy 上设置 LOG_ENABLED=FALSE

我试图禁用 Scrapy 调试打印，在网上快速搜索后我发现这是帮助其他人的代码行：

或者

在这两种情况下，它都没有解决我的问题。

如果有人可以帮助我，我将不胜感激。

python scrapy scrapy-spider

2014-11-02T09:38:09.550

0 投票

0 回答

1616 浏览

python - Tripadvisor 上的 Scrapy，Crawling 点评：如何应用双递归规则？

这是我的蜘蛛的样子：

第一条规则成功运行，爬取了列出酒店的所有页面。

蜘蛛只爬过每家酒店评论的第一页，不幸的是忽略了第二条规则，它应该让它递归地爬过所有评论的页面。

由于不同的回调和不同的xpath ，我不认为仅遵循一个规则时如何修复scrapy规则适用于此。

我求救！

python web-scraping scrapy scrapy-spider

2014-11-04T12:52:47.567

0 投票

1 回答

213 浏览

python - 使用scrapy获取“下一页”数据

我需要抓取一个商品网站的评论数据，但是它的用户数据是分页的。每页评论是10条，大约有100页。我怎样才能把它们全部爬出来？

这是关于“下一页”链接的 Html 代码：

究竟是href="#"什么意思？

python web-crawler scrapy-spider

2014-11-06T14:06:06.473

0 投票

1 回答

160 浏览

python - 计算一个scrapy webspider的覆盖率

我正在编写网络蜘蛛，以使用python 中的scrapy框架从网站上废弃一些产品。我想知道计算书面蜘蛛的覆盖率和缺失项目的最佳实践是什么。

我现在使用的是记录无法解析或引发异常的案例。举个例子：当我期望产品价格或地点地址的特定格式时，我发现我编写的正则表达式与报废的字符串不匹配。或者当我xpath的特定数据选择器什么都不返回时。

有时，当产品在一页或多个页面中列出时，我也会使用curl并grep粗略计算产品数量。但我想知道是否有更好的做法来处理这个问题。

python web-scraping scrapy scrapy-spider

2014-11-14T03:18:00.350

0 投票

1 回答

1593 浏览

javascript - Scrapy choose from dropdown menu

I am trying to crawl this page https://www.stickyguide.com/dispensaries/leaf-lab/ using scrapy. I am now having trouble crawling reviews from this page for a long time. If any one has any experience dealing with Ajax or Javascript, please share your thoughts.

1) I can easily get the Xpath for the review:

However, I believe the review part of the page is loaded by javascript. Every time when I crawled this page, I got the following value of Xpath:

If there any method I can use to assure that scrapy crawls before javascript has been loaded? When I looked up the method online, using selenium package may be a solution, but it may be not efficient.

2) Another problem I met is that I only want to crawl the data from dispensaries. I need to choose the option "VIEW: Dispensary Only" from the dropdown menu next to the Review module. I took a look at the HTML code and it tends out to be an Ajax object.

If there any method I can use to request the content of the option "VIEW: Dispensary Only"? I have tried a lot of methods on stackoverflow but I still can't work this out.

Thank you in advance

javascript ajax xpath scrapy scrapy-spider

2014-11-17T18:21:21.260

0 投票

1 回答

576 浏览

python - CrawlSpider 派生对象没有属性“状态”

我尝试按照http://doc.scrapy.org/en/0.22/topics/jobs.html中的描述使用 spider.state ，但出现错误

我尝试在CrawlSpider 派生类的init () 函数中使用它。这可能是问题吗？

我的目标是让 crawl_start 属性始终位于我的爬虫首先启动的 isoformat 日期时间字符串上，与 x 何时恢复开始无关

python web-scraping scrapy scrapy-spider

2014-11-20T22:04:23.337

1 2 3 4 5 6 7 8 9 10

问题标签 [scrapy-spider]

Reference