“scrapy-spider”的相关标签问题

0 投票

1 回答

1434 浏览

python - Scrapy：规则 SgmlLinkExtractor 概念

请指导我如何编写 Rule SgmlLinkExtractor
我很困惑，无法弄清楚英文文档

我想用很多页面爬网
规则是：

这是我的代码：

2014-07-30T03:16:51.373

0 投票

1 回答

1285 浏览

python - How to scrape contents from multiple tables in a webpage

I want to scrape contents from multiple tables in a webpage and the HTML code goes like this :

There are more matches under the dates (9 or 2 or 1 depending on the matches played on that date) and the no. of tables is 63 (which is equal to no. of days)

I want to extract, for each date, matches between teams and also which team is home and which team is away.

I was using the scrapy shell and tried following commands:

This printed a list of the home teams and this printed a list of all the away teams,

This gave me a list for all the dates :

What I want is for all dates get the matches that are played on a day (and also which team is home and away)

Should my items.py look like this:

Please help me to write the parse function and the Item class.

Thanks in advance.

python web-scraping scrapy scrapy-spider

2014-07-31T17:08:17.637

0 投票

4 回答

16564 浏览

python - 从 scrapy 导出 csv 文件（不是通过命令行）

我成功地尝试从命令行将我的项目导出到 csv 文件中，例如：

我的问题是：在代码中执行相同操作的最简单解决方案是什么？我需要这个，因为我从另一个文件中提取文件名。结束场景应该是，我称之为

并将项目写入 filename.csv

python csv scrapy export-to-csv scrapy-spider

2014-08-06T14:28:29.510

0 投票

1 回答

351 浏览

python - 蜘蛛运行时一次发送一次垃圾邮件

当蜘蛛完成抓取页面时，我试图在 gmail 中发送电子邮件..当我定义函数 send_mail 并像下面一样传递它时，在日志中，它说 send_mail("some message", "Scraper Report") NameError: name ' send_mail' 未定义..当蜘蛛完成抓取时我如何发送 gmail。当我在 def parse(self,response) 方法中传递 send_mail 函数时，由于抓取循环，它试图阻止我的 gmail..

python email scrapy scrapy-spider

2014-08-08T08:41:34.940

0 投票

1 回答

4048 浏览

python - selenium-webdriver：如何使用 for 循环来查找元素

我想获取所有链接以及 start_time 和 end_time 一个页面，然后发送到 function(parse_detail) 以废弃其他信息但我不知道如何使用 selenium 进行循环

这是我的代码并且有错误：

请教我如何在 selenium 中使用像 scrapy 这样的 for 循环。谢谢！

python selenium selenium-webdriver web-scraping scrapy-spider

2014-08-09T00:15:12.057

0 投票

1 回答

249 浏览

image - 为什么scrapy会为可用的图像提供404？

这是我添加到 image_urls 字段的图像示例。 http://static.zara.net/photos//2014/I/0/2/p/5875/309/800/2/w/1920/5875309800_1_1_1.jpg 但是我收到了这个警告并且图片没有上传。

[zara_com] 警告：文件（代码：404）：从 http://static.zara.net/photos//2014/I/0/2/p/5875/309/800/2/w/1920 下载图像时出错/5875309800_1_1_1.jpg> 参考

虽然像这样的图像： http ://static.zara.net/photos//2014/V/1/3/p/1280/303/105/2/w/1920/1280303105_2_1_1.jpg 正常上传。

可能是什么问题？我应该检查什么？

image scrapy scrapy-spider

2014-08-14T10:42:33.353

0 投票

1 回答

303 浏览

python - 我的第一个 scrapy xpath 选择器

我对此很陌生，并且一直在尝试了解我的第一个选择器。有人可以帮助我吗？我正在尝试从此页面中提取数据：

http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc- -ghs-d1- -asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false

div class = Listing clearfix ShelfListing 下的所有信息，但我似乎无法弄清楚如何格式化response.xpath()。

我已经设法启动了scrapy控制台，但无论我输入什么，response.xpath()我似乎都无法选择正确的节点。我知道它有效，因为当我输入

我得到回应。然而，我不知道如何导航到清单 cleardix 货架清单。我希望一旦我得到这一点，我就可以继续通过蜘蛛工作。

PS我想知道是否无法扫描此站点-所有者是否可以阻止蜘蛛？

python xpath web-scraping scrapy scrapy-spider

2014-08-24T20:26:45.547

0 投票

1 回答

726 浏览

python - scrapy : 如果 FormRequest 有 jsessionid

我练习FormRequest并遇到问题
，我在def（parse）中抓取了一个链接，我将在def（parse1）中得到一个json。
然后我得到了actIdin json，我可以产生抓取其他链接的请求，但是有这样的错误：

我认为这是因为它提供了一个 jsessionidjsessionid=A69C5203A49A12DA450F32E6B2AB0E23

因为我yield FormRequest(url='http://xxx.tw/ca/toView?mtd=do', callback=self.parse3, formdata={'actId': actId}) 拼命尝试，效果很好。

这是代码：

我该怎么做才能解决这个问题？

python json scrapy-spider

2014-08-25T10:50:19.680

0 投票

1 回答

293 浏览

python - 如何将 ScrapyFileLogObserver 文件发送到我的电子邮件

我想在蜘蛛关闭时给自己发送一封电子邮件，
我查看了这个来源，我可以收到邮件。但是我发现failure.getTraceback()当蜘蛛出错时它会写

部分源代码：

但我想让它可以像这样将控制台日志发送到我的电子邮件：

我需要这个日志，因为如果我得到了WARNING: can't find the images！: http://www.example.com，我可以运行另一个蜘蛛来获取丢失的图像

现在我的方法是使用ScrapyFileLogObserver(open("spider.log", 'w'), level=log.INFO).start()写入文件。运行蜘蛛后，我打开它以检查是否有问题。我想知道我是否可以将此文件发送到我的电子邮件或只是文件中的“文本”

有人可以教我如何做到这一点吗？谢谢你。

python scrapy scrapy-spider

2014-08-28T01:55:17.793

0 投票

2 回答

2272 浏览

python - Pyinstaller 报错：

在 Windows 32 位上安装所有 scrapy 的依赖项后。我试图从我的爬虫蜘蛛构建一个可执行文件。以“python runspider.py”运行时，蜘蛛脚本“runspider.py”工作正常

构建可执行文件“pyinstaller --onefile runspider.py”：

C:\Users\username\Documents\scrapyexe>pyinstaller --onefile runspider.py 19 INFO: 写 C:\Users\username\Documents\scrapyexe\runspider.spec 49 INFO: 测试设置图标的能力，版本资源.. . 59 INFO: ...资源更新可用 59 INFO: UPX 不可用。89 INFO：处理 hook hook-os 279 INFO：处理 hook hook-time 279 INFO：处理 hook hook-cPickle 380 INFO：处理 hook hook-_sre 561 INFO：处理 hook hook-cStringIO 700 INFO：处理 hook-encodings 720 INFO ：处理挂钩挂钩编解码器 1351 信息：使用 C:\Users\username\Documents\scrapyexe 扩展 PYTHONPATH 1351 信息：检查分析 1351 信息：构建分析，因为 out00-Analysis.toc 不存在 1351 信息：运行分析 out00-Analysis.toc 1351 信息：添加 Microsoft.VC90。

py 3694 信息：分析 runspider.py 3755 警告：找不到 django 根目录！3755 信息：处理钩子 hook-django 3785 信息：处理钩子钩子-lxml.etree 4135 信息：处理钩子钩子-xml 4196 信息：处理钩子钩子-xml.dom 4246 信息：处理钩子钩子-xml.sax 4296 信息：处理钩子钩子-pyexpat 4305 信息：处理钩子钩子-xml.dom.domreg 4736 信息：处理钩子钩子-pywintypes 5046 信息：处理钩子钩子-distutils 7750 信息：已找到隐藏导入“编解码器” 7750 信息：隐藏导入“否则已找到编码' 7750 信息：寻找运行时挂钩 7750 信息：分析 rthook C:\python27\lib\site-packages\PyInstaller\loader\rth ooks\pyi_rth_twisted.py 8111 信息：分析 rthook C：

运行构建的 exe “runspider.exe”：

C:\Users\用户名\Documents\scrapyexe\dist>runspider.exe

回溯（最近一次通话最后）：

文件“”，第 2 行，在

文件“C:\python27\Lib\site-packages\PyInstaller\loader\pyi_importers.py”，第 270 行，在 load_module

执行（字节码，模块。字典）

文件 "C:\Users\username\Documents\scrapyexe\build\runspider\out00-PYZ.pyz\scrapy" ，第 10 行，在

文件“C:\Users\username\Documents\scrapyexe\build\runspider\out00-PYZ.pyz\pkgutil”，第 591 行，在 get_data

文件“C:\python27\Lib\site-packages\PyInstaller\loader\pyi_importers.py”，第 342 行，在 get_data

fp = 打开（路径，'rb'）

IOError：[Errno 2] 没有这样的文件或目录：'C:\Users\username\AppData\Local\\Temp\_MEI15522\scrapy\VERSION'

我对任何帮助都非常有帮助。我需要知道如何从scrapy spider for windows 构建独立的exe。

非常感谢您的帮助。

python windows-7 scrapy pyinstaller scrapy-spider

2014-08-28T20:43:41.957

问题标签 [scrapy-spider]

Reference