python-2.7 - 在 Scrapy 中生成带有约束的 start_urls 列表

Question

我需要使用 Scrapy 解析如下网址（来自房地产经纪人的广告）

http://ws.seloger.com/search.xml?idq=?&cp=72&idqfix=1&pxmin=30000&pxmax=60000&idtt=2&SEARCHpg=1&getDtCreationMax=1&tri=d_dt_crea

无论您在 url 中使用的最低/最高价格如何，来自服务器的响应都限制为 200 个结果（请参阅 url 中的 pxmin / pxman）。

因此，我想使用一个函数来为 start_urls 生成具有正确价格范围的 url，这样它就不会超过 200 个搜索结果，因此 url 涵盖的价格范围是 [0:1000000]

该函数将执行以下操作：

取第一个网址
检查结果数（XML 响应中的“nbTrouvees”标签）
如果结果 > 200 则调整价格范围，如果 < 200 则添加到 start_urls 列表
该函数增加价格区间，直到达到 1,000,000 的价格。
函数返回最终的 start_urls 列表，该列表将涵盖给定区域的所有属性。

这显然意味着向服务器发出大量请求以找出正确的价格范围，以及 Spider 为最终抓取而生成的所有请求。

1）因此，我的第一个问题是：您认为有没有更好的方法来解决这个问题？

2）我的第二个问题：我试图用 Scrapy 检索其中一个页面的内容，只是想看看我如何在不使用蜘蛛的情况下解析“nbTrouvees”标签，但我被卡住了。

我尝试使用 TextResponse 方法，但没有得到任何回报。然后我尝试了以下方法，但它失败了，因为“响应”对象不存在“body to unicode”方法。

>>>link = 'http://ws.seloger.com/search.xml?   idq=1244,1290,1247&ci=830137&idqfix=1&pxmin=30000&pxmax=60000&idtt=2&SEARCHpg=1&getDtCreationMax=1&tri=d_dt_crea'

>>>xxs = XmlXPathSelector(Response(link))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site-         packages/scrapy/selector/lxmlsel.py", line 31, in __init__
    _root = LxmlDocument(response, self._parser)
  File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site-    packages/scrapy/selector/lxmldocument.py", line 27, in __new__
    cache[parser] = _factory(response, parser)
  File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site-    packages/scrapy/selector/lxmldocument.py", line 13, in _factory
    body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
AttributeError: 'Response' object has no attribute 'body_as_unicode'

任何想法？（仅供参考，它适用于我的蜘蛛）

谢谢吉尔斯

python-2.7 - 在 Scrapy 中生成带有约束的 start_urls 列表

0 回答 0

Related

Reference