我正在抓取一系列网址。该代码正在运行,但scrapy 没有按顺序解析网址。例如,虽然我正在尝试解析 url1、url2、...、url100,但 scrapy 会解析 url2、url10、url1...等。
它会解析所有的 url,但是当特定的 url 不存在时(例如 example.com/unit.aspx?b_id=10),Firefox 会显示我之前请求的结果。由于我想确保没有重复项,因此我需要确保循环按顺序解析 url 而不是“随意”解析。
我尝试了“for n in range(1,101)”和“while bID<100”,结果是一样的。(见下文)
提前致谢!
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Welcome!" in response.body:
self.log("Successfully logged in. Let's start crawling!")
print "Successfully logged in. Let's start crawling!"
# Now the crawling can begin..
self.initialized()
bID=0
#for n in range(1,100,1):
while bID<100:
bID=bID+1
startURL='https://www.example.com/units.aspx?b_id=%d' % (bID)
request=Request(url=startURL ,dont_filter=True,callback=self.parse_add_tables,meta={'bID':bID,'metaItems':[]})
# print self.metabID
yield request #Request(url=startURL ,dont_filter=True,callback=self.parse2)
else:
self.log("Something went wrong, we couldn't log in....Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.