Having trouble with the following simple code which scrapes a vBulletin forum site:
class ForumSpider(CrawlSpider):
...
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths="//div[@class='threadlink condensed']"),
callback='parse_threads'),
)
def parse_threads(self, response):
thread = HtmlXPathSelector(response)
# get the list of posts
posts = thread.select("//div[@id='posts']//table[contains(@id,'post')]/*")
# plist = []
for p in posts:
table = ThreadItem()
table['thread_id'] = (p.select("//input[@name='searchthreadid']/@value").extract())[0].strip()
string_id = p.select("../@id").extract() # returns a list
p_id = string_id[0].split("post")
table['post_id'] = p_id[1]
# plist.append(table)
# return plist
yield table
Some xpath hackiness aside, when I run this with yield I get really strange results with multiple hits for the same thread_id and post_id. Something like:
114763,1314728
114763,1314728
114763,1314728
114763,1314740
114763,1314740
114763,1314740
When I switch back to the same logic with return (in the comments) everything works fine. I think it might be some basic mistake with the generator, but I cannot figure it out. Why are the same posts being hit multiple times? Why does the code work using return but not yield?
The full code snippet in a gist here.