1

Having trouble with the following simple code which scrapes a vBulletin forum site:

class ForumSpider(CrawlSpider):
    ...

    rules = (
            Rule(SgmlLinkExtractor(restrict_xpaths="//div[@class='threadlink condensed']"),
            callback='parse_threads'),
            )

    def parse_threads(self, response):

        thread = HtmlXPathSelector(response)

        # get the list of posts
        posts = thread.select("//div[@id='posts']//table[contains(@id,'post')]/*")

        # plist = []
        for p in posts:
            table = ThreadItem()

            table['thread_id'] = (p.select("//input[@name='searchthreadid']/@value").extract())[0].strip()

            string_id = p.select("../@id").extract() # returns a list
            p_id = string_id[0].split("post")
            table['post_id'] = p_id[1]

            # plist.append(table)
            # return plist
            yield table

Some xpath hackiness aside, when I run this with yield I get really strange results with multiple hits for the same thread_id and post_id. Something like:

114763,1314728
114763,1314728
114763,1314728
114763,1314740
114763,1314740
114763,1314740

When I switch back to the same logic with return (in the comments) everything works fine. I think it might be some basic mistake with the generator, but I cannot figure it out. Why are the same posts being hit multiple times? Why does the code work using return but not yield?

The full code snippet in a gist here.

4

1 回答 1

1

Looks like it is an indentation problem. The following should work the same way as using list and return:

def parse_threads(self, response):

    thread = HtmlXPathSelector(response)

    # get the list of posts
    posts = thread.select("//div[@id='posts']//table[contains(@id,'post')]/*")

    for p in posts:
        table = ThreadItem()

        table['thread_id'] = (p.select("//input[@name='searchthreadid']/@value").extract())[0].strip()

        string_id = p.select("../@id").extract() # returns a list
        p_id = string_id[0].split("post")
        table['post_id'] = p_id[1]

        yield table

UPD: I've fixed and improved the code of your parse_threads method, should work now:

def parse_threads(self, response):
    thread = HtmlXPathSelector(response)
    thread_id = thread.select("//input[@name='searchthreadid']/@value").extract()[0].strip()
    post_id = thread.select("//div[@id='posts']//table[contains(@id,'post')]/@id").extract()[0].split("post")[1]

    # get the list of posts
    posts = thread.select("//div[@id='posts']//table[contains(@id,'post')]/tr[2]")
    for p in posts:
        # getting user_name
        user_name = p.select(".//a[@class='bigusername']/text()").extract()[0].strip()

        # skip adverts
        if 'Advertisement' in user_name:
            continue

        table = ThreadItem()
        table['user_name'] = user_name
        table['thread_id'] = thread_id
        table['post_id'] = p.select("../@id").extract()[0].split("post")[1]

        yield table

Hope that helps.

于 2013-08-07T05:20:49.373 回答