python - Beautifulsoup 在 while 循环中调用时返回相同的结果

Question

我是 python 新手，正在尝试编写一个爬虫来获取页面上的所有链接，并带有多个分页。我在 while 循环中调用以下代码。

page = urllib2.urlopen(givenurl,"",10000)

soup = BeautifulSoup(page, "lxml")

linktags = soup.findAll('span',attrs={'class':'paginationLink pageNum'}) 

page.close()

BeautifulSoup.clear(soup)

return linktags

它总是返回我传递的第一个 url 的结果。难道我做错了什么？

score 5 · Accepted Answer

@uncollected 可能在评论中为您提供了正确的答案，但我想对此进行扩展。

如果您调用的是精确代码，但嵌套在一个while块中，它将立即返回第一个结果。你可以在这里做两件事。

我不确定你是如何while在你自己的上下文中使用的，所以我在for这里使用了一个循环。

扩展结果列表，并返回整个列表

def getLinks(urls):
    """ processes all urls, and then returns all links """
    links = []
    for givenurl in urls:
        page = urllib2.urlopen(givenurl,"",10000)
        soup = BeautifulSoup(page, "lxml")
        linktags = soup.findAll('span',attrs={'class':'paginationLink pageNum'}) 
        page.close()
        BeautifulSoup.clear(soup)
        links.extend(linktags)
        # dont return here or the loop is over

    return links

或者，您可以使用yield关键字将其设为生成器，而不是返回。生成器将返回每个结果并暂停直到下一个循环：

def getLinks(urls):
    """ generator yields links from one url at a time """
    for givenurl in urls:
        page = urllib2.urlopen(givenurl,"",10000)
        soup = BeautifulSoup(page, "lxml")
        linktags = soup.findAll('span',attrs={'class':'paginationLink pageNum'}) 
        page.close()
        BeautifulSoup.clear(soup)
        # this will return the current results,
        # and pause the state, until the the next
        # iteration is requested    
        yield linktags

python - Beautifulsoup 在 while 循环中调用时返回相同的结果

1 回答 1

Related

Reference