python - Python 中的递归与 Beautiful Soup

Question

所以我可能只是在这里很愚蠢并且不了解python的基本机制，但我正在尝试浏览并爬过网页，然后获取一个新链接并继续递归。这是一个粗略的细分：

def go_to_next_page(soup, data, curr_link):
    print "Curr Link: " + curr_link 
    # gather information and append to data
    new_link = ""  # unless I find link with Beautiful Soup

    if new_link is not "":
        print "Next Link: " + new_link
        new_soup = BeautifulSoup(mechanize.urlopen(new_link))
        data = go_to_next_page(new_soup, data, new_link)
    return data

但是当它第二次进入时并没有创建一个新的汤，然后没有数据可以收集。

这是一个美丽的汤问题，还是我在 Python 中做递归错误

score 1 · Accepted Answer

如果通过链接您的意思是 url，那么您需要使用BeautifulSoup来使内容可读和可解析为 Beautiful soup

如果您只是对新内容做同样的事情，那么就这样做

import urllib2

def get_data(link):
    page = urllib2.urlopen(link)
    soup = BeautifulSoup(page)
    return soup

现在你可以使用 BeautifulSoup 来解析给定链接中的内容，你不需要像你拥有它那样做

更多关于beautifulsoup的信息，还有另一个有用的网站Bs4 Webscraping

编辑

就像你说你已经完成了那部分，你正试图通过递归获得下一个链接

我写了这个例子：

import urllib2
from bs4 import BeautifulSoup

def go_to_next_page(soup, data, curr_link):
    print "Curr Link: " + curr_link 
    pop = soup.find_all('a',{'class':'guide-item yt-uix-sessionlink yt-valign  guide-item-selected'})
    for i in pop:     #These three lines get the new link
        end = i.get('href')

        new_link = 'http://www.youtube.com' + end


    if new_link != "":
        print "Next Link: " + new_link     #then if the new_link isnt empty it gets the new soup
        new_soup = BeautifulSoup(urllib2.urlopen(new_link).read())
        data = go_to_next_page(new_soup, data, new_link)
    return data

def get_data(link):
    page = urllib2.urlopen(link)
    soup = BeautifulSoup(page)
    return soup

go_to_next_page(get_data('http://www.youtube.com'),data,'http://www.youtube.com')

此示例从中获取数据curr_link，然后找到新链接（在此示例中是 Youtube 热门页面），然后返回 new_links 页面的 html 并使用新数据进行递归（我假设您使用的是相同的 BeautifulSoup prasing在每次递归的函数中）

可能有更好的方法可以做到这一点，但这很好用

score 1 · Accepted Answer

你没有得到页面的内容。BeautifulSoup 不会为您检索 html 内容，您必须自己检索它。您应该将页面内容或文件对象传递给BeautifulSoup：

import urllib2
f = urllib2.urlopen(new_link)
soup = BeautifulSoup(f) # or soup = BeautifulSoup(f.read())

python - Python 中的递归与 Beautiful Soup

2 回答 2

编辑

Related

Reference