python - 为什么
标签没有给出任何输出？

Question

我试过这段代码：

import urllib
from bs4 import BeautifulSoup
url = 'http://www.freesoft4down.com/Windows/System-Utilities/Clipboard-Tools/Page-1-0-0-0-0.html'
pageurl = urllib.urlopen(url)
soup = BeautifulSoup(pageurl)
print soup.find('ul',{'class':'div_pages'})

我想阅读标签内的链接，以便我可以打开其中的下一个链接。因为每个类别都有不止一页。

score 2 · Accepted Answer

首先需要获取下一页的URL，然后可以使用urllib2打开下一页..etc。

要获取 URL，如果 URL 中存在明确的模式，您可以手动构建它。

或者您可以通过阅读next标签来阅读内容。

# the advantage of using `Next` is it is web text based which is more reliable. 
import urllib
from bs4 import BeautifulSoup
import re
url = 'http://www.freesoft4down.com/Windows/System-Utilities/Clipboard-Tools/Page-1-0-0-0-0.html'
pageurl = urllib.urlopen(url)
soup = BeautifulSoup(pageurl)
print soup.find('ul',{'class':'div_pages'}).find(text=re.compile("Next")).find_parent('a')['href']

输出如下所示：

http://www.freesoft4down.com/Windows/System-Utilities/Clipboard-Tools/Page-2-0-0-0-0.html

现在您有了下一页的链接，如果您想获得下一页，下一页...，您只需要重复此过程即可。

让我知道这是否回答了您的问题。

score 1 · Accepted Answer

采用B.M.W. 的答案并对其进行改进以逐页获取下一页：

import re
import urllib
from bs4 import BeautifulSoup


def get_next_page(url):
    pageurl = urllib.urlopen(url)
    soup = BeautifulSoup(pageurl)
    next_text = soup.find('ul', {'class': 'div_pages'}).find(text=re.compile("Next"))
    if next_text:
        return next_text.find_parent('a')['href']
    return None

next_url = 'http://www.freesoft4down.com/Windows/System-Utilities/Clipboard-Tools/Page-1-0-0-0-0.html'
while next_url:
    print 'Retrieving URL {}'.format(next_url)
    next_url = get_next_page(next_url)

您可能想要更改代码，以便实际对页面做一些有用的事情。

例如，您可能希望将urllib.urlopen调用放在while循环中，以便可以直接访问页面的内容。（并且为了防止两次检索页面，您不要将 URL 发送到get_next_page函数，而是发送例如页面的内容。）但这一切都取决于您首先检索这些页面的原因。

python - 为什么标签没有给出任何输出？

2 回答 2

Related

Reference

python - 为什么
标签没有给出任何输出？