python-2.7 - 无法使用 BeautifulSoup 找到所有链接以从网站中提取链接（链接识别）

Question

我正在使用此处找到的此代码（使用 python 和 BeautifulSoup 从网页检索链接）从使用的网站中提取所有链接。

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.bestwestern.com.au')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print link['href']

我正在使用这个网站http://www.bestwestern.com.au 作为测试。不幸的是，我注意到代码没有提取一些链接，例如这个http://www.bestwestern.com.au/about-us/careers/。我不知道为什么。在页面的代码中，这是我发现的。

<li><a href="http://www.bestwestern.com.au/about-us/careers/">Careers</a></li>

我认为提取器通常应该识别它。在 BeautifulSoup 文档中，我可以读到：“最常见的意外行为类型是您在文档中找不到您知道的标签。你看到它进去了，但是 find_all() 返回 [] 或 find() 返回 None。这是 Python 内置 HTML 解析器的另一个常见问题，它有时会跳过它不理解的标签。同样，解决方案是安装 lxml 或 html5lib。” 所以我安装了html5lib。但我仍然有同样的行为。

谢谢您的帮助

score 2 · Accepted Answer

好的，这是一个老问题，但我在搜索中偶然发现了它，看起来它应该相对容易完成。我确实从 httplib2 切换到请求。

import requests
from bs4 import BeautifulSoup, SoupStrainer
baseurl = 'http://www.bestwestern.com.au'

SEEN_URLS = []
def get_links(url):
    response = requests.get(url)
    for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a', href=True)):
        print(link['href'])
        SEEN_URLS.append(link['href'])
        if baseurl in link['href'] and link['href'] not in SEEN_URLS:
            get_links(link['href'])

if __name__ == '__main__':
    get_links(baseurl)

score 1 · Accepted Answer

一个问题是 - 您正在使用BeautifulSoup不再维护的版本 3。您需要升级到BeautifulSoup版本 4：

pip install beautifulsoup4

另一个问题是主页上没有“职业”链接，但在“站点地图”页面上有一个 - 请求它并使用默认html.parser 解析器进行解析 - 你会看到打印的“职业”链接：

import requests
from bs4 import BeautifulSoup, SoupStrainer

response = requests.get('http://www.bestwestern.com.au/sitemap/')

for link in BeautifulSoup(response.content, "html.parser", parse_only=SoupStrainer('a', href=True)):
    print(link['href'])

请注意我如何将“必须有 href”规则移至汤过滤器。

python-2.7 - 无法使用 BeautifulSoup 找到所有链接以从网站中提取链接（链接识别）

2 回答 2

Related

Reference