python - 即使显示更多链接，也可以从 html 获取所有链接

Question

我正在使用 python 和 beautifulsoup 进行 html 解析。

我正在使用以下代码：

from BeautifulSoup import BeautifulSoup
import urllib2
import re

url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"

main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)

for a in soup.findAll('a',href=True):
    print a[href]

但我没有得到像这样的输出链接：http: //www.wikipathways.org/index.php/Pathway :WP26

还有一点是有 107 条路径。但我不会得到所有链接，因为其他链接取决于页面底部的“显示链接”。

那么，如何从该网址获取所有链接（107 个链接）？

score 2 · Accepted Answer

你的问题是第 8 行，content = url.read(). 你实际上并没有阅读网页，你实际上什么也没做（如果有的话，你应该得到一个错误）。

main_url是您要阅读的内容，因此将第 8 行更改为：

content = main_url.read()

您还有另一个错误，print a[href]. href应该是一个字符串，所以它应该是：

print a['href']

score 1 · Accepted Answer

我建议使用lxml它更快更好地解析 html 值得花时间学习它。

from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')

那应该让你继续前进。

python - 即使显示更多链接，也可以从 html 获取所有链接

2 回答 2

Related

Reference