python - 询问：漂亮的汤 + 一个 href 模式没有像我想要的那样刮

Question

我有以下 html 模式，我想使用 BeautifulSoup 报废。html模式是：

<a href="link" target="_blank" onclick="blah blah blah">TITLE</a>

我想获取 TITLE 和链接中显示的信息。也就是说，如果您单击该链接，则会有 TITLE 的描述。我想要那个描述。

我开始尝试使用以下代码获取标题：

import urllib
from bs4 import BeautifulSoup
import re

webpage = urrlib.urlopen("http://urlofinterest")

title = re.compile('<a>(.*)</a>')
findTitle = re.findall(title,webpage)
print findTile

我的输出是：

% python beta2.py
[]

因此，这显然甚至找不到标题。我什至尝试过<a href>(.*)</a>，但没有奏效。根据我对文档的阅读，我认为 BeautifulSoup 会抓取我给它的符号之间的任何文本。在这种情况下，我做错了什么？

score 1 · Accepted Answer

你为什么要导入beautifulsoup，然后根本不使用它？

webpage = urrlib.urlopen("http://urlofinterest")

您需要从中读取数据，以便：

webpage = urrlib.urlopen("http://urlofinterest").read()

Something like (should get you to a point to go further):

>>> blah = '<a href="link" target="_blank" onclick="blah blah blah">TITLE</a>'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(blah) # change to webpage later
>>> for tag in soup('a', href=True):
    print tag['href'], tag.string

link TITLE

python - 询问：漂亮的汤 + 一个 href 模式没有像我想要的那样刮

1 回答 1

Related

Reference