html - Beautifulsoup 无法从 img 标签中提取 src 属性

Question

这是我的代码：

html = '''<img onload='javascript:if(this.width>950) this.width=950'
src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">'''
soup = BeautifulSoup(html)
imgs = soup.findAll('img')

print imgs[0].attrs

它会打印[(u'onload', u'javascript:if(this.width>950) this.width=950')]

那么src属性在哪里呢？

如果我用类似的东西替换htmlhtml = '''<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />'''

我得到正确的结果[(u'src', u'/image/fluffybunny.jpg'), (u'title', u'Harvey the bunny'), (u'alt', u'a cute little fluffy bunny')]

我对 HTML 和 beautifulsoup 很陌生。我缺少一些知识吗？感谢您的任何想法。

score 8 · Accepted Answer

我使用 BeautifulSoup 的第三版和第四版对此进行了测试，并注意到bs4（版本 4）似乎比版本 3 更好地修复了您的 HTML。

使用 BeautifulSoup 3：

>>> html = """<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">"""
>>> soup = BeautifulSoup(html) # Version 3 of BeautifulSoup
>>> print soup
<img onload="javascript:if(this.width&gt;950) this.width=950" />950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg"&gt;

请注意>现在>的情况，有些位不合适。

此外，当您调用 BeautifulSoup() 时，它会将其拆分。如果你要打印soup.img，你会得到：

<img onload="javascript:if(this.width&gt;950) this.width=950" />

所以你会错过细节。

使用bs4（BeautifulSoup 4，当前版本）：

>>> html = '''<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">'''
>>> soup = BeautifulSoup(html) 
>>> print soup
<html><body><img onload="javascript:if(this.width&gt;950) this.width=950" src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg"/></body></html>

现在有了.attrs：在 BeautifulSoup 3 中，它返回一个元组列表，正如您所发现的那样。在 BeautifulSoup 4 中，它返回一个字典：

>>> print soup.findAll('img')[0].attrs # Version 3
[(u'onload', u'javascript:if(this.width>950) this.width=950')]

>>> print soup.findAll('img')[0].attrs # Version 4
{'onload': 'javascript:if(this.width>950) this.width=950', 'src': 'http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg'}

那么该怎么办？获取 BeautifulSoup 4。它将更好地解析 HTML。

顺便说一句，如果您想要的只是，则不需要src调用：.attrs

>>> print soup.findAll('img')[0].get('src')
http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg

score 0 · Accepted Answer

这种方法很有用：

image=container.find("div",{"class":"ika-picture-flex-box"})
image=image.find_all("source")
image[1].get('srcset')

html - Beautifulsoup 无法从 img 标签中提取 src 属性

2 回答 2

Related

Reference