1

我想从找到的所有图像中删除“a”标签(链接)。因此,为了提高性能,我列出了 html 中的所有图像,并寻找包装标签并简单地删除链接。

我正在使用 BeautifulSoup 并且不确定我做错了什么,而不是删除 a 标签,而是删除了内部内容。

这就是我所做的

from bs4 import BeautifulSoup

html = '''<div> <a href="http://somelink"><img src="http://imgsrc.jpg" /></a> <a href="http://somelink2"><img src="http://imgsrc2.jpg /></a>"  '''
soup = BeautifulSoup(html)
for img in soup.find_all('img'):
    print 'THIS IS THE BEGINING /////////////// '
    #print img.find_parent('a').unwrap()
    print img.parent.unwrap()

这给了我以下输出

> >> print img.parent() 
<a href="http://somelink"><img src="http://imgsrc.jpg" /></a> 
<a href="http://somelink2"><img src="http://imgsrc2.jpg /></a>

> >> print img.parent.unwrap() 
<a href="http://somelink"></a> 
<a href="http://somelink2"></a>

当我使用replaceWithreplaceWithChildrenobject.parentfindParent

我不确定我做错了什么。自从我开始使用 python 以来只有几周的时间。

4

3 回答 3

2

The unwrap() function returns the tag that has been removed. The tree itself has been properly modified. Quoting from the unwrap() documentation:

Like replace_with(), unwrap() returns the tag that was replaced.

In other words: it works correctly! Print the new parent of img instead of the return value of unwrap() to see that the <a> tags have indeed been removed:

>>> from bs4 import BeautifulSoup
>>> html = '''<div> <a href="http://somelink"><img src="http://imgsrc.jpg" /></a> <a href="http://somelink2"><img src="http://imgsrc2.jpg /></a>"  '''
>>> soup = BeautifulSoup(html)
>>> for img in soup.find_all('img'):
...     img.parent.unwrap()
...     print img.parent
... 
<a href="http://somelink"></a>
<div> <img src="http://imgsrc.jpg"/> <a href="http://somelink2"><img src="http://imgsrc2.jpg /&gt;&lt;/a&gt;"/></a></div>
<a href="http://somelink2"></a>
<div> <img src="http://imgsrc.jpg"/> <img src="http://imgsrc2.jpg /&gt;&lt;/a&gt;"/></div>

Here python echoes the img.parent.unwrap() return value, followed by the output of the print statement showing the parent of the <img> tag is now the <div> tag. The first print shows the other <img> tag still wrapped, the second print shows them both as direct children of the <div> tag.

于 2013-08-10T15:14:47.997 回答
1

我不确定你在寻找什么输出。是这个吗?

from bs4 import BeautifulSoup

html = '''<div> <a href="http://somelink"><img src="http://imgsrc.jpg" /></a> <a href="http://somelink2"><img src="http://imgsrc2.jpg" /></a>  '''
soup = BeautifulSoup(html)
for img in soup.find_all('img'):
    img.parent.unwrap()
print(soup)

产量

<html><body><div> <img src="http://imgsrc.jpg"/> <img src="http://imgsrc2.jpg"/></div></body></html>
于 2013-08-10T15:06:19.953 回答
0

I haven't worked much with Python, but it looks like unwrap returns the HTML that was removed, and not the img tag you're looking for. Try calling soup.prettify() and see if the link was removed after all.

于 2013-08-10T15:18:46.930 回答