python - 使用 Beautifulsoup 在 xml 文件的描述标签中提取 img

Question

我在做解析。我想在描述标签中获取图像。我正在使用 urllib 和 BeautifulSoup。我可以获取单独标签内的图像，但无法以编码格式获取描述标签内的图像。

xml代码

<item>
         <title>Kidnapped NDC member and political activist tells his story</title>
         <link>http://www.yementimes.com/en/1724/news/3065</link>
         <description>&lt;img src="http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg" border="0" align="left" hspace="5" /&gt;
‘I kept telling them that they would never break me and that the change we demanded in 2011 would come whether they wanted it or not’
&lt;br clear="all"&gt;</description>

视图.py

for q in b.findAll('item'):
            d={}
            d['desc']=strip_tags(q.description.string).strip('&nbsp')
            if q.guid:
                d['link']=q.guid.string
            else:   
                d['link']=strip_tags(q.comments)
            d['title']=q.title.string
            for r in q.findAll('enclosure'):
                d['image']=r['url']
            arr.append(d)

谁能给我一个想法来做这件事。
这就是我在单独标签内解析图像所做的事情...我试图了解它是否在描述中，但我做不到。

score 0 · Accepted Answer

您可以尝试从中提取所有内容<description>，使用它创建一个新BeautifulSoup对象并搜索第一个元素的src属性：<img>

from bs4 import BeautifulSoup
import sys 
import html.parser

h = html.parser.HTMLParser()

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for i in soup.find_all('item'):
    d = BeautifulSoup(h.unescape(i.description.string))
    print(d.img['src'])

像这样运行它：

python3 script.py xmlfile

这会产生：

http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg

python - 使用 Beautifulsoup 在 xml 文件的描述标签中提取 img

1 回答 1

Related

Reference