0

我正在尝试用 Python 2.7 编写一个应用程序以在 Google App Engine 上使用。我希望它解析 rss 提要并将数据存储在数据库中,但是在获取 'enclosure' 标记中的 'url' 属性值时会出现不一致的行为。我是编码新手,希望对此有任何帮助。

我有两个 RSS 提要:

饲料 A:http: //youhadtobethere.libsyn.com/rss

<item>
<title>Episode 97: Yannis Pappas</title>
<pubDate>Thu, 07 Feb 2013 19:38:00 +0000</pubDate>
<guid isPermaLink="false"><![CDATA[1364808bb99fe6bdb71b16333530076f]]></guid>
<link><![CDATA[http://youhadtobethere.libsyn.com/episode-97-yannis-pappas]]></link>
<media:thumbnail url="http://assets.libsyn.com/item/2210463" />
<description><![CDATA[<div>This week, Nikki and Sara marvel at how, two weeks in, they've already gotten used to the process of making their television show.&nbsp; Sara recently saw <i>Django Unchained</i> in a now-rare moment of free time and when she says she liked it, Brooklyn-born comic and certified "grown man" (see <a href="http://splitsider.com/2013/02/you-had-to-be-there-96-jessimae-peluso/">ep. 96</a>) <a href="http://ditchfilms.com/?page_id=2">Yannis Pappas</a> (<a href="http://www.youtube.com/watch?v=PWSRHNvTSIU"><i>Modern Comedian</i></a>, <a href="https://twitter.com/yannispappas">Twitter</a>) offers his wholehearted agreement.&nbsp; After flinging fury at Brooklyn's bogus new neighborhoods, Philly's sports obsessions, and Beantown's general demeanor, Yannis tells the story of the shooting that shoved him into maturity early on in his career.&nbsp; The trio muse a bit on their futile little legacies but soon leap into a joyous edition of Talking Pee that includes a <a href="http://whatsongamidancingto.com/">previral guessing game</a> on YouTube, a <a href="http://www.fitbit.com/">playful pedometer</a> on everyone's waistband, Yannis's modest dog zoo, and Nikki's upcoming appearance on <a href="http://www.comedycentral.com/shows/the-burn-with-jeff-ross"><i>The Burn</i></a>.&nbsp; Check her out this Tuesday (2/12) on Comedy Central at 10.30pm/9.30c...<br /><br /></div>
<p>...conveniently right before you flip over to <b><i>Nikki &amp; Sara LIVE</i></b> on <b>MTV </b>at <b>11pm/10c</b>.&nbsp; <i>Nikki &amp; Sara LIVE</i>: like a podcast for your eyes!</p>]]></description>
<enclosure length="61227675" type="audio/mpeg" url="http://traffic.libsyn.com/youhadtobethere/YHTBT_97_YannisPappas.mp3" />
<itunes:duration>01:03:47</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:keywords>pappas,sara,nikki,schaefer,glaser,yannis</itunes:keywords>
<itunes:subtitle><![CDATA[This week, Nikki and Sara marvel at how, two weeks in, they've already gotten used to the process of making their television show.&nbsp; Sara recently saw Django Unchained in a now-rare moment of free time and when she says she liked it, Brooklyn-born...]]></itunes:subtitle>
</item>

饲料 B:http ://smodcast.com/channels/bagged-boarded-live/feed/

<item>
<title>
Bagged & Boarded Live 146: All Superheroes Must Pod
</title>
<link>
http://smodcast.com/episodes/all-superheroes-must-pod/
</link>
<comments>
http://smodcast.com/episodes/all-superheroes-must-pod/#comments
</comments>
<pubDate>Tue, 22 Jan 2013 00:08:35 +0000</pubDate>
<dc:creator>Editor</dc:creator>
<category>
<![CDATA[ Episodes ]]>
</category>
<guid isPermaLink="false">http://smodcast.com/?p=12999</guid>
<description>
<![CDATA[
In which Matt sits down with Jason Trost (The Fp) and Lucas Till (X-Men First Class) for a chat about their new film All Superheroes Must Die
]]>
</description>
<content:encoded>
<![CDATA[
In which Matt sits down with Jason Trost (The Fp) and Lucas Till (X-Men First Class) for a chat about their new film All Superheroes Must Die
]]>
</content:encoded>
<wfw:commentRss>
http://smodcast.com/episodes/all-superheroes-must-pod/feed/
</wfw:commentRss>
<slash:comments>0</slash:comments>
<enclosure url="http://api.soundcloud.com/tracks/75907996/stream.mp3?client_id=a427c512429c9c90e58de7955257879c" length="0" type="audio/mpeg"/>
</item>

代码片段:

import urllib

from lxml import etree

rss = etree.parse(urllib.urlopen(feedUrl))
show = rss.getroot()

for episode in show.iter('item'):
    mediaUrl = episode.xpath('enclosure/@url')

这将返回一个列表,其中 url 属性的值作为唯一的项目。在 Feed A 上运行时,我可以改用 mediaUrl = episode.xpath('enclosure/@url')[0] 或 mediaUrl = mediaUrl[0] 将该 URL 保存为字符串。但是,在 Feed B 上,这两个都会生成错误:IndexError: list index out of range。如果我在从 Feed BI 返回的列表上使用 len(mediaUrl),则结果为 1,我认为这意味着它返回了一个包含 url 的列表,但试图从该列表中获取 url 会生成 IndexError .

我试过了:

enclosure = episode.find('enclosure') 
mediaUrl = enclosure.get('url') 

这从 Feed A 中获取 url 作为字符串就好了,但会生成 AttributeError: 'NoneType' object has no attribute 'get' error in Feed B。我在使用时得到相同的行为:

mediaUrl = episode.find('enclosure').attrib['url']

正确地从 Feed A 返回 url 作为字符串,生成 AttributeError: 'NoneType' object has no attribute 'attrib' from Feed B。

我没有看到两个 rss 提要的布局之间存在明显差异,无法解释为什么使用最后两种方法可以轻松地从提要 A 中提取 url,但在提要 B 中根本看不到。我不明白为什么我可以使用第一种方法从 Feed A 返回的列表中提取 url,但不能从 Feed B 返回的列表中提取 url。有人可以帮忙吗?

4

1 回答 1

0

您的错误是假设提要中的所有项目都有<enclosure />标签。

对于您的第一个提要,情况确实如此,但在撰写本文时,第二个提要 URL 有两个项目没有包含附件 URL。

只需跳过这些项目:

for episode in show.iter('item'):
    mediaUrl = episode.xpath('enclosure/@url')
    if not mediaUrl:  # no enclosure for this episode
        continue
于 2013-02-12T21:52:40.587 回答