5

我正在尝试编写一个 python 程序,该程序将抓取并显示自上次运行程序以来的任何 rss 更新。我正在使用feedparser并尝试使用 etags 并按照SO 上的描述进行最后修改,但我的测试脚本似乎无法正常工作。

import feedparser
rsslist=["http://skottieyoung.tumblr.com/rss","http://mrjakeparker.com/feed/"]
for feed in rsslist:
print('--------'+feed+'-------')
d=feedparser.parse(feed)
print(len(d.entries))
if (len(d.entries) > 0):
    etag=d.feed.get('etag','')
    modified=d.get('modified',d.get('updated',d.entries[0].get('published','no modified,update or published fields present in rss')))

    d2=feedparser.parse(feed,modified)
    if (len(d2.entries) > 0):
        etag2=d2.feed.get('etag','')
        modified2=d2.get('updated',d.entries[0].get('published',''))

    if (d2==d): #ideally we would never see this bc etags/last modified would prevent unnecessarily downloading what we all ready have.
        print("Arrg these are the same")

老实说,我不确定 rss/xml 技术是否与我在网上使用的参考资料有所不同,或者我的代码是否有问题。

无论如何,我正在寻找有效使用 rss 提要的最佳解决方案。就目前而言,我希望最大限度地减少带宽浪费,例如使用 last-modified 和 etags 字段。

提前致谢。

4

2 回答 2

7

您的问题是您正在传递最后修改日期来代替etag. etag是方法的第二个参数,parse()modified第三个参数。

代替:

d2=feedparser.parse(feed,modified)

做:

d2=feedparser.parse(feed,modified=modified)

查看源代码后,似乎传递etag或传递modifiedparse()函数的唯一事情是将适当的标头发送到服务器,以便服务器可以在没有任何更改的情况下返回空响应。如果服务器不支持这一点,那么服务器将只返回完整的 RSS 提要。我会修改您的代码以检查每个条目的日期,并忽略日期小于上一个请求中最大日期的日期:

import feedparser
rsslist=["http://skottieyoung.tumblr.com/rss", "http://mrjakeparker.com/feed/"]

def feed_modified_date(feed):
    # this is the last-modified value in the response header
    # do not confuse this with the time that is in each feed as the server
    # may be using a different timezone for last-resposne headers than it 
    # uses for the publish date

    modified = feed.get('modified')
    if modified is not None:
        return modified

    return None

def max_entry_date(feed):
    entry_pub_dates = (e.get('published_parsed') for e in feed.entries)
    entry_pub_dates = tuple(e for e in entry_pub_dates if e is not None)

    if len(entry_pub_dates) > 0:
        return max(entry_pub_dates)    

    return None

def entries_with_dates_after(feed, date):
    response = []

    for entry in feed.entries:
        if entry.get('published_parsed') > date:
            response.append(entry)

    return response            

for feed_url in rsslist:
    print('--------%s-------' % feed_url)
    d = feedparser.parse(feed_url)
    print('feed length %i' % len(d.entries))

    if len(d.entries) > 0:
        etag = d.feed.get('etag', None)
        modified = feed_modified_date(d)
        print('modified at %s' % modified)

        d2 = feedparser.parse(feed_url, etag=etag, modified=modified)
        print('second feed length %i' % len(d2.entries))
        if len(d2.entries) > 0:
            print("server does not support etags or there are new entries")
            # perhaps the server does not support etags or last-modified
            # filter entries ourself

            prev_max_date = max_entry_date(d)

            entries = entries_with_dates_after(d2, prev_max_date)

            print('%i new entries' % len(entries))
        else:
            print('there are no entries')

这会产生:

--------http://skottieyoung.tumblr.com/rss-------
feed length 20
modified at None
second feed length 20
server does not support etags or there are new entries
0 new entries
--------http://mrjakeparker.com/feed/-------
feed length 10
modified at Wed, 07 Nov 2012 19:27:48 GMT
second feed length 0
there are no entries
于 2012-11-08T21:28:56.497 回答
0

I would suggest using the Date in the header as a fallback if there is no etag or modified information in the feed.

Use feed['headers']['Date'] which can be used like this.

feedparser.parse(url, modified=feed['headers']['Date'])

Edit: But it looks like that some servers ignoring the modified parameter.

于 2018-03-21T21:48:06.873 回答