Python etag / last修改不起作用;如何获得最新的rss
作者:互联网
我正在尝试编写一个python程序,它将抓取并显示自上次运行程序以来的任何rss更新.我正在使用feedparser并尝试使用etags并按照here on SO所述进行修改,但我的测试脚本似乎无法正常工作.
import feedparser
rsslist=["http://skottieyoung.tumblr.com/rss","http://mrjakeparker.com/feed/"]
for feed in rsslist:
print('--------'+feed+'-------')
d=feedparser.parse(feed)
print(len(d.entries))
if (len(d.entries) > 0):
etag=d.feed.get('etag','')
modified=d.get('modified',d.get('updated',d.entries[0].get('published','no modified,update or published fields present in rss')))
d2=feedparser.parse(feed,modified)
if (len(d2.entries) > 0):
etag2=d2.feed.get('etag','')
modified2=d2.get('updated',d.entries[0].get('published',''))
if (d2==d): #ideally we would never see this bc etags/last modified would prevent unnecessarily downloading what we all ready have.
print("Arrg these are the same")
老实说,我不确定rss / xml技术是否已经改变了我在线使用的参考文献,或者我的代码是否有问题.
无论我在寻找有效使用rss feed的最佳解决方案.我正在寻找最小化带宽浪费,例如使用最后修改和etags字段的带宽浪费.
提前致谢.
解决方法:
您的问题是您在上次修改日期时代替etag. etag是parse()方法的第二个参数,修改后是第三个参数.
代替:
d2=feedparser.parse(feed,modified)
做:
d2=feedparser.parse(feed,modified=modified)
在查看源代码之后,看起来传递etag或修改为parse()函数的唯一方法是将相应的头发送到服务器,以便服务器可以在没有任何更改的情况下返回空响应.如果服务器不支持这个,那么服务器将只返回完整的RSS提要.我会修改您的代码以检查每个条目的日期,并忽略一个日期小于上一个请求中的最大日期的日期:
import feedparser
rsslist=["http://skottieyoung.tumblr.com/rss", "http://mrjakeparker.com/feed/"]
def feed_modified_date(feed):
# this is the last-modified value in the response header
# do not confuse this with the time that is in each feed as the server
# may be using a different timezone for last-resposne headers than it
# uses for the publish date
modified = feed.get('modified')
if modified is not None:
return modified
return None
def max_entry_date(feed):
entry_pub_dates = (e.get('published_parsed') for e in feed.entries)
entry_pub_dates = tuple(e for e in entry_pub_dates if e is not None)
if len(entry_pub_dates) > 0:
return max(entry_pub_dates)
return None
def entries_with_dates_after(feed, date):
response = []
for entry in feed.entries:
if entry.get('published_parsed') > date:
response.append(entry)
return response
for feed_url in rsslist:
print('--------%s-------' % feed_url)
d = feedparser.parse(feed_url)
print('feed length %i' % len(d.entries))
if len(d.entries) > 0:
etag = d.feed.get('etag', None)
modified = feed_modified_date(d)
print('modified at %s' % modified)
d2 = feedparser.parse(feed_url, etag=etag, modified=modified)
print('second feed length %i' % len(d2.entries))
if len(d2.entries) > 0:
print("server does not support etags or there are new entries")
# perhaps the server does not support etags or last-modified
# filter entries ourself
prev_max_date = max_entry_date(d)
entries = entries_with_dates_after(d2, prev_max_date)
print('%i new entries' % len(entries))
else:
print('there are no entries')
这会产生:
--------http://skottieyoung.tumblr.com/rss-------
feed length 20
modified at None
second feed length 20
server does not support etags or there are new entries
0 new entries
--------http://mrjakeparker.com/feed/-------
feed length 10
modified at Wed, 07 Nov 2012 19:27:48 GMT
second feed length 0
there are no entries
标签:python,feedparser 来源: https://codeday.me/bug/20190529/1180900.html