feedparser在脚本运行期间失败,但无法在交互式python控制台中重现
作者:互联网
当我运行eclipse或在iPython中运行脚本时,此操作将失败:
'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)
我不知道为什么,但是当我简单地使用相同的URL执行feedparse.parse(url)语句时,不会引发任何错误.这让我很沮丧.
代码很简单:
try:
d = feedparser.parse(url)
except Exception, e:
logging.error('Error while retrieving feed.')
logging.error(e)
logging.error(formatExceptionInfo(None))
logging.error(formatExceptionInfo1())
这是堆栈跟踪:
d = feedparser.parse(url)
File "C:\Python26\lib\site-packages\feedparser.py", line 2623, in parse
feedparser.feed(data)
File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
sgmllib.SGMLParser.feed(self, data)
File "C:\Python26\lib\sgmllib.py", line 104, in feed
self.goahead(0)
File "C:\Python26\lib\sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "C:\Python26\lib\sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "C:\Python26\lib\sgmllib.py", line 360, in finish_endtag
self.unknown_endtag(tag)
File "C:\Python26\lib\site-packages\feedparser.py", line 476, in unknown_endtag
method()
File "C:\Python26\lib\site-packages\feedparser.py", line 1318, in _end_content
value = self.popContent('content')
File "C:\Python26\lib\site-packages\feedparser.py", line 700, in popContent
value = self.pop(tag)
File "C:\Python26\lib\site-packages\feedparser.py", line 641, in pop
output = _resolveRelativeURIs(output, self.baseuri, self.encoding)
File "C:\Python26\lib\site-packages\feedparser.py", line 1594, in _resolveRelativeURIs
p.feed(htmlSource)
File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
sgmllib.SGMLParser.feed(self, data)
File "C:\Python26\lib\sgmllib.py", line 104, in feed
self.goahead(0)
File "C:\Python26\lib\sgmllib.py", line 138, in goahead
k = self.parse_starttag(i)
File "C:\Python26\lib\sgmllib.py", line 296, in parse_starttag
self.finish_starttag(tag, attrs)
File "C:\Python26\lib\sgmllib.py", line 338, in finish_starttag
self.unknown_starttag(tag, attrs)
File "C:\Python26\lib\site-packages\feedparser.py", line 1588, in unknown_starttag
attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs]
File "C:\Python26\lib\site-packages\feedparser.py", line 1584, in resolveURI
return _urljoin(self.baseuri, uri)
File "C:\Python26\lib\site-packages\feedparser.py", line 286, in _urljoin
return urlparse.urljoin(base, uri)
File "C:\Python26\lib\urlparse.py", line 215, in urljoin
params, query, fragment))
File "C:\Python26\lib\urlparse.py", line 184, in urlunparse
return urlunsplit((scheme, netloc, url, query, fragment))
File "C:\Python26\lib\urlparse.py", line 192, in urlunsplit
url = scheme + ':' + url
File "C:\Python26\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
部分解决:
当传递给feedparser.parse()的URL是unicode时,这是可重现的.当它是ascii URL时不会复制.为了记录,您需要一个具有一些高字符unicode字符的提要.我不确定为什么会这样.
解决方法:
看起来给您带来问题的url包含具有某种编码的文本(例如latin-1,其中0xe2是“小写的a,顶部带有圆圈”又名& acirc;),但没有适当的内容类型标头(应该在Content-Type中有一个charset =参数:但没有).
如果是这种情况,feedparser无法猜测编码,请尝试使用默认值(ascii),然后失败.
feedparser的文档this part更详细地说明了这些问题.
不幸的是,没有“灵丹妙药”来解决这个一般性问题(由于破坏了XML规则的庞然大物).您可以尝试捕获此异常,然后在处理程序中分别读取url的内容(使用urllib2),并尝试使用各种可能的编码对其进行解码-然后,当您最终以这种方式获得可用的unicode对象时,将其提供给feedparser.parse( first arg可以是url,文件流或带有数据的unicode字符串).
标签:python,unicode,character-encoding,ascii,feedparser 来源: https://codeday.me/bug/20191013/1905762.html