python-lxml和in

2019-11-19 19:55:37 作者：互联网

我在lxml中遇到了一个奇怪的错误：

>>> s = '<html><head><noscript></noscript><script></script><meta></head></html>' 
>>> root = lxml.html.fromstring(s)
>>> root.xpath('/html/head/meta')
>>> root.xpath('/html/body/meta')
[<Element meta at 0x2a92788>]

元标记应位于head元素中,而不是body中.在这种情况下如何获得正确的元素？

解决方法:

让我猜测：您是否正在使用旧版本的Ubuntu(例如12.04)？
实际上,这是lxml软件包使用的预安装libxml2库的旧版本中的错误.在版本2.8.0的release notes中,他们提到了对HTML解析器错误的修复,其中< noscript>在< head>中-所以我猜libxml2> = 2.8.0的版本应该可以工作. Ubuntu 12.04已安装版本2.7.8.

>>> import lxml.etree
>>> lxml.etree.LIBXML_COMPILED_VERSION
(2, 7, 8)
>>> lxml.etree.LIBXML_VERSION
(2, 9, 1)

我认为,如果这些版本中的任何一个> = 2.8.0,则< noscript>
问题应该消失了.

标签：lxml,lxml-html,noscript,python
来源： https://codeday.me/bug/20191119/2038667.html