Python / ElementTree:解析内联元素并尊重周围的文字吗?
作者:互联网
我需要解析一些包含内联元素的XML.例如,XML外观如下所示:
< section>
富巴,我是如此富巴,富巴甚至更多< fref bar =“ baz”> fubare< / fref>.还有更多fubar.
< / section>
如果我现在用list(parent)中的elem遍历此结构,则… …我只能访问fref.如果我现在处理fref,则周围的文本当然会丢失,因为文本不是真正的元素.
有人知道正确解决此问题的方法吗?
解决方法:
下面显示了如何使用lxml实现此目的.
>>> from lxml.etree import fromstring
>>> tree = fromstring('''<section> Fubar, I'm so fubar, fubar and even more <fref bar="baz">fubare</fref>. And yet more fubar. </section>''')
>>> elem = tree.xpath('/section/fref')[0]
>>> elem.text
'fubare'
>>> elem.tail
'. And yet more fubar. '
>>> elem.getparent().text
" Fubar, I'm so fubar, fubar and even more "
从lxml.etree tutorial:
If you want to read only the text, i.e. without any intermediate tags,
you have to recursively concatenate all text and tail attributes in
the correct order. Again, the tostring() function comes to the rescue,
this time using the method keyword:
>>> from lxml.etree import tostring
>>> tostring(html, method="text")
" Fubar, I'm so fubar, fubar and even more fubare. And yet more fubar. "
链接页面中也描述了一种XPath方法.
标签:text,elementtree,python 来源: https://codeday.me/bug/20191101/1983665.html