编程语言
首页 > 编程语言> > 在python中使用lxml打印html实体

在python中使用lxml打印html实体

作者:互联网

我正在尝试使用html实体从下面的字符串中创建一个div元素.由于我的字符串包含html实体,& html实体中的保留字符被转义为& amp;在输出中.因此,html实体显示为纯文本.我怎样才能避免这种情况,以便正确呈现html实体?

s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'

div = etree.Element("div")
div.text = s

lxml.html.tostring(div)

output:
<div>Actress Adamari L&amp;#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&amp;#8482; Website And Resources</div>

解决方法:

您可以在调用tostring()时指定编码:

>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>

作为旁注,在处理HTML数据时你是should definitely use lxml.html.tostring()

Note that you should use lxml.html.tostring and not lxml.tostring.
lxml.tostring(doc) will return the XML representation of the document,
which is not valid HTML. In particular, things like <script src="..."></script> will be serialized as <script src="..." />, which completely confuses browsers.

另见:

> Serialising to Unicode strings

标签:python,html-parsing,lxml,html,lxml-html
来源: https://codeday.me/bug/20190628/1319352.html