首页 > 编程语言> > 在python中将unicode文本输出到RTF文件

在python中将unicode文本输出到RTF文件

2019-06-30 07:56:18 作者：互联网

我试图从python脚本输出unicode文本到RTF文件.对于背景,Wikipedia说

For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode UTF-16 code unit number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter bāʼ ب, specifying that older programs which do not have Unicode support should render it as a question mark instead.

还有this question on outputting RTF from Java和this one on doing so in C#.

但是,我无法弄清楚如何从Python中将unicode代码点输出为“带有Unicode UTF-16代码单元号的16位带符号十进制整数”.我试过这个：

for char in unicode_string:
    print '\\' + 'u' + ord(char) + '?',

但是当在文字处理器中打开时,输出只会呈现为乱码;问题似乎是它不是UTF-16代码.但不知道如何得到它;虽然可以用utf-16编码,但是如何获得代码？

顺便说一下,PyRTF不支持unicode(它被列为“todo”),虽然pyrtf-NG应该这样做,但该项目似乎没有维护并且文档很少,所以我很谨慎地使用它 – 生产系统.

编辑：我的错误.上面的代码中有两个错误 – 正如Wobble所指出的那样,字符串必须是一个unicode字符串,而不是已编码的字符串,而上面的代码产生一个结果,字符之间有空格.正确的代码是这样的：

convertstring=""
for char in unicode(<my_encoded_string>,'utf-8'):
    convertstring = convertstring + '\\' + 'u' + str(ord(char)) + '?'

这很好,至少对于OpenOffice.我将此作为其他人的参考
(以下讨论后进一步纠正了一个错误).

解决方法:

根据您最新编辑中的信息,我认为此功能将正常工作.除了下面的改进版本.

def rtf_encode(unistr):
    return ''.join([c if ord(c) < 128 else u'\\u' + unicode(ord(c)) + u'?' for c in unistr])

>>> test_unicode = u'\xa92012'
>>> print test_unicode
©2012
>>> test_utf8 = test_unicode.encode('utf-8')
>>> print test_utf8
©2012
>>> print rtf_encode(test_utf8.decode('utf-8'))
\u169?2012

这是另一个版本,它有点分解,更容易理解.我还在返回ASCII字符串时保持一致,而不是保留Unicode并在连接处插入它.它还包含基于评论的修复程序.

def rtf_encode_char(unichar):
    code = ord(unichar)
    if code < 128:
        return str(unichar)
    return '\\u' + str(code if code <= 32767 else code-65536) + '?'

def rtf_encode(unistr):
    return ''.join(rtf_encode_char(c) for c in unistr)

标签：python,rtf
来源： https://codeday.me/bug/20190630/1334403.html