Python UTF-16 CSV阅读器
作者:互联网
我有一个UTF-16 CSV文件,我必须阅读. Python csv模块似乎不支持UTF-16.
我使用的是python 2.7.2.我需要解析的CSV文件是大量运行到几GB的数据.
John Machin的答案问题如下
print repr(open('test.csv', 'rb').read(100))
输出test.csv只有abc作为内容
'\xff\xfea\x00b\x00c\x00'
我认为csv文件是在美国的Windows机器上创建的.我正在使用Mac OSX Lion.
如果我使用phihag提供的代码和包含一条记录的test.csv.
示例test.csv使用的内容.下面是print repr(open(‘test.csv’,’rb’).read(1000))输出
'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
代码由phihag
import codecs
import csv
with open('test.csv','rb') as f:
sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))
for row in csv.reader(sr):
print row
输出上面的代码
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']
预期产量是
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']
解决方法:
目前,csv模块不支持UTF-16.
在Python 3.x中,csv需要文本模式文件,您只需使用open
的编码参数来强制进行另一种编码:
# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
for line in csv.reader(csvf):
print(line) # do something with the line
在Python 2.x中,您可以重新编码输入:
# Python 2.x only
import codecs
import csv
class Recoder(object):
def __init__(self, stream, decoder, encoder, eol='\r\n'):
self._stream = stream
self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
self._buf = ''
self._eol = eol
self._reachedEof = False
def read(self, size=None):
r = self._stream.read(size)
raw = self._decoder.decode(r, size is None)
return self._encoder.encode(raw)
def __iter__(self):
return self
def __next__(self):
if self._reachedEof:
raise StopIteration()
while True:
line,eol,rest = self._buf.partition(self._eol)
if eol == self._eol:
self._buf = rest
return self._encoder.encode(line + eol)
raw = self._stream.read(1024)
if raw == '':
self._decoder.decode(b'', True)
self._reachedEof = True
return self._encoder.encode(self._buf)
self._buf += self._decoder.decode(raw)
next = __next__
def close(self):
return self._stream.close()
with open('test.csv','rb') as f:
sr = Recoder(f, 'utf-16', 'utf-8')
for row in csv.reader(sr):
print (row)
open和codecs.open要求文件以BOM开头.如果它没有(或者你在Python 2.x上),你仍然可以在内存中转换它,如下所示:
try:
from io import BytesIO
except ImportError: # Python < 2.6
from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
print(line) # do something with the line
标签:python,utf-16,csv 来源: https://codeday.me/bug/20190917/1810260.html