如何使用Python打开和处理存储在Google Cloud Storage中的CSV文件
作者:互联网
我正在使用Google Cloud Storage客户端库.
我正在尝试使用以下代码打开和处理CSV文件(已上传到存储桶):
filename = '/<my_bucket/data.csv'
with gcs.open(filename, 'r') as gcs_file:
csv_reader = csv.reader(gcs_file, delimiter=',', quotechar='"')
响应csv.reader的第一个参数(即gcs_file),出现错误“参数1必须是迭代器”.显然,gcs_file不支持迭代器.next方法.
关于如何进行的任何想法?我是否需要包装gcs_file并在其上创建一个迭代器,还是有更简单的方法?
解决方法:
我认为最好为csv.reader设计自己的包装器/迭代器.如果gcs_file支持Iterator协议,则不清楚next()应该返回什么以始终容纳其使用者.
根据csv阅读器文档,
Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable. If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
它期望来自基础文件的原始字节块,不一定是一行.您可以拥有这样的包装器(未经测试):
class CsvIterator(object)
def __init__(self, gcs_file, chunk_size):
self.gcs_file = gcs_file
self.chunk_size = chunk_size
def __iter__(self):
return self
def next(self):
result = self.gcs_file.read(size=self.chunk_size)
if not result:
raise StopIteration()
return result
关键是一次读取一个块,这样当您拥有一个大文件时,就不会浪费内存或经历urlfetch的超时.
甚至更简单.要使用内置的iter:
csv.reader(iter(gcs_file.readline, ''))
标签:google-cloud-storage,google-app-engine,python 来源: https://codeday.me/bug/20191030/1969281.html