在将大文件逐行读入Python2.7时使用内存
作者:互联网
堆栈溢出,
我正在研究涉及一些大型文件(10-50Gb)的基因组学项目,我想将其读入Python 2.7进行处理.我不需要将整个文件读入内存,而是简单地逐行读取每个文件,执行一项小任务,然后继续.
我发现了类似的SO问题,并试图实现一些解决方案:
Efficient reading of 800 GB XML file in Python 2.7
How to read large file, line by line in python
当我在17Gb文件上运行以下代码时:
脚本1(itertools):
#!/usr/bin/env python2
import sys
import string
import os
import itertools
if __name__ == "__main__":
#Read in PosList
posList=[]
with open("BigFile") as f:
for line in iter(f):
posList.append(line.strip())
sys.stdout.write(str(sys.getsizeof(posList)))
脚本2(文件输入):
#!/usr/bin/env python2
import sys
import string
import os
import fileinput
if __name__ == "__main__":
#Read in PosList
posList=[]
for line in fileinput.input(['BigFile']):
posList.append(line.strip())
sys.stdout.write(str(sys.getsizeof(posList)))
SCRIPT3(换行):
#!/usr/bin/env python2
import sys
import string
import os
if __name__ == "__main__":
#Read in PosList
posList=[]
with open("BigFile") as f:
for line in f:
posList.append(line.strip())
sys.stdout.write(str(sys.getsizeof(posList)))
SCRIPT4(产量):
#!/usr/bin/env python2
import sys
import string
import os
def readInChunks(fileObj, chunkSize=30):
while True:
data = fileObj.read(chunkSize)
if not data:
break
yield data
if __name__ == "__main__":
#Read in PosList
posList=[]
f = open('BigFile')
for chunk in readInChunks(f):
posList.append(chunk.strip())
f.close()
sys.stdout.write(str(sys.getsizeof(posList)))
从17Gb文件中,Python中最终列表的大小是〜5Gb [来自sys.getsizeof()],但是根据’top’,每个脚本使用超过43Gb的内存.
我的问题是:为什么内存使用量比输入文件或最终列表的上升得多?如果最终列表只有5Gb,并且逐行读取17Gb文件输入,为什么每个脚本的内存使用量达到〜43Gb?有没有更好的方法来读取没有内存泄漏的大文件(如果这是他们的)?
非常感谢.
编辑:
从’/usr/bin/time -v python script3.py’输出:
Command being timed: "python script3.py"
User time (seconds): 159.65
System time (seconds): 21.74
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:01.96
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 181246448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 10182731
Voluntary context switches: 315
Involuntary context switches: 16722
Swaps: 0
File system inputs: 33831512
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
从顶部输出:
15816 user 20 0 727m 609m 2032 R 76.8 0.5 0:02.31 python
15816 user 20 0 1541m 1.4g 2032 R 99.6 1.1 0:05.31 python
15816 user 20 0 2362m 2.2g 2032 R 99.6 1.7 0:08.31 python
15816 user 20 0 3194m 3.0g 2032 R 99.6 2.4 0:11.31 python
15816 user 20 0 4014m 3.8g 2032 R 99.6 3 0:14.31 python
15816 user 20 0 4795m 4.6g 2032 R 99.6 3.6 0:17.31 python
15816 user 20 0 5653m 5.3g 2032 R 99.6 4.2 0:20.31 python
15816 user 20 0 6457m 6.1g 2032 R 99.3 4.9 0:23.30 python
15816 user 20 0 7260m 6.9g 2032 R 99.6 5.5 0:26.30 python
15816 user 20 0 8085m 7.7g 2032 R 99.9 6.1 0:29.31 python
15816 user 20 0 8809m 8.5g 2032 R 99.6 6.7 0:32.31 python
15816 user 20 0 9645m 9.3g 2032 R 99.3 7.4 0:35.30 python
15816 user 20 0 10.3g 10g 2032 R 99.6 8 0:38.30 python
15816 user 20 0 11.1g 10g 2032 R 100 8.6 0:41.31 python
15816 user 20 0 11.8g 11g 2032 R 99.9 9.2 0:44.32 python
15816 user 20 0 12.7g 12g 2032 R 99.3 9.9 0:47.31 python
15816 user 20 0 13.4g 13g 2032 R 99.6 10.5 0:50.31 python
15816 user 20 0 14.3g 14g 2032 R 99.9 11.1 0:53.32 python
15816 user 20 0 15.0g 14g 2032 R 99.3 11.7 0:56.31 python
15816 user 20 0 15.9g 15g 2032 R 99.9 12.4 0:59.32 python
15816 user 20 0 16.6g 16g 2032 R 99.6 13 1:02.32 python
15816 user 20 0 17.3g 17g 2032 R 99.6 13.6 1:05.32 python
15816 user 20 0 18.2g 17g 2032 R 99.9 14.2 1:08.33 python
15816 user 20 0 18.9g 18g 2032 R 99.6 14.9 1:11.33 python
15816 user 20 0 19.9g 19g 2032 R 100 15.5 1:14.34 python
15816 user 20 0 20.6g 20g 2032 R 99.3 16.1 1:17.33 python
15816 user 20 0 21.3g 21g 2032 R 99.6 16.7 1:20.33 python
15816 user 20 0 22.3g 21g 2032 R 99.9 17.4 1:23.34 python
15816 user 20 0 23.0g 22g 2032 R 99.6 18 1:26.34 python
15816 user 20 0 23.7g 23g 2032 R 99.6 18.6 1:29.34 python
15816 user 20 0 24.4g 24g 2032 R 99.6 19.2 1:32.34 python
15816 user 20 0 25.4g 25g 2032 R 99.3 19.9 1:35.33 python
15816 user 20 0 26.1g 25g 2032 R 99.9 20.5 1:38.34 python
15816 user 20 0 26.8g 26g 2032 R 99.9 21.1 1:41.35 python
15816 user 20 0 27.4g 27g 2032 R 99.6 21.7 1:44.35 python
15816 user 20 0 28.5g 28g 2032 R 99.6 22.3 1:47.35 python
15816 user 20 0 29.2g 28g 2032 R 99.9 22.9 1:50.36 python
15816 user 20 0 29.9g 29g 2032 R 99.6 23.5 1:53.36 python
15816 user 20 0 30.5g 30g 2032 R 99.6 24.1 1:56.36 python
15816 user 20 0 31.6g 31g 2032 R 99.6 24.7 1:59.36 python
15816 user 20 0 32.3g 31g 2032 R 100 25.3 2:02.37 python
15816 user 20 0 33.0g 32g 2032 R 99.6 25.9 2:05.37 python
15816 user 20 0 33.7g 33g 2032 R 99.6 26.5 2:08.37 python
15816 user 20 0 34.3g 34g 2032 R 99.6 27.1 2:11.37 python
15816 user 20 0 35.5g 34g 2032 R 99.6 27.7 2:14.37 python
15816 user 20 0 36.2g 35g 2032 R 99.6 28.4 2:17.37 python
15816 user 20 0 36.9g 36g 2032 R 100 29 2:20.38 python
15816 user 20 0 37.5g 37g 2032 R 99.6 29.6 2:23.38 python
15816 user 20 0 38.2g 38g 2032 R 99.6 30.2 2:26.38 python
15816 user 20 0 38.9g 38g 2032 R 99.6 30.8 2:29.38 python
15816 user 20 0 40.1g 39g 2032 R 100 31.4 2:32.39 python
15816 user 20 0 40.8g 40g 2032 R 99.6 32 2:35.39 python
15816 user 20 0 41.5g 41g 2032 R 99.6 32.6 2:38.39 python
15816 user 20 0 42.2g 41g 2032 R 99.9 33.2 2:41.40 python
15816 user 20 0 42.8g 42g 2032 R 99.6 33.8 2:44.40 python
15816 user 20 0 43.4g 43g 2032 R 99.6 34.3 2:47.40 python
15816 user 20 0 43.4g 43g 2032 R 100 34.3 2:50.41 python
15816 user 20 0 38.6g 38g 2032 R 100 30.5 2:53.43 python
15816 user 20 0 24.9g 24g 2032 R 99.7 19.6 2:56.43 python
15816 user 20 0 12.0g 11g 2032 R 100 9.4 2:59.44 python
编辑2:
为进一步澄清,这是问题的扩展.我在这里做的是读取FASTA文件中的位置列表(Contig1 / 1,Contig1 / 2等).通过以下方式将其转换为充满N的字典:
keys = posList
values = ['N'] * len(posList)
speciesDict = dict(zip(keys, values))
然后,我正在阅读多个物种的堆积文件,再次逐行(存在同样的问题),并通过以下方式获得最终的基本调用:
with open (path+'/'+os.path.basename(path)+'.pileups',"r") as filein:
for line in iter(filein):
splitline=line.split()
if len(splitline)>4:
node,pos,ref,num,bases,qual=line.split()
loc=node+'/'+pos
cleanBases=getCleanList(ref,bases)
finalBase=getFinalBase_Pruned(cleanBases,minread,thresh)
speciesDict[loc] = finalBase
因为物种特定的堆积文件长度不同,或者顺序相同,所以我创建了一个列表来创建一个“共同花园”的方式来存储单个物种数据.如果物种的特定站点没有可用数据,则会收到“N”呼叫.否则,将为字典中的站点分配基础.
最终结果是每个物种的文件是有序和完整的,我可以从中进行下游分析.
因为逐行读取占用了大量内存,所以读取两个大文件会使我的资源过载,即使最终数据结构比我预期需要的内存小得多(增长列表单行大小)在要添加的数据时).
解决方法:
sys.getsizeof(posList)没有给你我认为你认为它的东西:它告诉你包含行的列表对象的大小;这不包括行本身的大小.以下是将大约3.5Gb文件读入系统列表的一些输出:
In [2]: lines = []
In [3]: with open('bigfile') as inf:
...: for line in inf:
...: lines.append(line)
...:
In [4]: len(lines)
Out[4]: 68318734
In [5]: sys.getsizeof(lines)
Out[5]: 603811872
In [6]: sum(len(l) for l in lines)
Out[6]: 3473926127
In [7]: sum(sys.getsizeof(l) for l in lines)
Out[7]: 6001719285
那里有六十多亿字节;在顶部,我的翻译在这一点上使用了大约7.5Gb.
字符串有相当大的开销:每个37字节,它看起来像:
In [2]: sys.getsizeof('0'*10)
Out[2]: 47
In [3]: sys.getsizeof('0'*100)
Out[3]: 137
In [4]: sys.getsizeof('0'*1000)
Out[4]: 1037
因此,如果您的线路相对较短,则大部分内存使用将是开销.
标签:python,io,memory,python-2-7,bioinformatics 来源: https://codeday.me/bug/20190627/1305129.html