如何在python中编辑文本(.fastq)文件
作者:互联网
我有一个类似下面的小示例的文件.每4行与一个ID相关.每个ID的第二行都以N开头.我想在这些行的开头处删除N,其他所有内容都将保持不变.
我想在python中做到这一点.你知道怎么做吗?
例:
@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
NGCGACCTCAGATCAGACGTGGCGACC
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
#<<ABGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
NGCCGACATCGAAGGATCAA
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
#<<ABFGGGGGGGGGGGGGG
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
NACAAACCCTTGTGTCGAGGGC
+SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
#=ABBGGGGGGGGGGGGGGGGG
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
NGGGACATGACAGCCTGGACCATCG
+SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
#=ABBGGGGGGGGGGGGGGGGGGGG
输出:
@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
GCGACCTCAGATCAGACGTGGCGACC
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
#<<ABGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
GCCGACATCGAAGGATCAA
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
#<<ABFGGGGGGGGGGGGGG
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
ACAAACCCTTGTGTCGAGGGC
+SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
#=ABBGGGGGGGGGGGGGGGGG
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
GGGACATGACAGCCTGGACCATCG
+SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
#=ABBGGGGGGGGGGGGGGGGGGGG
解决方法:
如果我完全按照您的要求进行操作(从每个序列中删除起始N),那么FASTQ file会处于不一致状态.
FASTQ文件的每四行都保留前两行的序列的质量值.因此,如果从序列中删除第一个字符,则还需要从具有质量值的行中删除第一个字符.
您可以在纯Python中做一些非常简单的事情,例如
with open("example.fastq") as f:
for idx, line in enumerate(f.read().splitlines()):
if idx % 2:
print(line[1:])
else:
print(line)
但是,如果您要定期处理生物数据,则确实应该开始使用生物信息学模块,例如BioPython.它会警告您,如果您尝试做的事情会导致文件的形状不一致或不起作用感.
解决方案如下:
from Bio import SeqIO
from Bio import Seq
new_records = []
for record in SeqIO.parse("example.fastq", "fastq"):
sequence = str(record.seq)
letter_annotations = record.letter_annotations
# You first need to empty the existing letter annotations
record.letter_annotations = {}
new_sequence = sequence[1:]
record.seq = Seq.Seq(new_sequence)
new_letter_annotations = {'phred_quality': letter_annotations['phred_quality'][1:]}
record.letter_annotations = new_letter_annotations
new_records.append(record)
with open('without_starting_N.fastq', 'w') as output_handle:
SeqIO.write(new_records, output_handle, "fastq")
哪个输出
@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
GCGACCTCAGATCAGACGTGGCGACC
+
<<ABGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
GCCGACATCGAAGGATCAA
+
<<ABFGGGGGGGGGGGGGG
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
ACAAACCCTTGTGTCGAGGGC
+
=ABBGGGGGGGGGGGGGGGGG
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
GGGACATGACAGCCTGGACCATCG
+
=ABBGGGGGGGGGGGGGGGGGGGG
(每三行的”字符后面可以有相同的序列标识符和前两行的描述)
标签:bioinformatics,python 来源: https://codeday.me/bug/20191026/1933510.html