python – 定义计算氨基酸相对频率的函数
作者:互联网
我正在尝试计算给定DNA序列内的密码子频率.
例如:
sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']
密码子中的XX:
frequency = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))
请注意,XX2和XX3并不总是在序列中.一些密码子可能有也可能没有多个密码子.
实施例:赖氨酸具有2个密码子,AAA和AAG
所以的频率
AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))
我怎样才能为列表中的每个密码子做这个?我如何解释多个密码子?
解决方法:
使用defaultdict
from collections import defaultdict
mydict = defaultdict(int)
for aa in mysecuence:
mydict[aa] +=1
这适用于氨基酸(蛋白质).
对于密码子,您应该以3个位置步骤迭代序列以获取defaultdict的键.例如:
>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a
['GAU', 'CAC', 'TUG', 'CCA']
编辑:如果你想计算变性,你应该准备一个字典,将每个密码子(密钥)与其简并密码子(值,密码子列表)联系起来.要计算经常性,
从defaultdict中,您可以获得每个密码子的计数,然后对于每个密码子,您计算从上面指出的密码子字典中读取的简并密码子的计数总和.然后你可以计算出频率.
编辑2:这里有一个真实的例子:
from collections import defaultdict
#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
"cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
"ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
"ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
"gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
"aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")
seq = rna.upper().replace('T', 'U')
#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
('UAA', 'UGA', 'UAG'))
#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
for codon in codons:
degen_dict[codon] = codons
#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]
#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
counts[codon] +=1
#actual calculation of frecuencies
data = {}
for codon in query_codons:
if codon in degen_dict:
totals = sum(counts[deg] for deg in degen_dict[codon])
frecuency = float(counts[codon]) / totals
else:
frecuency = 1.00
data[codon] = frecuency
#print results
for codon, frecuency in data.iteritems():
print "%s -> %.2f" %(codon, frecuency)
#produces:
GUC -> 0.57
AUA -> 1.00
ACG -> 0.50
AAC -> 1.00
CCU -> 0.25
UAU -> 1.00
..........
GCU -> 0.19
GAU -> 1.00
UAG -> 0.33
CUC -> 0.38
UUA -> 0.13
UGA -> 0.33
标签:python,bioinformatics 来源: https://codeday.me/bug/20190721/1496176.html