python – 如何扩展模糊的dna序列
作者:互联网
假设你有一个像这样的DNA序列:
AATCRVTAA
其中R和V是DNA核苷酸的模糊值,其中R代表A或G,V代表A,C或G.
是否有Biopython方法生成可由上述模糊序列表示的所有不同序列组合?
相反,输出将是:
AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA
解决方法:
也许是一种更短更快的方式,因为无论如何,这个函数将被用于非常大的数据:
from Bio import Seq
from itertools import product
def extend_ambiguous_dna(seq):
"""return list of all possible sequences given an ambiguous DNA input"""
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
return [ list(map("".join, product(*map(d.get, seq)))) ]
使用map允许循环在C而不是Python中执行.这应该比使用普通循环甚至列表推导要快得多.
现场测试
使用简单的dict作为d而不是ambiguous_na_values返回的dict
from itertools import product
import time
d = { "N": ["A", "G", "T", "C"], "R": ["C", "A", "T", "G"] }
seq = "RNRN"
# using list comprehensions
lst_start = time.time()
[ "".join(i) for i in product(*[ d[j] for j in seq ]) ]
lst_end = time.time()
# using map
map_start = time.time()
[ list(map("".join, product(*map(d.get, seq)))) ]
map_end = time.time()
lst_delay = (lst_end - lst_start) * 1000
map_delay = (map_end - map_start) * 1000
print("List delay: {} ms".format(round(lst_delay, 2)))
print("Map delay: {} ms".format(round(map_delay, 2)))
输出:
# len(seq) = 2:
List delay: 0.02 ms
Map delay: 0.01 ms
# len(seq) = 3:
List delay: 0.04 ms
Map delay: 0.02 ms
# len(seq) = 4
List delay: 0.08 ms
Map delay: 0.06 ms
# len(seq) = 5
List delay: 0.43 ms
Map delay: 0.17 ms
# len(seq) = 10
List delay: 126.68 ms
Map delay: 77.15 ms
# len(seq) = 12
List delay: 1887.53 ms
Map delay: 1320.49 ms
显然,地图更好,但只有2或3倍.可以肯定它可以进一步优化.
标签:python,dna-sequence,biopython 来源: https://codeday.me/bug/20190628/1319168.html