202208祖先序列重构
作者:互联网
首先理解清楚,基因发生了正选择,并不表示该基因就是发生了趋同进化(适应趋同)。那么什么才是发生了适应趋同呢?一般定义为满足两个条件:1、处于正选择;2、现生物种氨基酸位点同祖先相比存在改变(而且最好是正选择位点同改变位点时同一位点)。
现生物种氨基酸位点同祖先相比存在变化可分为两种情况:
1、所研究的现生目标物种在某位点处,氨基酸残基都相同,这种情况就被称为convergent site。又可进一步分为parallel substitution和convergent substitution 【Storz J F. Causes of molecular convergence and parallelism in protein evolution[J]. Nature Reviews Genetics, 2016, 17(4): 239-250. Hu Y, Wu Q I, Ma S, et al. Comparative genomics reveals convergent evolution between the bamboo-eating giant and red pandas[J]. Proceedings of the National Academy of Sciences, 2017, 114(5): 1081-1086.】
2、所研究的现生目标物种在某位点处,氨基酸残基不完全相同,该情况称为common change。
以上两种根据自己的研究情况,只要基因又与正选择基因重叠,就可以称该基因发生了趋同适应进化 【Foote A D, Liu Y, Thomas G W C, et al. Convergent evolution of the genomes of marine mammals[J]. Nature genetics, 2015, 47(3): 272-275.】。
#文件1、蛋白序列文件(cds序列也可以,只需要修改配置文件参数seqtype = 3)
#文件2、树文件
(base) vip10t01@bio10-desktop 2022-07-17 22:17:59 ~/project/01-oxy/gene_cds/reconstrctAnc $cat paml_tree.nwk ((Lcha:0.235889,Pann:0.271384):0.058453,(Locu:0.1986,(Sfor:0.182893,((Eele:0.166864,Drer:0.159921):0.0522582,(Olat:0.148378,((Marm:0.072489,Malb:0.0946715):0.0165605,(Carg:0.103992,(Atest:0.0589527,Bsple:0.0974119):0.0139159):0.0136644):0.0327347):0.104933):0.0345937):0.0507974):0.058453);
#文件3、配置文件
seqfile = OG0013976.fa.bla.uniq.fa.maf.id.2.cds.fas.best.fas-gb.re.phy.stop * sequence data filename (核苷酸文件) treefile = paml_tree.nwk * tree structure file name (不用标记树枝) outfile = mlc.ancestral.sequence.reconstruction * main result file name noisy = 9 * 0,1,2,3,9: how much rubbish on the screen verbose = 2 * 0: concise; 1: detailed, 2: too much runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise seqtype = 3 * 1:codons; 2:AAs; 3:codons-->AAs (根据自己的序列文件是氨基酸还是核酸进行选择) clock = 0 * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a aaRatefile = /home/data/vip10t01/biosoftware/paml4.9j/dat/jones.dat * only used for aa seqs with model=empirical(_F) (从安装PAML的路径中找) * dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own model = 2 * models for codons: * 0:one, 1:b, 2:2 or more dN/dS ratios for branches * models for AAs or codon-translated AAs: * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189) icode = 0 * 0:universal code; 1:mammalian mt; 2-10:see below Mgene = 0 * codon: 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff * AA: 0:rates, 1:separate fix_alpha = 0 * 0: estimate gamma shape parameter; 1: fix it at alpha alpha = 0.5 * initial or fixed alpha, 0:infinity (constant rate) Malpha = 1 * different alphas for genes ncatG = 4 * # of categories in dG of NSsites models getSE = 0 * 0: don't want them, 1: want S.E.s of estimates RateAncestor = 1 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2) Small_Diff = .5e-6 cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)? method = 1 * Optimization method 0: simultaneous; 1: one branch a time * Genetic codes: 0:universal, 1:mammalian mt., 2:yeast mt., 3:mold mt., * 4: invertebrate mt., 5: ciliate nuclear, 6: echinoderm mt., * 7: euplotid mt., 8: alternative yeast nu. 9: ascidian mt., * 10: blepharisma nu. * These codes correspond to transl_table 1 to 11 of GENEBANK.
(base) vip10t01@bio10-desktop 2022-07-17 22:28:21 ~/project/01-oxy/gene_cds/reconstrctAnc $codeml baseml_aa1.ctl #运行
由于需要的输出文件为rst,而这个文件对于每一个orthologous名字都是一样的,为了区分,只有将每个orthologous分别存于不同文件夹下运行【已证实同一文件夹下运行,最终rst文件只会覆盖】
(base) vip10t01@bio10-desktop 2022-07-19 11:34:07 ~/project/01-oxy/gene_cds/5001 $for i in `ls *.ctl_re |awk -F"." '{print $1}'`; do echo mkdir $i; done >>z_mkdir.sh #为每个文件创建独立文件夹 (base) vip10t01@bio10-desktop 2022-07-19 11:33:46 ~/project/01-oxy/gene_cds/reconstrctAnc $mv ../5001/z_mkdir.sh . $cat z_mkdir.sh |wc -l #看一下数量 9868 $bash z_mkdir.sh #创建文件夹 (base) vip10t01@bio10-desktop 2022-07-19 11:35:41 ~/project/01-oxy/gene_cds/5001 $for i in `ls *.stop |awk -F"." '{print $1}'`; do echo cp $i.fa.bla.uniq.fa.maf.id.2.cds.fas.best.fas-gb.re.phy.stop ../reconstrctAnc/$i; done >>z_Ancp.sh #复制序列文件,到各自的文件夹 $bash z_Ancp.sh (base) vip10t01@bio10-desktop 2022-07-19 12:18:03 ~/project/01-oxy/gene_cds/reconstrctAnc $for i in `ls -F |grep "/$"`; do echo cp paml_tree.nwk $i; done >>z.sh #复制树文件到各个文件夹
$bash z.sh
$for i in `ls -F |grep "/$"`; do cp ../reconAnc.ctl $i; done #复制配置文件到各个文件夹
(base) vip10t01@bio10-desktop 2022-07-20 11:53:57 ~/project/01-oxy/gene_cds/reconstrctAnc
$for i in `ls -F |grep "/$"`; do echo "sed -i 's/@/${i}\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' ${i}reconAnc.ctl"; done >>z_re.sh #修改配置文件
(base) vip10t01@bio10-desktop 2022-07-20 11:54:22 ~/project/01-oxy/gene_cds/reconstrctAnc
$head z_re.sh #一定看一下,有问题酌情修改
sed -i 's/@/OG0000000/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000000/reconAnc.ctl
sed -i 's/@/OG0000003/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000003/reconAnc.ctl
sed -i 's/@/OG0000004/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000004/reconAnc.ctl
sed -i 's/@/OG0000008/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000008/reconAnc.ctl
sed -i 's/@/OG0000015/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000015/reconAnc.ctl
sed -i 's/@/OG0000017/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000017/reconAnc.ctl
sed -i 's/@/OG0000021/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000021/reconAnc.ctl
sed -i 's/@/OG0000022/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000022/reconAnc.ctl
sed -i 's/@/OG0000024/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000024/reconAnc.ctl
sed -i 's/@/OG0000026/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000026/reconAnc.ctl
$sed "s/\/\\\./\\\./g" z_re.sh |head #酌情修改
$bash z_re.sh
#创建运行代码 (base) vip10t01@bio10-desktop 2022-07-20 12:22:23 ~/project/01-oxy/gene_cds/reconstrctAnc $for i in `ls -F |grep "/$"`; do echo "cd $i; codeml reconAnc.ctl"; done >>z_runAnce.sh
$cd OG0000000/; codeml reconAnc.ctl #用一个文件测试了一下,没啥问题。
#开始运行 (base) vip10t01@bio10-desktop 2022-07-29 07:44:18 ~/project/01-oxy/gene_cds/reconstrctAnc $nohup parallel -j 25 <z_runAnce.sh & 【半个小时不到就跑完了】 [1] 3980356
结果处理:每个orthologous做祖先重构时,树文件都是相同的,从而我们可以分析出需要进行比较的节点
tree with node labels for Rod Page's TreeView
((1_Lcha, 2_Pann) 14 , (3_Locu, (12_Sfor, ((11_Eele, 10_Drer) 18 , (9_Olat, ((5_Marm, 4_Malb) 21 , (7_Carg, (6_Atest, 8_Bsple) 23 ) 22 ) 20 ) 19 ) 17 ) 16 ) 15 ) 13 ;
条件1:提取air-breathing和其祖先序列在某一位点处有差异【5个中有两个存在差异就行】的氨基酸位点。
自己细化为:5个中有1、2、3、4、5个存在差异的类型
node #14 ==》2_Pann
node #18 ==》11_Eele
node #21 ==》4_Malb
node #22 ==》7_Carg
node #23 ==》6_Atest
在条件1的基础上进行位点分类:
对比所有air-breathing,如果该位点全部相同,则为convergent
对比所有air-breathing,如果该位点存在不同,则为parallel
# _author: Administrator # _date: 2022/7/29 # Usaeg: .py infile convergentFile commonFile goalOrthologousNameFile P=[] M=[] A=[] E=[] C=[] n14=[] n18=[] n21=[] n22=[] n23=[] import re import sys with open(sys.argv[1], "r+") as infile: # with open("rst", "r+") as infile: for line in infile.readlines(): if line.startswith("Pann"): line_re = re.sub("Pann| |\n", "", line) P.append(line_re) # print(P) elif line.startswith("Malb"): line_re = re.sub("Malb| |\n", "", line) M.append(line_re) elif line.startswith("Atest"): line_re = re.sub("Atest| |\n", "", line) A.append(line_re) elif line.startswith("Carg"): line_re = re.sub("Carg| |\n", "", line) C.append(line_re) elif line.startswith("Eele"): line_re = re.sub("Eele| |\n", "", line) E.append(line_re) elif line.startswith("node #14"): line_re = re.sub("node #14| |\n", "", line) n14.append(line_re) elif line.startswith("node #18"): line_re = re.sub("node #18| |\n", "", line) n18.append(line_re) elif line.startswith("node #21"): line_re = re.sub("node #21| |\n", "", line) n21.append(line_re) elif line.startswith("node #22"): line_re = re.sub("node #22| |\n", "", line) n22.append(line_re) elif line.startswith("node #23"): line_re = re.sub("node #23| |\n", "", line) n23.append(line_re) #node #14 ==》2_Pann pn=[] flag=0 for a, b in zip(P[0], n14[0]): #字符串对应比较差异 if a != b: locate = a, b, flag pn.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(pn) #node18 ==》11_Eele flag=0 en=[] for a, b in zip(E[0], n18[0]): #字符串对应比较差异 if a != b: locate = a, b, flag en.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(en) #node #21 ==》4_Malb flag=0 mn=[] for a, b in zip(M[0], n21[0]): #字符串对应比较差异 if a != b: locate = a, b, flag mn.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(mn) #node #22 ==》7_Carg flag=0 cn=[] for a, b in zip(C[0], n22[0]): #字符串对应比较差异 if a != b: locate = a, b, flag cn.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(cn) #node #23 ==》6_Atest flag=0 an=[] for a, b in zip(C[0], n22[0]): #字符串对应比较差异 if a != b: locate = a, b, flag an.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(an) # times=[] tmp1=[x[2] for x in an] #https://blog.csdn.net/weixin_38082364/article/details/79833348 tmp2=[x[2] for x in pn] tmp3=[x[2] for x in en] tmp4=[x[2] for x in cn] tmp5=[x[2] for x in mn] # print(tmp1) # print(tmp2) times=tmp1+tmp2+tmp3+tmp4+tmp5 #几个列表合成一个列表 # print(times) sets=set(times) dict={} for item in sets: dict.update({item:times.count(item)}) # print(dict) times5=[k for k,v in dict.items() if v==5] #输出次数为5的键(表示该位点在5个air-breathing中均有变化) print(times5) # print(A) # convergentSite=open(sys.argv[2],"a+") # convergentSite=open("11-out1.txt","a+") # parallelSite=open(sys.argv[3],"a+") # parallelSite=open("11-out2.txt","a+") goalOrthologous=open(sys.argv[4],"a+") #用于存储目标orthologous的文件名 # goalOrthologous=open("11-tmp.txt","a+") #用于存储目标orthologous的文件名 for i in times5: i_list=[i,str(E[0])[i],str(A[0])[i],str(P[0])[i],str(C[0])[i],str(M[0])[i]] # print(i_list) if "-" not in i_list: #只考虑位点非空(即:如果序列在该位点有"-",则不考虑该位点)【如果考虑,可以发现很多位点都是"-"】 # print(i_list[1:-1]) goalOrthologous.write(infile.name+"\n") #输出orthologous名字【同一个文件可能输出多次,后面再处理】 if len(set(i_list[1:-1]))==1: #convergent sites print(i_list) # convergentSite=open("11-out1.txt","a+") convergentSite=open(sys.argv[2],"a+") #注意输出格式为:siteNumber E A P C M convergentSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+"\n") else: #common changes print(i_list) # parallelSite=open("11-out2.txt","a+") parallelSite=open(sys.argv[3],"a+") parallelSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+"\n")
(base) vip10t01@bio10-desktop 2022-07-30 22:31:55 ~/project/01-oxy/gene_cds/reconstrctAnc/OG0009213 $python3 ../ancestorSeq.py OG0009213/rst OG0009213/convergentFile.txt OG0009213/parallelFile.txt z_goalOrthologous.txt #验证可用性【结果完全符合预期】
#写循环 (base) vip10t01@bio10-desktop 2022-07-30 22:42:10 ~/project/01-oxy/gene_cds/reconstrctAnc $for i in `ls -F |grep "/$"`; do echo "python3 ../ancestorSeq.py ${i}rst ${i}convergentFile.txt ${i}parallelFile.txt z_goalOrthologous.txt"; done >>z_ancestor.sh
#看看文件如下,符合预期
(base) vip10t01@bio10-desktop 2022-07-30 22:49:49 ~/project/01-oxy/gene_cds/reconstrctAnc $nohup bash z_ancestor.sh & #运行程序 [1] 1221584
#可能遇到重跑的情况
(base) vip10t01@bio10-desktop 2022-07-30 23:06:06 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "parallelFile.txt" |xargs rm #删除指定文件名
$find -name "convergentFile.txt" |xargs rm
$rm nohup.out
$rm z_goalOrthologous.txt
尝试结果1
#代码出了上述问题时,结果如下
(base) vip10t01@bio10-desktop 2022-07-31 07:43:35 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "parallelFile.txt" |wc -l #common change 5249 (base) vip10t01@bio10-desktop 2022-07-31 07:51:19 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "convergentFile.txt" |wc -l #convergent site 33
#修改代码后:
(base) vip10t01@bio10-desktop 2022-07-31 07:58:47 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "convergentFile.txt" |xargs rm $find -name "parallelFile.txt" |xargs rm $rm nohup.out $rm z_goalOrthologous.txt $nohup bash z_ancestor.sh & 【很快就跑完了】 [1] 1682506
(base) vip10t01@bio10-desktop 2022-07-31 07:43:35 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "parallelFile.txt" |wc -l #common changes 5253 (base) vip10t01@bio10-desktop 2022-07-31 07:51:19 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "convergentFile.txt" |wc -l #convergent sites 18
(base) vip10t01@bio10-desktop 2022-07-31 09:18:22 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "convergentFile.txt" #convergent orthologous ./OG0011532/convergentFile.txt ./OG0003214/convergentFile.txt ./OG0000890/convergentFile.txt ./OG0001393/convergentFile.txt ./OG0006886/convergentFile.txt ./OG0001851/convergentFile.txt ./OG0006300/convergentFile.txt ./OG0004181/convergentFile.txt ./OG0001577/convergentFile.txt ./OG0009321/convergentFile.txt ./OG0009594/convergentFile.txt ./OG0007484/convergentFile.txt ./OG0002346/convergentFile.txt ./OG0000982/convergentFile.txt ./OG0011231/convergentFile.txt ./OG0008506/convergentFile.txt ./OG0011059/convergentFile.txt ./OG0012207/convergentFile.txt
上面python脚本修改一下,将由祖先的什么氨基酸位点变为现在的什么氨基酸位点都输出。再跑一遍
# _author: Administrator # _date: 2022/7/29 # Usaeg: .py infile convergentFile commonFile goalOrthologousNameFile P=[] M=[] A=[] E=[] C=[] n14=[] n18=[] n21=[] n22=[] n23=[] import re import sys with open(sys.argv[1], "r+") as infile: # with open("rst", "r+") as infile: for line in infile.readlines(): if line.startswith("Pann"): line_re = re.sub("Pann| |\n", "", line) P.append(line_re) # print(P) elif line.startswith("Malb"): line_re = re.sub("Malb| |\n", "", line) M.append(line_re) elif line.startswith("Atest"): line_re = re.sub("Atest| |\n", "", line) A.append(line_re) elif line.startswith("Carg"): line_re = re.sub("Carg| |\n", "", line) C.append(line_re) elif line.startswith("Eele"): line_re = re.sub("Eele| |\n", "", line) E.append(line_re) elif line.startswith("node #14"): line_re = re.sub("node #14| |\n", "", line) n14.append(line_re) elif line.startswith("node #18"): line_re = re.sub("node #18| |\n", "", line) n18.append(line_re) elif line.startswith("node #21"): line_re = re.sub("node #21| |\n", "", line) n21.append(line_re) elif line.startswith("node #22"): line_re = re.sub("node #22| |\n", "", line) n22.append(line_re) elif line.startswith("node #23"): line_re = re.sub("node #23| |\n", "", line) n23.append(line_re) #node #14 ==》2_Pann pn=[] flag=0 for a, b in zip(P[0], n14[0]): #字符串对应比较差异 if a != b: locate = a, b, flag pn.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(pn) #node18 ==》11_Eele flag=0 en=[] for a, b in zip(E[0], n18[0]): #字符串对应比较差异 if a != b: locate = a, b, flag en.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(en) #node #21 ==》4_Malb flag=0 mn=[] for a, b in zip(M[0], n21[0]): #字符串对应比较差异 if a != b: locate = a, b, flag mn.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(mn) #node #22 ==》7_Carg flag=0 cn=[] for a, b in zip(C[0], n22[0]): #字符串对应比较差异 if a != b: locate = a, b, flag cn.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(cn) #node #23 ==》6_Atest flag=0 an=[] for a, b in zip(C[0], n22[0]): #字符串对应比较差异 if a != b: locate = a, b, flag an.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(an) # times=[] tmp1=[x[2] for x in an] #https://blog.csdn.net/weixin_38082364/article/details/79833348 tmp2=[x[2] for x in pn] tmp3=[x[2] for x in en] tmp4=[x[2] for x in cn] tmp5=[x[2] for x in mn] # print(tmp1) # print(tmp2) times=tmp1+tmp2+tmp3+tmp4+tmp5 #几个列表合成一个列表 # print(times) sets=set(times) dict={} for item in sets: dict.update({item:times.count(item)}) # print(dict) times5=[k for k,v in dict.items() if v==5] #输出次数为5的键(表示该位点在5个air-breathing中均有变化) print(times5) # print(A) # convergentSite=open(sys.argv[2],"a+") # convergentSite=open("11-out1.txt","a+") # parallelSite=open(sys.argv[3],"a+") # parallelSite=open("11-out2.txt","a+") goalOrthologous=open(sys.argv[4],"a+") #用于存储目标orthologous的文件名 # goalOrthologous=open("11-tmp.txt","a+") #用于存储目标orthologous的文件名 for i in times5: i_list=[i,str(E[0])[i],str(A[0])[i],str(P[0])[i],str(C[0])[i],str(M[0])[i]] node_list=[str(n14[0][i]),str(n18[0][i]),str(n21[0][i]),str(n22[0][i]),str(n23[0][i])] #修改处 # print(i_list) if "-" not in i_list: #只考虑位点非空(即:如果序列在该位点有"-",则不考虑该位点) # print(i_list[1:-1]) goalOrthologous.write(infile.name+"\n") if len(set(i_list[1:]))==1: #convergent print(i_list) # convergentSite=open("11-out1.txt","a+") convergentSite=open(sys.argv[2],"a+") #注意输出格式为:siteNumber E A P C M node14 node18 node21 node22 node23 convergentSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+"\n") #修改处 else: #parallel print(i_list) # parallelSite=open("11-out2.txt","a+") parallelSite=open(sys.argv[3],"a+") parallelSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+"\n") #修改处
(base) vip10t01@bio10-desktop 2022-07-31 10:31:18 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "convergentFile.txt" |xargs rm $find -name "parallelFile.txt" |xargs rm $rm z_goalOrthologous.txt
(base) vip10t01@bio10-desktop 2022-07-31 11:09:13 ~/project/01-oxy/gene_cds/reconstrctAnc
$nohup bash z_ancestor.sh & 【正在跑】
[1] 1857903
#python代码再修改一下,将改变处氨基酸位点的每条序列的对应位置的氨基酸都输出
# _author: Administrator # _date: 2022/7/29 # Usaeg: .py infile convergentFile commonFile goalOrthologousNameFile P=[] M=[] A=[] E=[] C=[] n14=[] n18=[] n21=[] n22=[] n23=[] LC=[] LO=[] BS=[] MA=[] OL=[] DR=[] SF=[] N13=[] N15=[] N16=[] N17=[] N19=[] N20=[] import re import sys with open(sys.argv[1], "r+") as infile: # with open("rst", "r+") as infile: for line in infile.readlines(): if line.startswith("Pann"): line_re = re.sub("Pann| |\n", "", line) P.append(line_re) # print(P) elif line.startswith("Malb"): line_re = re.sub("Malb| |\n", "", line) M.append(line_re) elif line.startswith("Atest"): line_re = re.sub("Atest| |\n", "", line) A.append(line_re) elif line.startswith("Carg"): line_re = re.sub("Carg| |\n", "", line) C.append(line_re) elif line.startswith("Eele"): line_re = re.sub("Eele| |\n", "", line) E.append(line_re) elif line.startswith("node #14"): line_re = re.sub("node #14| |\n", "", line) n14.append(line_re) elif line.startswith("node #18"): line_re = re.sub("node #18| |\n", "", line) n18.append(line_re) elif line.startswith("node #21"): line_re = re.sub("node #21| |\n", "", line) n21.append(line_re) elif line.startswith("node #22"): line_re = re.sub("node #22| |\n", "", line) n22.append(line_re) elif line.startswith("node #23"): line_re = re.sub("node #23| |\n", "", line) n23.append(line_re) elif line.startswith("Lcha"): line_re = re.sub("Lcha| |\n", "", line) LC.append(line_re) elif line.startswith("Locu"): line_re = re.sub("Locu| |\n", "", line) LO.append(line_re) elif line.startswith("Bsple"): line_re = re.sub("Bsple| |\n", "", line) BS.append(line_re) elif line.startswith("Marm"): line_re = re.sub("Marm| |\n", "", line) MA.append(line_re) elif line.startswith("Olat"): line_re = re.sub("Olat| |\n", "", line) OL.append(line_re) elif line.startswith("Drer"): line_re = re.sub("Drer| |\n", "", line) DR.append(line_re) elif line.startswith("Sfor"): line_re = re.sub("Sfor| |\n", "", line) SF.append(line_re) elif line.startswith("node #13"): line_re = re.sub("node #13| |\n", "", line) N13.append(line_re) elif line.startswith("node #15"): line_re = re.sub("node #15| |\n", "", line) N15.append(line_re) elif line.startswith("node #16"): line_re = re.sub("node #16| |\n", "", line) N16.append(line_re) elif line.startswith("node #17"): line_re = re.sub("node #17| |\n", "", line) N17.append(line_re) elif line.startswith("node #19"): line_re = re.sub("node #19| |\n", "", line) N19.append(line_re) elif line.startswith("node #20"): line_re = re.sub("node #20| |\n", "", line) N20.append(line_re) #node #14 ==》2_Pann pn=[] flag=0 for a, b in zip(P[0], n14[0]): #字符串对应比较差异 if a != b: locate = a, b, flag pn.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(pn) #node18 ==》11_Eele flag=0 en=[] for a, b in zip(E[0], n18[0]): #字符串对应比较差异 if a != b: locate = a, b, flag en.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(en) #node #21 ==》4_Malb flag=0 mn=[] for a, b in zip(M[0], n21[0]): #字符串对应比较差异 if a != b: locate = a, b, flag mn.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(mn) #node #22 ==》7_Carg flag=0 cn=[] for a, b in zip(C[0], n22[0]): #字符串对应比较差异 if a != b: locate = a, b, flag cn.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(cn) #node #23 ==》6_Atest flag=0 an=[] for a, b in zip(C[0], n22[0]): #字符串对应比较差异 if a != b: locate = a, b, flag an.append(locate) # print('a:', a, ' b:', b, ' site:',flag) flag+=1 else: flag+=1 # print(an) # times=[] tmp1=[x[2] for x in an] #https://blog.csdn.net/weixin_38082364/article/details/79833348 tmp2=[x[2] for x in pn] tmp3=[x[2] for x in en] tmp4=[x[2] for x in cn] tmp5=[x[2] for x in mn] # print(tmp1) # print(tmp2) times=tmp1+tmp2+tmp3+tmp4+tmp5 #几个列表合成一个列表 # print(times) sets=set(times) dict={} for item in sets: dict.update({item:times.count(item)}) # print(dict) times5=[k for k,v in dict.items() if v==5] #输出次数为5的键(表示该位点在5个air-breathing中均有变化) print(times5) # print(A) # convergentSite=open(sys.argv[2],"a+") # convergentSite=open("11-out1.txt","a+") # parallelSite=open(sys.argv[3],"a+") # parallelSite=open("11-out2.txt","a+") goalOrthologous=open(sys.argv[4],"a+") #用于存储目标orthologous的文件名 # goalOrthologous=open("11-tmp.txt","a+") #用于存储目标orthologous的文件名 for i in times5: i_list=[i,str(E[0])[i],str(A[0])[i],str(P[0])[i],str(C[0])[i],str(M[0])[i]] node_list=[str(n14[0][i]),str(n18[0][i]),str(n21[0][i]),str(n22[0][i]),str(n23[0][i])] I_list=[str(LC[0])[i],str(LO[0])[i],str(BS[0])[i],str(MA[0])[i],str(OL[0])[i],str(DR[0])[i],str(SF[0])[i]] Node_list=[str(N13[0][i]),str(N15[0][i]),str(N16[0][i]),str(N17[0][i]),str(N19[0][i]),str(N20[0][i])] # print(i_list) if "-" not in i_list: #只考虑位点非空(即:如果序列在该位点有"-",则不考虑该位点) # print(i_list[1:-1]) goalOrthologous.write(infile.name+"\n") if len(set(i_list[1:]))==1: #convergent print(i_list) # convergentSite=open("11-out1.txt","a+") convergentSite=open(sys.argv[2],"a+") #注意输出格式为:siteNumber Eele Atest Pann Carg Malb node14 node18 node21 node22 node23 @ Lcha Locu Bsple Marm Olat Drer Sfor @ node13 node15 node16 node17 node19 node20 convergentSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+" @ "+" ".join(I_list[:])+" @ "+" ".join(Node_list[:])+"\n") else: #parallel print(i_list) # parallelSite=open("11-out2.txt","a+") parallelSite=open(sys.argv[3],"a+") parallelSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+" @ "+" ".join(I_list[:])+" @ "+" ".join(Node_list[:])+"\n")
(base) vip10t01@bio10-desktop 2022-07-31 12:22:38 ~/project/01-oxy/gene_cds/reconstrctAnc $rm nohup.out $find -name "convergentFile.txt" |xargs rm $find -name "parallelFile.txt" |xargs rm $rm z_goalOrthologous.txt (base) vip10t01@bio10-desktop 2022-07-31 12:56:51 ~/project/01-oxy/gene_cds/reconstrctAnc $nohup bash z_ancestor.sh & [1] 1889760
(base) vip10t01@bio10-desktop 2022-07-31 19:31:50 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" |wc -l #跟前面完全相同,结果可靠
5253
(base) vip10t01@bio10-desktop 2022-07-31 19:35:50 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |wc -l #跟前面完全相同,结果可靠
18
结果处理:直接处理convergent。手动统计,但是统计过程中发现需要统计位点由什么变为什么,这个是以后做的时候需要改进的,这里量少,手动统计就行了。
首先判断了基因是否是one-to-one gene;
然后判断基因是否是正选择基因,从而该基因是否为适应进化基因;
接下来判断是parallel site还是convergent site
#主要的结果统计 (base) vip10t01@bio10-desktop 2022-08-01 09:30:39 ~/project/01-oxy/gene_cds/reconstrctAnc $cat z_goalOrthologous.txt |uniq |wc -l #判断总的有多少个基因 5259 $find -name "parallelFile.txt" |wc -l #查考有多少个基因有common sites 5253 $find -name "parallelFile.txt" |xargs cat >>z_commonSite.txt #用来判断有多少个common sites $cat z_commonSite.txt |wc -l #common sites数量 12878 (base) vip10t01@bio10-desktop 2022-08-01 09:39:05 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "convergentFile.txt" |wc -l #查看多少个基因有convergent site 18 $find -name "convergentFile.txt" |xargs cat >>z_convergentSite.txt #用来判断有多少个convergent sites $cat z_convergentSite.txt |wc -l #convergent sites数量 18
(base) vip10t01@bio10-desktop 2022-08-01 16:06:23 ~/project/01-oxy/gene_cds/reconstrctAnc $find -name "parallelFile.txt" >z_commonSiteFile.txt
$sed "s/\.//g" z_commonSiteFile.txt |sed "s/\/parallelFiletxt/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop\.out/g" |sed "s/\///g" |head
标签:node,重构,print,202208,re,flag,序列,line,txt 来源: https://www.cnblogs.com/ly-zy/p/16560757.html