首页 > 其他分享> > 202208祖先序列重构

202208祖先序列重构

2022-08-08 10:03:38 作者：互联网

首先理解清楚，基因发生了正选择，并不表示该基因就是发生了趋同进化（适应趋同）。那么什么才是发生了适应趋同呢？一般定义为满足两个条件：1、处于正选择；2、现生物种氨基酸位点同祖先相比存在改变（而且最好是正选择位点同改变位点时同一位点）。

现生物种氨基酸位点同祖先相比存在变化可分为两种情况：

　　1、所研究的现生目标物种在某位点处，氨基酸残基都相同，这种情况就被称为convergent site。又可进一步分为parallel substitution和convergent substitution 【Storz J F. Causes of molecular convergence and parallelism in protein evolution[J]. Nature Reviews Genetics, 2016, 17(4): 239-250. Hu Y, Wu Q I, Ma S, et al. Comparative genomics reveals convergent evolution between the bamboo-eating giant and red pandas[J]. Proceedings of the National Academy of Sciences, 2017, 114(5): 1081-1086.】

　　2、所研究的现生目标物种在某位点处，氨基酸残基不完全相同，该情况称为common change。

以上两种根据自己的研究情况，只要基因又与正选择基因重叠，就可以称该基因发生了趋同适应进化【Foote A D, Liu Y, Thomas G W C, et al. Convergent evolution of the genomes of marine mammals[J]. Nature genetics, 2015, 47(3): 272-275.】。

#文件1、蛋白序列文件（cds序列也可以，只需要修改配置文件参数seqtype = 3）

#文件2、树文件

(base) vip10t01@bio10-desktop 2022-07-17 22:17:59 ~/project/01-oxy/gene_cds/reconstrctAnc
$cat paml_tree.nwk
((Lcha:0.235889,Pann:0.271384):0.058453,(Locu:0.1986,(Sfor:0.182893,((Eele:0.166864,Drer:0.159921):0.0522582,(Olat:0.148378,((Marm:0.072489,Malb:0.0946715):0.0165605,(Carg:0.103992,(Atest:0.0589527,Bsple:0.0974119):0.0139159):0.0136644):0.0327347):0.104933):0.0345937):0.0507974):0.058453);

#文件3、配置文件

      seqfile = OG0013976.fa.bla.uniq.fa.maf.id.2.cds.fas.best.fas-gb.re.phy.stop                 * sequence data filename (核苷酸文件)
     treefile = paml_tree.nwk     * tree structure file name (不用标记树枝)
      outfile = mlc.ancestral.sequence.reconstruction        * main result file name

        noisy = 9  * 0,1,2,3,9: how much rubbish on the screen
      verbose = 2  * 0: concise; 1: detailed, 2: too much
      runmode = 0  * 0: user tree;  1: semi-automatic;  2: automatic
                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

      seqtype = 3  * 1:codons; 2:AAs; 3:codons-->AAs (根据自己的序列文件是氨基酸还是核酸进行选择)
        clock = 0  * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
       aaDist = 0  * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
   aaRatefile = /home/data/vip10t01/biosoftware/paml4.9j/dat/jones.dat  * only used for aa seqs with model=empirical(_F) (从安装PAML的路径中找)
                   * dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own
        model = 2
                   * models for codons:
                       * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
                   * models for AAs or codon-translated AAs:
                       * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F
                       * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189)
        icode = 0  * 0:universal code; 1:mammalian mt; 2-10:see below
        Mgene = 0
                   * codon: 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff
                   * AA: 0:rates, 1:separate
    fix_alpha = 0 * 0: estimate gamma shape parameter; 1: fix it at alpha
        alpha = 0.5 * initial or fixed alpha, 0:infinity (constant rate)
       Malpha = 1  * different alphas for genes
        ncatG = 4  * # of categories in dG of NSsites models

        getSE = 0  * 0: don't want them, 1: want S.E.s of estimates
 RateAncestor = 1  * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
   Small_Diff = .5e-6
    cleandata = 0  * remove sites with ambiguity data (1:yes, 0:no)?
       method = 1  * Optimization method 0: simultaneous; 1: one branch a time

* Genetic codes: 0:universal, 1:mammalian mt., 2:yeast mt., 3:mold mt.,
* 4: invertebrate mt., 5: ciliate nuclear, 6: echinoderm mt., 
* 7: euplotid mt., 8: alternative yeast nu. 9: ascidian mt., 
* 10: blepharisma nu.
* These codes correspond to transl_table 1 to 11 of GENEBANK.

(base) vip10t01@bio10-desktop 2022-07-17 22:28:21 ~/project/01-oxy/gene_cds/reconstrctAnc
$codeml baseml_aa1.ctl #运行

由于需要的输出文件为rst，而这个文件对于每一个orthologous名字都是一样的，为了区分，只有将每个orthologous分别存于不同文件夹下运行【已证实同一文件夹下运行，最终rst文件只会覆盖】

(base) vip10t01@bio10-desktop 2022-07-19 11:34:07 ~/project/01-oxy/gene_cds/5001
$for i in `ls *.ctl_re |awk -F"." '{print $1}'`; do echo mkdir $i; done >>z_mkdir.sh #为每个文件创建独立文件夹

(base) vip10t01@bio10-desktop 2022-07-19 11:33:46 ~/project/01-oxy/gene_cds/reconstrctAnc
$mv ../5001/z_mkdir.sh .
$cat z_mkdir.sh |wc -l  #看一下数量
9868
$bash z_mkdir.sh #创建文件夹
(base) vip10t01@bio10-desktop 2022-07-19 11:35:41 ~/project/01-oxy/gene_cds/5001
$for i in `ls *.stop |awk -F"." '{print $1}'`; do echo cp $i.fa.bla.uniq.fa.maf.id.2.cds.fas.best.fas-gb.re.phy.stop ../reconstrctAnc/$i; done >>z_Ancp.sh #复制序列文件，到各自的文件夹

$bash z_Ancp.sh

(base) vip10t01@bio10-desktop 2022-07-19 12:18:03 ~/project/01-oxy/gene_cds/reconstrctAnc
$for i in `ls -F |grep "/$"`; do echo cp paml_tree.nwk $i; done >>z.sh #复制树文件到各个文件夹
$bash z.sh
$for i in `ls -F |grep "/$"`; do cp ../reconAnc.ctl $i; done #复制配置文件到各个文件夹

(base) vip10t01@bio10-desktop 2022-07-20 11:53:57 ~/project/01-oxy/gene_cds/reconstrctAnc
$for i in `ls -F |grep "/$"`; do echo "sed -i 's/@/${i}\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' ${i}reconAnc.ctl"; done >>z_re.sh #修改配置文件

(base) vip10t01@bio10-desktop 2022-07-20 11:54:22 ~/project/01-oxy/gene_cds/reconstrctAnc
$head z_re.sh #一定看一下，有问题酌情修改
sed -i 's/@/OG0000000/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000000/reconAnc.ctl
sed -i 's/@/OG0000003/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000003/reconAnc.ctl
sed -i 's/@/OG0000004/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000004/reconAnc.ctl
sed -i 's/@/OG0000008/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000008/reconAnc.ctl
sed -i 's/@/OG0000015/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000015/reconAnc.ctl
sed -i 's/@/OG0000017/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000017/reconAnc.ctl
sed -i 's/@/OG0000021/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000021/reconAnc.ctl
sed -i 's/@/OG0000022/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000022/reconAnc.ctl
sed -i 's/@/OG0000024/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000024/reconAnc.ctl
sed -i 's/@/OG0000026/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000026/reconAnc.ctl

$sed "s/\/\\\./\\\./g" z_re.sh |head #酌情修改

$bash z_re.sh

#创建运行代码
(base) vip10t01@bio10-desktop 2022-07-20 12:22:23 ~/project/01-oxy/gene_cds/reconstrctAnc
$for i in `ls -F |grep "/$"`; do echo "cd $i; codeml reconAnc.ctl"; done >>z_runAnce.sh
$cd OG0000000/; codeml reconAnc.ctl #用一个文件测试了一下，没啥问题。

#开始运行
(base) vip10t01@bio10-desktop 2022-07-29 07:44:18 ~/project/01-oxy/gene_cds/reconstrctAnc
$nohup parallel -j 25 <z_runAnce.sh & 【半个小时不到就跑完了】
[1] 3980356

结果处理：每个orthologous做祖先重构时，树文件都是相同的，从而我们可以分析出需要进行比较的节点

tree with node labels for Rod Page's TreeView
((1_Lcha, 2_Pann) 14 , (3_Locu, (12_Sfor, ((11_Eele, 10_Drer) 18 , (9_Olat, ((5_Marm, 4_Malb) 21 , (7_Carg, (6_Atest, 8_Bsple) 23 ) 22 ) 20 ) 19 ) 17 ) 16 ) 15 ) 13 ;

条件1：提取air-breathing和其祖先序列在某一位点处有差异【5个中有两个存在差异就行】的氨基酸位点。
　　自己细化为：5个中有1、2、3、4、5个存在差异的类型
node #14 ==》2_Pann
node #18 ==》11_Eele
node #21 ==》4_Malb
node #22 ==》7_Carg
node #23 ==》6_Atest
在条件1的基础上进行位点分类：
　　对比所有air-breathing，如果该位点全部相同，则为convergent
　　对比所有air-breathing，如果该位点存在不同，则为parallel

# _author: Administrator
# _date: 2022/7/29
# Usaeg: .py infile convergentFile commonFile goalOrthologousNameFile
P=[]
M=[]
A=[]
E=[]
C=[]
n14=[]
n18=[]
n21=[]
n22=[]
n23=[]
import re
import sys
with open(sys.argv[1], "r+") as infile:
# with open("rst", "r+") as infile:
    for line in infile.readlines():
        if line.startswith("Pann"):
            line_re = re.sub("Pann| |\n", "", line)
            P.append(line_re)
            # print(P)
        elif line.startswith("Malb"):
            line_re = re.sub("Malb| |\n", "", line)
            M.append(line_re)
        elif line.startswith("Atest"):
            line_re = re.sub("Atest| |\n", "", line)
            A.append(line_re)
        elif line.startswith("Carg"):
            line_re = re.sub("Carg| |\n", "", line)
            C.append(line_re)
        elif line.startswith("Eele"):
            line_re = re.sub("Eele| |\n", "", line)
            E.append(line_re)
        elif line.startswith("node #14"):
            line_re = re.sub("node #14| |\n", "", line)
            n14.append(line_re)
        elif line.startswith("node #18"):
            line_re = re.sub("node #18| |\n", "", line)
            n18.append(line_re)
        elif line.startswith("node #21"):
            line_re = re.sub("node #21| |\n", "", line)
            n21.append(line_re)
        elif line.startswith("node #22"):
            line_re = re.sub("node #22| |\n", "", line)
            n22.append(line_re)
        elif line.startswith("node #23"):
            line_re = re.sub("node #23| |\n", "", line)
            n23.append(line_re)
#node #14 ==》2_Pann
pn=[]
flag=0
for a, b in zip(P[0], n14[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        pn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(pn)
#node18 ==》11_Eele
flag=0
en=[]
for a, b in zip(E[0], n18[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        en.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(en)
#node #21 ==》4_Malb
flag=0
mn=[]
for a, b in zip(M[0], n21[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        mn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(mn)
#node #22 ==》7_Carg
flag=0
cn=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        cn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(cn)
#node #23 ==》6_Atest
flag=0
an=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        an.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(an)
# times=[]
tmp1=[x[2] for x in an] #https://blog.csdn.net/weixin_38082364/article/details/79833348
tmp2=[x[2] for x in pn]
tmp3=[x[2] for x in en]
tmp4=[x[2] for x in cn]
tmp5=[x[2] for x in mn]
# print(tmp1)
# print(tmp2)
times=tmp1+tmp2+tmp3+tmp4+tmp5 #几个列表合成一个列表
# print(times)

sets=set(times)
dict={}
for item in sets:
    dict.update({item:times.count(item)})
# print(dict)
times5=[k for k,v in dict.items() if v==5] #输出次数为5的键(表示该位点在5个air-breathing中均有变化)
print(times5)
# print(A)
# convergentSite=open(sys.argv[2],"a+")
# convergentSite=open("11-out1.txt","a+")
# parallelSite=open(sys.argv[3],"a+")
# parallelSite=open("11-out2.txt","a+")
goalOrthologous=open(sys.argv[4],"a+") #用于存储目标orthologous的文件名
# goalOrthologous=open("11-tmp.txt","a+") #用于存储目标orthologous的文件名
for i in times5:
    i_list=[i,str(E[0])[i],str(A[0])[i],str(P[0])[i],str(C[0])[i],str(M[0])[i]]
    # print(i_list)
    if "-" not in i_list: #只考虑位点非空（即：如果序列在该位点有"-"，则不考虑该位点）【如果考虑，可以发现很多位点都是"-"】
        # print(i_list[1:-1])
        goalOrthologous.write(infile.name+"\n") #输出orthologous名字【同一个文件可能输出多次，后面再处理】
        if len(set(i_list[1:-1]))==1: #convergent sites
            print(i_list)
            # convergentSite=open("11-out1.txt","a+")
            convergentSite=open(sys.argv[2],"a+")
            #注意输出格式为：siteNumber E A P C M
            convergentSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+"\n")
        else: #common changes
            print(i_list)
            # parallelSite=open("11-out2.txt","a+")
            parallelSite=open(sys.argv[3],"a+")
            parallelSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+"\n")

(base) vip10t01@bio10-desktop 2022-07-30 22:31:55 ~/project/01-oxy/gene_cds/reconstrctAnc/OG0009213
$python3 ../ancestorSeq.py OG0009213/rst OG0009213/convergentFile.txt OG0009213/parallelFile.txt z_goalOrthologous.txt  #验证可用性【结果完全符合预期】

#写循环
(base) vip10t01@bio10-desktop 2022-07-30 22:42:10 ~/project/01-oxy/gene_cds/reconstrctAnc
$for i in `ls -F |grep "/$"`; do echo "python3 ../ancestorSeq.py ${i}rst ${i}convergentFile.txt ${i}parallelFile.txt z_goalOrthologous.txt"; done >>z_ancestor.sh
#看看文件如下，符合预期

(base) vip10t01@bio10-desktop 2022-07-30 22:49:49 ~/project/01-oxy/gene_cds/reconstrctAnc
$nohup bash z_ancestor.sh &  #运行程序  
[1] 1221584

#可能遇到重跑的情况
(base) vip10t01@bio10-desktop 2022-07-30 23:06:06 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" |xargs rm #删除指定文件名
$find -name "convergentFile.txt" |xargs rm
$rm nohup.out
$rm z_goalOrthologous.txt

尝试结果1

#代码出了上述问题时，结果如下
(base) vip10t01@bio10-desktop 2022-07-31 07:43:35 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" |wc -l #common change
5249

(base) vip10t01@bio10-desktop 2022-07-31 07:51:19 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |wc -l #convergent site
33

#修改代码后：

(base) vip10t01@bio10-desktop 2022-07-31 07:58:47 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |xargs rm
$find -name "parallelFile.txt" |xargs rm
$rm nohup.out
$rm z_goalOrthologous.txt

$nohup bash z_ancestor.sh &  【很快就跑完了】
[1] 1682506

(base) vip10t01@bio10-desktop 2022-07-31 07:43:35 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" |wc -l #common changes
5253

(base) vip10t01@bio10-desktop 2022-07-31 07:51:19 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |wc -l #convergent sites
18

(base) vip10t01@bio10-desktop 2022-07-31 09:18:22 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt"  #convergent orthologous
./OG0011532/convergentFile.txt
./OG0003214/convergentFile.txt
./OG0000890/convergentFile.txt
./OG0001393/convergentFile.txt
./OG0006886/convergentFile.txt
./OG0001851/convergentFile.txt
./OG0006300/convergentFile.txt
./OG0004181/convergentFile.txt
./OG0001577/convergentFile.txt
./OG0009321/convergentFile.txt
./OG0009594/convergentFile.txt
./OG0007484/convergentFile.txt
./OG0002346/convergentFile.txt
./OG0000982/convergentFile.txt
./OG0011231/convergentFile.txt
./OG0008506/convergentFile.txt
./OG0011059/convergentFile.txt
./OG0012207/convergentFile.txt

上面python脚本修改一下，将由祖先的什么氨基酸位点变为现在的什么氨基酸位点都输出。再跑一遍

# _author: Administrator
# _date: 2022/7/29
# Usaeg: .py infile convergentFile commonFile goalOrthologousNameFile
P=[]
M=[]
A=[]
E=[]
C=[]
n14=[]
n18=[]
n21=[]
n22=[]
n23=[]
import re
import sys
with open(sys.argv[1], "r+") as infile:
# with open("rst", "r+") as infile:
    for line in infile.readlines():
        if line.startswith("Pann"):
            line_re = re.sub("Pann| |\n", "", line)
            P.append(line_re)
            # print(P)
        elif line.startswith("Malb"):
            line_re = re.sub("Malb| |\n", "", line)
            M.append(line_re)
        elif line.startswith("Atest"):
            line_re = re.sub("Atest| |\n", "", line)
            A.append(line_re)
        elif line.startswith("Carg"):
            line_re = re.sub("Carg| |\n", "", line)
            C.append(line_re)
        elif line.startswith("Eele"):
            line_re = re.sub("Eele| |\n", "", line)
            E.append(line_re)
        elif line.startswith("node #14"):
            line_re = re.sub("node #14| |\n", "", line)
            n14.append(line_re)
        elif line.startswith("node #18"):
            line_re = re.sub("node #18| |\n", "", line)
            n18.append(line_re)
        elif line.startswith("node #21"):
            line_re = re.sub("node #21| |\n", "", line)
            n21.append(line_re)
        elif line.startswith("node #22"):
            line_re = re.sub("node #22| |\n", "", line)
            n22.append(line_re)
        elif line.startswith("node #23"):
            line_re = re.sub("node #23| |\n", "", line)
            n23.append(line_re)
#node #14 ==》2_Pann
pn=[]
flag=0
for a, b in zip(P[0], n14[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        pn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(pn)
#node18 ==》11_Eele
flag=0
en=[]
for a, b in zip(E[0], n18[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        en.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(en)
#node #21 ==》4_Malb
flag=0
mn=[]
for a, b in zip(M[0], n21[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        mn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(mn)
#node #22 ==》7_Carg
flag=0
cn=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        cn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(cn)
#node #23 ==》6_Atest
flag=0
an=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        an.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(an)
# times=[]
tmp1=[x[2] for x in an] #https://blog.csdn.net/weixin_38082364/article/details/79833348
tmp2=[x[2] for x in pn]
tmp3=[x[2] for x in en]
tmp4=[x[2] for x in cn]
tmp5=[x[2] for x in mn]
# print(tmp1)
# print(tmp2)
times=tmp1+tmp2+tmp3+tmp4+tmp5 #几个列表合成一个列表
# print(times)

sets=set(times)
dict={}
for item in sets:
    dict.update({item:times.count(item)})
# print(dict)
times5=[k for k,v in dict.items() if v==5] #输出次数为5的键(表示该位点在5个air-breathing中均有变化)
print(times5)
# print(A)
# convergentSite=open(sys.argv[2],"a+")
# convergentSite=open("11-out1.txt","a+")
# parallelSite=open(sys.argv[3],"a+")
# parallelSite=open("11-out2.txt","a+")
goalOrthologous=open(sys.argv[4],"a+") #用于存储目标orthologous的文件名
# goalOrthologous=open("11-tmp.txt","a+") #用于存储目标orthologous的文件名
for i in times5:
    i_list=[i,str(E[0])[i],str(A[0])[i],str(P[0])[i],str(C[0])[i],str(M[0])[i]]
    node_list=[str(n14[0][i]),str(n18[0][i]),str(n21[0][i]),str(n22[0][i]),str(n23[0][i])] #修改处
    # print(i_list)
    if "-" not in i_list: #只考虑位点非空（即：如果序列在该位点有"-"，则不考虑该位点）
        # print(i_list[1:-1])
        goalOrthologous.write(infile.name+"\n")
        if len(set(i_list[1:]))==1: #convergent
            print(i_list)
            # convergentSite=open("11-out1.txt","a+")
            convergentSite=open(sys.argv[2],"a+")
            #注意输出格式为：siteNumber E A P C M node14 node18 node21 node22 node23
            convergentSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+"\n") #修改处
        else: #parallel
            print(i_list)
            # parallelSite=open("11-out2.txt","a+")
            parallelSite=open(sys.argv[3],"a+")
            parallelSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+"\n") #修改处

(base) vip10t01@bio10-desktop 2022-07-31 10:31:18 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |xargs rm
$find -name "parallelFile.txt" |xargs rm
$rm z_goalOrthologous.txt

(base) vip10t01@bio10-desktop 2022-07-31 11:09:13 ~/project/01-oxy/gene_cds/reconstrctAnc
$nohup bash z_ancestor.sh & 【正在跑】
[1] 1857903

#python代码再修改一下，将改变处氨基酸位点的每条序列的对应位置的氨基酸都输出

# _author: Administrator
# _date: 2022/7/29
# Usaeg: .py infile convergentFile commonFile goalOrthologousNameFile
P=[]
M=[]
A=[]
E=[]
C=[]
n14=[]
n18=[]
n21=[]
n22=[]
n23=[]

LC=[]
LO=[]
BS=[]
MA=[]
OL=[]
DR=[]
SF=[]
N13=[]
N15=[]
N16=[]
N17=[]
N19=[]
N20=[]
import re
import sys
with open(sys.argv[1], "r+") as infile:
# with open("rst", "r+") as infile:
    for line in infile.readlines():
        if line.startswith("Pann"):
            line_re = re.sub("Pann| |\n", "", line)
            P.append(line_re)
            # print(P)
        elif line.startswith("Malb"):
            line_re = re.sub("Malb| |\n", "", line)
            M.append(line_re)
        elif line.startswith("Atest"):
            line_re = re.sub("Atest| |\n", "", line)
            A.append(line_re)
        elif line.startswith("Carg"):
            line_re = re.sub("Carg| |\n", "", line)
            C.append(line_re)
        elif line.startswith("Eele"):
            line_re = re.sub("Eele| |\n", "", line)
            E.append(line_re)
        elif line.startswith("node #14"):
            line_re = re.sub("node #14| |\n", "", line)
            n14.append(line_re)
        elif line.startswith("node #18"):
            line_re = re.sub("node #18| |\n", "", line)
            n18.append(line_re)
        elif line.startswith("node #21"):
            line_re = re.sub("node #21| |\n", "", line)
            n21.append(line_re)
        elif line.startswith("node #22"):
            line_re = re.sub("node #22| |\n", "", line)
            n22.append(line_re)
        elif line.startswith("node #23"):
            line_re = re.sub("node #23| |\n", "", line)
            n23.append(line_re)

        elif line.startswith("Lcha"):
            line_re = re.sub("Lcha| |\n", "", line)
            LC.append(line_re)
        elif line.startswith("Locu"):
            line_re = re.sub("Locu| |\n", "", line)
            LO.append(line_re)
        elif line.startswith("Bsple"):
            line_re = re.sub("Bsple| |\n", "", line)
            BS.append(line_re)
        elif line.startswith("Marm"):
            line_re = re.sub("Marm| |\n", "", line)
            MA.append(line_re)
        elif line.startswith("Olat"):
            line_re = re.sub("Olat| |\n", "", line)
            OL.append(line_re)
        elif line.startswith("Drer"):
            line_re = re.sub("Drer| |\n", "", line)
            DR.append(line_re)
        elif line.startswith("Sfor"):
            line_re = re.sub("Sfor| |\n", "", line)
            SF.append(line_re)
        elif line.startswith("node #13"):
            line_re = re.sub("node #13| |\n", "", line)
            N13.append(line_re)
        elif line.startswith("node #15"):
            line_re = re.sub("node #15| |\n", "", line)
            N15.append(line_re)
        elif line.startswith("node #16"):
            line_re = re.sub("node #16| |\n", "", line)
            N16.append(line_re)
        elif line.startswith("node #17"):
            line_re = re.sub("node #17| |\n", "", line)
            N17.append(line_re)
        elif line.startswith("node #19"):
            line_re = re.sub("node #19| |\n", "", line)
            N19.append(line_re)
        elif line.startswith("node #20"):
            line_re = re.sub("node #20| |\n", "", line)
            N20.append(line_re)
#node #14 ==》2_Pann
pn=[]
flag=0
for a, b in zip(P[0], n14[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        pn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(pn)
#node18 ==》11_Eele
flag=0
en=[]
for a, b in zip(E[0], n18[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        en.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(en)
#node #21 ==》4_Malb
flag=0
mn=[]
for a, b in zip(M[0], n21[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        mn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(mn)
#node #22 ==》7_Carg
flag=0
cn=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        cn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(cn)
#node #23 ==》6_Atest
flag=0
an=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        an.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(an)
# times=[]
tmp1=[x[2] for x in an] #https://blog.csdn.net/weixin_38082364/article/details/79833348
tmp2=[x[2] for x in pn]
tmp3=[x[2] for x in en]
tmp4=[x[2] for x in cn]
tmp5=[x[2] for x in mn]
# print(tmp1)
# print(tmp2)
times=tmp1+tmp2+tmp3+tmp4+tmp5 #几个列表合成一个列表
# print(times)

sets=set(times)
dict={}
for item in sets:
    dict.update({item:times.count(item)})
# print(dict)
times5=[k for k,v in dict.items() if v==5] #输出次数为5的键(表示该位点在5个air-breathing中均有变化)
print(times5)
# print(A)
# convergentSite=open(sys.argv[2],"a+")
# convergentSite=open("11-out1.txt","a+")
# parallelSite=open(sys.argv[3],"a+")
# parallelSite=open("11-out2.txt","a+")
goalOrthologous=open(sys.argv[4],"a+") #用于存储目标orthologous的文件名
# goalOrthologous=open("11-tmp.txt","a+") #用于存储目标orthologous的文件名
for i in times5:
    i_list=[i,str(E[0])[i],str(A[0])[i],str(P[0])[i],str(C[0])[i],str(M[0])[i]]
    node_list=[str(n14[0][i]),str(n18[0][i]),str(n21[0][i]),str(n22[0][i]),str(n23[0][i])]
    I_list=[str(LC[0])[i],str(LO[0])[i],str(BS[0])[i],str(MA[0])[i],str(OL[0])[i],str(DR[0])[i],str(SF[0])[i]]
    Node_list=[str(N13[0][i]),str(N15[0][i]),str(N16[0][i]),str(N17[0][i]),str(N19[0][i]),str(N20[0][i])]
    # print(i_list)
    if "-" not in i_list: #只考虑位点非空（即：如果序列在该位点有"-"，则不考虑该位点）
        # print(i_list[1:-1])
        goalOrthologous.write(infile.name+"\n")
        if len(set(i_list[1:]))==1: #convergent
            print(i_list)
            # convergentSite=open("11-out1.txt","a+")
            convergentSite=open(sys.argv[2],"a+")
            #注意输出格式为：siteNumber Eele Atest Pann Carg Malb node14 node18 node21 node22 node23 @ Lcha Locu Bsple Marm Olat Drer Sfor @ node13 node15 node16 node17 node19 node20
            convergentSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+" @ "+" ".join(I_list[:])+" @ "+" ".join(Node_list[:])+"\n")
        else: #parallel
            print(i_list)
            # parallelSite=open("11-out2.txt","a+")
            parallelSite=open(sys.argv[3],"a+")
            parallelSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+" @ "+" ".join(I_list[:])+" @ "+" ".join(Node_list[:])+"\n")

(base) vip10t01@bio10-desktop 2022-07-31 12:22:38 ~/project/01-oxy/gene_cds/reconstrctAnc
$rm nohup.out
$find -name "convergentFile.txt" |xargs rm
$find -name "parallelFile.txt" |xargs rm
$rm z_goalOrthologous.txt

(base) vip10t01@bio10-desktop 2022-07-31 12:56:51 ~/project/01-oxy/gene_cds/reconstrctAnc
$nohup bash z_ancestor.sh & 
[1] 1889760

(base) vip10t01@bio10-desktop 2022-07-31 19:31:50 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" |wc -l #跟前面完全相同，结果可靠
5253

(base) vip10t01@bio10-desktop 2022-07-31 19:35:50 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |wc -l #跟前面完全相同，结果可靠
18

结果处理：直接处理convergent。手动统计，但是统计过程中发现需要统计位点由什么变为什么，这个是以后做的时候需要改进的，这里量少，手动统计就行了。

首先判断了基因是否是one-to-one gene；

然后判断基因是否是正选择基因，从而该基因是否为适应进化基因；

接下来判断是parallel site还是convergent site

#主要的结果统计
(base) vip10t01@bio10-desktop 2022-08-01 09:30:39 ~/project/01-oxy/gene_cds/reconstrctAnc
$cat z_goalOrthologous.txt |uniq |wc -l  #判断总的有多少个基因
5259

$find -name "parallelFile.txt" |wc -l  #查考有多少个基因有common sites
5253
$find -name "parallelFile.txt" |xargs cat >>z_commonSite.txt #用来判断有多少个common sites
$cat z_commonSite.txt |wc -l  #common sites数量
12878

(base) vip10t01@bio10-desktop 2022-08-01 09:39:05 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |wc -l  #查看多少个基因有convergent site
18
$find -name "convergentFile.txt" |xargs cat >>z_convergentSite.txt #用来判断有多少个convergent sites
$cat z_convergentSite.txt |wc -l #convergent sites数量
18

(base) vip10t01@bio10-desktop 2022-08-01 16:06:23 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" >z_commonSiteFile.txt
$sed "s/\.//g" z_commonSiteFile.txt |sed "s/\/parallelFiletxt/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop\.out/g" |sed "s/\///g" |head

标签：node,重构,print,202208,re,flag,序列,line,txt
来源： https://www.cnblogs.com/ly-zy/p/16560757.html