linux – 在fasta文件中选择序列超过300 aa,“C”至少出现4次
作者:互联网
我有一个包含蛋白质序列的fasta文件.我想选择超过300个氨基酸的序列,半胱氨酸(C)氨基酸出现超过4次.
我用这个命令来选择超过300 aa的序列:
cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'
一些序列示例:
>jgi|Triasp1|216614|CE216613_3477
MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*
解决方法:
我不知道bioawk,但我认为它与awk相同,有一些初始解析和常量定义.
我将按如下方式进行.假设你想要找到字母C in的4倍以上且长度超过300的字符串,那么你可以这样做:
bioawk -c fastx '
(length($seq) > 300) && (gsub("C","C",$seq)>4) {
print ">"$name; print $seq
}' 72hDOWN-fasta.fasta
但这假定seq是完整的字符序列.
其背后的想法如下. gsub命令在字符串中执行替换并返回它所做的总替换.因此,如果我们用“C”替换所有字符“C”,我们实际上并没有改变字符串,而是返回字符串中“C”的总量.
From the 07001:
gsub(ere, repl[, in])
: Behave likesub
(see below), except that it shall replace all occurrences of the regular expression (like
the 07002 utility global substitute) in$0
or in the in argument,
when specified.
sub(ere, repl[, in ])
: Substitute the stringrepl
in place of the first instance of the extended regular expressionere
in stringin
and return the number of substitutions. An <ampersand> (&
) appearing in the stringrepl
shall be replaced by the string fromin
that matches the ERE. An <ampersand> preceded with a
<backslash> shall be interpreted as the literal
<ampersand> character. An occurrence of two consecutive
<backslash> characters shall be interpreted as just a single
literal <backslash> character. Any other occurrence of a
<backslash> (for example, preceding any other character) shall
be treated as a literal <backslash> character. Note that ifrepl
is a string literal (the lexical token STRING; see 07003), the
handling of the <ampersand> character occurs after any lexical
processing, including any lexical <backslash>-escape sequence
processing. Ifin
is specified and it is not an lvalue (see
07004), the behavior is undefined. Ifin
is omitted, awk
shall use the current record ($0
) in its place.
注意:BioAwk基于Brian Kernighan’s awk,记录于“The AWK Programming Language”,
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
.我不确定此版本是否与POSIX兼容.
标签:sequences,linux,awk,bioinformatics,fasta,awk 来源: https://codeday.me/bug/20190828/1747484.html