首页 > 系统相关> > linux – 在fasta文件中选择序列超过300 aa,“C”至少出现4次

linux – 在fasta文件中选择序列超过300 aa,“C”至少出现4次

2019-08-28 02:50:21 作者：互联网

我有一个包含蛋白质序列的fasta文件.我想选择超过300个氨基酸的序列,半胱氨酸(C)氨基酸出现超过4次.

我用这个命令来选择超过300 aa的序列：

 cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'

一些序列示例：

  >jgi|Triasp1|216614|CE216613_3477
 MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
 NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
 YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
 AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
 YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
 MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
 QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*

解决方法:

我不知道bioawk,但我认为它与awk相同,有一些初始解析和常量定义.

我将按如下方式进行.假设你想要找到字母C in的4倍以上且长度超过300的字符串,那么你可以这样做：

bioawk -c fastx '
   (length($seq) > 300) && (gsub("C","C",$seq)>4) {
       print ">"$name; print $seq
   }' 72hDOWN-fasta.fasta

但这假定seq是完整的字符序列.

其背后的想法如下. gsub命令在字符串中执行替换并返回它所做的总替换.因此,如果我们用“C”替换所有字符“C”,我们实际上并没有改变字符串,而是返回字符串中“C”的总量.

From the 07001:

gsub(ere, repl[, in]): Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like
the 07002 utility global substitute) in $0 or in the in argument,
when specified.

sub(ere, repl[, in ]): Substitute the string repl in place of the first instance of the extended regular expression ere in string in
and return the number of substitutions. An <ampersand> ( &
) appearing in the string repl shall be replaced by the string from in
that matches the ERE. An <ampersand> preceded with a
<backslash> shall be interpreted as the literal
<ampersand> character. An occurrence of two consecutive
<backslash> characters shall be interpreted as just a single
literal <backslash> character. Any other occurrence of a
<backslash> (for example, preceding any other character) shall
be treated as a literal <backslash> character. Note that if repl
is a string literal (the lexical token STRING; see 07003), the
handling of the <ampersand> character occurs after any lexical
processing, including any lexical <backslash>-escape sequence
processing. If in is specified and it is not an lvalue (see
07004), the behavior is undefined. If in is omitted, awk
shall use the current record ($0) in its place.

注意：BioAwk基于Brian Kernighan’s awk,记录于“The AWK Programming Language”,
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
.我不确定此版本是否与POSIX兼容.

标签：sequences,linux,awk,bioinformatics,fasta,awk
来源： https://codeday.me/bug/20190828/1747484.html