首页 > 数据库> > 《生物信息学：导论与方法》----序列数据库搜索----听课笔记（五）

《生物信息学：导论与方法》----序列数据库搜索----听课笔记（五）

2019-09-10 17:06:43 作者：互联网

3.1 序列数据库

Sequence Database Searching
Rather than do the alignment pair-wise, it's more often to search sequence database in a high-througnput style.
Or, identify similarities between:
novel query sequence (whose structures and functions are usually unknown and/or uncharacterized)
sequence in (public) databases (whose structures and functions have been elucidated and annotated)
The query sequence is compared/aligned with every sequence in the database.
Statistically significant hits are assumed to be related to the query sequence: 1. Similar function/structure; 2.Common evolutionary ancestor.
BLAST: to make the alignment effectively, a Heuristic algorithm BLAST(Basic Local Alignment Search Tool) is proposed by Altschul et al in 1990.
BLAST finds the highest scoring locally optimal alignment between a query sequence and a database.
Very fast algorithm
Can be used to search extremely large databases
Sufficiently sentisive and selective for most purposes
Rubost - the default parameters just work for most cases
BLAST相关论文被引用了上万次，是生物信息学领域最被广泛认可的算法之一。
Swiss-Prot数据库中的每一条记录，都经过了专家团队的手工注释，包含了从功能、修饰、结构等方面的全面信息以及到其他数据库的全面的连接。
因此研究人员在需要尽可能准确、详尽地注释新序列时，通常都会用Swiss-Prot数据库作为入手。

3.2 BLAST算法初探

BLAST首先找到两条序列之间的高度相似小片段，也就是所谓的种子；而后以此为基础，向两端延伸并构建比对；最后，为了避免可能的假阳性，BLAST还会计算它的统计显著性。
Seeding: For a given word length w (usuallyl 3 for proteins and 11 for nucletides), slicing the query sequence into multiple continuous "seed words".
Speedup: Index database: the database was pre-indexed to quickly locate all positions in the database for a given seed.
通过正确地设计索引结构，Speedup可以在线性甚至近似常数时间内完成，从而大大提高效率。
BLAST会屏蔽重复性的低复杂度区域，以免产生太多的假阳性hit。
低复杂度是根据序列的信息量（information content）来判断。
为了提高灵敏度，BLAST在seeding的时候除了考虑由查询序列分解而成的种子单字之外，还会同时考虑那些与种子单字相似的邻居单字。
E-Value：How a match is likely to arise by chance
The expected number of alignments with a given score that would be expected to occur at random in the database that has been searched.
$E=Kmne^{^{-\lambda S}}$ m:query sequence的长度；n:数据库的大小；S：分数；其他两个与打分矩阵有关。
E不是一个概率，是一个期望，是可以大于1的。
$p=1-e^{-E}$ p(0.05)=E(0.0513) 有人将0.05作为E-value的cut-off
BLAST不确保能找到最优解，但尽力在更短时间内找到足够好的解。
速度的提高是以灵敏度的下降为代价的。
Tradeoff: speed vs. sensitivity

3.3 学生课堂报告

Why BLAST?

Homology is the central concept for all of biology.
BLAST is the tool most frequently used for calculating sequence similarity, by searching the databases.
If you work with one or a few proteins or genes, it can tell you about their conservation, active sites, structure and regulation in other organisms, etc.

What BLAST does?

Identity: the occurrence of exactly the same nucleotide or amino acid in the same position in aligned sequences.
Similarity: measure the sameness or difference of the sequences
Homology: is defined in terms of shared ancestors. Homologous sequences are often similar. Sequence regions that are homologous are also called conserved regions.

BLAST
Step 0-----Filtering: -F flag: filter query sequence 核酸序列用N来mask；氨基酸序列用X来mask。
Step 1-----Seeding: -W flag: word size 设定种子长度w，核酸一般是11;氨基酸一般是3。长度是n的序列，种子个数：n-w+1
Step 2-----Search word hits: Scoring matrix
Step 3-----Scanning: 1. Hash Table: direct addressing method; 2. Deterministic finite automaton/finite state machine:much faster
Step 4------Extending -> HSP: Cutoff score: S
Step 5------Significance evaluation: E-value
E值越小越好，如果E大于1的话，就说明我们现在找到的这个比对上的序列是随机发生的一个事件，也就是不可靠的一个结果。E值小于0.1或0.05的话，就认为是在统计上有意义的结果。E值小于1e-5, 就认为我们所得到的比对序列和用来查询的序列是高度一致性的，有非常高的相似度。
使用过程中注意：

BLAST统计上的结果并不能完全代表生物学上意义，因此在分析结果的时候，还是要考虑到序列的其他结构上或其他的问题。
如果序列可以翻译成蛋白质的话，最好还是比对一下它的蛋白质序列来研究它是不是具有高度的相似性。
相似度的概念并不能够代表同源性的概念。
50%的同源性或者80%的同源性，这样的概念是不太科学的，要避免这样的表达。

标签：信息学,sequence,database,导论,序列,Step,听课,query,BLAST
来源： https://blog.csdn.net/wxw060709/article/details/100699912