The BLAST Algorithm for Sequence Retrieval

15. The BLAST Algorithm for Sequence Retrieval#

15.1. Introduction#

BLAST (Basic Local Alignment Search Tool) is a powerful algorithm designed for comparing biological sequences, such as protein amino acid sequences or DNA/RNA nucleotide sequences [AGM+90]. Developed in 1990, BLAST has become one of the most widely used tools in bioinformatics, cited more than hundred thousand times.

The primary purpose of BLAST is to compare a query sequence against a database of sequences, identifying those with a similarity above a certain threshold. This enables researchers to efficiently find sequences that resemble the query, facilitating further biological research.

15.2. Background and Development#

BLAST was designed to address a fundamental challenge in bioinformatics: how to efficiently search large genomic databases. Its heuristic algorithm offers a much faster alternative to traditional methods like Smith-Waterman and Needleman-Wunsch, which focus on optimal alignment but are computationally intensive.

BLAST’s success and widespread adoption stem from its balance of speed and sensitivity, making it an essential tool for biological research. Its development was also supported by a novel stochastic model [KA90], the so called Karlin-Altschul statistics, forming the statistical foundation for the tool.

15.3. K-mer indexing#

15.3.1. What are k-mers?#

Before we describe the details of BLAST we need to describe the concept of k-mers. A k-mer is a contiguous substring of length k derived from a larger sequence:

Generation: K-mers are generated by sliding a window of length k across a sequence, creating overlapping substrings.
Examples: For a DNA sequence “AGCTAGC”, 3-mers (k=3) would include “AGC”, “GCT”, “CTA”, etc.
Importance: K-mers are useful for representing sequences, identifying sequence patterns, and conducting sequence comparisons.

K-mers can be used for indexing sequences, making sequence retrieval more efficient:

Hash Table Creation: A hash table or other data structure can store k-mers as keys, mapping them to sequence locations.
Fast Lookups: By indexing k-mers from database sequences, BLAST can quickly look up similar k-mers from the query sequence, avoiding the need for a full pairwise alignment.
Word Hits: K-mer matches between the query and database sequences, termed “word hits”, serve as initial points of comparison for BLAST.

Dividing “APEPTIDE” into 3-mers

The process of dividing a sequence into k-mers involves taking every possible contiguous subsequence of length k from the given sequence. Here’s how it looks for the 3-mers of “APEPTIDE”:

APEPTIDE
APEPTIDE
APEPTIDE
APEPTIDE
APEPTIDE
APEPTIDE

This results in the following 3-mers:

APE
PEP
EPT
PTI
TID
IDE

15.3.2. Indexing#

Before BLAST can efficently search a database, it is critical that one index the sequences. Here’s how BLAST indexes sequences: BLAST creates a lookup table where each entry corresponds to a k-mer (typically 11 for nucleotides and 3-6 for proteins). For each k-mer in the database sequences, the algorithm stores its position within the sequence.

15.4. The BLAST Algorithm#

The Basic Local Alignment Search Tool (BLAST) is a heuristic algorithm designed to create local alignments, helping to find the closest homologs of sequences in a database. The algorithm leverages indices of k-tuples from the sequence database, with k typically ranging from 3 to 6 for proteins and 6 to 20 for DNA. BLAST follows these key steps:

Word Hits Generation: The query sequence is divided into k-tuples, which are compared to pre-computed k-tuples from the database sequences.
Seeding: The matching k-tuples, known as “seeds”, are used as anchor points for the alignment procedure. The list of k-tuples is expanded to include any k-tuples that match with an alignment score above a threshold T.
Extension: BLAST extends these seeds in both directions, creating larger alignments. This extension continues until the alignment score drops below a threshold, preventing poor alignments from extending. These extended matches are known as high-scoring segment pairs (HSPs).
Alignment Scoring: Each extended alignment is scored based on the similarity between the aligned regions, taking into account matches, mismatches, and gaps.
Statistical Evaluation: BLAST calculates the statistical significance of each alignment using E-values, indicating how many alignments of the same quality would be found by chance in a database of a given size.
Output: The algorithm then lists all alignments above a threshold score, providing details such as alignment positions, scores, and statistical significance.

This combined approach gives an overview of BLAST, highlighting its key features and processes.

15.5. E-Values and Their Meaning#

BLAST assesses the statistical significance of alignments using the Gumbel extreme value distribution (EVD) model. This model estimates the probability of an alignment score being due to chance. An E-value (expect value) is a statistical measure used in BLAST to assess the significance of an alignment:

Definition: The E-value represents the number of alignments expected to occur by chance in a database of a given size.
Interpretation: Lower E-values indicate more significant alignments. For example, an E-value of 0.01 suggests that this alignment quality would be obtained by chance 1 times out of hundred with a query sequence of the same length of sequence and database size.

The E-values in BLAST searches provide a direct indication of alignment quality:

Significance Threshold: Alignments with E-values below a certain threshold (e.g., 0.05) are considered significant, indicating a meaningful biological relationship.
Comparison: E-values allow for comparing the statistical significance of different alignments, making them a valuable metric for researchers.
Database Size Influence: The size of the database affects E-values, as larger databases increase the likelihood of chance alignments, resulting in higher E-values.

15.6. How is the BLAST E-value Calculated?#

The E-value (expect value) is a statistical measure used in BLAST to assess the significance of an alignment between a query sequence and database sequences. It reflects the expected number of alignments that would occur by chance in a given database search. The E-value is calculated based on the alignment score \(S\), the search space size \(m × n\), and parameters derived from the scoring system and database composition, such as the Karlin-Altschul parameters (\(K\) and \(λ\)). The formula for the E-value is:

\(E = Kmne^{-\lambda S} \)

Where:

\(m\) is the length of the query sequence.
\(n\) is the length of the database, which is the sum of the lengths of all the sequences in the database.
\(K\) and \(\lambda\) are the Karlin-Altschul parameters. They can be estimated from large sets of random sequence alignments:
- \(\lambda\) normalizes the alignment score.
- \(K\) scales the E-value based on the database and sequence lengths.
\(S\) is the alignment score, calculated from the selected scoring matrix and the alignment of residues. It accounts for the sum of substitution and gap scores for the aligned residues.

The E-value is directly proportional to the search space size (m × n) and inversely proportional to the exponential function of the alignment score (S). Consequently, larger databases provide more opportunities for chance alignments, resulting in higher E-values (indicating weaker statistical significance) for the same level of sequence similarity.

15.7. Types of BLAST#

BLAST comes in a couple of different versions, depending on its usage. Here are the main versions of BLAST:

BLASTn (Nucleotide BLAST): Compares nucleotide sequences against a database or another sequence to identify evolutionary relationships. Useful in phylogenetics studies.
tBLASTn: Searches for proteins in untranslated DNA sequences. Takes a protein sequence and compares it to all potential translations of a nucleotide sequence. Useful for finding protein-coding regions in ESTs and HTGs.
BLASTx: Compares a nucleotide query, translated into six possible protein sequences, against a protein database. This tool is essential when DNA sequence reading frames are uncertain or contain errors that could impact protein coding. It provides combined statistics across all frames for new DNA analyses.
BLASTp (Protein BLAST): Used for comparing protein sequences against a database (e.g., nr database). It helps in identifying proteins by finding similar sequences in known protein databases.

15.8. The FASTA Format#

BLAST is using FASTA format as input format for its databases and queries. The FASTA format is a widely adopted standard for representing nucleotide and protein sequences in bioinformatics. Developed in the 1980s for the FASTA sequence alignment software, it has since become a versatile and essential format for storing and sharing sequence data. The FASTA format consists of two key components:

Header Line: Each sequence begins with a header line, which starts with a greater-than symbol (>). The text following this symbol provides a description of the sequence. This description often includes information such as a unique identifier (e.g., accession number), the source organism, and other metadata. For example:

>sp|P12345|PROT_HUMAN Human Protein Name [Homo sapiens]`

In this example, “sp” indicates the Swiss-Prot database, “P12345” is the accession number, “PROT_HUMAN” is the unique identifier, and additional information follows.

Sequence Lines: Below the header line, the sequence data is written as plain text. For nucleotide sequences, this consists of a string of the bases A, C, G, and T (or U for RNA). For protein sequences, it consists of a string of the standard 20 amino acid single-letter codes. The sequence can be broken into multiple lines, making it easier to read and process:

ATGCGTACGTGACGT  
CGTGAGCTAGTCAGT  

These sequence lines represent the data to be analyzed and compared.

Combing the two you get a fasta record e.g.

>sp|C9JLW8|MCRI1_HUMAN Mapk-regulated corepressor-interacting protein 1 OS=Homo sapiens OX=9606 GN=MCRIP1 PE=1 SV=1  
MTSSPVSRVVYNGKRTSSPRSPPSSSEIFTPAHEENVRFIYEAWQGVERDLRGQVPGGER  
GLVEEYVEKVPNPSLKTFKPIDLSDLKRRSTQDAKKS  

15.8.1. Usage and Applications#

The FASTA format’s simplicity and clarity make it ideal for storing and sharing sequence data. It is supported by most bioinformatics tools, including alignment algorithms, database search tools, and genome browsers. Additionally, the format is highly adaptable, allowing for easy conversion to and from other sequence formats.

FASTA files can contain multiple sequences, each represented by its own header and sequence lines, making them an efficient way to store large datasets. They are used in bioinformatics pipelines, providing a way to manage and share sequence information. Note, however, that FASTA is not a well defined format, and there are multiple variants in how both headers and sequence lines should be formated.

You can for instance get all the cannonical amino acid sequences of the human genome as a single FASTA file

15.9. References#

[AGM+90]

Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.

[KA90]

Samuel Karlin and Stephen F Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences, 87(6):2264–2268, 1990.