The BLAST Algorithm for Sequence Retrieval

Introduction¶

BLAST (Basic Local Alignment Search Tool) is a powerful algorithm designed for comparing biological sequences, such as protein amino acid sequences or DNA/RNA nucleotide sequences Altschul et al. (1990). Developed in 1990, BLAST has become one of the most widely used tools in bioinformatics, cited more than a hundred thousand times.

The primary purpose of BLAST is to compare a query sequence against a database of sequences, identifying those with a similarity above a certain threshold. This enables researchers to efficiently find sequences that resemble the query, facilitating further biological research.

Background and Development¶

BLAST was designed to address a fundamental challenge in bioinformatics: how to efficiently search large genomic databases. Its heuristic algorithm offers a much faster alternative to traditional methods like Smith-Waterman and Needleman-Wunsch, which focus on optimal alignment but are computationally intensive.

BLAST’s success and widespread adoption stem from its balance of speed and sensitivity, making it an essential tool for biological research. Its development was also supported by a novel stochastic model Karlin & Altschul (1990), the so-called Karlin-Altschul statistics, forming the statistical foundation for the tool.

K-mer indexing¶

What are k-mers?¶

Before we describe the details of BLAST, we need to describe the concept of k-mers. A k-mer is a contiguous substring of length k derived from a larger sequence:

Generation: K-mers are generated by sliding a window of length k across a sequence, creating overlapping substrings.
Examples: For a DNA sequence “AGCTAGC”, 3-mers (k=3) would include “AGC”, “GCT”, “CTA”, etc.
Importance: K-mers are useful for representing sequences, identifying sequence patterns, and conducting sequence comparisons.

K-mers can be used for indexing sequences, making sequence retrieval more efficient:

Hash Table Creation: A hash table or other data structure can store k-mers as keys, mapping them to sequence locations.
Fast Lookups: By indexing k-mers from database sequences, BLAST can quickly look up similar k-mers from the query sequence, avoiding the need for a full pairwise alignment.
Word Hits: K-mer matches between the query and database sequences, termed “word hits”, serve as initial points of comparison for BLAST.

Indexing¶

Before BLAST can efficiently search a database, it is critical that one index the sequences. Here’s how BLAST indexes sequences: BLAST creates a lookup table where each entry corresponds to a k-mer (the default word size is 11 for nucleotides and 3 for proteins). For each k-mer in the database sequences, the algorithm stores its position within the sequence.

The BLAST Algorithm¶

The Basic Local Alignment Search Tool (BLAST) is a heuristic algorithm designed to create local alignments, helping to find the closest homologs of sequences in a database. The algorithm leverages indices of k-tuples from the sequence database, with k typically ranging from 3 to 6 for proteins and 6 to 20 for DNA. BLAST follows these key steps:

Word Hits Generation: The query sequence is divided into k-tuples, which are compared to pre-computed k-tuples from the database sequences.
Seeding: The matching k-tuples, known as “seeds”, are used as anchor points for the alignment procedure. The list of k-tuples is expanded to include any k-tuples that match with an alignment score above a threshold T.
Extension: BLAST extends these seeds in both directions, creating larger alignments. This extension continues until the alignment score drops below a threshold, preventing poor alignments from extending. These extended matches are known as high-scoring segment pairs (HSPs).
Alignment Scoring: Each extended alignment is scored based on the similarity between the aligned regions, taking into account matches, mismatches, and gaps.
Statistical Evaluation: BLAST calculates the statistical significance of each alignment using E-values, indicating how many alignments of the same quality would be found by chance in a database of a given size.
Output: The algorithm then lists all alignments above a threshold score, providing details such as alignment positions, scores, and statistical significance.

This combined approach gives an overview of BLAST, highlighting its key features and processes.

E-Values and Their Meaning¶

BLAST assesses the statistical significance of alignments using the Gumbel extreme value distribution (EVD) model. This model estimates the probability of an alignment score being due to chance. An E-value (expect value) is a statistical measure used in BLAST to assess the significance of an alignment with a score $S$ .

Definition: The E-value represents the number of alignments with a score equal to or greater than $S$ that would be expected to occur by chance in the searched database.
Interpretation: Lower E-values indicate more significant alignments. For example, an E-value of 0.01 means that, on average, 0.01 alignments of this quality or better would be expected by chance for a query sequence of the same length searched against a database of the same size — i.e. roughly one such chance alignment per hundred searches. (Strictly, the E-value is an expected number of hits rather than a probability, but for small E-values it closely approximates the probability of seeing at least one such alignment by chance.)

The E-values in BLAST searches provide a direct indication of alignment quality:

Significance Threshold: Alignments with E-values below a certain threshold (e.g., 0.05) are considered significant, indicating a meaningful biological relationship.
Comparison: E-values allow for comparing the statistical significance of different alignments, making them a valuable metric for researchers.
Database Size Influence: The size of the database affects E-values, as larger databases increase the likelihood of chance alignments, resulting in higher E-values.

How is the BLAST E-value Calculated?¶

The E-value (expect value) is a statistical measure used in BLAST to assess the significance of an alignment between a query sequence and database sequences. It reflects the expected number of alignments that would occur by chance in a given database search. The E-value is calculated based on the alignment score $S$ , the search space size $m × n$ , and parameters derived from the scoring system and database composition, such as the Karlin-Altschul parameters ( $K$ and $λ$ ). The formula for the E-value is:

$E = Kmne^{-\lambda S}$

Where:

$m$ is the length of the query sequence.
$n$ is the length of the database, which is the sum of the lengths of all the sequences in the database.
$K$ and $\lambda$ are the Karlin-Altschul parameters. They can be estimated from large sets of random sequence alignments:
- $\lambda$ normalizes the alignment score.
- $K$ scales the E-value based on the database and sequence lengths.
$S$ is the alignment score, calculated from the selected scoring matrix and the alignment of residues. It accounts for the sum of substitution and gap scores for the aligned residues.

The E-value is directly proportional to the search space size (m × n) and inversely proportional to the exponential function of the alignment score (S). Consequently, larger databases provide more opportunities for chance alignments, resulting in higher E-values (indicating weaker statistical significance) for the same level of sequence similarity.

How Are $K$ and $\lambda$ Obtained?¶

When BLAST was first developed, the parameters $K$ and $\lambda$ were estimated by aligning large numbers of randomized sequences with the same background composition as the database. The distribution of alignment scores from these random comparisons follows an extreme value distribution (EVD), as predicted by Karlin–Altschul theory. By fitting the empirical score distribution to the EVD, appropriate values of $K$ and $\lambda$ were obtained for each scoring matrix and residue composition.

Formally, $\lambda$ is defined as the unique positive solution to the equation

\sum_{i,j} p_i q_j e^{\lambda s_{ij}} = 1,

(1)

where $s_{ij}$ is the substitution score for aligning residues $i$ and $j$ , and $p_i$ , $q_j$ are their background frequencies. Once $\lambda$ is determined, $K$ is fitted so that the theoretical distribution

P(S \geq x) \approx 1 - \exp\!\big(-K m n e^{-\lambda x}\big)

(2)

matches the observed distribution of random alignment scores.

Today, BLAST does not generate random sequences for each search. Instead, it uses lookup tables of pre-computed values for $K$ and $\lambda$ , derived from these earlier large-scale calibrations. For standard scoring matrices (e.g. BLOSUM62, PAM250) and typical residue frequencies, these values are stable and can be reused. In addition, BLAST may apply composition-based adjustments if the query or database composition deviates significantly from the assumptions used in the pre-computed tables.

Types of BLAST¶

BLAST comes in a couple of different versions, depending on its usage. Here are the main versions of BLAST:

BLASTn (Nucleotide BLAST): Compares nucleotide sequences against a database or another sequence to identify evolutionary relationships. Useful in phylogenetics studies.
tBLASTn: Searches for proteins in untranslated DNA sequences. Takes a protein sequence and compares it to all potential translations of a nucleotide sequence. Useful for finding protein-coding regions in ESTs and HTGs.
BLASTx: Compares a nucleotide query, translated into six possible protein sequences, against a protein database. This tool is essential when DNA sequence reading frames are uncertain or contain errors that could impact protein coding. It provides combined statistics across all frames for new DNA analyses.
BLASTp (Protein BLAST): Used for comparing protein sequences against a database (e.g., nr database). It helps in identifying proteins by finding similar sequences in known protein databases.

The FASTA Format¶

BLAST is using FASTA format as input format for its databases and queries. The FASTA format is a widely adopted standard for representing nucleotide and protein sequences in bioinformatics. Developed in the 1980s for the FASTA sequence alignment software, it has since become a versatile and essential format for storing and sharing sequence data. The FASTA format consists of two key components:

Header Line: Each sequence begins with a header line, which starts with a greater-than symbol (>). The text following this symbol provides a description of the sequence. This description often includes information such as a unique identifier (e.g., accession number), the source organism, and other metadata. For example:

>sp|P12345|PROT_HUMAN Human Protein Name [Homo sapiens]

In this example, “sp” indicates the Swiss-Prot database, “P12345” is the accession number, “PROT_HUMAN” is the unique identifier, and additional information follows.

Sequence Lines: Below the header line, the sequence data is written as plain text. For nucleotide sequences, this consists of a string of the bases A, C, G, and T (or U for RNA). For protein sequences, it consists of a string of the standard 20 amino acid single-letter codes. The sequence can be broken into multiple lines, making it easier to read and process:

ATGCGTACGTGACGT  
CGTGAGCTAGTCAGT

These sequence lines represent the data to be analyzed and compared.

Combining the two you get a fasta record e.g.

>sp|C9JLW8|MCRI1_HUMAN Mapk-regulated corepressor-interacting protein 1 OS=Homo sapiens OX=9606 GN=MCRIP1 PE=1 SV=1  
MTSSPVSRVVYNGKRTSSPRSPPSSSEIFTPAHEENVRFIYEAWQGVERDLRGQVPGGER  
GLVEEYVEKVPNPSLKTFKPIDLSDLKRRSTQDAKKS

Usage and Applications¶

The FASTA format’s simplicity and clarity make it ideal for storing and sharing sequence data. It is supported by most bioinformatics tools, including alignment algorithms, database search tools, and genome browsers. Additionally, the format is highly adaptable, allowing for easy conversion to and from other sequence formats.

FASTA files can contain multiple sequences, each represented by its own header and sequence lines, making them an efficient way to store large datasets. They are used in bioinformatics pipelines, providing a way to manage and share sequence information. Note, however, that FASTA is not a well-defined format, and there are multiple variants in how both headers and sequence lines should be formatted.

You can for instance get all the canonical amino acid sequences of the human genome as a single FASTA file

References¶

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410.
Karlin, S., & Altschul, S. F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences, 87(6), 2264–2268.