Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

The BLAST Algorithm for Sequence Retrieval

Introduction

BLAST (Basic Local Alignment Search Tool) is a powerful algorithm designed for comparing biological sequences, such as protein amino acid sequences or DNA/RNA nucleotide sequences Altschul et al. (1990). Developed in 1990, BLAST has become one of the most widely used tools in bioinformatics, cited more than a hundred thousand times.

The primary purpose of BLAST is to compare a query sequence against a database of sequences, identifying those with a similarity above a certain threshold. This enables researchers to efficiently find sequences that resemble the query, facilitating further biological research.

Background and Development

BLAST was designed to address a fundamental challenge in bioinformatics: how to efficiently search large genomic databases. Its heuristic algorithm offers a much faster alternative to traditional methods like Smith-Waterman and Needleman-Wunsch, which focus on optimal alignment but are computationally intensive.

BLAST’s success and widespread adoption stem from its balance of speed and sensitivity, making it an essential tool for biological research. Its development was also supported by a novel stochastic model Karlin & Altschul (1990), the so-called Karlin-Altschul statistics, forming the statistical foundation for the tool.

K-mer indexing

What are k-mers?

Before we describe the details of BLAST, we need to describe the concept of k-mers. A k-mer is a contiguous substring of length k derived from a larger sequence:

K-mers can be used for indexing sequences, making sequence retrieval more efficient:

Indexing

Before BLAST can efficiently search a database, it is critical that one index the sequences. Here’s how BLAST indexes sequences: BLAST creates a lookup table where each entry corresponds to a k-mer (typically 11 for nucleotides and 3-6 for proteins). For each k-mer in the database sequences, the algorithm stores its position within the sequence.

The BLAST Algorithm

The Basic Local Alignment Search Tool (BLAST) is a heuristic algorithm designed to create local alignments, helping to find the closest homologs of sequences in a database. The algorithm leverages indices of k-tuples from the sequence database, with k typically ranging from 3 to 6 for proteins and 6 to 20 for DNA. BLAST follows these key steps:

  1. Word Hits Generation: The query sequence is divided into k-tuples, which are compared to pre-computed k-tuples from the database sequences.

  2. Seeding: The matching k-tuples, known as “seeds”, are used as anchor points for the alignment procedure. The list of k-tuples is expanded to include any k-tuples that match with an alignment score above a threshold T.

  3. Extension: BLAST extends these seeds in both directions, creating larger alignments. This extension continues until the alignment score drops below a threshold, preventing poor alignments from extending. These extended matches are known as high-scoring segment pairs (HSPs).

  4. Alignment Scoring: Each extended alignment is scored based on the similarity between the aligned regions, taking into account matches, mismatches, and gaps.

  5. Statistical Evaluation: BLAST calculates the statistical significance of each alignment using E-values, indicating how many alignments of the same quality would be found by chance in a database of a given size.

  6. Output: The algorithm then lists all alignments above a threshold score, providing details such as alignment positions, scores, and statistical significance.

This combined approach gives an overview of BLAST, highlighting its key features and processes.

E-Values and Their Meaning

BLAST assesses the statistical significance of alignments using the Gumbel extreme value distribution (EVD) model. This model estimates the probability of an alignment score being due to chance. An E-value (expect value) is a statistical measure used in BLAST to assess the significance of an alignment with a score SS.

The E-values in BLAST searches provide a direct indication of alignment quality:

How is the BLAST E-value Calculated?

The E-value (expect value) is a statistical measure used in BLAST to assess the significance of an alignment between a query sequence and database sequences. It reflects the expected number of alignments that would occur by chance in a given database search. The E-value is calculated based on the alignment score SS, the search space size m×nm × n, and parameters derived from the scoring system and database composition, such as the Karlin-Altschul parameters (KK and λλ). The formula for the E-value is:

E=KmneλSE = Kmne^{-\lambda S}

Where:

The E-value is directly proportional to the search space size (m × n) and inversely proportional to the exponential function of the alignment score (S). Consequently, larger databases provide more opportunities for chance alignments, resulting in higher E-values (indicating weaker statistical significance) for the same level of sequence similarity.

How Are KK and λ\lambda Obtained?

When BLAST was first developed, the parameters KK and λ\lambda were estimated by aligning large numbers of randomized sequences with the same background composition as the database. The distribution of alignment scores from these random comparisons follows an extreme value distribution (EVD), as predicted by Karlin–Altschul theory. By fitting the empirical score distribution to the EVD, appropriate values of KK and λ\lambda were obtained for each scoring matrix and residue composition.

Formally, λ\lambda is defined as the unique positive solution to the equation

i,jpiqjeλsij=1,\sum_{i,j} p_i q_j e^{\lambda s_{ij}} = 1,

where sijs_{ij} is the substitution score for aligning residues ii and jj, and pip_i, qjq_j are their background frequencies. Once λ\lambda is determined, KK is fitted so that the theoretical distribution

P(Sx)1exp ⁣(Kmneλx)P(S \geq x) \approx 1 - \exp\!\big(-K m n e^{-\lambda x}\big)

matches the observed distribution of random alignment scores.

Today, BLAST does not generate random sequences for each search. Instead, it uses lookup tables of pre-computed values for KK and λ\lambda, derived from these earlier large-scale calibrations. For standard scoring matrices (e.g. BLOSUM62, PAM250) and typical residue frequencies, these values are stable and can be reused. In addition, BLAST may apply composition-based adjustments if the query or database composition deviates significantly from the assumptions used in the pre-computed tables.

Types of BLAST

BLAST comes in a couple of different versions, depending on its usage. Here are the main versions of BLAST:

The FASTA Format

BLAST is using FASTA format as input format for its databases and queries. The FASTA format is a widely adopted standard for representing nucleotide and protein sequences in bioinformatics. Developed in the 1980s for the FASTA sequence alignment software, it has since become a versatile and essential format for storing and sharing sequence data. The FASTA format consists of two key components:

  1. Header Line: Each sequence begins with a header line, which starts with a greater-than symbol (>). The text following this symbol provides a description of the sequence. This description often includes information such as a unique identifier (e.g., accession number), the source organism, and other metadata. For example:

>sp|P12345|PROT_HUMAN Human Protein Name [Homo sapiens]  

In this example, “sp” indicates the Swiss-Prot database, “P12345” is the accession number, “PROT_HUMAN” is the unique identifier, and additional information follows.

  1. Sequence Lines: Below the header line, the sequence data is written as plain text. For nucleotide sequences, this consists of a string of the bases A, C, G, and T (or U for RNA). For protein sequences, it consists of a string of the standard 20 amino acid single-letter codes. The sequence can be broken into multiple lines, making it easier to read and process:

ATGCGTACGTGACGT  
CGTGAGCTAGTCAGT  

These sequence lines represent the data to be analyzed and compared.

Combining the two you get a fasta record e.g.

>sp|C9JLW8|MCRI1_HUMAN Mapk-regulated corepressor-interacting protein 1 OS=Homo sapiens OX=9606 GN=MCRIP1 PE=1 SV=1  
MTSSPVSRVVYNGKRTSSPRSPPSSSEIFTPAHEENVRFIYEAWQGVERDLRGQVPGGER  
GLVEEYVEKVPNPSLKTFKPIDLSDLKRRSTQDAKKS  

Usage and Applications

The FASTA format’s simplicity and clarity make it ideal for storing and sharing sequence data. It is supported by most bioinformatics tools, including alignment algorithms, database search tools, and genome browsers. Additionally, the format is highly adaptable, allowing for easy conversion to and from other sequence formats.

FASTA files can contain multiple sequences, each represented by its own header and sequence lines, making them an efficient way to store large datasets. They are used in bioinformatics pipelines, providing a way to manage and share sequence information. Note, however, that FASTA is not a well-defined format, and there are multiple variants in how both headers and sequence lines should be formatted.

You can for instance get all the canonical amino acid sequences of the human genome as a single FASTA file

References

References
  1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410.
  2. Karlin, S., & Altschul, S. F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences, 87(6), 2264–2268.