Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Multiple Sequence Alignments

Introduction

While pairwise alignments enable comparisons between two sequences, many biological analyses require a more comprehensive approach: the alignment of multiple sequences simultaneously. Multiple sequence alignments (MSAs) offer a holistic view of sequence similarities and differences, illuminating functional and evolutionary relationships across diverse genomes and proteomes.

Exponential Time Complexity

In principle, one can use dynamic programming to form MSAs. This can be achieved by extending the ideas used for pairwise alignments, expanding the dynamic programming matrix into a tensor with as many dimensions as there are sequences. However, a major drawback is the exponential time complexity. Aligning TT sequences, each of length NN, results in a time complexity of O(NT)O(N^T), making it impractical for datasets with more than a few sequences.

Iterative Approaches as a Practical Alternative

Given the inefficiency of dynamic programming for MSAs, bioinformaticians have turned to heuristics that approximate the behavior of full dynamic programming. In particular, iterative approaches such as progressive alignment simplify the alignment process by breaking it down into manageable steps. Progressive alignment, for instance, first evaluates the pairwise similarities between sequences by computing their pairwise alignments, then iteratively aligns sequences according to which are most similar. This reduces the computational burden and allows for the processing of large datasets in a reasonable time frame.

Progressive Alignments

While dynamic programming offers an optimal solution for pairwise alignments, its inefficiency for MSAs necessitates a balance between accuracy and computational efficiency. Iterative and heuristic methods like progressive alignment and consistency-based algorithms strike this balance, providing near-optimal solutions in significantly less time.

A key method for constructing MSAs is progressive alignment, a stepwise approach that builds upon pairwise alignments to create a broader, more nuanced comparison. The progressive alignment method begins by generating a similarity matrix (or distance matrix if we instead quantify how different the sequences are), which quantifies the pairwise similarity between all sequences in the dataset.

Progressive alignment iteratively aligns sequences in order of their proximity on the tree. This process begins with the closest pair of sequences, which are aligned directly. Subsequent sequences are added progressively, with each new sequence being aligned to the growing consensus. This stepwise process creates an alignment that reflects both sequence similarities and evolutionary relationships.

The progressive alignment method offers several advantages. It efficiently handles large datasets, making it well-suited for genomics and proteomics research. By utilizing an evolutionary framework, it provides deeper insights into the conserved and divergent features of sequences, aiding in the identification of functionally important regions and evolutionary patterns.

In this chapter, we delve into the mechanics of progressive alignment, exploring its implementation and applications in bioinformatics. By understanding this approach, we gain valuable insights into the biological relationships and evolutionary trajectories that shape our world.

Progressive Alignment Steps

The process builds alignments in a stepwise manner, starting with the most similar sequences and gradually incorporating others. Here’s a detailed look at how to create a progressive alignment. Let’s assume you start with TT sequences, C={A1,,AT}C=\{A_1, \ldots, A_T\}.

  1. Compute Pairwise Alignments: Calculate all pairwise alignments between each pair of sequences in CC using a suitable algorithm such as Needleman-Wunsch or Smith-Waterman. This provides a matrix of scores or distances that indicate how similar or different each pair of sequences is.

  2. Use the pairwise distances to identify the two sequences in CC, AiA_i and AjA_j, that are closest (those with the highest similarity score or lowest distance), and form an alignment of those two sequences. Remove AiA_i and AjA_j from CC. We call the alignment of those two sequences BB.

  3. Add BB to CC. Set the score or distance from BB to each of the other sequences in CC to the average of their scores or distances from AiA_i and AjA_j.

  4. If there is more than one sequence left in CC, repeat from step 2.

Progressive Alignment Example

Input sequences

IAMAPEPTIDE  
IAMPEPTIDE
IAMPEPPED
IAMAPEPPERD

These are the sequences we want to align.

Seq1 and Seq2

IAMAPEPTIDE
IAM-PEPTIDE

We form a pairwise alignment of the two most similar sequences. We then pool the resulting sequences with the other two sequences.

Seq3 and Seq4

IAM-PEPPE-D
IAMAPEPPERD

These sequences are the most similar among the remaining ones.

(Seq1,Seq2) and (Seq3,Seq4)

IAMAPEPTID-E
IAM-PEPTID-E
IAM-PEPP-E-D
IAMAPEPP-ERD

There are two generalized sequences left. We align those two, and we are then done.

Applications of Progressive Alignments

Modern bioinformatics tools, such as MUSCLE Edgar (2004), MAFFT Katoh et al. (2002), T-Coffee Notredame et al. (2000), and Kalign Lassmann & Sonnhammer (2005), incorporate iterative refinement as a core feature, allowing researchers to achieve high-quality alignments with minimal manual intervention. These tools use sophisticated algorithms to iteratively refine alignments, often integrating additional information such as structural data or evolutionary constraints. This integration enhances the overall robustness of the alignment process, making these tools highly effective for a wide range of applications in genomics, proteomics, and evolutionary biology.

Iterative Refinement of MSAs

While progressive alignment provides a structured approach to generating multiple sequence alignments (MSAs), it is often just the starting point for more refined alignments. The initial alignment produced by progressive methods can be improved through iterative refinement, a crucial step for enhancing alignment accuracy and resolving inconsistencies that may arise during the initial process.

The Need for Iterative Refinement

Progressive alignment can introduce biases, particularly when the initial pairwise alignments are not optimal or when sequences with varying evolutionary distances are involved. These biases can propagate through the alignment process, leading to suboptimal alignments, especially in regions where the evolutionary signals are weak or ambiguous. Iterative refinement addresses these issues by revisiting and adjusting the alignment, ensuring that it more accurately reflects the true evolutionary relationships and functional similarities among the sequences.

Iterative Refinement Techniques

There are several approaches to iterative refinement, each with its own strengths and applications:

Applications and Advantages of Iterative Refinement

Iterative refinement is widely used in bioinformatics tools such as MUSCLE Edgar (2004) and MAFFT Katoh et al. (2002), which implement various refinement strategies to enhance MSA quality. The iterative process ensures that the final alignment is more accurate and reliable, making it particularly valuable for downstream analyses, such as phylogenetic tree construction, structural modeling, and functional annotation.

Moreover, iterative refinement is adaptable, allowing bioinformaticians to tailor the process based on the specific characteristics of the sequences being aligned. This flexibility makes it an indispensable tool in the alignment of sequences with complex evolutionary histories, such as those involving duplications, deletions, or recombination events.

References

References
  1. Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797.
  2. Katoh, K., Misawa, K., Kuma, K., & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30(14), 3059–3066.
  3. Notredame, C., Higgins, D. G., & Heringa, J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology, 302(1), 205–217.
  4. Lassmann, T., & Sonnhammer, E. L. (2005). Kalign–an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics, 6, 1–9.