A sequence alignment arranges two or more DNA, RNA, or protein sequences so that corresponding positions appear in the same column, making regions of similarity and difference immediately visible. Alignments are the central operation of bioinformatics: they underpin gene annotation, protein function prediction, evolutionary analysis, and sequence database search.
Illustration of a Sequence Alignment¶
GATTA-
GCT-ACEach column represents one position in the alignment. Where both rows carry the same character the column is a match (G–G at position 1); different characters is a mismatch (A–C at position 2); a dash paired with a character is a gap, representing an insertion or deletion (indel) relative to the other sequence.
What is needed to systematically construct a pairwise alignment?¶
To systematically construct a pairwise alignment we need three things.
A scoring function, i.e. how is each position in the alignment scored.
An alignment type, i.e. what type of alignment is to be constructed.
An alignment algorithm, i.e. how do we optimise the score of an alignment to form the desired alignment type
Scoring function¶
A scoring function, , giving the score of a column of any characters and . The characters could be DNA or RNA bases of amino acids, for now, let’s think about them as DNA bases. A typical scoring function could be:
Here, , is called a match score, , a mismatch score, and a gap penalty.
We score an alignment by summing up the scoring function over each position in an alignment. This allows us to evaluate if one alignment is better than another.
Exercises¶
Alignment Type¶
The alignment type specifies which part of each sequence must be covered by the optimal alignment.
Global alignment spans the full length of both sequences and penalises any unaligned ends. Use it when comparing closely related sequences of similar length — for example, orthologous genes from two species. The classic algorithm is Needleman–Wunsch.
Local alignment finds the highest-scoring contiguous region of similarity anywhere within the two sequences, ignoring flanking regions entirely. Use it when only part of each sequence is expected to match — for example, detecting a conserved domain in otherwise divergent proteins. The classic algorithm is Smith–Waterman.
Semi-global alignment requires one sequence to be fully covered while allowing unpenalised overhangs at either end of the other. Use it when a shorter sequence (e.g. a gene) should align entirely within a longer one (e.g. a genome). See variants of Needleman–Wunsch.
Alignment Algorithm¶
Given the scoring function and alignment type, we have a definition of what we want to achieve, i.e., there is a definition of optimality. Now we come to the question of how we obtain such optimality.
Exhaustive searches¶
The simplest imaginable strategy is to enumerate every possible alignment of two sequences, score each one, and return the best. This is called an exhaustive search. For short sequences it works: you can write out all alignments by hand. At each position in the alignment, the algorithm must decide whether to match/mismatch the next character from each sequence, insert a gap in sequence 1, or insert a gap in sequence 2 — three choices that can be made independently at every step. The total number of alignments therefore grows exponentially with sequence length.
To make this concrete, consider two sequences of length . A rough lower bound on the number of distinct global alignments is — the number of ways to interleave characters from each sequence before any gaps are added. For this is already ; for it exceeds 1058. Biological sequences are routinely hundreds to thousands of residues long, placing an exhaustive search entirely beyond computational reach.
The exhaustive approach fails not because any individual alignment is hard to score, but because the number of candidates grows faster than any practical computer can handle. This is the fundamental problem that dynamic programming solves.
Dynamic programming¶
Dynamic programming avoids the combinatorial explosion by recognising that the optimal alignment of two full sequences must contain within it the optimal alignment of every pair of their prefixes — a property called optimal substructure. Each sub-problem therefore only needs to be solved once.
In practice this means:
Fill a matrix. Cell stores the best alignment score for the first characters of sequence 1 against the first characters of sequence 2. Each cell depends only on the three cells directly above, to the left, and diagonally above-left, so the whole table is filled in time.
Traceback. Follow the recorded choices back from the end cell (global) or the highest-scoring cell (local) to read off the alignment.
This reduces an exponentially large search to a calculation that scales with the product of the sequence lengths, and it is the principle behind every alignment algorithm in the following chapters.
In the next chapters we will describe a set of such dynamic programming algorithms in detail, here is table summarizing them.
Sequence alignment types
Alignment Type | Definition | Typical Use Case | Algorithm |
|---|---|---|---|
Global | Aligns full length of both sequences | Closely related sequences of similar length | |
Local | Aligns subsequences with highest similarity | Motif/domain detection, different-length sequences | |
Semi-global | Aligns one sequence entirely, allows overhangs in the other | Gene-to-genome, substring matching |