Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Introduction to Sequence Alignments

A sequence alignment arranges two or more DNA, RNA, or protein sequences so that corresponding positions appear in the same column, making regions of similarity and difference immediately visible. Alignments are the central operation of bioinformatics: they underpin gene annotation, protein function prediction, evolutionary analysis, and sequence database search.

Illustration of a Sequence Alignment

GATTA-
GCT-AC

Each column represents one position in the alignment. Where both rows carry the same character the column is a match (G–G at position 1); different characters is a mismatch (A–C at position 2); a dash paired with a character is a gap, representing an insertion or deletion (indel) relative to the other sequence.

What is needed to systematically construct a pairwise alignment?

To systematically construct a pairwise alignment we need three things.

Scoring function

A scoring function, d(x,y)d(x,y), giving the score of a column of any characters xx and yy. The characters could be DNA or RNA bases of amino acids, for now, let’s think about them as DNA bases. A typical scoring function could be:

d(x,y)={pif x=ygif x= or y=notherwise d(x,y)= \begin{cases} p & \text{if } x=y\\ g & \text{if } x=- \text{ or } y=-\\ n & \text{otherwise } \end{cases}

Here, pp, is called a match score, nn, a mismatch score, and gg a gap penalty.

We score an alignment by summing up the scoring function over each position in an alignment. This allows us to evaluate if one alignment is better than another.

Exercises

Alignment Type

The alignment type specifies which part of each sequence must be covered by the optimal alignment.

Global alignment spans the full length of both sequences and penalises any unaligned ends. Use it when comparing closely related sequences of similar length — for example, orthologous genes from two species. The classic algorithm is Needleman–Wunsch.

Local alignment finds the highest-scoring contiguous region of similarity anywhere within the two sequences, ignoring flanking regions entirely. Use it when only part of each sequence is expected to match — for example, detecting a conserved domain in otherwise divergent proteins. The classic algorithm is Smith–Waterman.

Semi-global alignment requires one sequence to be fully covered while allowing unpenalised overhangs at either end of the other. Use it when a shorter sequence (e.g. a gene) should align entirely within a longer one (e.g. a genome). See variants of Needleman–Wunsch.

Alignment Algorithm

Given the scoring function and alignment type, we have a definition of what we want to achieve, i.e., there is a definition of optimality. Now we come to the question of how we obtain such optimality.

Exhaustive searches

The simplest imaginable strategy is to enumerate every possible alignment of two sequences, score each one, and return the best. This is called an exhaustive search. For short sequences it works: you can write out all alignments by hand. At each position in the alignment, the algorithm must decide whether to match/mismatch the next character from each sequence, insert a gap in sequence 1, or insert a gap in sequence 2 — three choices that can be made independently at every step. The total number of alignments therefore grows exponentially with sequence length.

To make this concrete, consider two sequences of length nn. A rough lower bound on the number of distinct global alignments is (2nn)\binom{2n}{n} — the number of ways to interleave nn characters from each sequence before any gaps are added. For n=10n = 10 this is already (2010)=184,756\binom{20}{10} = 184{,}756; for n=100n = 100 it exceeds 1058. Biological sequences are routinely hundreds to thousands of residues long, placing an exhaustive search entirely beyond computational reach.

The exhaustive approach fails not because any individual alignment is hard to score, but because the number of candidates grows faster than any practical computer can handle. This is the fundamental problem that dynamic programming solves.

Dynamic programming

Dynamic programming avoids the combinatorial explosion by recognising that the optimal alignment of two full sequences must contain within it the optimal alignment of every pair of their prefixes — a property called optimal substructure. Each sub-problem therefore only needs to be solved once.

In practice this means:

This reduces an exponentially large search to a calculation that scales with the product of the sequence lengths, and it is the principle behind every alignment algorithm in the following chapters.

In the next chapters we will describe a set of such dynamic programming algorithms in detail, here is table summarizing them.

Sequence alignment types

Alignment Type

Definition

Typical Use Case

Algorithm

Global

Aligns full length of both sequences

Closely related sequences of similar length

Needleman–Wunsch

Local

Aligns subsequences with highest similarity

Motif/domain detection, different-length sequences

Smith–Waterman

Semi-global

Aligns one sequence entirely, allows overhangs in the other

Gene-to-genome, substring matching

Variants of Needleman–Wunsch