Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Code: Semi-global pairwise alignment

To facilitate the the reasoning in the subsequent cells, we first we define a couple of functions that we will need later, for formating and printing alignments. It is not important that you understand what these functions do, for now.

Source
import numpy as np

# Print 2 sequences on top of each other
def print_alignment(seqA,seqB):
    print(seqA)
    print(seqB)

# Print a dynamic programming score matrix
# together with its sequences
def print_dynamic(seqA,seqB,dpm):
    seqA,seqB = "-" + seqA, "-" + seqB
    m,n = len(seqA),len(seqB)
    print('{:^5}'.format(" "), end=""),
    for j in range(n):
        print('{:^5}'.format(seqB[j]), end="")
    print()
    for i in range(m):
        print ('{:^5}'.format(seqA[i]), end="")
        for j in range(n):
            print ('{:5.1f}'.format(dpm[i,j]), end="")
        print()
    print()

# Format an alignment by inserting gaps in sequences
def format_alignment(seqA,seqB,S,trace):
    print("Best score: " + str(S[-1,-1]))
    outA,outB = "",""
    i,j = len(seqA),len(seqB)
    while i>0 or j>0:
        di,dj = trace[i,j]
        i += int(di)
        j += int(dj)
        if di == 0:
            outA = "-" + outA
        else:
            outA = seqA[i] + outA
        if dj == 0:
            outB = "-" + outB
        else:
            outB = seqB[j] + outB
    return outA,outB

Scoring system for DNA sequences

We setup the scoring system we need for the alignment of DNA sequences. Here we use a score system where gaps score -2 and miss matches are scored -1 and matches get a score of 3.

def gap_penalty():
    return -2.0

def match_score(letterA,letterB):
    if letterA == '-' or letterB == '-':
        return gap_penalty()
    elif letterA == letterB:
        return 3.0
    else:
        return -1.0

Semi-global alignments

Yet another type of alignments are the semi-global alignments. Here we will able to reuse the code for the global score, but we will initiate the dynamic programming matrix as for the local alignment. We also need to redefine how to read the alignment.

First we initiate the dynamic programming matrix SS: Si0=0,i,S_{i0}=0, \forall i, S0j=0,jS_{0j}=0, \forall j

# Initiating dynamic programming matrices, S and trace
def initiate_semiglobal_dp(m,n):
    S = np.zeros((m,n))
    trace = np.zeros((m,n,2))
    S[0,0] = 0.
    trace[0,0,:] = (0.,0.)
    for i in range(1,m):
        S[i,0] = 0
        trace[i,0,:] =(-1,0)
    for j in range(1,n):
        S[0,j] = 0
        trace[0,j,:] =(0,-1)
    return S,trace
def format_semiglobal_alignment(seqA,seqB,S,trace):
    outA,outB = "",""
    m,n = S.shape[0]-1, S.shape[1]-1
    i,j = S[:,n].argmax(), S[m,:].argmax()
    if S[i,n] > S[m,j]:
        print("Best score: " + str(S[i,n]))
        j = n
        # point the alignmnént from (m,n) to here
        for ix in range(i+1,m+1):
            print(ix,n)
            trace[ix,n,:] = (-1,0)
    else:
        print("Best score: " + str(S[m,j]))
        i = m    
        # point the alignmnént from (m,n) to here
        for ix in range(j+1,n+1):
            print(m,ix)
            trace[m,ix,:] = (0,-1)
    i,j = m,n
    while min(trace[i,j])<0:
        di,dj = trace[i,j]
        i += int(di)
        j += int(dj)
        if di == 0:
            outA = "-" + outA
        else:
            outA = seqA[i] + outA
        if dj == 0:
            outB = "-" + outB
        else:
            outB = seqB[j] + outB
    return outA,outB

Subsequently, the rest of SS is filled as:

Sij=max{Si1,j1+d(ai,bj)Si1,j+d(ai,)Si,j1+d(,bj) S_{ij}=\max\left\{ \begin{array}{ll} S_{i-1,j-1} & +d(a_i,b_j)\\ S_{i-1,j} & +d(a_i,-)\\ S_{i,j-1} & +d(-,b_j) \end{array} \right.

def semiglobal_align(seqA,seqB,print_dynamic_matrix = False):
    # Initiating variables
    m, n = len(seqA)+1, len(seqB)+1
    S,trace = initiate_semiglobal_dp(m,n)
    # Fill in the rest of the dynamic programming matrix
    for i in range(1,m):
        for j in range(1,n):
            match = S[i-1][j-1] + match_score(seqA[i-1],seqB[j-1])
            delete = S[i-1,j] + match_score(seqA[i-1],'-') 
            insert = S[i,j-1] + match_score('-',seqB[j-1]) 
            S[i,j] = max(match, delete, insert)
            if match >= max(delete,insert):
                trace[i,j,:] = (-1,-1.)
            elif delete >= insert:
                trace[i,j,:] = (-1.,0.)
            else:
                trace[i,j,:] = (0.,-1.)
    if print_dynamic_matrix:
        print_dynamic(seqA,seqB,S)
    return format_semiglobal_alignment(seqA,seqB,S,trace)
seqA,seqB = semiglobal_align("CTATCTATCTCGCTATCCAA","CTACGCTATTTCA",True)
print_alignment(seqA,seqB)
       -    C    T    A    C    G    C    T    A    T    T    T    C    A  
  -    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
  C    0.0  3.0  1.0 -1.0  3.0  1.0  3.0  1.0 -1.0 -1.0 -1.0 -1.0  3.0  1.0
  T    0.0  1.0  6.0  4.0  2.0  2.0  1.0  6.0  4.0  2.0  2.0  2.0  1.0  2.0
  A    0.0 -1.0  4.0  9.0  7.0  5.0  3.0  4.0  9.0  7.0  5.0  3.0  1.0  4.0
  T    0.0 -1.0  2.0  7.0  8.0  6.0  4.0  6.0  7.0 12.0 10.0  8.0  6.0  4.0
  C    0.0  3.0  1.0  5.0 10.0  8.0  9.0  7.0  5.0 10.0 11.0  9.0 11.0  9.0
  T    0.0  1.0  6.0  4.0  8.0  9.0  7.0 12.0 10.0  8.0 13.0 14.0 12.0 10.0
  A    0.0 -1.0  4.0  9.0  7.0  7.0  8.0 10.0 15.0 13.0 11.0 12.0 13.0 15.0
  T    0.0 -1.0  2.0  7.0  8.0  6.0  6.0 11.0 13.0 18.0 16.0 14.0 12.0 13.0
  C    0.0  3.0  1.0  5.0 10.0  8.0  9.0  9.0 11.0 16.0 17.0 15.0 17.0 15.0
  T    0.0  1.0  6.0  4.0  8.0  9.0  7.0 12.0 10.0 14.0 19.0 20.0 18.0 16.0
  C    0.0  3.0  4.0  5.0  7.0  7.0 12.0 10.0 11.0 12.0 17.0 18.0 23.0 21.0
  G    0.0  1.0  2.0  3.0  5.0 10.0 10.0 11.0  9.0 10.0 15.0 16.0 21.0 22.0
  C    0.0  3.0  1.0  1.0  6.0  8.0 13.0 11.0 10.0  8.0 13.0 14.0 19.0 20.0
  T    0.0  1.0  6.0  4.0  4.0  6.0 11.0 16.0 14.0 13.0 11.0 16.0 17.0 18.0
  A    0.0 -1.0  4.0  9.0  7.0  5.0  9.0 14.0 19.0 17.0 15.0 14.0 15.0 20.0
  T    0.0 -1.0  2.0  7.0  8.0  6.0  7.0 12.0 17.0 22.0 20.0 18.0 16.0 18.0
  C    0.0  3.0  1.0  5.0 10.0  8.0  9.0 10.0 15.0 20.0 21.0 19.0 21.0 19.0
  C    0.0  3.0  2.0  3.0  8.0  9.0 11.0  9.0 13.0 18.0 19.0 20.0 22.0 20.0
  A    0.0  1.0  2.0  5.0  6.0  7.0  9.0 10.0 12.0 16.0 17.0 18.0 20.0 25.0
  A    0.0 -1.0  0.0  5.0  4.0  5.0  7.0  8.0 13.0 14.0 15.0 16.0 18.0 23.0

Best score: 25.0
20 13
CTATCTATCT-CGCTA-TCCAA
--------CTACGCTATTTCA-