6. Code: Local Pairwise alignment#
To facilitate the the reasoning in the subsequent cells, we first we define a couple of functions that we will need later, for formating and printing alignments. It is not important that you understand what these functions do, for now.
6.1. Scoring system for DNA sequences#
We setup the scoring system we need for the alignment of DNA sequences. Here we use a score system where gaps score -2 and miss matches are scored -1 and matches get a score of 3.
def gap_penalty():
return -2.0
def match_score(letterA,letterB):
if letterA == '-' or letterB == '-':
return gap_penalty()
elif letterA == letterB:
return 3.0
else:
return -1.0
6.2. Local alignments using Smith-Waterman#
Smith-Waterman alignments are similar to the ones of Needleman-Wunsch. The difference sits in the initiation of the dynamic programming matrix, and how we trace the most optimal alignment. We will implement these difference by redefining some functions.
First the initiation of the dynamic programming matrix \(S\): \(S_{i0}=0, \forall i,\) \(S_{0j}=0, \forall j\)
# Initiating dynamic programming matrices, S and trace
def initiate_local_dp(m,n):
S = np.zeros((m,n))
trace = np.zeros((m,n,2))
S[0,0] = 0.
trace[0,0,:] = (0.,0.)
for i in range(1,m):
S[i,0] = 0
trace[i,0,:] =(0,0)
for j in range(1,n):
S[0,j] = 0
trace[0,j,:] =(0,0)
return S,trace
Subsequently, the rest of \(S\) is filled as: \(S_{ij}=\max\left\{ \begin{array}{ll} S_{i-1,j-1} & +d(a_i,b_j)\\ S_{i-1,j} & +d(a_i,-)\\ S_{i,j-1} & +d(-,b_j)\\ 0 \end{array} \right.\)
def local_align(seqA,seqB,print_dynamic_matrix = False):
# Initiating variables
m, n = len(seqA)+1, len(seqB)+1
S,trace = initiate_local_dp(m,n)
# Fill in the rest of the dynamic programming matrix
for i in range(1,m):
for j in range(1,n):
match = S[i-1][j-1] + match_score(seqA[i-1],seqB[j-1])
delete = S[i-1,j] + match_score(seqA[i-1],'-')
insert = S[i,j-1] + match_score('-',seqB[j-1])
S[i,j] = max(match, delete, insert, 0.)
if match >= max(delete,insert,0.):
trace[i,j,:] = (-1,-1.)
elif delete >= max(insert,0.):
trace[i,j,:] = (-1.,0.)
elif insert >= 0.:
trace[i,j,:] = (0.,-1.)
else:
trace[i,j,:] = (0.,0.)
if print_dynamic_matrix:
print_dynamic(seqA,seqB,S)
return format_local_alignment(seqA,seqB,S,trace)
We also need a slightly different method to trace the the alignment. It is not essential you understand this code.
seqA,seqB = local_align("GATTA","GCTAC",True)
print_alignment(seqA,seqB)
- G C T A C
- 0.0 0.0 0.0 0.0 0.0 0.0
G 0.0 3.0 1.0 0.0 0.0 0.0
A 0.0 1.0 2.0 0.0 3.0 1.0
T 0.0 0.0 0.0 5.0 3.0 2.0
T 0.0 0.0 0.0 3.0 4.0 2.0
A 0.0 0.0 0.0 1.0 6.0 4.0
Best score: 6.0
GATTA
G-CTA
seqA,seqB = local_align("GCGATTA","GCTTAC",True)
print_alignment(seqA,seqB)
- G C T T A C
- 0.0 0.0 0.0 0.0 0.0 0.0 0.0
G 0.0 3.0 1.0 0.0 0.0 0.0 0.0
C 0.0 1.0 6.0 4.0 2.0 0.0 3.0
G 0.0 3.0 4.0 5.0 3.0 1.0 1.0
A 0.0 1.0 2.0 3.0 4.0 6.0 4.0
T 0.0 0.0 0.0 5.0 6.0 4.0 5.0
T 0.0 0.0 0.0 3.0 8.0 6.0 4.0
A 0.0 0.0 0.0 1.0 6.0 11.0 9.0
Best score: 11.0
GATTA
GCTTA
seqA,seqB = local_align("CTATCTCGCTATCCA","CTACGCTATTTCA",True)
print_alignment(seqA,seqB)
- C T A C G C T A T T T C A
- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 0.0 3.0 1.0 0.0 3.0 1.0 3.0 1.0 0.0 0.0 0.0 0.0 3.0 1.0
T 0.0 1.0 6.0 4.0 2.0 2.0 1.0 6.0 4.0 3.0 3.0 3.0 1.0 2.0
A 0.0 0.0 4.0 9.0 7.0 5.0 3.0 4.0 9.0 7.0 5.0 3.0 2.0 4.0
T 0.0 0.0 3.0 7.0 8.0 6.0 4.0 6.0 7.0 12.0 10.0 8.0 6.0 4.0
C 0.0 3.0 1.0 5.0 10.0 8.0 9.0 7.0 5.0 10.0 11.0 9.0 11.0 9.0
T 0.0 1.0 6.0 4.0 8.0 9.0 7.0 12.0 10.0 8.0 13.0 14.0 12.0 10.0
C 0.0 3.0 4.0 5.0 7.0 7.0 12.0 10.0 11.0 9.0 11.0 12.0 17.0 15.0
G 0.0 1.0 2.0 3.0 5.0 10.0 10.0 11.0 9.0 10.0 9.0 10.0 15.0 16.0
C 0.0 3.0 1.0 1.0 6.0 8.0 13.0 11.0 10.0 8.0 9.0 8.0 13.0 14.0
T 0.0 1.0 6.0 4.0 4.0 6.0 11.0 16.0 14.0 13.0 11.0 12.0 11.0 12.0
A 0.0 0.0 4.0 9.0 7.0 5.0 9.0 14.0 19.0 17.0 15.0 13.0 11.0 14.0
T 0.0 0.0 3.0 7.0 8.0 6.0 7.0 12.0 17.0 22.0 20.0 18.0 16.0 14.0
C 0.0 3.0 1.0 5.0 10.0 8.0 9.0 10.0 15.0 20.0 21.0 19.0 21.0 19.0
C 0.0 3.0 2.0 3.0 8.0 9.0 11.0 9.0 13.0 18.0 19.0 20.0 22.0 20.0
A 0.0 1.0 2.0 5.0 6.0 7.0 9.0 10.0 12.0 16.0 17.0 18.0 20.0 25.0
Best score: 25.0
CT-CGCTA-TCCA
CTACGCTATTTCA