Next: Sequence and structural alignments
Up: Bioinformatics Tutorial
Previous: SCOP fold classification
Subsections
Sequence Alignment Algorithms
In this section you will optimally align two short protein sequences using
pen and paper, then search for homologous proteins by using a
computer program to align several, much longer, sequences.
Dynamic programming algorithms are recursive algorithms modified to store
intermediate results, which improves efficiency for certain problems. The
Smith-Waterman (Needleman-Wunsch) algorithm uses a dynamic programming
algorithm to find the optimal local (global)
alignment of two sequences --
and
. The alignment algorithm is based on
finding the elements of a matrix
where the element
is the
optimal score for aligning the sequence (
,
,...,
) with
(
,
,.....,
). Two similar amino acids (e.g. arginine and lysine)
receive a high score, two dissimilar amino acids (e.g. arginine and glycine)
receive a low score. The higher the score of a path through the matrix, the better
the alignment. The matrix
is found by progressively finding the matrix
elements, starting at
and proceeding in the directions of increasing
and
. Each element is set according to:
where
is the similarity score of comparing amino acid
to amino
acid
(obtained here from the BLOSUM40 similarity table) and
is the
penalty for a single gap. The matrix is initialized with
. When
obtaining the local Smith-Waterman alignment,
is modified:
The gap penalty can be modified, for instance,
can be replaced by
, where
is the penalty for a single gap
and
is the number of consecutive gaps.
Once the optimal alignment score is found, the ``traceback'' through
along
the optimal path is found, which corresponds to the the optimal sequence
alignment for the score. In the next set of exercises you will manually
implement the Needleman-Wunsch alignment for a pair of short sequences, then
perform global sequence alignments with a computer program developed by Anurag
Sethi, which is based on the Needleman-Wunsch algorithm with an affine gap
penalty,
, where
is the extension gap penalty. The output
file will be in the GCG format, one of the two standard formats in
bioinformatics for storing sequence information (the other standard format is
FASTA).
Manually perform a Needleman-Wunsch alignment
In the first exercise you will test the Smith-Waterman algorithm on a short
sequence parts of hemoglobin (PDB code 1AOW) and myoglobin 1 (PDB code 1AZI).
- Here you will align the sequence HGSAQVKGHG to the sequence KTEAEMKASEDLKKHGT.
- The two sequences are arranged in a matrix in Table 3.
The sequences start at the upper right corner, the initial gap penalties
are listed at each offset starting position. With each move from the start
position, the initial penalty increase by our single gap penalty of 8.
Table 3:
The empty matrix with initial gap penalties.
|
|
H |
G |
S |
A |
Q |
V |
K |
G |
H |
G |
|
0 |
-8 |
-16 |
-24 |
-32 |
-40 |
-48 |
-56 |
-64 |
-72 |
-80 |
K |
-8 |
|
|
|
|
|
|
|
|
|
|
T |
-16 |
|
|
|
|
|
|
|
|
|
|
E |
-24 |
|
|
|
|
|
|
|
|
|
|
A |
-32 |
|
|
|
|
|
|
|
|
|
|
E |
-40 |
|
|
|
|
|
|
|
|
|
|
M |
-48 |
|
|
|
|
|
|
|
|
|
|
K |
-56 |
|
|
|
|
|
|
|
|
|
|
A |
-64 |
|
|
|
|
|
|
|
|
|
|
S |
-72 |
|
|
|
|
|
|
|
|
|
|
E |
-80 |
|
|
|
|
|
|
|
|
|
|
D |
-88 |
|
|
|
|
|
|
|
|
|
|
L |
-96 |
|
|
|
|
|
|
|
|
|
|
K |
-104 |
|
|
|
|
|
|
|
|
|
|
K |
-112 |
|
|
|
|
|
|
|
|
|
|
H |
-120 |
|
|
|
|
|
|
|
|
|
|
G |
-128 |
|
|
|
|
|
|
|
|
|
|
T |
-136 |
|
|
|
|
|
|
|
|
|
|
|
- The first step is to fill in the similarity scores S
from looking up
the matches in the BLOSUM40 table, shown here labeled with 1-letter amino acid
codes:
A 5
R -2 9
N -1 0 8
D -1 -1 2 9
C -2 -3 -2 -2 16
Q 0 2 1 -1 -4 8
E -1 -1 -1 2 -2 2 7
G 1 -3 0 -2 -3 -2 -3 8
H -2 0 1 0 -4 0 0 -2 13
I -1 -3 -2 -4 -4 -3 -4 -4 -3 6
L -2 -2 -3 -3 -2 -2 -2 -4 -2 2 6
K -1 3 0 0 -3 1 1 -2 -1 -3 -2 6
M -1 -1 -2 -3 -3 -1 -2 -2 1 1 3 -1 7
F -3 -2 -3 -4 -2 -4 -3 -3 -2 1 2 -3 0 9
P -2 -3 -2 -2 -5 -2 0 -1 -2 -2 -4 -1 -2 -4 11
S 1 -1 1 0 -1 1 0 0 -1 -2 -3 0 -2 -2 -1 5
T 0 -2 0 -1 -1 -1 -1 -2 -2 -1 -1 0 -1 -1 0 2 6
W -3 -2 -4 -5 -6 -1 -2 -2 -5 -3 -1 -2 -2 1 -4 -5 -4 19
Y -2 -1 -2 -3 -4 -1 -2 -3 2 0 0 -1 1 4 -3 -2 -1 3 9
V 0 -2 -3 -3 -2 -3 -3 -4 -4 4 2 -2 1 0 -3 -1 1 -3 -1 5
B -1 -1 4 6 -2 0 1 -1 0 -3 -3 0 -3 -3 -2 0 0 -4 -3 -3 5
Z -1 0 0 1 -3 4 5 -2 0 -4 -2 1 -2 -4 -1 0 -1 -2 -2 -3 2 5
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 0 -1 -2 0 0 -2 -1 -1 -1 -1 -1
A R N D C Q E G H I L K M F P S T W Y V B Z X
- We fill in the BLOSUM40 similarity scores for you in Table 4.
- To turn this
matrix intro the dynamic programming
matrix requires
calculation of the contents of all 170 boxes. We've calculated the first 4
here, and encourage you to calculate the contents of at least 4 more. The
practice will come in handy in the next steps. As described above, a matrix square
cannot be filled with its dynamic programming value until the squares above, to
the left, and to the above-left diagonal are computed. The value of a square is
, using the convention that
values appear in the top part of a square in
large print, and
values appear in the bottom part of a square in small
print. Our gap penalty
is 8.
- Example: In the upper left square in Table 4, square (1,1), the similarity score
is -1, the number in small type at the bottom of the box. The value to assign as
will be the greatest (``max'') of these three values:
. That is, the greatest of:
which just means the greatest of: -1, -16, and -16. This is -1, so we write -1 as the value of
(the larger number in the top part of the box).
The same reasoning in square (2,1) leads us to set
as -9, and so on.
Note: we consider
to be the ``predecessor'' of
,
since it helped decided
's value. Later, predecessors will qualify to be on the
traceback path.
- Again, just fill in 4 or 5 boxes in Table 4 until you
get a feel for gap penalties and similarity scores S vs. alignment scores H.
In the next step, we provide the matrix with all values filled in as
Table 5. Check that your 4 or 5 calculations match.
- Now we move to Table 5, with all 170
values are shown, to do the ``alignment traceback''. To find the alignment requires one to trace the
path through from the end of the sequence (the lower right box) to the start of the sequence (the upper left box). This job looks complicated, but should only take about 5 -7 minutes.
- We are tracing a path in Table 5, from the lower right box to the upper left box.
You can only move to a square if it could have been a ``predecessor'' of your
current square - that is, when the matrix was being filled with
values, the move from the predecessor square to your current
square would have followed the mathematical rules we used to find H
above. Circle each square you move to along your path.
- Example: we start at the lower right square (10,17), where
is
-21 and
is -2. We need to test for 3 possible directions of
movement: diagonal (up + left), up, and left. The condition for diagonal
movement given above is:
, so for the
diagonal box (9,16) to have contributed to (10,17),
would have to equal the H value of our box, -21. Since (-29 + -2) does not
equal -21, the diagonal box is not a ``predecessor'', so we can't move in that
direction. We try the rule for the box to the left:
Since -37 - 8 does not equal -21, we also can't move left. Our last chance is
moving up. We test
. Since -21 = (-13 - 8) we can
move up! Draw an arrow from the lower right box, (
) to
the box just above it, (
) .
- Continue moving squares, drawing arrows, and circling each new square you
land on, until you have reached the upper right corner of the matrix If the path
branches, follow both branches.
- Write down the alignment(s) that corresponds to your path(s) by writing
the the letter codes on the margins of each position along your circled path.
Aligned pairs are at the boxes at which the path exits via the upper-left corner.
When there are horizontal or vertical movements movements along your path,
there will be a gap (write as a dash, ``-'') in your sequence.
- Now to check your results against a computer program. We have prepared a
pairwise Needleman-Wunsch alignment program, pair, which you will apply to the same
sequences which you have just manually aligned.
- Change your directory by typing at the Unix prompt:
cd
~/tbss.work/Bioinformatics/pairData
then start the pair alignment executable by typing:
pair targlist
. All alignments will be carried out using the BLOSUM40
matrix, with a gap penalty of 8. The paths to the input files and the
BLOSUM40 matrix used are defined in the file targlist; the BLOSUM40 matrix is the
first 25 lines of the file blosum40. (Other substitution matrices
can be found at the NCBI/Blast website.)
Note: In some installations, the pair executable is
in ~/tbss.work/Bioinformatics/pairData and here you must
type ./pair targlist to run it.
If you cannot access the pair
executable at all, you can see the output from this step in ~/tbss.work/Bioinformatics/pairData/example_output/
- After executing the program you will generate three output files namely
align, scorematrix and stats. View the alignment in GCG format by typing less align. The file scorematrix is the 17x10
matrix. If there are multiple paths along the traceback matrix, the program pair will choose only one path, by following this precedence rule for existing potential traceback directions, listed in decreasing precedence: diagonal (left and up), up, left.
In the file stats you will find the optimal alignment score and the percent identity of the alignment.
Finding homologous pairs of ClassII tRNA synthetases
Homologous proteins are proteins derived from a common ancestral gene.
In this exercise with the Needleman-Wunsch algorithm you will study the
sequence identity of several class II tRNA synthetases, which are either from
Eucarya, Eubacteria or Archaea or differ in the kind of aminoacylation
reaction which they catalyze. Table 6 summarizes the reaction
type, the organism and the PDB accession code and chain name of the
employed Class II tRNA synthetase domains.
Table 6:
Domain types, origins, and accession codes
Specificity |
Organism |
PDB code:chain |
ASTRAL catalytic domain |
Aspartyl |
Eubacteria |
1EQR:B |
d1eqrb3 |
Aspartyl |
Archaea |
1B8A:A |
d1b8aa2 |
Aspartyl |
Eukarya |
1ASZ:A |
d1asza2 |
Glycl |
Archaea |
1ATI:A |
d1atia2 |
Histidyl |
Eubacteria |
1ADJ:C |
d1adjc2 |
Lysl |
Eubacteria |
1BBW:A |
d1bbwa2 |
Aspartyl |
Eubacteria |
1EFW:A |
d1efwa3 |
|
- We have have prepared a computer program multiple which will align multiple pairs of proteins.
- Change your directory by typing at the Unix prompt:
cd ~/tbss.work/Bioinformatics/multipleData
then start the alignment executable by typing:
multiple targlist
.
Note: In some installations, the multiple executable is
in ~/tbss.work/Bioinformatics/multipleData and here you must
type ./multiple targlist to run it.
If you cannot access the multiple executable at all, you can see the output from this step in ~/tbss.work/Bioinformatics/multipleData/example_output/
- In the align and stats files you will find all
combinatorial possible pairs of the provided sequences. On a piece of paper,
write the names of the the proteins, grouped by ther domain of life, as listed
in Table 6. Compare sequence identities of aligned proteins
from the same domain of a life, and of aligned proteins from different domains
of life, to help answer the questions below.
Next: Sequence and structural alignments
Up: Bioinformatics Tutorial
Previous: SCOP fold classification
zan@uiuc.edu