Mod 7_0
Multiple sequence alignments
Assumptions, limitations and pitfalls

next mod7_1->


In 1965 Zuckerkandl and Pauling wrote a paper entitled "Molecules as documents of evolutionary history". They suggested that linear amino acid or DNA sequences could be treated "as records of organismal evolutionary history". The full impact of this approach on evolutionary biology became apparent when Woese and coworkers using 16s rRNA sequences suggested a formal three-domain proposal for the classification of organisms. The proposal assigned each of the three groups, archaebacteria, eubacteria. and eukaryotes, a Domain status (a new highest taxonomic level) and renamed them Archaea, Bacteria, and Eucarya. The name Archaea was specifically proposed to indicate that this group of prokaryotes bears no specific relationship to the other prokaryotes the eubacteria. This rooted version of the universal tree- commonly referred to as the archaebacterial or three-domain tree, is now widely accepted as the current paradigm in the field.

|===== Eubacteria ("True bacteria", mitochondria, and chloroplasts)
|===== Archaea (Methanogens, Halophiles, Sulfolobus, and relatives)
|===== Eukaryotes (Protists, Plants, Fungi, Animals, etc.)
?=== Viruses

The monophyly of Archaea is however controversial see ( for a discussion)

In recent years, much new information based on a large number of gene and protein sequences, including the complete genomes of several prokaryotic and eukaryotic organisms has become available. Based on this information, it is now possible to critically evaluate the three-domain proposal as described above, and its various predictions and to determine if this view is supported by all data or is true only for a subset of gene and protein sequences.

Concepts and terminology:
In order to understand this chapter you must be familiar with the following terms.
phylo_glossary.gif (108517 bytes)

Assumptions, Limitations, and Pitfalls

The use of molecular sequences for phylogenetic studies is based on the assumption that changes in gene sequences occur randomly and in a time-dependent manner and that a certain proportion of these become fixed in the DNA and is passed on to the descendents. The accumulation of changes in gene sequences in a quasi clock-like manner has given rise to the concept of "evolutionary clock". In a clock different features (e.g. the month, day, minute, and second) move at very different rates, the changes in different gene sequences (or sometimes within different parts of the same gene) also occur at different rates. Some sequences which change very slowly (like the year, month. or day) are well suited for monitoring ancient events, while others, with a higher rate of change (like the hour, minute, or second), are better suited for relatively recent occurrences. Since the "evolutionary history" of life on this planet spans a vast period (approximately 3.8 Ga, 1 Ga= 109 years), different sequences have different utilities in evolutionary studies.

Phylogenetic studies are also based on the assumption that all life descends from a single type of cell. This "ultimate ancestor" is also known as the last common ancestor (LCA).

What are better input sequences for aligning for phylogenetic Analysis. DNA or protein sequences?
Phylogenetic analysis can be carried out based on either nucleic acid or protein sequences. Obviously for noncoding sequences such as rRNAs. tRNAs and intron sequences phylogenetic analysis can be carried out based on only the nucleotide sequence data. For gene sequences that encodes proteins, analyses can be performed based on either the nucleic acid or the amino acid sequence data.  These two kinds of analyses appear appear to be analogous.

An analysis based on nucleic acid sequences, with three times as many characters is more informative. However, for phylogenetic analyses involving distantly related taxa the increased information content in nucleic acid sequences as opposed to protein sequences is merely an illusion.
The reason for this is the degeneracy of the genetic code. All but two amino acids (Met and Trp) are encoded by multiple codons that mostly differ in the third nucleotide of a codon. In view of this degeneracy most changes in  the wobble position (the third codon position) are selectively neutral. In other words they do not result in any change of the protein sequence. As a consequence they change frequently even in closely related species. In distantly related taxa, which diverged from each other a long time ago the bases at the third codon positions may have changed so many times that the actual bases found at these positions are random in nature. These positions do not contain a phylogenetic signal and are called saturated positions. In other words excluding introns at least 33% of the bases of an open reading frame is not informative. The inclusion of such bases in the analyses, therefore, would reduce the signal to-noise ratio in the data set (positions which are evolutionary important in comparison with positions or changes which provide no evolutionary information) significantly.

Another important factor affecting the usefulness of nucleic acid sequences compared to protein sequences is the differences in the genomic G+C content of species. The G+C content of different species is known to differ greatly (this is often true for two species within the same genus as well. and it is generally homogenized over the entire genome.
In the protein coding sequences codon preference in the wobble position accommodates these differences in the G+C contents. In other words the species, which are rich in G+C, show a strong preference for codons that have G or C in the third position (often >90%) whereas species with low G+C content predominantly use A or T in these positions. Thus, two unrelated species with similar G+C contents (e.g.. either very high or very low) may have very similar bases in the third codon positions. If phylogenetic analysis is carried out based on nucleic acid sequences these species may show a strong but wrong affinity for or relationship to each other. Thus, the wobble positions can introduce a major bias into these analyses. For the same reason the bases in the first codon positions are also evolutionarily less informative and can cause a reduction in the signal-to-noise ratio. Thus in the phylogenetic analyses of distantly related taxa with varying G+C contents, the larger number of characters in the nucleic acid sequences does not offer a real advantage. Thus for the protein-coding regions, the amino acid sequences, which are minimally affected by the differences in the G+C contents of the species are the preferred choice for phylogenetic analyses.

In contrast to the protein-coding regions, where the codon degeneracy provides a natural mechanism for accommodating changes caused by G+C drifts, the effect of varying G+C compositions on structural nucleic acid sequences such as rRNA or tRNA remains largely undetermined. Thus, when comparing sequences from different species with varying G+C compositions, it is difficult to distinguish between the changes that are due to G+C drift (evolutionarily not significant) from those that are evolutionarily important. Thus, in any analyses based on structural nucleic acid sequence the signal-to-noise ratio is inherently low. The effect that this will have on phylogenetic reconstruction cannot be easily determined or corrected and is a major source of concern in phylogenetic studies based on structural nucleic acids such as the 16S rRNA.

Another major problem in phylogenetic analyses is the reliability of the sequence alignment. The alignment of homologous positions in a set of sequences is the starting point in phylogenetic analyses from which all conclusions are derived. Most sequence alignment programs work by recognizing local similarities in different parts of molecules first and then creating an alignment of all positions, which maximizes the number of matches between the sequences, keeping the number of gaps introduced to a minimum. An example of such a program is CLUSTAL(X) and a manual for this program is included in the BIT course

Although these alignment programs work similarly for both nucleic acid and protein sequences, there are important differences. In nucleic acid sequences there are only four characters, and hence the number of matches between any two unrelated sequences is expected to be a minimum of 25%: with the introduction of a small number of gaps, it is commonly in the range of 40 to 50%. Thus the probability of chance alignment of non-homologous regions in two nucleotide sequences is quite high, particularly if the sequences that are compared are of different lengths and have either unusually high or low G+C contents. In contrast, in proteins each character (=amino acid) has 20 states, which greatly reduces the probability of chance alignment between non-homologous regions.

Very often, differences in sequence alignment, the regions included in the phylogenetic analyses or even the order in which the sequences are added in an alignment could lead to important differences in the conclusions drawn. That is why in some programs the input order can be changed or can be randomized.

rRNA Sequences and Reconstruction of the Tree of Life

Most extensive phylogenetic studies of living organisms have been carried out based on the Small Sub-unit (SSU) or 16S rRNA sequences. The aim was to reconstruct tree of life. However, the alignment of rRNA sequences from various prokaryotic and eukaryotic species presents problems. There are large differences in the lengths of prokaryotic (1500 nucleotides) and eukaryotic (2,000 nt) SSU rRNAs and there is a wide variations in the G+C contents of species. Thus a reliable alignment of rRNA sequences from distantly related taxa couldn’t easily be obtained based on the primary sequence data alone. The approach taken to get around this problem is to rely on the secondary structure models of rRNA, based on the assumption that the secondary structure of the rRNA is highly conserved and provides a reliable guide for identification of important homologous positions (Fig. 2)


Based on this, portions of the folded molecules (i.e.. particular loops or stems) that are postulated to be similar in different sequences are aligned and used for phylogenetic studies. Enzymatic digestion and chemical modification studies of some species support the proposed structures of rRNAs. However their validity in distantly related prokaryotic and eukaryotic taxa is far from established. The effect that these far-reaching assumptions, on which all rRNA alignments are based, will have on the deduced phylogenetic relationships is unclear. However, it is clear that these assumptions can have a profound influence on the outcome of the analyses.

In contrast to the rRNA sequence alignment, alignment of amino acid sequences of a highly conserved protein such as the 70-kDa heat shock chaperone protein (Hsp70) requires minimal or no assumptions. Because of the similar size of this protein in various prokaryotic and eukaryotic species (even including organellar homologs) and its high degree of (global) sequence conservation, a good alignment of the sequences from various species is readily obtained. Using a sequence alignment program such as CLUSTALW or CLUSTALX or even manually by placing the sequences next to each other can do this. Read more about CLUSTALX in this PDF-file.

Now start doing exercise10

Phylogenetic Trees

A tree is a 2-dimensional graph showing evolutionary relationships among organisms, or certain genes from separate organisms. We refer to these separate sources of sequences as taxa (singular taxon), defined as phylogenetically distinct units on the tree. The tree is composed of nodes representing the taxa and branches representing the relationships among the taxa. The lengths of the branches are often drawn proportional to the number of sequence changes in the branch.

There are to kinds of trees possible trees with an (inferred) origin (rooted trees) and trees which which just present the relationship between the taxa (unrooted trees). An examples of an unrooted tree is  given below.

In this tree:

Once a (reliable!) sequence alignment has been obtained three main types of methods are used for phylogenetic (tree) reconstruction:

These methods interpret the sequence alignment in different ways and therefore the results obtained from them (tree topology) often differ! All these methods as well as the others can give rise to incorrect relationships under different conditions.

Five main factors affecting the outcome of these analyses are:

In most cases, it is difficult to ascertain the effects of different factors and to determine which phylogenetic method is more suitable or reliable. Therefore, the phylogenetic analyses described are generally carried out by different methods to see if all the methods give similar results.

Bootstrap Analysis: Statistical Relevance of a Constructed Tree
A bootstrap test can be used to test the reliability of phylogenetic relationships inferred from the above methods. In this test, the aligned sequences are sampled randomly and certain numbers of columns in the original alignment are replaced with columns from elsewhere in the sequences to obtain 100 or more different alignments each containing the same number of columns. Thus, in a given bootstrap set, some columns will not be included at all, others will be included once, and still others will be repeated two or more times. Phylogenetic analysis is then performed on each of the so-called bootstrap replicates, and a consensus tree from this data is drawn.

The main purpose that bootstrap analyses serve is to provide a measure of the variability of the phylogenetic estimate or confidence levels in the observed evolutionary relationships. If the sample data throughout the sequence length support a particular relationship, the shuffling of columns in the alignment will have no or little effect on the outcome of the analysis. The grouping of the species in all (or a vast majority) of the bootstraps will then be the same. The results of these analyses are presented by placing bootstrap scores (indicated by the percentages or the number of times that different species group together in bootstrap trees) on different nodes in the tree. As a rule of the thumb bootstrap values of >80 -to 85% are generally considered to provide good support for a specific phylogenetic relationship.

Despite due care in the alignment and analyses of the sequence data, interpretation of the phylogenetic trees that are obtained is not straightforward. The most common problem in this regard is that phylogenetic trees based on different genes or proteins may differ from each other in terms of the evolutionary information that they provide. Some genes are just better suited to resolve certain relationships than are others. Thus, while a particular relationship between two or more species may be clearly resolved and strongly supported by one gene phylogeny, the same relationship may not be obvious from a different gene phylogeny.

Sequence Signatures and Their Importance in Evolutionary Studies

Signature sequences in proteins could be defined as regions in the alignments where a specific change is observed in the primary structure of a protein in all members of one or more taxa but not in the other taxa. The changes in the sequence could be either the presence of particular amino acid substitutions or specific deletions or insertions (indels) or introns. In all cases, regions that are conserved in all the sequences under consideration must flank the signatures. These conserved regions serve as anchors to ensure that the observed signature is not an artifact resulting from improper alignment or from sequencing errors.

The rationale of using conserved indels in evolutionary studies could briefly be described as follows. When a conserved indel of defined length and sequence, and flanked by conserved regions is found at precisely the same position in homologues from different species, the simplest and most "parsimonious" explanation for this observation is that the indel was introduced only once during the course of evolution and then passed on to all descendants. Finally besides indels position of introns in eukaryotic genes can sometimes also be treated as phylogenetic markers.

Return to the exercise

Felsenstein, J. (1988) Phylogenies from molecular sequences: inferences and reliability. Annu. Rev. Genet. 22:521-565.

Li, W., and Graur, D. (1991) Fundamentals of molecular evolution. Sinauer Associates, Inc., Sunderland, MA. pp. 106-111

Swofford, D.L., and Olsen, G.J. (1990) Phylogeny Reconstruction, in Molecular systematics. edited by Hillis and Moritz, Sinauer Associates, Inc., Sunderland, MA. pp. 436-451.