Phylogenetics. Fst measures
genetic distances during the initial stages of population divergence. Over
time, existing alleles drift to fixation and new alleles arising via mutations also
drift to fixation in isolated populations (see Chapter 7 pages 249-255 on the
concept of substitutions). Phylogenetics provides various methods for reconstructing
the genealogical history of alleles, populations, and species. The methods operate
under genealogical theory - all alleles, populations, and species are related to
varying degrees - and that descendants inherit novel mutations that arose in
their immediate ancestors. Furthermore, 1) because drift and mutation
are each stochastic, it is statistically impossible for two isolated descendant
lineages to remain genetically the same over time and 2) genetic similarity
between two descendant lineages decays as a function of the time since divergence
from a common ancestor. Any similarity
due to common ancestry (and most often not to functional constraint) is homology.
We will study two ways of reconstructing phylogenies. One involves genetic distances and is used when the timing of branching events is needed (e.g., as in phylogenetic forensics and epidemiology). The other is cladistics, which is used when the evolution of homologies need to be detailed. The software program ForensicEA by Jon C. Herron illustrates phylogenetic analysis via genetic distances. It was developed for understanding virus populations infecting hosts, but is applicable to phylogenetic evolution in general (diploid, haploid, or otherwise).
Distance methods. From aligned nucleotide
or amino acid sequences of alleles, 1) derive a distance matrix showing the
number or proportion of substitutions that differ between each sequence (e.g., representing
a lineage, population, species, etc.). 2) Identify the most similar pair or
pairs of alleles and join them in a branching diagram. Assign half the number or
proportion of substitutions to each branch. 3) Find the next closest allele to
one of the established clusters – this involves taking the average distance
between each of the remaining species with the established pair(s), one at a
time. 4) Continue until all sequences are joined in a branching diagram. The
result is a phylogenetic tree that reveals relationships among the various
populations or species, as well as the branch lengths that separate them. The
genetic distances we will use are simple, either numbers or proportions of
observed substitutions. The clustering approach is referred to as UPGMA (Unweighted Pair Group Method with Arithmetic
mean). For ease of hand calculation, we will limit all phylogenetic analyses in
class to 5 sequences (of populations, species, etc.) and 5-7 variable
nucleotide sites.
Data set. Here is a sample of just some variable sites within 100 bps of the control region of the mitochondrial genome from the Brown and Polar Bear from throughout the northern hemisphere:
W_Europe T … T … A … T … G … A … T
E_Alaska A … C … T … G … T … A … C
N_Alaska A … C … T … G … T … A … C
ABC_islands G
… T … T … G … T … C … C
Polar_Bear G … T … T … A … T … C … C
From the above aligned 100 bps of nucleotide sequence data, calculate the distance (proportion of substations out of 100 bps) between each pair of sequences to derive this distance matrix:
W_Europe E_Alaska N_Alaska ABC_islands Polar_Bear
W_Europe 0
- - - -
E_Alaska 0.06 0
- - -
N_Alaska 0.06 0.00
0 - -
ABC_islands 0.06
0.03 0.03 0
-
Polar_Bear 0.06 0.04
0.04 0.01 0
From this distance matrix, a UPGMA tree is calculated, as will be discussed in class.