American Scientist, November-December 2004
This is a really interesting article that I found via William Dembski’s Uncommon Descent blog. It concerns some important properties of the genetic code. Here’s the deal (as far I understand it with my limited knowledge of biology) –
Each cell of an organism contains a bunch of DNA molecules. Each DNA molecule is a chemically linked chain of nucleotides, each of which consists of a sugar, a phosphate and one of five kinds of nucleobases (“bases”). The five bases are adenine (A), thymine (T), uracil (U), cytosine (C), and guanine (G). However only four, A, T, C and G are found in most DNA.
The cell also contains a mechanism for building protein molecules, which are used for constructing the various structures of the cell. Proteins are long chains of amino acids, and the specific sequence of amino acids assembled to create the protein molecule is determined by the sequence of bases in the cell’s DNA molecule. Specifically, each set of threeconsecutive bases along the length of the DNA, termed a codon, ends up determining (by a complicated process) which one of twenty different amino acids is used to form one link in one of the proteins created by the cell. The specific set of amino acids determines the functional properties of the protein. There is a really nice description of this process here.
The correspondence between a particular codon (a set of three consecutive bases in the DNA) and the resulting amino acid that forms one link in the protein is called the genetic code. You can think of this as a tablein which each row lists the bases in a single codon and alongside it the identity of the corresponding amono acid that gets used to construct a protein. Since there are four possible bases in each link (nucleotide) of the DNA and three consecutive bases in a codon, there are 4x4x4 = 64 possible combinations associated with each codon. Yet there are only 20 possible amino acids that the cell may insert into a protein. The implication is that more than one codon may correspond to a single amino acid.
The American Scientist article I mentioned above deals with the properties of the genetic code that translates between codons and amino acids. In particular, it discusses a theory that the code is optimal in terms of its robustness to point mutations – single substitutions of one nucleotide for another at a single site on the DNA. Logically, a mutation in the DNA leads to the use of an incorrect amino acid in a corresponding protein. However replacing one amino acid with another can have more or less serious effects on the functional properties of the protein, depending on the relative properties of the correct amino acid and the one that substitutes for it. It’s possible in principle to determine the robustness of the genetic code by considering all of the possible random point mutations in each codon in the DNA and determining the significance of the amino acid substitution that results from each based on the genetic code. By doing this for all possible genetic codes, you can estimate the robustness of the actual genetic code relative to all other possibilities.
However, the success of this strategy depends entirely on having some sensible means of quantifying the significance of the amino acid substitutions associated with point mutations. Various studies have tried to determine the robustness of the genetic code by using somewhat arbitrary measures of the functional dissimilarity between amino acids. However the article I mentioned cites a study reported in a paper by Freeland et al (pdf) that uses what is considered to be a more appropriate measure of amino acid dissimilarity than previous studies. In particular, they consider how frequently actual amino acid substitutions are observed to occur naturally between the corresponding proteins of different individuals. As the American Scientist article explains -
“If two amino acids are often found occupying the same position in variant copies of the same protein, then it seems safe to conclude that the amino acids are physiologically compatible. Conversely, amino acids that are never found to occupy the same position would not be likely substitutions in a successful genetic code. “
What’s going on here is an assumption that if a particular substitution is relatively common, it suggests that that substitution is not particularly harmful to the organism. Conversely, substitutions that are very uncommon might be suspected to be rather less benign. Based on this assumption, and taking account of known statistical biases in DNA mutations, they computed the total functional dissimilarity in amino acids generated by the point mutations first for the actual genetic code and then for a million randomly generated alternative codes. Their conclusion was that the actual genetic code leads to the least impact on protein function from point mutation than any other code.
This is quite a stunning conclusion (I think), since under the thoery of common descent, the genetic code is assumed to have been constant for all of evolution history since the emergence of the universal common ancestor of all organisms now known. Under an evolutionary model of biological origins, the fact that this one common ancestor had (apparently) by chance the one optimal genetic code out of a huge number of possibilities suggests that there must have been a large amount of evolution prior to that point in order to converge on that one code. Yet evolutionary theory gives no indication of why that one ancestor should be the single organism of that epoch that has surviving descendents.
However, I think there may be a problem with the reasoning that leads to this conclusion. The functional similarity of substituted amino acids is actually only one of two possible influences on the observed substitution frequencies. The other is that the genetic code itself – since it determines the amino acid substitution that results from each possible DNA mutation – may tend to make certain amino acid substitutions more common than others, regardless of how beneficial or harmful those substitutions are. Considering this fact, it could be argued that the observed amino acid substitution frequencies don’t provide a good measure of the functional similarity of those amino acids.
The paper cited anticipates this possible objection and argues that the effect of the genetic code on observed substitution frequencies can be neglected by only considering the substitution frequencies that are observed after many generations of evolution. They claim to have used the substitution frequencies corresponding to a distance in time of 74-100 generations.
However, although they don’t draw attention to the fact, my impression is that they actually don’t use the observed frequencies of amino acid substitutions at that evolutionary distance. They actually use estimated frequencies, which are computed based entirely on the observed frequencies after just a single generation. These computed frequencies would be reasonable in the absence of selection pressures over multiple generations. But the whole point of the authors’ analysis is that these frequencies can be considered to reflect the functional similarity between amino acids precisely because they are assumed to reflect selection pressure.
I have taken a look at the online Pubmed database and see that there have been five citations of the paper by Freeland et al. I have yet to look at them, but the title of at least one (“Testing a biosynthetic theory of the genetic code: Fact or artifact?”) at least suggests some scepticism. I’m looking forward to investigating further.