The genetic code is said to be degenerate because in messenger RNA there are
64 triplets of the four nucleotide bases, the codons, but these translate to
only 20 common amino acids. The degree of degeneracy can be understood using
information theory. Selecting one amino acid for insertion into a protein takes
log_{2} 20 ≈ 4.3 bits of information, but the coding
potential in the mRNA is log_{2} 64 = 6 bits. Dividing these to form a
unitless measure of the code degeneracy gives the code efficiency, 4.3/6
≈ 72%.

Surprisingly, similar efficiencies are found for DNA protein binding sites,
photosensitive proteins, and motility systems (TDS in preparation).
Efficiencies near 70% can be explained given the requirement that choices must
be made precisely, with minimal errors, to create distinct biological states
such as the selection of particular amino acids. As Shannon showed for
communications, this is possible by using a high dimensional coding space.
Refinement of the computation using over a billion amino acids from the UniRef
database predicts an efficiency of 0.6949, which is significantly higher than
the theoretical maximum, ln 2 = 0.6931. Using basic information
theory, this discrepancy was used to predict that the error rate of translation
is < 1⋅10^{-3} errors per amino acid, which fits measured rates of
5⋅10^{-5} to 3⋅10^{-3}. Conversely, taking the average error
rate to be exactly 1⋅10^{-3}, the theory fits the data to about 4
decimal places. The theory not only correctly predicts the error rate of
translation from amino acid frequencies but it also explains why and to exactly
what degree the genetic code is degenerate.

- Mathematical and Statistical Models for Genetic Coding, September 26th to 28th 2013, Mannheim, Germany
- Bits <-> Biology on 2014 May 1.

Schneider Lab

origin: 2014 Feb 28

updated: 2014 Nov 03