Evolutionary change has been observed in the fossil record,
in the field,
in the laboratory, and at the molecular level in DNA and protein
sequences, but a general
method for quantifying the changes has not been agreed upon.
In this paper the well-established
mathematics of information theory
[1,2,3]
is used
to measure the information content of
nucleotide binding sites
[4,5,6,7,8,9,10,11]
and to follow changes in this measure to gauge the degree
of evolution of the binding sites.
For example, human splice acceptor sites contain about
9.4 bits of information on the average
[6].
This number is called
Rsequence because it represents
a rate (bits per site) computed from the aligned sequences
[4].
(The equation for
Rsequence is given in the Results.)
The question arises as to why one gets 9.4 bits rather than, say, 52.
Is 9.4 a fundamental number?
The way to answer this is to compare it to something else.
Fortunately,
one can use the size of the genome and the number of sites to compute
how much information is needed to find the sites.
The average distance between acceptor sites is the average size of
introns plus exons, or about 812 bases,
so the information needed to find the acceptors is
bits [6].
By comparison,
Rsequence = 9.4 bits,
so in this and other genetic systems
Rsequence is close to
Rfrequency [4].
These measurements show that
there is a subtle connection between the pattern at binding sites and the
size of the genome and number of sites.
Relative to the potential for changes at binding sites,
the size of the entire genome is approximately
fixed over long periods of time. Even if the genome
were to double in length (while keeping the number of sites constant),
Rfrequency would only change
by 1 bit, so the measure is quite insensitive. Likewise, the number of sites
is approximately fixed by the physiological functions that have to be
controlled by the recognizer. So
Rfrequency is essentially fixed during
long periods of
evolution. On the other hand,
Rsequence can change rapidly and could have any value, as it depends on
the details of how the recognizer contacts the nucleic acid binding sites
and these numerous small contacts can mutate quickly. So how does
Rsequence come to equal
Rfrequency? It must be that
Rsequencecan start from zero and evolve up to
Rfrequency.
That is, the necessary information should be able to evolve from scratch.
The purpose of this paper is to demonstrate
that
Rsequence can indeed evolve to
match
Rfrequency [12].
To simulate the biology, suppose we have a population of
organisms each with a given length of DNA. That fixes the genome size, as in
the biological situation. Then we need to specify a set of locations that a
recognizer protein has to bind to. That fixes the number of sites, again as
in nature. We need to code the recognizer into the genome so that it can
co-evolve with the binding sites. Then we need to apply random mutations and
selection for finding the sites and against finding non-sites. Given these
conditions, the simulation will match the biology at
every point.
Because half of the population always survives each selection
round in the evolutionary simulation presented here,
the population cannot die out and there is no
lethal level of incompetence.
While this may not be representative of all biological systems,
since extinction and threshold effects do occur,
it is representative of the situation in which a functional
species can survive without a particular genetic control system
but which would do better to gain control ab initio.
Indeed, any new function must have this property until the species
comes to depend on it, at which point it can become essential
if the earlier means of survival is lost by atrophy or no longer available.
I call such a situation a `Roman arch' because once such a structure
has been constructed on top of scaffolding, the scaffold may
be removed,
and will disappear from biological systems
when it is no longer needed.
Roman arches are common in biology,
and they are a natural consequence of evolutionary processes.
The fact that the population cannot become extinct
could be dispensed with,
for example by assigning a probability of death,
but it would be inconvenient to lose an entire population
after many generations.
A twos complement weight matrix was used to store the recognizer
in the genome.
At first it may seem that this is insufficient
to simulate the complex processes of
transcription, translation, protein folding
and DNA sequence recognition found in cells.
However the success of the simulation, as shown below, demonstrates
that the form of the genetic apparatus
does not affect the computed information measures.
For information theorists and physicists
this emergent mesoscopic property
[13]
will come as no surprise because information theory is
extremely general and does not depend on the physical mechanism.
It applies equally
well to telephone conversations, telegraph signals, music
and molecular biology [2].
Given that, when one runs the model one finds that the information at the
binding sites (
Rsequence) does indeed evolve to be the amount predicted to be
needed to find the sites (
Rfrequency). This is the same result
as observed in natural binding sites and it
strongly supports the hypothesis that these numbers should be close
[4].