Next: MATERIALS AND METHODS Up: Evolution of Biological Information Previous: ABSTRACT

INTRODUCTION

Evolutionary change has been observed in the fossil record, in the field, in the laboratory, and at the molecular level in DNA and protein sequences, but a general method for quantifying the changes has not been agreed upon. In this paper the well-established mathematics of information theory [1,2,3] is used to measure the information content of nucleotide binding sites [4,5,6,7,8,9,10,11] and to follow changes in this measure to gauge the degree of evolution of the binding sites.

For example, human splice acceptor sites contain about 9.4 bits of information on the average [6]. This number is called R_sequence because it represents a rate (bits per site) computed from the aligned sequences [4]. (The equation for R_sequence is given in the Results.) The question arises as to why one gets 9.4 bits rather than, say, 52. Is 9.4 a fundamental number? The way to answer this is to compare it to something else. Fortunately, one can use the size of the genome and the number of sites to compute how much information is needed to find the sites. The average distance between acceptor sites is the average size of introns plus exons, or about 812 bases, so the information needed to find the acceptors is $R_{frequency} = \log_2{812} = 9.7$ bits [6]. By comparison, R_sequence = 9.4 bits, so in this and other genetic systems R_sequence is close to R_frequency [4].

These measurements show that there is a subtle connection between the pattern at binding sites and the size of the genome and number of sites. Relative to the potential for changes at binding sites, the size of the entire genome is approximately fixed over long periods of time. Even if the genome were to double in length (while keeping the number of sites constant), R_frequency would only change by 1 bit, so the measure is quite insensitive. Likewise, the number of sites is approximately fixed by the physiological functions that have to be controlled by the recognizer. So R_frequency is essentially fixed during long periods of evolution. On the other hand, R_sequence can change rapidly and could have any value, as it depends on the details of how the recognizer contacts the nucleic acid binding sites and these numerous small contacts can mutate quickly. So how does R_sequence come to equal R_frequency? It must be that R_sequencecan start from zero and evolve up to R_frequency. That is, the necessary information should be able to evolve from scratch.

The purpose of this paper is to demonstrate that R_sequence can indeed evolve to match R_frequency [12]. To simulate the biology, suppose we have a population of organisms each with a given length of DNA. That fixes the genome size, as in the biological situation. Then we need to specify a set of locations that a recognizer protein has to bind to. That fixes the number of sites, again as in nature. We need to code the recognizer into the genome so that it can co-evolve with the binding sites. Then we need to apply random mutations and selection for finding the sites and against finding non-sites. Given these conditions, the simulation will match the biology at every point.

Because half of the population always survives each selection round in the evolutionary simulation presented here, the population cannot die out and there is no lethal level of incompetence. While this may not be representative of all biological systems, since extinction and threshold effects do occur, it is representative of the situation in which a functional species can survive without a particular genetic control system but which would do better to gain control ab initio. Indeed, any new function must have this property until the species comes to depend on it, at which point it can become essential if the earlier means of survival is lost by atrophy or no longer available. I call such a situation a `Roman arch' because once such a structure has been constructed on top of scaffolding, the scaffold may be removed, and will disappear from biological systems when it is no longer needed. Roman arches are common in biology, and they are a natural consequence of evolutionary processes.

The fact that the population cannot become extinct could be dispensed with, for example by assigning a probability of death, but it would be inconvenient to lose an entire population after many generations.

A twos complement weight matrix was used to store the recognizer in the genome. At first it may seem that this is insufficient to simulate the complex processes of transcription, translation, protein folding and DNA sequence recognition found in cells. However the success of the simulation, as shown below, demonstrates that the form of the genetic apparatus does not affect the computed information measures. For information theorists and physicists this emergent mesoscopic property [13] will come as no surprise because information theory is extremely general and does not depend on the physical mechanism. It applies equally well to telephone conversations, telegraph signals, music and molecular biology [2].

Given that, when one runs the model one finds that the information at the binding sites ( R_sequence) does indeed evolve to be the amount predicted to be needed to find the sites ( R_frequency). This is the same result as observed in natural binding sites and it strongly supports the hypothesis that these numbers should be close [4].

Next: MATERIALS AND METHODS Up: Evolution of Biological Information Previous: ABSTRACT

Tom Schneider
2001-11-07