(e) Why is Rsequence Approximately Equal to Rfrequency?

Next: APPENDIX Up: 4. DISCUSSION Previous: (d) How are Secondary

(e) Why is R_sequence Approximately Equal to R_frequency?

R_frequency is a function of genome size and the number of sites. Both of these quantities are fixed by factors that have little to do with recognition: genome size is essentially invariant within a species, and the number of sites required by the organism is fixed by physiology and genetics. For example, a ribosome binding site must precede every gene and the number of genes is determined by physiology and evolutionary history. Unless the population of organisms is undergoing speciation or rapid change in a new environment (Gould, 1977), there is a reasonably fixed frequency of sites and thus R_frequency is approximately fixed. To account for our results, we focus attention on R_sequence. Sequence drift will keep R_sequencefrom being larger than is needed for the regulatory process to function properly. If an organism were to have a collection of sites that were more conserved in sequence than was required, mutations in some of the positions of the sites could be tolerated. This would mean an increase in the uncertainty Hs at those positions in the site and a decrease in R_sequence. Uncertainty is related to thermodynamic entropy (Shannon, 1948; Tribus and McIrvine, 1971). Just as the entropy of an isolated system tends to increase, excess binding site information content should tend to atrophy. The lower limit to the drift would be the point at which proper function of the regulatory circuit is diminished.

We are left with many puzzles. How does the information content of sites evolve to equal that needed to find the sites? How is binding energy related to information content? How are chemical contacts related to the base frequencies? What happens in skewed genomes? Lastly, are there situations in biology capable of sustaining large R_sequence to R_frequency ratios, similar to those observed for the T7 late promoters, but for which there is really only one macromolecular recognizer? That is, could a high information content be advantageous for reasons not encountered in the systems studied thus far?

We thank many friends and colleagues for their suggestions, criticisms and patience during the years that this work evolved. We also thank Phil Bloch for a current estimate of the coding capacity of E. coli; F.W. Studier for sending us the sequence of T7 before publication; Michael Perry for a general proof for formula (15); and Kathie Piekarski for typing the manuscript. Computer resources were generously provided by the University of Colorado Academic Computing Services. This work was supported by NIH grant GM28755.

Next: APPENDIX Up: 4. DISCUSSION Previous: (d) How are Secondary

Tom Schneider
2002-10-16