Next: l. Introduction

The Information Content of Binding Sites on Nucleotide Sequences

Thomas D. Schneider¹ ², Gary D. Stormo^*, Larry Gold^*,
and Andrzej Ehrenfeuch³

version = 1.12 of schneider1986.tex 2002 Oct 16

J. Mol. Biol. (1986) 188, 415-431⁴

I have not finished transforming this document into L^ATEX. Still remaining to be done:

1.: check equations
2.: read the text for errors.
3.: put all references into bibtex so they hyperlink
4.: Do figs 3 - 10
5.: Fig 1 is missing the 0 lines!

If you have corrections, please email them to me at toms@alum.mit.edu. By releasing this paper now, Figures 1, 11, 12, 13 and the Appendix (which are finished) are available for people who would like to learn about the small sample correction.

NEWS:

version 1.11, 2001 July 5: Tables 1 and 2 are done!
version 1.08, 2001 June 22: Figs 11, 12, 13 now are in the html appendix.
version 1.08, 2001 June 22: Figs 1 and 2 sizes fixed in html. Figs 11, 12, 13 now function.
version 1.07, 2001 June 8: Fig 2 (ribosome Rs curve) done.

Repressors, polymerases, ribosomes and other macromolecules bind to specific nucleic acid sequences. They can find a binding site only if the sequence has a recognizable pattern. We define a measure of the information ( R_sequence) in the sequence patterns at binding sites. It allows one to investigate how information is distributed across the sites and to compare one site to another. One can also calculate the amount of information ( R_frequency) that would be required to locate the sites given that they occur with some frequency in the genome. Several Escherichia coli binding sites were analyzed using these two independent empirical measurements.

The two amounts of information are similar for most of the sites we analyzed. In contrast, bacteriophage T7 RNA polymerase binding sites contain about twice as much information as is necessary for recognition by the T7 polymerase, suggesting that a second protein may bind at T7 promoters. The extra information can be accounted for by a strong symmetry element found at the T7 promoters. This element may be an operator. If this model is correct, these promoters and operators do not share much information. The comparisons between R_sequence and R_frequencysuggest that the information at binding sites is just sufficient for the sites to be distinguished from the rest of the genome.

Next: l. Introduction

Tom Schneider
2002-10-16