Next: 2. Materials and Methods Up: The Information Content of Previous: The Information Content of

l. Introduction

When studying molecular binding sites in DNA or RNA, it is conventional practice to align the sequences of several sites recognized by the same macromolecular recognizer⁵ and then to choose the most common bases at each position to create a consensus sequence (e.g. Davidson et al., 1983). Consensus sequences are difficult to work with and are not reliable when searching for new sites (Sadler et al., 1983b; Hawley and McClure, 1983). In part, this is because information is lost when the relative frequency of specific bases at each position is ignored. For example, the first position of E. coli translational initiation codons has 94% A, 5% G, 1% U and 0% C, which is not represented precisely by the consensus "A". To avoid this problem, four histograms can be made that record the frequencies of each base at each position of the aligned sequences. Such histograms can be compressed into a single curve by the use of a $\gamma$ function (Gold et al., 1981; Stormo et al., 1982a). Although these curves show where information lies in the site, they have several disadvantages: the $\gamma$ scale is not easily understood in simple terms; it is difficult to compare the overall information content of two different kinds of sites, such as ribosome binding sites and restriction enzyme sites; and $\gamma$ histograms are not directly useful in searching for new sites (Stormo et al., 1982b).

We present here a method for evaluating the information content of sites recognized by one kind of macromolecule. The method begins with an alignment of known sites, just as with the evaluation of consensus sequences or $\gamma$ histograms. However, the calculation of the information content (called R_sequence) does not ignore variability of individual positions within a set of sites, as do consensus sequences. Furthermore, R_sequence is a measure that encourages direct comparisons between sites recognized by different macromolecules, which is an improvement over $\gamma$ histograms. R_sequence has units of bits per site. The values obtained precisely describe how different the sequences are from all possible sequences in the genome of the organism, in a manner that clearly delineates the important features of the site.

An independent approach is to measure the information needed to find sites in the genome. This relies on the size of the genome and the number of sites in the genome rather than nucleotide sequence information. There is at least one lac operator in E. coli, while there are thousands of ribosome binding sites. We have defined another measure, R_frequency, that is a function of the frequency of sites in the genome. More information would be necessary to identify a single site than any one in a set of thousands. Thus R_frequency is greater for the lac operator than for ribosome binding sites. R_frequency, like R_sequence, is expressed in bits per site.

R_sequence, which measures the information in binding site sequences, should be related to the specific binding interaction between the recognizer and the site. R_frequency, based only on the frequency of sites, is related to the amount of information required for the sites to be distinguished from all sites in the genome. The problem of how proteins can find their required binding sites among a huge excess of non-sites has been discussed (Lin and Riggs, 1975; von Hippel, 1979). R_sequence and R_frequencygive us quantitative tools for addressing this problem. Thus we compare R_sequence and R_frequency and come to the pleasing conclusion that the values are similar for each site studied. This result was not necessarily expected.

Next: 2. Materials and Methods Up: The Information Content of Previous: The Information Content of

Tom Schneider
2002-10-16