When studying molecular binding sites in DNA or RNA, it is conventional
practice to align the sequences of several sites recognized by the same
macromolecular recognizer5
and then to choose the most common bases at each position to
create a consensus sequence (e.g. Davidson et al., 1983).
Consensus sequences
are difficult to work with and are not reliable when searching for new sites
(Sadler et al., 1983b; Hawley and McClure, 1983).
In part, this is because
information is lost when the relative frequency of specific bases at each
position is ignored.
For example, the first position of E. coli translational
initiation codons has 94% A, 5% G, 1% U and 0% C, which is not represented
precisely by the consensus "A". To avoid this problem, four histograms can be
made that record the frequencies of each base at each position of the aligned
sequences. Such histograms can be compressed into a single curve by the use
of a
function (Gold et al., 1981;
Stormo et al., 1982a).
Although these curves show where information lies in the site,
they have several
disadvantages: the
scale is not easily understood in simple
terms; it is difficult to compare the overall information content of two
different kinds of sites, such as ribosome binding sites and restriction enzyme
sites; and
histograms are not directly useful in searching for new
sites (Stormo et al., 1982b).
We present here a method for evaluating the
information content of sites
recognized by one kind of macromolecule.
The method begins with an alignment
of known sites, just as with the evaluation of consensus sequences or
histograms. However, the calculation of the information content (called
Rsequence) does not ignore variability of individual positions within a
set of sites, as do consensus sequences. Furthermore,
Rsequence is a measure that encourages direct comparisons between sites
recognized by different macromolecules, which is an improvement over
histograms.
Rsequence has units of bits per site. The values obtained
precisely describe how different the sequences are from all possible sequences
in the genome of the organism, in a manner that clearly delineates the
important features of the site.
An independent approach is to measure the information needed to find sites in the genome. This relies on the size of the genome and the number of sites in the genome rather than nucleotide sequence information. There is at least one lac operator in E. coli, while there are thousands of ribosome binding sites. We have defined another measure, Rfrequency, that is a function of the frequency of sites in the genome. More information would be necessary to identify a single site than any one in a set of thousands. Thus Rfrequency is greater for the lac operator than for ribosome binding sites. Rfrequency, like Rsequence, is expressed in bits per site.
Rsequence, which measures the information in binding site sequences, should be related to the specific binding interaction between the recognizer and the site. Rfrequency, based only on the frequency of sites, is related to the amount of information required for the sites to be distinguished from all sites in the genome. The problem of how proteins can find their required binding sites among a huge excess of non-sites has been discussed (Lin and Riggs, 1975; von Hippel, 1979). Rsequence and Rfrequencygive us quantitative tools for addressing this problem. Thus we compare Rsequence and Rfrequency and come to the pleasing conclusion that the values are similar for each site studied. This result was not necessarily expected.