We aligned the sequences of 149 E. coli and coliphage ribosome binding sites by their initiation codons because the process of initiation requires that the fmet-tRNA bind there. Since ribosomes search mRNA, we used the composition of the transcript library (Stormo et al., 1982a) to calculate Hg: A=29526, C=25853, G=27800, T=28951 for which Hg=1.99817 bits/base. The frequencies of bases at each position of the sites were used to find the information content, Rsequence(L), as a function of position (equations 2, 3 and a.8). Fig. 2 shows that the largest peak is for the initiation codon. The second largest peak represents the "Shine and Dalgarno" sequence (Shine and Dalgarno, 1974). There are at least five other distinct peaks.
Position 0 is the first base of the initiation codon. |
Rsequence, the total information content of the site, is found by adding together the individual information contents from each position (equation 6). Previous statistical analyses showed a range of -21 to +13 (zero is the first base of the initiation codon), which corresponds well to the regions of RNA protected by ribosomes from ribonucleases (Gold et al., 1981). This range was extended by 5 bases on both sides. For this range, we calculate an Rsequence of 11.0 bits per site. Alignment by the Shine and Dalgarno sequence gives less than 8.3 bits (data not shown), which suggests that this is not a good alignment.
A good estimate for the size of the E. coli genome is basepairs (Bachmann and Low, 1980). In determining Rfrequency, we assume that almost all of the genome is transcribed into messages and that for the most part only one strand is transcribed. The number of potential ribosome binding sites is therefore . Based on the coding capacity versus DNA insert size of 24 plasmids selected at random from the Clark-Carbon bank (P. Bloch, personal communication; F.C. Neidhardt et al., 1983), and a genome size of bp, we estimate the number of proteins encoded by E. coli, and therefore the number of ribosome binding sites, to be 2574. Equation (9) therefore gives an Rfrequency of 10.6 bits per site. The data for all analyses are gathered in Table 1.