One of the early bacteriophage T7 proteins, encoded by gene 1, is a new RNA polymerase (Chamberlin et al., 1970). This polymerase transcribes the middle and late genes of the phage genome. Concurrently, the T7 proteins encoded by genes 0.7 and 2 inactivate the host RNA polymerase so that transcription is directed to the T7 genome rather than that of the host (Hesselbach and Nakada 1977a,b; see Studier, 1969, 1972; Kruumlautger and Schroeder, 1981; Dunn and Studier, 1983 for reviews on T7).
All 17 T7 RNA polymerase promoters have been sequenced (Dunn and Studier, 1983). In vitro deletion experiments and homology among the promoters suggest that a functional promoter is at least 32 base pairs long. Five bases beyond the range -24 to +7 was used to calculate Rsequence (Fig. 8). (The zero base is thought to be the start of each transcript, see Fig. 9 for the alignment.) Rsequence = 35.4 bits per site.
To calculate Rfrequency, we must determine both G and . There are two genomes that can contribute to the potential binding sites: the host and the phage. The host DNA is destroyed by gene products 3 (endonuclease, Center et al., 1970) and 6 (exonuclease, Sadowski and Kerr, 1970) which are synthesized from T7 RNA polymerase dependent transcripts. They are therefore made following the synthesis of the T7 RNA polymerase. This means that the gene 1 product may search both the E. coli and T7 genomes. The T7 genome is only one hundredth the size of the host genome, so it does not contribute much. The relevant genome is probably the host DNA. Because promoters are asymmetric, there are twice as many potential binding sites on the genome as there are base pairs, so G is twice the genomic size of E. coli (Table 1).
The transcriptional map of T7 is known in great detail (Carter et al., 1981); there are almost certainly no more than 17 T7 polymerase sites (Dunn and Studier, 1983). The activity of T7 RNA polymerase on E. coli DNA is 4% of its activity on T7 DNA (Chamberlin and Ring, 1973; see also Summers and Siegel, 1970). Therefore the total number of sites on E. coli DNA could be (17 sites/39936 bp T7) x (3.9 x 10 6 bp E. coli) x 0.04 = 66. On infection by T7, there could be as many as 83 sites in the cell. This gives a lower bound for Rfrequency of 16.5 bits per site. If there are no sites in the E. coli genome, and thus only 17 sites in the cell, Rfrequency would be 18.8 bits per site. This is the first case for which Rsequence is much bigger than Rfrequency, so we studied the sequences more closely.
Oakley and Coleman (1977; Oakley et al., 1979) observed that several of the T7 promoters contain a symmetric element centered between bases -3 and -2. The 17 promoter sequences are presented in Fig. 9. The extent of the symmetry in all 17 promoters was found by counting numbers of complementary matches between the two halves. For example, position -14 matches the corresponding position +9 in only 5 of the 17 sites. This number is likely to occur if the bases were not correlated. The rest of the complementary matches are tabulated in Table 2. 12 positions have a significantly high number of matches (p < 0.005), and these are taken to represent the symmetry. (The positions -6 and 1 are presumably not involved because they have exceptionally few complementary matches.) Several of the sites contain CTCnCTA:TAGnGAG, while in a few the GAG is shifted to the left by one position.
The information content of these palindromes was determined from the 17 sequences and their complements (34 sequences total) centered as described above (Fig. 10). The Rsequence value given in Table 1. is for the 12 positions of the symmetry. Rsequence is 16.4 bits per site. There are at least 17 sites in an infected cell, so Rfrequency is less than or equal to 17.8 bits per site.