Reading of DNA Sequence Logos: Prediction of Major Groove Binding by Information Theory

Methods in Enzymology: RNA Polymerase and Associated Factors, Part B, Volume 274, pages 445-455, 1996
Edited by Sankar Adhya

Information theory was introduced in the late 1940s by Claude Shannon for the study of communications systems[1,2]. With this mathematical tool, information passing through a telephone or computer data line can be measured and compared to the theoretical limit of the line, called the channel capacity. The information measure is given in units of bits per second, where one bit is the choice between two equally likely possibilities. Surprisingly, one of Shannon's theorems states that as long as the channel capacity is not exceeded, the communications may have as few errors as desired[3]. An example of the practical use of this result is the clear music of compact discs (CDs) which are specially coded to protect against noise. Cleaning instructions for CDs say that they should be wiped in a radial direction. If any scratches are introduced, they will have less effect on the coding, which runs in concentric circles and is capable of correcting up to 4000 data bit errors[4].

Evolutionary conservation indicates the functional importance of biological structures, so a robust and precise measure for conservation is necessary for empirical studies. The measurement of conservation in bits is uniquely suited to this task because, unlike any other measure, bits are additive and consistent from one system to the next. Bits provide a universal scale for measuring biological conservation not only in DNA and RNA but also for proteins and other macromolecules[5,6].

OxyR is a tetrameric protein that binds to the DNA of several promoters in Escherichia coli and activates transcription of genes encoding antioxidant enzymes[7,8]. The initial investigation of the DNA-binding sites by Tartaglia et al. showed that OxyR binds to a large region of DNA, but the consensus sequences obtained were weak and sparse, making the sites difficult to characterize.

In conjunction with the information analysis described here, recent experimental work on this protein[9] clarified the way in which OxyR binds to DNA sequences. This would not have been possible using a consensus sequence.

When possible, wild-type OxyR DNA binding site sequences[7,9] were obtained from GenBank 83. The following list includes the site name, accession number, and coordinate of the central (zero) base: oxyR, J04553, 163; katG, M21516, 68; ahpC, D13187, 116; dps, X69337, 202; gorA, U00039, 60491; Mu mom, V01463, 59; S. typhimurium orf, (not in GenBank) 29. Both wild-type and the random sequences that bound to OxyR are given elsewhere[9].

Information analysis and the sequence logo technique were performed as described previously [5,10,11]. A primer on information theory is available on the internet at https://alum.mit.edu/www/toms/ftp/primer.ps. Programs (written in Pascal[12]) and data are available by anonymous ftp from https://alum.mit.edu/www/toms/ftp/ or via the World Wide Web site http://www-lmmb.ncifcrf.gov/ $R_{frequency} = \log_2{812} = 9.7$ toms/.

The programs for constructing sequence logos and performing other tasks were used in approximately this order:

The aligned listing and sequence logos were printed on a Tektronix Phaser 140 inkjet printer.

The sequence logos[10,11] in Fig. 2 summarize the data in a set of aligned sequences such as Fig. 1. The height of a stack of letters is the sequence conservation measured in bits of information according to equation (1). The height of each letter within a stack is proportional to its frequency at that position in the binding site. The letters are sorted, with the most frequent on top. The cosine wave represents the twist of B-form DNA. Wave peaks are all on one face of the DNA and represent the major groove facing the protein. Error bars indicate the variability of a comparable number of random sequences[5].

Wild-type OxyR binding site sequences were aligned (Fig. 1) and information analysis was used to generate the sequence logo in Fig. 2a. The logo shows a correlation between the strongest sequence conservation, as given by the heights of the stacks of letters, and the face of B-form DNA, as given by the cosine wave. The same correlation is seen in other proteins ( $\gamma/G$ cI/cro, $\gamma/G$ O, 434 cI/cro, ArgR, CRP, TrpR, FNR and LexA, see figure 6 of Papp et al.[11]). The extent of DNase I footprinting[7] and this correlation suggests that OxyR binds to one face of B-form DNA in 4 successive major grooves.

A second line of evidence that can be read from the sequence logo also supports this model. When a protein is in contact with a major groove, the two base pairs and their two orientations can be distinguished, as recognized by Seeman et al[17,11], so the protein is capable of ``choosing'' one of the four possibilities: A=T, T=A, C $G - \gamma$ G, or G $G - \gamma$ C. This can be explained with the help of Fig. 3, which depicts the two base pairs. The possible chemical contacts for T=A in the major groove are (reading from left to right): methyl group, hydrogen acceptor, hydrogen donor and hydrogen acceptor, or T:Me-a-d-a:A for short. This can easily be distinguished from the complementary pattern of A=T, which is A:a-d-a-Me:T. Likewise C:(blank)-d-a-a:G is distinguishable from G:a-a-d-(blank):C. Finally, GC/CG can be distinguished from AT/TA.

This choice of 1 possibility in 4 can be made with 2 bits of information. (This is calculated as $R_{sequence} = 3.983 \pm 0.399$ bits. For further explanation, see Pierce[2] or the primer on information theory whose location is given in Materials and Methods.) Completely conserved positions in the major groove are described by 2 bits and this is the highest point on the vertical scale of the sequence logos. It is easiest to think of a bit as a knife slice that dissects the bases in Fig. 3. A horizontal slice is the first bit and a vertical slice is the second one. The first bit determines whether the base is above or below the slice and the second bit determines whether it is to the left or the right. Because they are at right angles to one another, the slices provide independent choices and no more than 2 bits are needed to specify a single base. For example, ``top, left'' selects the T. The average number of bits needed to describe the observed frequency of bases is the information content or sequence conservation. Because it is an average it does not need to be an integer and so the heights of the letter stacks in the sequence logo are real numbers. Sequence conservation in the major groove can range anywhere between 0 and 2 bits depending on the strength of the contacts involved, as seen in figure 6 of Papp et al.[11]. Just because it is possible for sequence conservation to be as high as 2 bits from the major groove does not mean that the protein will evolve to that high a value. The important factors are the total sequence conservation[5] and the coding of the binding site that distinguishes it from other sequences.

In contrast to the major groove, contacts in the minor groove of B-form DNA allow both orientations of each kind of base pair so that rotations about the dyad axis cannot easily be distinguished. This is because from the minor groove C $G - \gamma$ G appears nearly identical to G $G - \gamma$ C and A=T appears identical to T=A. Fig. 3 shows that C $G - \gamma$ G in the minor groove has the chemical moiety pattern a-d-a, which is, to a good first approximation, identical to the pattern of the complementary orientation. The hydrogen donor N2 of G is almost exactly on the dyad axis (the dashed line) and so its position does not change much in the complement. Hydrogen atoms held in a hydrogen bond vibrate vigorously and probably make such a fine positional distinction difficult because they vibrate almost independently of the donor and acceptor[18]. The base pair A=T has a-(blank)-a, which is identical in the other orientation. So A=T can be distinguished from C $G - \gamma$ G in the minor groove only by a donor contact to the N2 or by a physical probe which blocks the N2 (in the blank at the black dot). Because both of these are close to the dyad axis only the horizontal knife slice works for the minor groove.

Because only 2 of the 4 possibilities can be distinguished, when a B-form minor groove is probed by a protein no more than 1 bit of information ( $\begin{displaymath}R_{sequence} = H_{before} - H_{after} \;\;\;\;\;\mbox{(bits per site)}. \end{displaymath}$ bit) can be obtained. That is, positions with more than 1 bit of information are likely to represent major grooves facing the protein or, if they do represent minor groove contacts, then the DNA is probably not B-form[11]. In the OxyR sequence logo, positions $\displaystyle H_g = e(G) -\sum_{b=A}^{T} p(b) \log_2 p(b)$ , $\begin{displaymath}H_{after} = \sum_{l=1}^L \biggl( e(n(l)) -\sum_{b=A}^{T} f(b,l) \log_2 f(b,l) \biggr), \end{displaymath}$ and $\sim$ are conserved by more than 1 bit of information, so these positions probably represent major grooves facing the protein. This is consistent with the correlation between sequence conservation and the face of the DNA discussed earlier.

After this prediction was made, it was confirmed by hydroxyl radical footprinting and missing base experiments[9].

Because position 0 in the wild-type sequence logo shows equally likely A and T (Fig. 2a), that position may represent a contact from OxyR which collides with the minor groove N2 of G (ref.[17]), allowing only A=T and T=A, and disallowing C $G - \gamma$ G entirely. This prediction was supported by methylation interference data which indicate that methylation at N3 of A blocks binding (Fig. 2a).

The logo also shows that positions +4, +7 and +13 are predominantly either A or G (T or C at the negative coordinates), which suggests a contact by OxyR to the major groove N7 group because only the N7 acceptor is conserved in an A $H_g = 1.9995 \pm 0.0058$ G transition (T:Me-a-d- $4.0 \pm 0.4$ :A matches C:(blank)-d-a- $4.0 \pm 0.4$ :G only in the last moiety, see Fig. 3). This is also confirmed by methylation interference data, but the disproportionate frequencies of these bases suggests that other contacts or effects are also involved.

Position -15 is mostly T or G (A or C at +15), suggesting a weak contact to the major groove A-N6 or C-N4 and/or T-O4 and G-O6 groups. These contacts could be conserved in T $H_g = 1.9995 \pm 0.0058$ G transversions because they shift by only $4.0 / (0.4 \times 704) = 0.014$ Å. (T:Me- $4.0 \pm 0.4$ - $0.006 \pm 0.001$ -a:A matches G:a- $4.0 \pm 0.4$ - $0.006 \pm 0.001$ -(blank):C in the middle two moieties, see Fig. 3). This kind of contact is likely to be at position -6 (and +7) of CRP sites[11], where the crystal structure shows that Arg¹⁸⁰ donates hydrogen bonds to O6 and N7 of guanine[19]. The preferred binding order G > T > A $H = -\sum p \log_2 p$ C is directly visible in the CRP sequence logo[11], and is confirmed by mutations[20,21,19]. Apparently when G is replaced by T in a CRP site, the T-O4 contact is used instead of G-O6, but the G-N7 contact is lost. This accounts for the binding order except for the A. Substitution by A would break the G-O6 contact but would maintain the N7 contact. The binding order T > A suggests that the O4 contact is stronger than the N7 contact[20], even though when T $\sum p = 1$ G, the O4/O6 contact shifts by $4.0 / (0.4 \times 704) = 0.014$ Å whereas the N7 contact does not move at all.

Other positions in the OxyR binding sites show methylation interference that do not have apparent correlation to the observed sequence conservation. This could be because the sequence conservation was derived from only a few sequences and so is noisy, as indicated by the large error bars. Alternatively, OxyR may pass close to the DNA at some points but not make actual contact unless the DNA is abnormally methylated. Other effects, such as DNA bending or twisting, also might account for these discrepancies.

A more subtle piece of evidence can be found in the overall shape of the sequence logo. Notice how the stack heights at positions -4, -5, and -7follow along under the cosine wave (Fig. 2a). This effect can be observed in other sequence logos, in particular $\gamma/G$ cI/cro, $\gamma/G$ O, 434 cI/cro, CRP, FNR and LexA[11]. It can be explained by the geometry of a globular protein approaching the cylindrical DNA. During the process of finding the binding sites the protein moves toward and away from the DNA[22]. Contacts at the center of the DNA cylinder are closest to the protein and so should be the easiest to evolve. Contacts become progressively more difficult to make as the approach is made further off axis (Fig. 4). If one were to rotate a DNA molecule on its long axis, a point on its surface would cycle between being visible and not visible. This naturally results in a cosine function of accessibility along a linear DNA molecule. If we define accessibility as a cosine function that runs from 0 to 1 bit in the minor groove and another cosine function that runs from 0 to 2 bits in the major groove, then the sum of these two functions is the cosine wave drawn on the logo. The correlation between sequence conservation and accessibility is consistent with the proposal that positions $\displaystyle H_g = e(G) -\sum_{b=A}^{T} p(b) \log_2 p(b)$ , $\begin{displaymath}H_{after} = \sum_{l=1}^L \biggl( e(n(l)) -\sum_{b=A}^{T} f(b,l) \log_2 f(b,l) \biggr), \end{displaymath}$ , and $\sim$ are read from the major groove.

To summarize, there are at least four interrelated techniques that can be used to read a sequence logo:

A ``randomization'' experiment[23,11,24] was performed in which OxyR protein was used to gel shift 30 base pair equi-probable random sequences[9]. Unfortunately this gave a dismal logo (Fig. 2b), possibly because the protein was prevented from binding properly by the flanking constant sequence of the vector. The high conservation at the ends ( $\sim 4 \times 10^9$ and $2^{-4 \times 16} \cong 5 \times 10^{-20}$ ) comes more from one side of the cloning sites and so may represent an artifact. ² Still, some correlation with Fig. 2a is visible in the logo, in particular the A preference at $\displaystyle H_g = e(G) -\sum_{b=A}^{T} p(b) \log_2 p(b)$ , the T preference at $\begin{displaymath}H_{after} = \sum_{l=1}^L \biggl( e(n(l)) -\sum_{b=A}^{T} f(b,l) \log_2 f(b,l) \biggr), \end{displaymath}$ , and the G preference at $\sim$ . But other positions are just as conserved and do not reflect the wild-type sequences.

To clarify this situation, the randomization experiment was repeated with 45 base pair equi-probable random sequences which were then aligned by an information theory technique using the malign.p program (Fig. 2c)[9]. Only some of the patterns evident in Fig. 2a were confirmed by this experiment, whereas others became more predominant. The T or C at position $\rotatebox{0}{\scalebox{1.00}{\includegraphics*{selflogo.ps}}}$ closely reflects the wild-type sequences there (7 Cs, 5 Ts, 1 G, 1 A). The conservation at $\displaystyle H_g = e(G) -\sum_{b=A}^{T} p(b) \log_2 p(b)$ and $\begin{displaymath}H_{after} = \sum_{l=1}^L \biggl( e(n(l)) -\sum_{b=A}^{T} f(b,l) \log_2 f(b,l) \biggr), \end{displaymath}$ increased but positions 0 and $\sim$ barely increased. The wild-type cluster at $H \ge 0$ , $2^{-4 \times 16} \cong 5 \times 10^{-20}$ and perhaps $\rotatebox{0}{\scalebox{0.90}{\includegraphics*{fismodels.ps}}}$ did materialize but not strongly. The almost insignificantly weak preferences for A at $\rotatebox{-90}{\resizebox{!}{\textwidth}{\includegraphics*{overlap.ps}}}$ , T at $\rotatebox{90}{\resizebox{!}{\textwidth}{\includegraphics*{gel-overlap.ps}}}$ , and T at $\rotatebox{-90}{\resizebox{!}{\textwidth}{\includegraphics*{oric.ps}}}$ of wild-type appear amplified. Additional conservation not seen previously appeared at $\scalebox{0.69}{\includegraphics*{fisori.ps}}$ (?), $\frac{N_i}{N}$ , $\;$ , $\scalebox{1.00}{\includegraphics*{cover.ps}}$ and $\begin{displaymath}\sum_{i=1}^{M}P_iu_i . \end{displaymath}$ . The reason for these quantitative discrepancies between the wild-type sequence logo and the logo from experimentally selected sites is unknown, but might be accounted for by the small sample sizes.

Another difficulty with this kind of experiment is that it always contains at least one unknown parameter, the stringency of selection. If the concentration of OxyR protein were large, then its non-specific binding should cause more DNA sequences to shift in the gel. This would lead to a sequence logo with a low information content relative to the natural sequences. However, a low concentration of OxyR protein should lead to a much higher measured information content, perhaps higher than is naturally found. This is also a danger in amplification protocols such as SELEX[25].

To counter this, the protein concentration could be adjusted so that the total information content from the randomization experiment matches that of the natural sequences. Presumably the two logos would then look the same. Were they the same for the experiment that was done? The area under the logo ( R_sequence) was $\begin{displaymath}\framebox{$\displaystyle H = -\sum_{i=1}^{M}P_i\log_2P_i \;\;\;\;\;\mbox{(bits per symbol).} $ }\end{displaymath}$ bits for the 7 wild-type sequences and their complements (Fig. 2A) but was $F_a = \frac{1}{12}$ bits for the 16 sequences and their complements in the 45mer experiment (Fig. 2C). These can be compared by a two-tailed Student's t test[16]. Because both the sequences and their complements were used, the two halves of the sequence logo are not independent and the test must be done with half-sites. This reduces each mean by a factor of 2 and reduces the variance by a factor of 2 so the standard deviations reduce by $F_c = \frac{2}{12}$ . The number of samples in the half-site set remains at 14 and 32 respectively. The t test on the half-sites shows that the natural sites are significantly different in information content from the experimental set (t=2.7 with 44 degrees of freedom, p < 0.02). Unfortunately this criterion for matching the wild-type binding sites was not met.

Even so, it is clear that the sequence conservation is not uniformly proportional across the two logos. One possibility that might account for the observed conservation at the edges of the experimentally derived sites is that OxyR is still affected by the constant flanking sequences of the cloning vector and binds to one or the other side in some cases. The discrepancy may also reflect different conditions between in vivo evolutionary factors and the in vitro gel shift experiment. For example, no spermidine was used in the gel shift experiment[9], yet it is well known that spermidine is important for precise recognition by other DNA binding proteins[26,27]. Comparison of the wild-type sequence logo to a series of random gel shift sequence logos might be used to determine precisely what the in vivo binding conditions are.

The case of OxyR demonstrates the usefulness of sequence logos as a replacement for consensus sequences. The pattern bound by the protein is difficult to detect by eye (Fig. 1), and no agreement could be found for a consensus sequence. In contrast, the sequence logos are created automatically and without any ambiguity. They show clear and easily interpretable patterns. Because information theory is quantitative, statistical tests can be applied to collections of binding site sequences.

A consensus replaces the natural frequencies of bases with arbitrarily chosen ones [28]. For example, in our data set[11], CRP position -6has 2 As, 2 Cs, 44 Gs and 10 Ts. Taking the consensus alters this to 100% G and 0% of the other bases. When the consensus sequence G was chosen by this method, a subtle pattern of sequence conservation was lost[29,30,31,32,,33,21,19]. That the T contact occurs naturally apparently went unrecognized until the present work, although the mechanism of the contact was already understood from mutations[20].

The art of predicting specific base contacts is well known[17] but the pervasive use of consensus sequences in the modern molecular biology literature has prevented full use of the available sequence data. In the case of CRP discussed earlier, the subtle G to T switch, which probably destroys one hydrogen bond while keeping the other, was missed because only the G was retained in the consensus model. In contrast, because it visually displays the relevant information in a compact, quantitative form, the sequence logo allows direct interpretation of the data and leads to specific predictions that can guide experimentation. Further, when anomalies appear, the logo displays them so blatantly that new phenomena are revealed[11]. Even correlations between positions in a binding site[15] could be presented in a three-dimensional sequence logo, but software to generate this display has not been written yet.

As seen by the sequence logo, OxyR does not have a special kind of binding site as has been suggested[7]. OxyR merely happens to have a long binding site with low overall information content, so it tends to have low sequence conservation per position. As a consequence of the Second Law of Thermodynamics, DNA-protein contacts tend to spread out over the available surface on an evolutionary time scale. With at least 4 major grooves and three minor grooves to make contacts in, OxyR can ``afford'' to have many small contacts. Paradoxically, a mathematically rigorous theorem shows that having many small contacts like those used by OxyR can improve sequence discrimination[34]. Furthermore, many binding sites are like OxyR in that they have variations in their information content at different positions. This is immediately apparent upon inspection of splice junction sequence logos[15] and the ``gallery'' of DNA recognition sequence logos[11].

The arbitrary and artificial distinctions between strong and weak binding sites, between the ``core'' and the periphery of a site and between the inside and outside of binding site ``boxes'' that have been fostered by the use of consensus sequences are eliminated when one adopts the concept that sequence conservation is a real number that can be measured precisely in bits of information.

DNA sequences to which the OxyR protein binds under oxidizing conditions were analyzed by the sequence logo method, a quantitative graphic technique based on information theory. A sequence logo shows both the sequence conservation and the frequencies of bases at each position in a site. Unlike the consensus sequence, the sequence logo analysis revealed that OxyR should bind to four major grooves of DNA. This was later confirmed by experiments. Detailed interpretation of the sequence logo also allowed the prediction of likely major and minor groove OxyR-DNA base contacts, consistent with available experimental results. Because the sequence logo shows the original base frequencies in a clear, easily interpreted graphic that does not distort the data, highly refined analysis of binding site contacts becomes easy. Not only can these methods be applied to any DNA sequence binding site, they can also be applied to sites on RNA and proteins.

I thank Paul N. Hengen, Denise Rubens, Paul A. Smith, R. Michael Stephens, and R. E. Wolf for useful comments on the manuscript.

**Figure 1:** Aligned list of OxyR binding sequences.
$\vspace{4.5in} \special{psfile=''fig/oxyr.alist.ps'' hoffset=-30 voffset=250 hscale=55 vscale=55 angle=-90}$ 7 OxyR wild-type binding sites (odd numbers) and their complementary sequences (even numbers) were listed by the alist program. The numbers in the bar on the top are read vertically and give the position in the binding site, running from -26 to +26. The GenBank accession number, the coordinate of the zero base, and the number of each sequence are given on the left-hand side of the figure.

**Figure 2:** Sequence logos for OxyR binding sequences.
$\vspace{5.4in} \special{psfile=''fig/oxyr.ps'' hoffset=50 voffset=-10 hscale=50 vscale=50 angle=0}$ (A) 7 OxyR wild-type binding sites and their complementary sequences. The total sequence conservation, obtained by adding together the stack heights to determine the ``area'' under the logo (equation (2)), is $\begin{displaymath}\framebox{$\displaystyle H = -\sum_{i=1}^{M}P_i\log_2P_i \;\;\;\;\;\mbox{(bits per symbol).} $ }\end{displaymath}$ bits per site for the range -22 to +22with error calculated by program rsim according to [15]. Methylation of guanines at N7 which interfere with OxyR binding are indicated by open circles ( $u_a = -\log_2(0.08) = 3.58$ ) and methylation of adenines at N3 which interfere with OxyR binding are indicated by filled triangles ( ) [35,7]. DNase I protected regions for the sites in Tartaglia et al.[7] are shown by arrows drawn $\displaystyle \frac{1}{12} \times \log_2(\frac{1}{12}) \; + \; \frac{2}{12} \times \log_2(\frac{2}{12})$ . (B) 38 randomly synthesized sequences selected by OxyR protein, 30 bases wide and their complementary sequences. $\displaystyle \frac{1}{12} \times \log_2(\frac{1}{12}) \; + \; \frac{8}{12} \times \log_2(\frac{8}{12}) ]$ bits per site. in the range -15 to +15. (C) 16 randomly synthesized sequences selected by OxyR protein, 45 bases wide and their complements. $\displaystyle \;$ bits per site in the range -22 to +22.

**Figure 3:** DNA base pairs
$\vspace{3.5in} \special{psfile=''fig/hyper.ps'' hoffset=36 voffset=-162 hscale=70 vscale=70 angle=0}$ DNA base pairs drawn by the dnag program using coordinates for B-DNA [36] with atomic radii [37]. Short line segments indicate hydrogen bonds between the bases. a: acceptor of hydrogen bond; d: donor of hydrogen bond. The scale is shown by a 1Å diameter dashed circle placed on the helical axis. The two-fold dyad axis is indicated by a dashed line. Rotation by 180 $\sqrt{P_y + N_y}$ on this axis brings the backbone sugars and phosphates into register again. The next base pair is 3.38Å above the page and is rotated 36 $\sqrt{P_y + N_y}$ counterclockwise.