If one calculates the information in many binding sites an interesting pattern emerges [20]: the information often comes in two peaks. The peaks are about 10 base pairs apart, which is the distance over which the DNA helix twists once. DNA has two grooves, a wide one and a narrow one, called the major and minor groove respectively. Using experimental data I found that the peaks of information correspond to places where a major groove faces the protein [20]. (See Fig. 1.2 for an example.)
This effect can be explained by inspecting the structure of bases [21]. There are enough asymmetrical chemical moieties in the major groove to allow all four of the bases to be completely distinguished. Thus any base pair from the set AT, TA, CG and GC is distinct from any other pair in the set. But because of symmetry in the minor groove it is difficult or impossible for a protein contact there to tell AT from TA, while CG is indistinguishable from GC. So a protein can pick 1 of the 4 bases when approaching the DNA from the major groove and it can make bits of choices, but from the minor groove it only make 1 bit of choice because it can distinguish AT from GC but not the orientation ( ). This shows up in the information curves as a dip that does not go higher than 1 bit where minor grooves face the protein. In contrast, the major groove positions often show sequence conservation near 2 bits.
There is another effect that the information curves show: as one moves across the binding site the curve increases and decreases as a sine wave according to the twist of the DNA. This pretty effect can be explained by understanding how proteins bind DNA and how they evolve [22,23].
Proteins first have to locate the DNA and then they will often skim along it before they find and bind to a specific site. They move around by Brownian motion and also bounce towards and away from the DNA. So during the evolution of the protein it is easiest to develop contacts with the middle of a major groove, because there are many possibilities there. However, given a particular direction of approach to the DNA, contacts more towards the back side (on the opposite ``face'') would be harder to form and would develop more rarely. So we would expect the DNA accessibility for the major groove to go from 2 bits (when a major groove faces the protein) to zero (when a minor groove faces the protein). The same kind of effect occurs at the same time for the minor groove but the peak is at 1 bit. The sum of these effects is a sine wave from 2 bits for the major groove down to 1 bit for the minor groove, as observed. The patterns of sequence conservation in DNA follow simple physical principles.