Next: Acknowledgments Up: Some Lessons for Molecular Previous: Waves in DNA Patterns

On Being Blind

Why weren't the waves noticed before? The sine waves in binding site sequences cannot be seen with a method often used to handle sequences. Most molecular biologists will collect binding sites or other sequences, align them, and then determine the most frequent base at each position. This is called a `consensus sequence'.

Suppose that a position in a binding site has 70% A, 10% C, 10% G and 10% T. Then if we make a consensus model of this position, we could call it `A'. This means that when we come to look at new binding sites, 30% of the time we will not recognize the site! If a binding site had 10 positions like this, then we would be wrong (1 - 0.7¹⁰) = 97% of the time! Yet this method is extremely widespread in the molecular biology literature.

For example, a Fis binding site in the tgt/sec promoter was missed even though four pieces of experimental data pointed to the site. Although the site was 2 bits stronger than an average Fis site, it was overlooked because it did not match the consensus used by the authors [24]. We tested the site experimentally and found that it does indeed bind to Fis [25]. Likewise the sine waves were missed before information analysis was done because creating a consensus sequences smashes the delicate sequence conservation in natural binding sites. Surprisingly, in retrospect, information theory provides good ``instrumentation'' for understanding the biology of DNA sequences.

In addition, information theory has been shown to be quite useful for biomedical applications. My colleague Pete Rogan found a paper that claimed to have identified a T to C change at a splice acceptor site as the cause of colon cancer. Presumably, the reason that the authors thought this is that the most frequent base at that position is a T. Then they apparently forgot that almost 50% of the natural sites have a C, so when they came across the T to C change it was misinterpreted as a mutation. Using information theory we were able to show that this is unlikely [26]. Our prediction was confirmed by experimental work which showed that of 20 normal people, 2 people had the change. If the initial claim had been made in a doctor's office it would have been a misdiagnosis, with legal ramifications. Since that time we have analyzed many splice junctions in a variety of genes and we have found that the information theory approach is powerful [27,28,29,].

Consensus sequences apparently cause some scientists to make a classical scientific error. The first time that promoter binding site sequences were obtained (by David Pribnow) they were aligned. How can one deal with this fuzzy data? One way is to simplify the data by making a model, the consensus sequence. Although biologists are well aware that these frequently fail, they apparently don't recognize that the problem is with the model itself, and as a consequence they will often write that there is a consensus site in such and such a location and that, for example a protein binds to the consensus [31]. That is, they think that the model (a consensus sequence) is the same as the reality (a binding site). But a model of reality is not reality itself. This problem has a Zen-like quality, since even our perceptions are models of reality. Indeed, it is now thought that our minds are running a controlled hallucination that is continuously matching data coming from our senses, and when there is no input or a mismatch, some rather odd illusions occur [32].

We have developed two models that use information theory to get away from the errors caused by using consensus sequences. The first is a graphic called a sequence logo [33]. (An example is Fig. 1.2.) Sequence logos show an average picture of binding sites. Fortunately the mathematics of information theory also allows one to compute the information for individual binding sites and these models are called sequence walkers [34,24]. Many examples of logos and walkers can be found in the references or at my web site.

Consensus sequences are dangerous to use and should be avoided. Using the best available instrumentation can be critical to science. We should always be aware that we are always working with models because no model fully captures reality.

Next: Acknowledgments Up: Some Lessons for Molecular Previous: Waves in DNA Patterns

Tom Schneider
2003-04-04