\documentclass[runningheads]{cl2emult}
\newcommand{\theversion}{{version = 1.23 of lessons2000.tex 2003 April 4 }}
% origin 2000 April 16
%RECOMMENDED%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{makeidx} % allows index generation
\usepackage{graphicx} % standard LaTeX graphics tool
% for including eps-figure files
\usepackage{subeqnar} % subnumbers individual equations
% within an array
\usepackage{multicol} % used for the two-column index
\usepackage{cropmark} % cropmarks for pages without
% pagenumbers
\usepackage{math} % placeholder for figures
\makeindex % used for the subject index
% please use the style sprmidx.sty with
% your makeindex program
%upright Greek letters (example below: upright "mu")
\newcommand{\euler}[1]{{\usefont{U}{eur}{m}{n}#1}}
\newcommand{\eulerbold}[1]{{\usefont{U}{eur}{b}{n}#1}}
\newcommand{\umu}{\mbox{\euler{\char22}}}
\newcommand{\umub}{\mbox{\eulerbold{\char22}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This is a sample input file for your contribution to a multi-
% author book to be published by Springer Verlag.
%
% Please use it as a template for your own input, and please
% follow the instructions for the formal editing of your
% manuscript as described in the file "1readme".
%
% Please send the Tex and figure files of your manuscript
% together with any additional style files as well as the
% PS file to the editor of your book.
%
% He or she will collect all contributions for the planned
% book, possibly compile them all in one go and pass the
% complete set of manuscripts on to Springer.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%OPTIONAL%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%\usepackage{amstex} % useful for coding complex math
%\mathindent\parindent % needed in case "Amstex" is used
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%AUTHOR_STYLES_AND_DEFINITIONS%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%Please reduce your own definitions and macros to an absolute
%minimum since otherwise the editor will find it rather
%strenuous to compile all individual contributions to a
%single book file
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{html}
\bibliographystyle{unsrt}
\newcommand{\rfrequency}{R_{frequency}}
\newcommand{\rsequence}{R_{sequence}}
\newcommand{\todo}{\rule{0.5em}{1ex}}
\newcommand{\todobf}[1]{{\rule{0.5em}{1ex}\textbf{ #1}}}
\begin{document}
%
\title{Some Lessons for Molecular Biology from Information Theory}
%
%
\toctitle{Some Lessons for Molecular Biology from Information Theory}
% \protect\newline in the Particle Deflection Plane}
% allows explicit linebreak for the table of content
%
%
\titlerunning{Lessons for Molecular Biology}
% allows abbreviation of title, if the full title is too long
% to fit in the running head
%
\author{Thomas D. Schneider}
%
\authorrunning{T. D. Schneider}
% if there are more than two authors,
% please abbreviate author list for running head
%
%
Frederick Cancer Research and Development Center,
P. O. Box B,
Frederick, MD 21702-1201.
toms@alum.mit.edu,
https://alum.mit.edu/www/toms/
% email: \latex{toms@alum.mit.edu.}
% \html{
% {
% \htmladdnormallink
% {toms@alum.mit.edu}
% {mailto:toms@alum.mit.edu}
% }
% }
% \latex{https://alum.mit.edu/www/toms/}
% \html{
% {
% \htmladdnormallink
% {https://alum.mit.edu/www/toms/}
% {https://alum.mit.edu/www/toms/}
% }
% }
}
\maketitle % typesets the title of the contribution
\vspace{12pt}
\noindent
\fbox{\theversion} \\
\fbox{\parbox{11.3cm}{
This paper is a chapter in a book, a festscrift in honour of Prof. J.
N. Kapur on ``Entropy Measures, Maximum Entropy and Emerging
Applications'' which has been published by Springer \cite{Schneider.lessons2003}.
Dr. Karmeshu, (Professor School of Computer and Systems sciences, Jawaharlal Nehru
University New Delhi-110067) is the editor. This book is in the
series "Studies in Fuzziness and Soft Computing", edited by Prof.
Janusz Kacprzyk.
}
}
\begin{abstract}
\index{abstract}
Applying information theory to molecular biology problems
clarifies many issues.
The topics addressed are:
how there can be precision in molecular interactions,
how much pattern is stored in the DNA for genetic control systems,
and
the roles of
theory violations,
instrumentation,
and
models
in science.
% The abstract\index{abstract} should summarize the contents of the paper
% in at least 70 and at most 150 words; neither too long
% nor too short but to the point!
\end{abstract}
This paper is a short review of a few of the lessons I've learned
from applying Shannon's information theory
to molecular biology. Since there are so many distinct results,
I call this emerging field
`molecular information theory'.
Many of the references and figures
can be found at my web site \cite{T.D.Schneider.web.site},
%\latex{https://alum.mit.edu/www/toms/}\html{{\htmladdnormallink
%{https://alum.mit.edu/www/toms/}
%{https://alum.mit.edu/www/toms/}}},
along with an earlier review \cite{Schneider.nano2}
and a primer on information theory
\cite{SchneiderPrimer}.
\section{
Precision in Biology}
Information theory was first described by Claude Shannon
in 1948 \cite{Shannon1948}. It sets out a mathematical
way to measure the choices made in a system. Although Shannon
concentrated on communications, the mathematics applies equally
well to other fields \cite{Pierce1980}.
In particular, all of the theorems apply
in biology because the same constraints occur in biology
as in communication. For example, if I call you on the phone
and it is a bad connection, I may say `let me call you back'.
Then I hang up. I may even complain to the phone company
who then rips out the bad wires. So the process of
\emph{killing the phone line}
is equivalent to
\emph{selecting against a specific phenotype} in biology.
A second example is the copying of a key.
In biology that's called `replication', and sometimes there are `mutations'.
We go to a hardware store and have a key copied,
but we get home only to find that it doesn't fit the door.
When we return to the person who copied it, they
throw the key away (kill it) and start fresh.
This kind of selection does not occur in straight physics.
It turns out that the
requirement of being able to make
distinct selections is critical to Shannon's channel capacity
theorem \cite{Shannon1949}.
Shannon defined the channel capacity, $C$ (bits per second)
as the maximum
rate that information can be sent through a communications
channel in the presence of thermal noise.
The theorem has two parts.
The first part says that
if the data rate one would like to send at,
$R$, is greater than $C$, one will fail.
At most $C$ bits per second will get through.
The second part is surprising. It says that as long
as $R$ is less than \emph{or equal to} $C$ the error rate
may be made as low as one desires.
The way that Shannon envisioned attaining this result was by encoding
the message before transmission
and decoding it afterwards.
Encoding methods have been explored in the ensuing 50 years
\cite{Gappmair1999,Verdu.McLaughlin1998},
and their successful application is
responsible for the accuracy of our solar-system spanning
communications systems.
To construct the channel capacity theorem, Shannon assigned each message
to a point in a high dimensional space.
Suppose that we have a volt meter
that can be connected by a cable to a battery with a switch.
The switch has two
states, on and off, and so we can send 1 bit of information.
In geometrical terms, we can record the state
(voltage)
as one of two points on a line, such as $X=0$ and $X=1$.
Suppose now that we send two pulses, $X$ and $Y$.
This allows for 4 possibilities,
00, 01, 10 and 11 and these form a square on a plane. If we send
100 pulses, then any particular sequence will be a point in
a 100 dimensional space (hyperspace).
If I send you a message, I first encode it as a string of 1s
and 0s and then send it down the wire. But the wire is hot
and this disturbs the signal \cite{Nyquist1928,Johnson1928}.
So instead of $X$ volts
you would receive
$X \pm \sigma_X$,
a variation around $X$.
There would be a different variation for $Y$:
$Y \pm \sigma_Y$.
$\sigma_X$
and
$\sigma_Y$
are independent because thermal noise does not correlate over time.
Because they are the sum of many random molecular impacts,
for 100 pulses the $\sigma$s would have a Gaussian distribution
if they were plotted on one axis.
But because they are independent,
and the geometrical representation of independence is a right angle,
this represents 100 different directions in the high dimensional
space.
There is no particular direction in the high dimensional space
that is favored by the noise, so it turns out
that the original message will come to the receiver somewhere
on a sphere around the original point
\cite{Shannon1949,Schneider.ccmm,Schneider.nano2}.
What Shannon recognized is that these little noise
spheres have very sharply defined edges.
This is an effect of the high dimensionality:
in traversing from the center of the sphere to the surface
there are so many ways to go that essentially everything is on the surface
\cite{Brillouin1962,Callen1985,Schneider.ccmm}.
If one packs the message spheres
together so that they don't touch (with some error because they
are still somewhat fuzzy) then one can get the channel capacity.
The positions in hyperspace
that we choose for the messages is the \emph{encoding}.
If we were to allow the spheres to intersect (by encoding in a poor way) then
the receiver wouldn't be able to distinguish overlapping messages.
The crucial
point is that we must choose non-overlapping spheres.
This only matters in human and animal communications systems
where failure can mean death.
It does not happen to rocks on the moon because there is
no consequence for `failure' in that case.
So Shannon's channel capacity theorem only applies when there is
a living creature associated with the system.
From this I conclude that Shannon is a biologist
and that his theorem is about biology.
The capacity theorem can be constructed for biological molecules
that interact or have different states \cite{Schneider.ccmm}.
This means that these molecular machines are capable of making precise
choices. Indeed, biologists know of many amazingly specific
interactions; the theorem shows that not only is this possible
but that
\textbf{biological systems can evolve to have as few
errors as necessary for survival}.
%%% figure %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[phtb] % h here; t top; b bottom; p page of floats
%\htmlimage{transparent,thumbnail=1.0}
%\vspace{15cm}
%\special{psfile="sequencelogo.ps"
% hoffset=-20 voffset=-20
% hscale=60 vscale=60
% angle=0}
\framebox{\parbox{11.3cm}{
\begin{enumerate}
\item
The number of bases $b \in \{a,c,g,t\}$
at each position $l$ in a set of aligned binding sites is
called $n(b,l)$.
The total number of sequences at a given position is
\begin{equation}
n(l) = \sum_{b=a}^{t} n(b,l),
\label{eqn.nl}
\end{equation}
where the sum is over all 4 bases $b$.
In Fig. \ref{fig.sequencelogo},
the range of $l$ is from $-9$ to $+9$ bases
and
$n(l) = 12$ for all positions.
Many times data will be missing, in which case $n(l)$ will vary with position
$l$.
\item
The frequency of bases at each position is then computed
as
\begin{equation}
f(b,l) = \frac{n(b,l)}{n(l)}.
\label{eqn.fbl}
\end{equation}
\item
Shannon's uncertainty \cite{SchneiderPrimer,Shannon1948}
is estimated from
\begin{equation}
H = -\sum_{i=1}^{M} f_i \log_2 f_i + e(n)
\; \; \; \; \; \; \mbox{{\rm(bits/symbol)}}
\label{eqn.H}
\end{equation}
for $M$ symbols,
where $e(n)$ is a correction
for replacing the probability of the $i^{\rm th}$ symbol with a frequency $f_i$,
which leads to a
small-sample bias when the number of samples $n$ is small
\cite{Schneider1986}.
\item
Protein-DNA interactions are modeled at two thermodynamic states,
\textit{before}
and
\textit{after}
binding \cite{Schneider.nano2}.
Before a protein binds DNA, all four bases are possible,
so $f_i$ is the frequency of each base in the genome, about $0.25$,
and equation \ref{eqn.H} reduces to:
\begin{equation}
H_{before} \cong 2
\; \; \; \; \; \; \mbox{{\rm(bits/base)}}.
\label{eqn.Hbefore}
\end{equation}
After binding, the uncertainty is computed by equation \ref{eqn.H}
for each position $l$ across the set of aligned binding sites,
using equation \ref{eqn.fbl} and $f_i = f(b,l)$:
\begin{equation}
H_{after} =
H(l) = -\sum_{b=a}^{t} f(b,l) \log_2 f(b,l) + e(n(l))
\; \; \; \; \; \; \mbox{{\rm(bits/base)}}
\label{eqn.Hafter}
\end{equation}
\item
The information at each position is
the \emph{decrease in uncertainty}
from \emph{before} to \emph{after} binding:
\begin{equation}
\rsequence(l) = H_{before} - H_{after}
\; \; \; \; \; \; \mbox{{\rm(bits/base)}}
\label{eqn.rsequencel}
\end{equation}
$R$ stands for a `rate', in this case information gain in bits \emph{per base}.
\item
If the positions in a binding site are independent (which is generally true,
but can be tested \cite{Stephens.Schneider.Splice})
then the total information at the binding sites
is the sum of the information over all positions:
\begin{equation}
\rsequence = \sum \rsequence(l)
\; \; \; \; \; \; \mbox{{\rm(bits/site)}}.
\label{eqn.rsequence}
\end{equation}
\end{enumerate}
% Although xdvi and ghostview shows a complete box,
% when I print to a printer the bottom of the box has
% holes on the left and right side. This apparently is a bug
% because the last text is an equation. So to get around this
% the following line of code was inserted.
\vspace{1pt} % fix the gap that the equation causes.
}}
\caption{Method of computing information content at protein binding
sites ($\rsequence$) from DNA sequences.}
\label{box.infocomputation}
\end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Address is the Message}
Keys select one lock in a set of locks and so are capable
(with a little motive force from us) of making a `choice'.
The base 2 logarithm of the number of choices is the number
of bits.
(More details about information theory are
described in a \emph{Primer} \cite{SchneiderPrimer}.)
In a similar way, there are many proteins that locate
and stick to specific spots on the genome.
These proteins turn on and off genes and perform many
other functions.
When one collects the DNA sequences
from these spots,
which are typically 10 to 20 base pairs long,
one finds that they are not all exactly the same.
Using Shannon's methods, we can calculate the amount
of information in the binding sites, and I call this $\rsequence$
because it is a rate of information measured in units of bits per site as
computed from the sequences \cite{Schneider1986}.
(See figure \ref{box.infocomputation} for the details of this computation.)
For example, in our cells the DNA is copied to RNA and then
big chunks of the RNA are cut out. This splicing operation
depends on patterns at the two ends of the segment that gets removed.
One of the end spots is called the donor
and the other is called the acceptor.
Let's focus on the acceptor because the story there is
simple (what's happening at the donor is beyond the scope of this paper).
Acceptor sites can be described by about 9.4 bits of information
on the average
\cite{Stephens.Schneider.Splice}.
Why is it that number?
A way to answer this is to see how the information is used.
In this case there are acceptor sites with a frequency of roughly
one every 812 positions
along the RNA, on average.
So the splicing machinery
has to pick one spot from 812 spots, or $\log_2 812 = 9.7$ bits;
this is called $\rfrequency$ (bits per site).
So
\textbf{the amount of
pattern at a binding site
($\rsequence$)
is just enough for it to be found in the genome
($\rfrequency$)}.
Also,
notice that we are using the fact that
the capacity theorem says that it is possible for the sites to be distinguished
from the rest of the genome.
\section{Breaking the Rules}
Within 5 days of discovering that
$\rsequence \approx \rfrequency$
for a number of genetic systems
I found an apparent exception
\cite{Schneider1986}.
The virus T7 infects
the bacterium
\emph{Escherichia coli} and replaces the host RNA polymerase
with its own. These T7 polymerases bind to sites that have
about $\rsequence = 35.4$ bits of information on the average.
If we compute
how much information is needed to locate the sites, it is only
$\rfrequency = 16.5$ bits.
So there is twice as much information at the sites as is needed
to find them.
The idea that
$\rsequence \approx \rfrequency$
is the first hypothesis of molecular information theory.
As in physics if we are building a theory and we find a violation
we have two choices: junk the theory or recognize that we
have discovered a new phenomenon.
One possibility would be that the T7 polymerase really uses
all the information at its binding sites. I tested this idea
at the lab bench
by making many variations of the promoters and then seeing how
much information
is
left among those that still function
strongly. The result was $18 \pm 2$ bits \cite{Schneider1989},
which is reasonably close to
$\rfrequency$.
So the polymerase does not use all of the information available to
it in the DNA!
An analogy, due to Matt Yarus, is that if we have a town
with 1000 houses we should expect to see $\log_{10}1000 = 3$ digits
on each house so that the mail can be delivered.
(The analogy as is does not match the biology perfectly, but one
can change it to match
\cite{Schneider.nano2}.)
Suppose we came across a town and we count 1000 houses but
each house has 6 digits on it. A simple explanation is that
there are two delivery systems that do not share
digits with each other.
In biological terms, this means that there could be another
protein binding at T7 promoters. We are looking for it in the
lab.
Some years after making this discovery, I asked one of my
students, Nate Herman, to analyze the repeat sequences in
a replicating ring of DNA called the F plasmid that makes
bacteria male. (Yes, they grow little pilli ...)
He did the analysis
but did not do the binding sites
I wanted because we were both ignorant of
F biology at that time.
Nate found that the \emph{incD} repeats contain 60 bits
of information but only 20 bits would be needed to find the sites.
The implication is that three proteins bind there.
Surprisingly,
when we looked in the literature we found that
an experiment had already been done that shows three proteins
bind to that DNA \cite{Hayakawa1985,Herman.Schneider1992}!
It seems that
\textbf{we can predict the minimum number of proteins
that bind to DNA}.
%%% figure %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[htb] % h here; t top; b bottom; p page of floats
\htmlimage{transparent,thumbnail=1.0}
\vspace{15cm}
\special{psfile="sequencelogo.ps"
hoffset=-20 voffset=-20
hscale=60 vscale=60
angle=0}
\caption{Sequence logo for the 6 sequences (and their complements)
bound by both the bacteriophage $\lambda$ cI repressor and the cro proteins.
The sequences are given $5'$ to $3'$.
The method of computing the stack heights is given in
Fig. \ref{box.infocomputation}.}
\label{fig.sequencelogo}
\end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Waves in DNA Patterns}
If one calculates the information in many binding sites
an interesting pattern emerges \cite{Papp.helixrepa}:
the information often comes in two peaks.
The peaks are about 10 base pairs apart, which is the distance
over which the DNA helix twists once.
DNA has two grooves, a wide one
and a narrow one, called the major and minor groove respectively.
Using experimental data
I found that the peaks of information
correspond to places where a major groove faces the protein
\cite{Papp.helixrepa}.
(See Fig. \ref{fig.sequencelogo} for an example.)
This effect can be explained by inspecting the structure of bases
\cite{Seeman1976}.
There are enough asymmetrical chemical moieties in the major groove to allow
all four of the bases to be completely distinguished.
Thus any base pair from the set
AT, TA, CG and GC
is distinct from
any other pair in the set.
But because of symmetry in the minor groove it is
difficult or impossible for a protein contact there to tell AT from TA,
while CG is indistinguishable from GC.
So a protein can pick 1 of the 4
bases when approaching the DNA
from the major groove and it can make
$\log_2 4 = 2$ bits of choices,
but from the minor groove it only make 1
bit of choice because it can distinguish AT from
GC but not the orientation ($\log_2 2 = 1$).
This shows up in the information curves
as a dip that does not go higher than 1 bit where minor grooves
face the protein.
In contrast, the major groove positions often show sequence
conservation near 2 bits.
There is another effect that the information curves show:
as one moves across the binding site the curve increases and decreases
as a sine wave
according to the twist of the DNA.
This pretty effect can be explained by understanding how proteins
bind DNA and how they evolve \cite{Schneider.oxyr,Schneider.ev2000}.
Proteins first have to locate the DNA and then they will often skim
along it before they find and bind to a specific site.
They move around by Brownian motion and also bounce towards
and away from the DNA.
So during the evolution of the protein it is easiest to develop
contacts with the middle of a major groove, because there are
many possibilities there.
However, given a particular direction of approach to the DNA,
contacts more towards the back side
(on the opposite ``face'')
would be harder to form and would
develop more rarely.
So we would expect the DNA accessibility for the major groove
to go from 2 bits (when a major groove faces the protein)
to zero (when a minor groove faces the protein).
The same kind of effect occurs
at the same time
for the minor groove but the peak
is at 1 bit. The sum of these effects is a sine wave from 2 bits
for the major groove down to 1 bit for the minor groove,
as observed.
\textbf{The patterns of sequence conservation in DNA follow simple
physical principles.}
\section{On Being Blind}
Why weren't the waves noticed before?
The sine waves in binding site sequences cannot be seen
with a method often used to handle sequences.
Most molecular biologists will collect binding sites or other
sequences, align them, and then determine the most frequent
base at each position. This is called a `consensus sequence'.
Suppose that a position in a binding site has
70\% A,
10\% C,
10\% G
and
10\% T.
Then if we make a consensus model of this position, we could call it
`A'. This means that when we come to look at new binding sites,
30\% of the time we will not recognize the site!
If a binding site had
10 positions like this, then we would be wrong
$(1 - 0.7^{10}) = 97$\% of the time!
Yet this method is extremely widespread in the molecular
biology literature.
For example, a Fis binding site in the \emph{tgt/sec} promoter
was missed even though four pieces of experimental data
pointed to the site.
Although the site was 2 bits stronger
than an average Fis site, it was overlooked because
it did not match the consensus used by the authors \cite{Schneider.walker}.
We tested the site experimentally and found that it does indeed bind to Fis
\cite{Hengen.fisinfo}.
Likewise the sine waves were missed
before information analysis was done
because creating
a consensus sequences smashes the delicate sequence conservation
in natural binding sites.
Surprisingly, in retrospect, information theory provides
good ``instrumentation'' for understanding the biology of DNA sequences.
%This is not a game.
In addition,
information theory has been shown to be quite useful for biomedical
applications.
My colleague Pete Rogan found a paper that claimed to have
identified a T to C change at a splice acceptor site
as the cause of colon cancer.
Presumably, the reason that the authors
thought this is that the most frequent
base at that position is a T.
Then they apparently forgot that almost 50\%
of the natural sites have a C,
so when they came across the T to C change
it was misinterpreted as a mutation.
%Then they forgot that almost 50\%
%of the natural sites have a C. So when they came across the T to C change
%it was misinterpreted as a mutation.
Using information theory we were able to show that this is unlikely
\cite{Rogan.Schneider.hmsh2.1995}. Our prediction was confirmed
by experimental work which showed that of 20 normal people, 2 people
had the change. If the initial claim had been made in a doctor's
office it would have been a misdiagnosis, with legal ramifications.
Since that time we have analyzed many splice junctions
in a variety of genes
and we have found that the information theory approach
is powerful
\cite{Rogan.Faux.Schneider1998,Kannabiran.Hejtmancik1998,Allikmets.Dean1998,%
Khan.Kraemer1998}.
Consensus sequences apparently cause
some scientists to make
a classical scientific error.
The first time that promoter
binding site sequences were obtained (by David Pribnow)
they were aligned. How can one deal with this fuzzy data?
One way is to simplify the data by making a model, the consensus sequence.
Although biologists are well aware that these frequently fail,
they apparently don't recognize that the problem is with the model itself,
and as a consequence they will often write that there is a consensus
site in such and such a location
and that, for example a protein binds to the consensus \cite{Speck.Messer1997}.
That is, they think that the \emph{model} (a consensus sequence)
is the same as the \emph{reality} (a binding site).
But a model of reality is not reality itself.
This problem has a Zen-like quality, since even our perceptions
are models of reality.
Indeed, it is now thought that
our minds are running
a controlled hallucination
that is continuously matching data coming from our senses,
and when there is no input or a mismatch, some rather odd illusions occur
\cite{Ramachandran.Blakeslee1998}.
We have developed two models that use information theory to get
away from the errors caused by using consensus sequences.
The first is a graphic called a sequence logo \cite{Schneider.Stephens1990}.
(An example is Fig. \ref{fig.sequencelogo}.)
Sequence logos show an average picture of binding sites.
Fortunately the mathematics of information theory also allows one
to compute the information for individual binding sites
and
these models are called sequence walkers \cite{Schneider.Ri,Schneider.walker}.
Many examples of logos and walkers
can be found in the references or at my web site.
\textbf{Consensus sequences are dangerous to use and should be avoided.
Using the best available instrumentation can be critical to science.
We should always be aware that we are always working
with models because no model fully captures reality.}
\section{Acknowledgments}
I thank Karen Lewis, Ilya Lyakhov, Ryan Shultzaberger,
Herb Schneider,
Denise Rubens,
Shu Ouyang
and
Pete Lemkin
for comments on the manuscript.
% *******************************************************************************
% the raggedright makes the URLs typeset without big blanks:
\begin{raggedright}
\begin{thebibliography}{10}
\addcontentsline{toc}{section}{References}
\bibitem{Schneider.lessons2003}
T.~D. Schneider.
\newblock Some lessons for molecular biology from information theory.
\newblock In J.~Kacprzyk, editor, {\em Entropy Measures, Maximum Entropy
Principle and Emerging Applications. Special Series on Studies in Fuzziness
and Soft Computing. (Festschrift Volume in honour of Professor J.N. Kapour,
Jawaharlal Nehru University, India)}, volume 119, pages 229--237, New York,
2003. Springer-Verlag.
\newblock Errata to the book: the two figures given in the paper are missing
from the book!
\bibitem{T.D.Schneider.web.site}
T.~D. Schneider, 2000.
\newblock \htmladdnormallink {https://alum.mit.edu/www/toms/}
{https://alum.mit.edu/www/toms/}.
\bibitem{Schneider.nano2}
T.~D. Schneider.
\newblock Sequence logos, machine/channel capacity, {Maxwell}'s demon, and
molecular computers: a review of the theory of molecular machines.
\newblock {\em Nanotechnology}, 5:1--18, 1994.
\newblock \htmladdnormallink
{https://alum.mit.edu/www/toms/paper/nano2/}
{https://alum.mit.edu/www/toms/paper/nano2/}.
\bibitem{SchneiderPrimer}
T.~D. Schneider.
\newblock {\em Information Theory Primer}.
\newblock \htmladdnormallink
{https://alum.mit.edu/www/toms/paper/primer/}
{https://alum.mit.edu/www/toms/paper/primer/}, 1995.
\bibitem{Shannon1948}
C.~E. Shannon.
\newblock A mathematical theory of communication.
\newblock {\em Bell System Tech. J.}, 27:379--423, 623--656, 1948.
\newblock \htmladdnormallink
{http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html}
{http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html}.
\bibitem{Pierce1980}
J.~R. Pierce.
\newblock {\em An Introduction to Information Theory: Symbols, Signals and
Noise}.
\newblock Dover Publications, Inc., New York, second edition, 1980.
\bibitem{Shannon1949}
C.~E. Shannon.
\newblock Communication in the presence of noise.
\newblock {\em Proc. IRE}, 37:10--21, 1949.
\bibitem{Gappmair1999}
W.~Gappmair.
\newblock {Claude E. Shannon: The} 50th anniversary of information theory.
\newblock {\em IEEE Communications Magazine}, 37(4):102--105, April 1999.
\bibitem{Verdu.McLaughlin1998}
S.~Verd\'{u} and Steven~W. McLaughlin.
\newblock {\em Information Theory: 50 Years of Discovery}.
\newblock IEEE Press, New York, 1998.
\bibitem{Nyquist1928}
H.~Nyquist.
\newblock Thermal agitation of electric charge in conductors.
\newblock {\em Physical Review}, 32:110--113, 1928.
\bibitem{Johnson1928}
J.~B. Johnson.
\newblock Thermal agitation of electricity in conductors.
\newblock {\em Physical Review}, 32:97--109, 1928.
\bibitem{Schneider.ccmm}
T.~D. Schneider.
\newblock Theory of molecular machines. {I. Channel} capacity of molecular
machines.
\newblock {\em J. Theor. Biol.}, 148:83--123, 1991.
\newblock \htmladdnormallink {https://alum.mit.edu/www/toms/paper/ccmm/}
{https://alum.mit.edu/www/toms/paper/ccmm/}.
\bibitem{Brillouin1962}
L.~Brillouin.
\newblock {\em Science and Information Theory}.
\newblock Academic Press, Inc., New York, second edition, 1962.
\bibitem{Callen1985}
H.~B. Callen.
\newblock {\em Thermodynamics and an Introduction to Thermostatistics}.
\newblock John Wiley \& Sons, Ltd., N. Y., second edition, 1985.
\bibitem{Schneider1986}
T.~D. Schneider, G.~D. Stormo, L.~Gold, and A.~Ehrenfeucht.
\newblock Information content of binding sites on nucleotide sequences.
\newblock {\em J. Mol. Biol.}, 188:415--431, 1986.
\newblock \htmladdnormallink
{https://alum.mit.edu/www/toms/paper/schneider1986/}
{https://alum.mit.edu/www/toms/paper/schneider1986/}.
\bibitem{Stephens.Schneider.Splice}
R.~M. Stephens and T.~D. Schneider.
\newblock Features of spliceosome evolution and function inferred from an
analysis of the information at human splice sites.
\newblock {\em J. Mol. Biol.}, 228:1124--1136, 1992.
\newblock \htmladdnormallink
{https://alum.mit.edu/www/toms/paper/splice/}
{https://alum.mit.edu/www/toms/paper/splice/}.
\bibitem{Schneider1989}
T.~D. Schneider and G.~D. Stormo.
\newblock Excess information at bacteriophage {T7} genomic promoters detected
by a random cloning technique.
\newblock {\em Nucleic Acids Res.}, 17:659--674, 1989.
\bibitem{Hayakawa1985}
Y.~Hayakawa, T.~Murotsu, and K.~Matsubara.
\newblock Mini-{F} protein that binds to a unique region for partition of
mini-{F} plasmid {DNA}.
\newblock {\em J. Bacteriol.}, 163:349--354, 1985.
\bibitem{Herman.Schneider1992}
N.~D. Herman and T.~D. Schneider.
\newblock High information conservation implies that at least three proteins
bind independently to {F} plasmid {{\em incD\/}} repeats.
\newblock {\em J. Bacteriol.}, 174:3558--3560, 1992.
\bibitem{Papp.helixrepa}
P.~P. Papp, D.~K. Chattoraj, and T.~D. Schneider.
\newblock Information analysis of sequences that bind the replication initiator
{RepA}.
\newblock {\em J. Mol. Biol.}, 233:219--230, 1993.
\bibitem{Seeman1976}
N.~C. Seeman, J.~M. Rosenberg, and A.~Rich.
\newblock Sequence-specific recognition of double helical nucleic acids by
proteins.
\newblock {\em Proc. Natl. Acad. Sci. USA}, 73:804--808, 1976.
\bibitem{Schneider.oxyr}
T.~D. Schneider.
\newblock Reading of {DNA} sequence logos: Prediction of major groove binding
by information theory.
\newblock {\em Meth. Enzym.}, 274:445--455, 1996.
\newblock \htmladdnormallink {https://alum.mit.edu/www/toms/paper/oxyr/}
{https://alum.mit.edu/www/toms/paper/oxyr/}.
\bibitem{Schneider.ev2000}
T.~D. Schneider.
\newblock Evolution of biological information.
\newblock {\em Nucleic Acids Res.}, 28(14):2794--2799, 2000.
\newblock \htmladdnormallink {https://alum.mit.edu/www/toms/paper/ev/}
{https://alum.mit.edu/www/toms/paper/ev/}.
\bibitem{Schneider.walker}
T.~D. Schneider.
\newblock Sequence walkers: a graphical method to display how binding proteins
interact with {DNA} or {RNA} sequences.
\newblock {\em Nucleic Acids Res.}, 25:4408--4415, 1997.
\newblock \htmladdnormallink
{https://alum.mit.edu/www/toms/paper/walker/}
{https://alum.mit.edu/www/toms/paper/walker/}, erratum: NAR 26(4):
1135, 1998.
\bibitem{Hengen.fisinfo}
P.~N. Hengen, S.~L. Bartram, L.~E. Stewart, and T.~D. Schneider.
\newblock Information analysis of {Fis} binding sites.
\newblock {\em Nucleic Acids Res.}, 25(24):4994--5002, 1997.
\newblock \htmladdnormallink
{https://alum.mit.edu/www/toms/paper/fisinfo/}
{https://alum.mit.edu/www/toms/paper/fisinfo/}.
\bibitem{Rogan.Schneider.hmsh2.1995}
P.~K. Rogan and T.~D. Schneider.
\newblock Using information content and base frequencies to distinguish
mutations from genetic polymorphisms in splice junction recognition sites.
\newblock {\em Human Mutation}, 6:74--76, 1995.
\newblock \htmladdnormallink
{https://alum.mit.edu/www/toms/paper/colonsplice/}
{https://alum.mit.edu/www/toms/paper/colonsplice/}.
\bibitem{Rogan.Faux.Schneider1998}
P.~K. Rogan, B.~M. Faux, and T.~D. Schneider.
\newblock Information analysis of human splice site mutations.
\newblock {\em Human Mutation}, 12:153--171, 1998.
\newblock \htmladdnormallink {https://alum.mit.edu/www/toms/paper/rfs/}
{https://alum.mit.edu/www/toms/paper/rfs/}.
\bibitem{Kannabiran.Hejtmancik1998}
C.~Kannabiran, P.~K. Rogan, L.~Olmos, S.~Basti, G.~N. Rao, M.~Kaiser-Kupfer,
and J.~F. Hejtmancik.
\newblock {Autosomal dominant zonular cataract with sutural opacities is
associated with a splice mutation in the betaA3/A1-crystallin gene}.
\newblock {\em Mol Vis}, 4:21, 1998.
\bibitem{Allikmets.Dean1998}
R.~Allikmets, W.~W. Wasserman, A.~Hutchinson, P.~Smallwood, J.~Nathans, P.~K.
Rogan, T.~D. Schneider, and M.~Dean.
\newblock Organization of the {ABCR} gene: analysis of promoter and splice
junction sequences.
\newblock {\em Gene}, 215:111--122, 1998.
\newblock \htmladdnormallink {https://alum.mit.edu/www/toms/paper/abcr/}
{https://alum.mit.edu/www/toms/paper/abcr/}.
\bibitem{Khan.Kraemer1998}
S.~G. Khan, H.~L. Levy, R.~Legerski, E.~Quackenbush, J.~T. Reardon, S.~Emmert,
A.~Sancar, L.~Li, T.~D. Schneider, J.~E. Cleaver, and K.~H. Kraemer.
\newblock {Xeroderma Pigmentosum Group C} splice mutation associated with
mutism and hypoglycinemia - {A} new syndrome?
\newblock {\em J. Investigative Dermatology}, 111:791--796, 1998.
\bibitem{Speck.Messer1997}
C.~Speck, C.~Weigel, and W.~Messer.
\newblock {From footprint to toeprint: a close-up of the DnaA box, the binding
site for the bacterial initiator protein DnaA}.
\newblock {\em Nucleic Acids Res.}, 25:3242--3247, 1997.
\bibitem{Ramachandran.Blakeslee1998}
V.~S. Ramachandran and Sandra Blakeslee.
\newblock {\em Phantoms in the Brain: Probing the Mysteries of the Human Mind}.
\newblock William Morrow \& Co, New York, 1998.
\bibitem{Schneider.Stephens1990}
T.~D. Schneider and R.~M. Stephens.
\newblock Sequence logos: A new way to display consensus sequences.
\newblock {\em Nucleic Acids Res.}, 18:6097--6100, 1990.
\newblock \htmladdnormallink
{https://alum.mit.edu/www/toms/paper/logopaper/}
{https://alum.mit.edu/www/toms/paper/logopaper/}.
\bibitem{Schneider.Ri}
T.~D. Schneider.
\newblock Information content of individual genetic sequences.
\newblock {\em J. Theor. Biol.}, 189(4):427--441, 1997.
\newblock \htmladdnormallink {https://alum.mit.edu/www/toms/paper/ri/}
{https://alum.mit.edu/www/toms/paper/ri/}.
\end{thebibliography}
\end{raggedright}
% \begin{thebibliography}{7}
% %
% \addcontentsline{toc}{section}{References}
%
% \end{thebibliography}
% \input{paper.bbl}
\end{document}