**Thomas D. Schneider
^{1}
**

**version = 2.32 of ridebate.tex 1999 November 3
Journal of Theoretical Biology, 201: 87-92, 1999**

Although information theory was developed more than 50 years ago
[Shannon, 1948,Shannon, 1949],
it is widely accepted
[Gappmair, 1999],
and a complete compendium of Claude
Shannon's works
was recently published
[Sloane & Wyner, 1993].
The application of information theory to understanding binding sites
of proteins on DNA or RNA was published more than 10 years ago
[Schneider *et al.*, 1986], and
since then it has been profitably used to study
many genetic systems
(see
http://www.lecb.ncifcrf.gov//
).
Shannon measured information as an average property of signals
passing through a communications channel,
so a natural extension is to understand the information contributed
by individual symbols.
The same extension can be applied to the study of binding sites
as an ``individual information theory''
[Schneider, 1997*a*,Schneider, 1997*b*]
and this has also been successfully used to understand a variety
of genetic and medically relevant systems
[Hengen *et al.*, 1997,Rogan *et al.*, 1998,Allikmets *et al.*, 1998,,Shultzaberger & Schneider, 1999,Zheng *et al.*, 1999].
Dr. Stormo subsequently published a letter in this journal
promoting an alternative to the Shannon approach
and pointing out some consequences of that choice
[Stormo, 1998].
In this letter I will address other consequences and interpretations
of the two approaches.
However,
before addressing the deep and difficult issues that Dr. Stormo has
raised,
which we have been discussing for more than 15 years,
I would like to make some small factual corrections.

First,
the Staden method
[Staden, 1984]
*is* discussed in my
*J. Theor. Biol.*
paper [Schneider, 1997*a*].
Staden's method has no cutoff, while the
individual information (*R*_{i})
method has a natural one and
although they are similar, no one derived the *R*_{i} formula from Staden's
approach. I did not derive the *R*_{i} method from Staden; it is
a natural extension of information theory inspired by Tribus
[Tribus, 1961].
The connection between the information
contributed by individual binding sites
(as represented by the
sequence walker
computer graphics
[Schneider, 1997*b*])
and their ensemble average
(as represented by the
sequence logo
computer graphics
[Schneider & Stephens, 1990])
is not obvious from the Staden approach,
nor is the relationship to energy
[Schneider, 1991*b*].

Second,
in his letter
[Stormo, 1998]
Dr. Stormo implied that I ``claim an inequality
relationship with the enthalpy of binding''.
My papers do not claim any relationship with enthalpy; indeed
I have not published the word ``enthalpy'' before now.
While it is possible for *q* in the Second Law
to refer to enthalpy (the increase in entropy of the surroundings of a system),
the more appropriate measure for molecular machines is the
total dissipation, and this corresponds to the free energy.
(In this letter I use the terms energy and free energy synonymously.)
At this point it would appear that we finally agree,
but information is not energy as will be discussed in section II below.

**I. What Does Dr. Stormo's I_{seq} Measure?**

**1. I_{seq} is not a state function.**

since ``information'' is (supposedly) additive for each independent event, and each step gives a zero or positive value. Irrespective of whether or not the last step is independent, because the function is nonnegative. Therefore the sum around a loop is nonnegative:

The only condition where is where no step had a change. By making many excursions to different composition regions of the genome, a recognizer would gain an arbitrarily large (and variable) information by Dr. Stormo's measure. In contrast, free energy and entropy are state functions (

**2. I_{seq} is not an information theory measure.**
Shannon's uncertainty

is related to the physical entropy if the probabilities correspond to the microstates of the system, so that [Schneider, 1991

The theory I work with differs from that of Dr. Stormo in that it uses a
definition of information that is path independent. A molecular machine --
including not only genetic recognizers but also rhodopsin, myosin,
*etc.*
-- dissipates energy into its surroundings as it makes choices
[Schneider, 1991*a*].
The information *R*(a rate of information, following Shannon's original notation) is
a decrease in uncertainty:

(4) |

For protein binding on a nucleic acid, the

The formula ascribed to by Dr. Stormo is

where supposedly

Note that *R* can be computed
from the genomic uncertainty

in which case, contrary to Dr. Stormo's claim, it

In other words there are three possible formulas:

Formula
(7)
would be the strict molecular machine view in which contact is
not made before binding [Schneider, 1991*a*,Schneider, 1994],
so that the uncertainty is
bits.
This raises the issue
of how it is known to be 4 bases. However, the situation is equivalent to
determining the channel capacity and therefore follows Shannon in that
sense.
Modification of bases, for example by methylation or glycosylation,
does not increase the information capacity of DNA beyond
2 bits per base since the modifications depend on the sequence itself,
for example in the
methylation of adenine by Dam methylase at 5' GATC 3'.
However, increasing the number of symbols everywhere by adding
new bases would increase the information, as has been done
experimentally [Piccirilli *et al.*, 1990].

In formula
(8)
*H*_{g} can be used to cancel the `background' around a binding
site due to genomic composition skew [Schneider *et al.*, 1986],
but this is dangerous
because we don't know what causes the skew.
For example, it could be caused by
a nucleosome
binding pattern everywhere in the genome
and therefore real information is there. This
leaves us with the difficult or unresolvable technical problem to
separate and identify
the information of other binding sites in such genomes. A similar
difficult situation is to use purely theoretical means to
distinguish ribosome binding site patterns from the
downstream codon biases that occur with 3 base repetition.
Aside from the toeprint experiment [Hartz *et al.*, 1988]
one doesn't know exactly
where the 3' edge of the ribosome is
[Rudd & Schneider, 1992],
and it is not clear that complicated
subtraction or extraction schemes
would provide fair models close to the
initiation codon since translation or protein chains
may be different when they are just starting
as
compared to later on.
Experimental approaches to determine the patterns, such as SELEX,
are also presently inadequate
[Schneider, 1996,Shultzaberger & Schneider, 1999].

Formula (9) is not a true Shannon uncertainty of the form , and is not a state function.

Thus formulas (7) or (8) appear reasonable but (9) is not and does not match the physics discussed in Section II.

**3. I_{seq} can violate the channel capacity theorem.**
Shannon's channel capacity theorem
provides an upper bound on the information that can be transmitted
[Shannon, 1948,Shannon, 1949].
It has been used to explain the observed precision
of molecular systems [Schneider, 1991

When discussing the computation for GGGG,
Dr. Stormo
does not give a justification for having
more than 2 bits/base other than having
*R*_{sequence} (the average information at a set of binding sites)
equal
*R*_{frequency} (the information needed to locate the binding sites on the genome).
There are now a number of clear cases where
*R*_{sequence}does not equal
*R*_{frequency} for good biological reasons
[Schneider *et al.*, 1986,Schneider & Stormo, 1989,Herman & Schneider, 1992,,Stephens & Schneider, 1992],
so forcing one's
formula to make them equal means that one could miss important biological
phenomena.

**4. Interpreting I_{seq} as a macroscopic measure made by an observer.**
I understand that it is not
Dr. Stormo's intent
to model the observational process, but it is worthwhile understanding
the implications of this possible interpretation.
Formulas like

**5. I_{seq} can measure prejudice.**

**6. I_{seq} as a global free energy measure.**
Dr. Stormo (private communication)
indicates that

**II. What is the inequality that Dr. Stormo disputes?**

The inequality is a version of the
Second Law of Thermodynamics, given in a previous
*J. Theor. Biol.*
paper [Schneider, 1991*b*].
The relationship derived from both the Second Law and (surprisingly!) from
Shannon's channel capacity equation is:

where is Boltzmann's constant,

A coin is a useful example for understanding this. A coin can carry one bit of information, since it has 2 states and bit. Consider a coin flipping in the air or bouncing around in a box. In such a condition it has no particular state and so its uncertainty is 1 bit. To `store' information in the coin, it must come to rest on one or the other face. This requires that the energy in the coin be allowed to flow out to the surrounding environment. The point here is that the initial energy of the coin can have different values relative to the final value. The Second Law tells us that there is a minimum energy that must be dissipated per bit ( joules), but there can be extra dissipation that is merely wasted because under all conditions no more than 1 bit can be stored in the coin. With even a small inefficiency, the relationship between energy dissipated and information gained will be an inequality, contrary to Dr. Stormo's claim (see [Tribus & McIrvine, 1971]).

A coin is also a good analogy for the situation of a protein binding to DNA.
Before specific binding, the protein/DNA complex has high energy, while after
binding at specific DNA sites it has lower energy. The excess energy must be
dissipated to the surroundings for the molecule to stick,
since if the energy
were
not dissipated the molecule would move on.
As with the coin, there can be an
excess dissipation so there is no
*a priori*
relationship between energy and
information aside from the Second Law bound.

If, in attempting to model binding energetics,
*p*(*b*)and
*f*(*b*, *l*)are to represent the time-average of various bases bound by the protein,
then the non-equivalence of energy and information means
that it is not correct to assume that these are the same
as the base frequencies observed in the genome and in binding sites, respectively,
since those correspond to information.
In this case, these probabilities are not yet experimentally accessible
and the measure Dr. Stormo proposes cannot be made.

On the other hand, these probabilities are usually presented as
estimatable from observed base frequencies, in which case
Dr. Stormo
is working entirely on the information side of
the energy/information equation (10)
to compute his ``specific free energy of binding''.
In this interpretation, *I*_{seq} cannot be a measure of energy.
Because of the Second Law inequality,
the *only* way to
know what the real energy is, is to go and make direct measurements of it.

**Acknowledgments**

I thank Lakshmanan Iyer for comments and for useful discussions leading to equation (2), and Elaine Bucheimer, John S. Garavelli, Denise Rubens, Peter K. Rogan, John Spouge, Bruce Shapiro, and Ryan Shultzaberger for comments on the manuscript.