The first thing needed is the rectification of names.
-- Confucius, Analects 13:3
Nature Chemical Biology 5, 521 - 525 (2009); The Rectification of Names
I might note also that some of the literature is confused and some of it is just plain wrong.
--- John Pierce, Symbols, Signals and Noise: The Nature and Process of Communication, 1961, preface, x.
Information theory and molecular biology touch on a huge number of topics, as shown by the icon to the right. (Click on it to see the detail.) As a result there are many ways that one can get into intellectual trouble and many of these are widely repeated in the literature. This page is devoted to listing the pitfalls that I have come across and needed to solve to create a consistent theory. Not everything that is in the literature is correct!
Using ambiguous or poor terminology
A sequence logo is not a consensus sequence. Despite the title of our original paper "Sequence Logos: A New Way to Display Consensus Sequences", a sequence logo is not a consensus. The strictest consensus may be read from the top letters while the anti-consensus from the bottom letters. Any combination can be read from the letters in between.
Confusing a model with reality: consensus sequences. The main example is confusing a consensus sequence (a model) with a binding site (a natural phenomenon). See The Consensus Sequence Hall of Fame and the paper Consensus Sequence Zen.
Using the popular meaning of the term 'information'. In physics it is well understood that the term 'force' has a precise technical definition, and this allows one to write Newton's famous equation F = M A (force is mass times acceleration). This is quite different from the popular use of the term force as in 'the force of my rhetoric'. It is clear that usually the phrase 'my rhetoric' is not meant to be an acceleration applied to the mass of your brain! Likewise, Shannon defined information in a precise technical sense. Beware of writers who slip from the technical definition into the popular one. You can tell when someone is being precise by seeing if they report the amount of information in bits. If what they are saying is not in bits or they don't indicate exactly how to compute the bits, then they are probably using the popular meaning.
Thinking that information (R) is the same as uncertainty (H). Because of noise, after a communication there is always some uncertainty remaining, H_{after} and this must be subtracted from the uncertainty before the communication is sent, H_{before}. In combination with the R/H pitfall, this pitfall has lead many authors to conclude that information is randomness. Examples:
Information' is, of course, not the very opposite of randomness. Elitzur is using the word 'information' in the semantic sense as synonym for knowledge or meaning. Everyone knows that a random sequence, that is, one chosen without intersymbol restrictions or influence, carries the most information in the sense use by Shannon and in computer technology. ...to which I (Tom Schneider) responded:
Here you have made the mistake of setting Hafter to zero. So a random sequence going into a receiver does not decrease the uncertainty of the receiver and so no information is received. But a message does allow for the decrease. Even the same signal can be information to one receiver and noise to another, depending on the receiver!
Treating
Uncertainty (H) and Entropy (S) as identical
OR treating them as completely unrelated.
The former philosophy
is clearly incorrect because uncertainty has units
of bits per symbol while
entropy has units Joules per Kelvin.
The latter philosophy is overcome by noting that
the two can be related
if one can correlate
the probabilities of
microstates of the system
under consideration
with probabilities of the symbols.
See
Theory of Molecular Machines.
II. Energy Dissipation from Molecular Machines
(J. Theor. Biol.
148
125-137,
1991)
for how to do this.
There's a story told by Tribus about the origin of this confusion:
What's in a name? In the case of Shannon's measure the naming was not accidental. In 1961 one of us (Tribus) asked Shannon what he had thought about when he had finally confirmed his famous measure. Shannon replied: "My greatest concern was what to call it. I thought of calling it 'information,' but the word was overly used, so I decided to call it 'uncertainty.' When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one knows what entropy really is, so in a debate you will always have the advantage.'Examples:
-- M. Tribus and E. C. McIrvine, Energy and Information, Sci. Am., 225, 3, 179-188, September, 1971. https://doi.org/10.1038/scientificamerican0971-179
Using the term "Shannon entropy". Although Shannon himself did this, it was a mistake because it leads to thinking that the thermodynamic entropy is the same as the "Shannon Entropy". There are two extreme classes of error:
Ignoring the number Zero. Molecular biologists are often not including zero in their counting systems. Surprisingly, zero was invented several thousand years ago. Physicists are shocked when I tell them that to some molecular biologists, counting goes like this: -3, -2, -1, +1, +2, +3 ...
"I'm a mathematician. There are counting numbers. We always start counting at zero."Methods for how to treat zero coordinate systems are given in the glossary. If one creates a sequence logo without a zero, then one will be seriously bitten later on when one starts using sequence walkers, because the location of a sequence walker has to be specified and the natural place to do this is the zero base. Examples:
--- Professor Carol Wood from Wesleyan University
Thinking that bits are merely a measure of statistical non-randomness. One can compute the significance of a position in a binding site as the number of z scores above background (e.g. for splice junctions splice). However, this prevents one from thinking of the bits as a measure of sequence conservation, which is a different thing. Aside from small sample effects, which can be corrected, the average number of bits in a binding site does not change as the sample size changes. By contrast, the error bars on a sequence logo show the significance of the conservation.
Maxwell's Demon. There is a huge literature on Maxwell's Demon and it is full of errors, too many to list here. The basic problem is that the people who write about the Demon are not molecular biologists, they are physicists and philosophers who do not know molecular biology, so they are not thinking in realistic molecular terms. If one treats the demon as a real physical being or device, then it is clear that there are natural analogues for the things he has to do, and none of these violate the Second Law of Thermodynamics. If one does not treat the demon as a real physical device, then one has violated known physics already and so violation of the the Second Law is not surprising. See nano2 for a detailed debunking of the Demon.
The meaning of ΔS in the ΔG equation.
It is well known from thermodynamics that the free energy is:
ΔG = ΔH - T ΔSOften people talk about ΔS in this equation as "the" entropy. This is misleading if not downright incorrect.
ΔS in the above equation is the entropy change of the system:
ΔS = ΔS_{system}ΔH corresponds to the entropy change of the surroundings:
ΔH = ΔH_{system} = -T ΔS_{surroundings}so the total free energy change is:
ΔG_{system} = ΔH_{system} -T ΔS_{system}
= -T ΔS_{surroundings} -T ΔS_{system}
= -T ΔS_{total}This is why, of course, that ΔG_{system} corresponds to the total entropy change and it is why one can use the sign of ΔG_{system} to predict the direction of a chemical reaction.
So ΔH_{system} is misnamed since it is about what happens outside the system.
The pitfall is to think or say that ΔS_{system} is "the entropy" change. It's not since it is only part of the total entropy change. |
Reference:
@book{Darnell1986, author = "J. Darnell and H. Lodish and D. Baltimore", title = "Molecular Cell Biology", publisher = "Scientific American Books, Inc.", address = "N. Y.", year = "1986"} See pages 36-38.
Entropy is not "disorder"; it is a measure of the dispersal of energy by Dr. Frank L. Lambert. An entropy increase MIGHT lead to disorder (by that I mean the scattering of matter) but then - as in living things - it might not!
How can we relate this idea to molecular information theory? 'Disorder' is the patterns (or mess) left behind after energy dissipates away. The measure Rsequence (the information content of a binding site) is a measure of the residue of energy dissipation left as a pattern in the DNA (by mutation and selection) when a protein binds to DNA. On the other hand, Rfrequency, the information required to find a set of binding sites, corresonds to the decrease of the positional entropy of the protein. To drive this decrease the entropy of the surrounding must increase more by dissipation of energy. After the energy has dissipated out the protein is bound. So the protein bound at the specific genetic control points represents 'ordering'. This concept applies in general to the way life dissipates energy to survive.
EcoRI + DNA <--> EcoRI.DNAand talk about the global ΔG. However, this tells us nothing about how a single molecule binds to the DNA. A single molecule will find a binding site IRRESPECTIVE OF THE TOTAL CONCENTRATIONS OF OTHER MOLECULES IN THE SOLUTION. In other words, the global ΔG is NOT relevant to the problem of how EcoRI finds the binding site. This is widely misunderstood in the literature.
"Per cent identity" does not take into account that amino acids are almost always not equally probable and for this reason leads to illusions. Mutual entropy is the correct measure of "similarity".The term 'entropy' should not be used, but otherwise the statement is correct. This means that the basis of the widely used phylogenetic tree generating programs, such as Clustal, is unreliable. These programs begin by pairwise comparison of the percent identity of proteins.
--- H. P. Yockey. Information theory, evolution and the origin of life. Information Sciences, 141:219-225, 2002.
We were wondering if you could point us to the right direction, We are doing some SELEX experiments using rounds of selection with random oligos to determine the DNA binding sites of a zinc finger protein. Do you know of any web sites that can easily determine a possible consensus from such sequences?Ok I can help you with this, thanks for asking, but it is important to understand two things. First, if you create a consensus sequence after having done your beautiful SELEX, you will be throwing out most of your hard-earned data! See the Consensus Sequence Zen paper and also the entry on consensus sequences on this page.
You wouldn't say that you walked 5 today would you? 5 what? |
"The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J. W. Tukey."There are papers that have used the natural log and others that have used log base 2 for measuring information in biology, so it is important to indicate the units. If you don't say the units (as in, "3 bits" or "19 bits per site") your paper will not be precise enough for someone to replicate the work. The word 'bits' is very important to have after every number. Examples:
The so-called 'relative entropy' (also known as the log likelihood or Kullback-Leibler divergence) has become popular to measure the distributions of bases or amino acids. This computation the form
∑ P_{i} log_{2} P_{i}/Q_{i}where P_{i} is the frequencies i of amino acids at a given position in a protein motif and q_{i} is the frequencies of amino acids in proteins in general. The problem with this measure is that it gives results that are not consistent with information theory. For example, the maximum information required to identify one protein from 20 is: log_{2} 20 = 4.3 bits. Yet this statistical measure can give more than 5 bits. So it is incorrect to assign units to the results of this measure as bits.
A simple example makes the situation more clear. Consider a coin which can be either heads showing or tails showing. In basic information theory there are two possible states (ignoring the coin being on a edge) and so there cannot be more than 1 bit of information stored in the coin. However, by appropriate choice of Q_{i} the relative entropy can give values greater than 1. Of course it is impossible for a coin to store more than 1 bit of information, so relative entropy does not give results in bits and should never be reported that way.
This does not mean that the log likelihood is not sometimes a useful statistical measure, just that if it is used the results are not compatible with Shannon information theory (except in the case when the Q_{i} are equally likely).
The relative entropy can be rewritten
(-∑ P_{i} log_{2} Q_{i}) - (-∑ P_{i} log_{2} P_{i})The second half is recognizable as Shannon's "uncertainty" but the first half is not.
Energy is a state function. That is, it is determined by the current state of a system. If P and Q correspond to probabilities of two conditions (e.g. a protein bound to specific DNA sequences or non-specifically on DNA, as in a sequence logo) then it is clear that the first term (-∑ P_{i} log_{2} Q_{i}) is a mixture of the two states and therefore not a state function. So it is not reasonable to compare the relative entropy to energy.
If one insists on using relative entropy, then the computed values a cannot be related to energy. Shannon's channel capacity and the rest of molecular information theory goes out of one's grasp and one cannot study the efficiency of molecular machines because 'relative entropy' is the wrong measure so the results do not fit the theory.
Probably the most convenient one is a Sequence Logo [752], in which the height of each letter indicates the degree of its conservation, whereas the total height of each column represents the statistical importance of the given position (Figure 3.2)
--- Sequence - evolution - function Eugene V. Koonin, Michael Y. Galperin, page 67 (another link)
Information is not energy! When a coin is set on a table, it can store 1 bit of information (in noisy thermal motion conditions it cannot stay on its edge). Before the coin is set on the table, when it could be set to either state, it has some potential and/or kinetic energy. Setting the coin on the table in a stable state requires that this energy be removed from the coin and ultimately it will be dissipated as heat into the environment. If the coin initially has a greater height or a higher velocity, then more energy will have to be dissipated to stabilize it on one face or the other. However, in either the low or the high energy case the coin can store only 1 bit of information. So there is an inequality relationship between energy and information. <-- --> It turns out that this relationship can be expressed as a version of the second law of thermodynamics. See Theory of Molecular Machines. II. Energy Dissipation from Molecular Machines for two derivations of the exact relationship. <-- --> The bottom line is that energy is not the same as information since there is a minimum energy dissipation required per bit but the actual energy dissipation can be larger.
This opens the important question of what the actual relationship is between energy and information in molecular systems. This was solved and published in 2010, see: 70% efficiency of bistate molecular machines explained by information theory, high dimensional geometry and evolutionary convergence.
Examples of this pitfall:
@article{Stormo2000, author = "G. D. Stormo", title = "{DNA binding sites: representation and discovery}", journal = "Bioinformatics", volume = "16", pages = "16--23", pmid = "10812473", year = "2000"}
@article{Wasserman.Sandelin2004, author = "W. W. Wasserman and A. Sandelin", title = "{Applied bioinformatics for the identification of regulatory elements}", journal = "Nat Rev Genet", volume = "5", pages = "276--287", pmid = "15131651", year = "2004"}
We pause to assure the reader that there is nothing mysterious about n-dimensional space. A point in n-dimensional space R^{n} is simply a string of n real numbers ...
-- J. H. Conway and N. J. A. Sloane, "Sphere Packings, Lattices and Groups" Springer-Verlag, third edition, New York, ISBN 0-387-98585-9, 1998, page 3.
http://neilsloane.com/doc/splag.html
Modeling or depicting free energy surfaces as 2 dimensional. Such surfaces are high dimensional. This has severe effects on the shape of the path. If the individual valleys are Gaussian, the final shape is a sphere. See Theory of Molecular Machines. I. Channel Capacity of Molecular Machines.
Here is a simple case that blows away the silly ideas about 'erasure' that the physicists talk about.
Consider a coin. Set it on a table. It can store 1 bit of information.
Now to set it there from a point above the table, you have to ALLOW the potential and kinetic energy of the coin to dissipate out as noise/heat into the rest of the universe. If you don't dissipate the potential energy, the coin is not yet on the table. If you don't dissipate the kinetic energy, the coin will bounce around and so it can't store the information yet.
Setting a coin to Heads or Tails by placing it on a table dissipates energy. For some reason this simple idea escapes physicists who use the term 'erasure'.
Now what is to 'erase'? Well call Heads 1 and Tails 0. Then when we set a bunch of coins to all 0 by (setting them all to have tails pointing up), we have 'erased' whatever might have been stored in them before. But that costs just as much dissipation as storing a pattern.
'Erasure' is no different than storing a pattern of bits, it costs just as much.
For this reason I dump the term 'erasure'.
Despite a lot of silly language to the contrary, you cannot capture the dissipated energy. The entropy of the coin goes down while the entropy of the universe goes up, but more so.
FURTHERMORE, if you start the coin 1m above the table, you will dissipate a certain amount of energy when you store the 1 bit. If you start at 2m above the table, you will dissipate twice as much energy. However, in both cases the information stored in the final state of the coin on the table is the same: 1 bit. This demonstrates clearly that information is not the same thing as energy as some people seem to think. Furthermore, the process in which the coin starts higher is less efficient than when the coin starts lower. This is the key to understanding the isothermal efficiency of molecules which is discussed in the paper emmgeo: 70% efficiency of bistate molecular machines explained by information theory, high dimensional geometry and evolutionary convergence.
Schneider Lab
origin: 2002 March 13
updated:
version = 1.66 of pitfalls.html 2018 Jan 30