ev: Evolution of Biological Information:
Frequently Asked Questions

Briefly, what is Ev? Ev is a computer program that allows one to model the way that information is gained in living organisms by natural selection. The example used is the patterns in DNA to which proteins bind to regulate genes. This is a well-understood system and so it makes a good demonstration of evolution. Also, the mathematics is precise and gives quantitative results that match the results seen in nature.
What is information? To do real science, we need precise definitions. The one used in the Ev program is Shannon's information measured in bits. This measure has been around since 1948 and it is well respected. All modern communications and data storage systems, including satellite communications, CDs, DVDs, MP3, cell phones, etc are based on Shannon's theory. This measure has been successfully used to study many biological systems. See the rest of this web site especially our papers for many examples.
Isn't everything 'information'? No, that way lies madness. Here's the definition I'm starting to use to distinguish between physical phenomena and data or information that can be measured using Shannon's method:
Phenomena are recorded as data (information) when the state of a device (including things like rhodopsin in the eye and CCDs) associated with a living organism is changed by the phenomena.
I'm only a few thousand years behind in this thought:
"By convention there is colour, by convention sweetness, by convention bitterness, but in reality there are atoms and space."
--- Democritus - 400 B.C.E.
What information are we talking about in the Ev program? There are two information measures. The only information gain measured in the Ev program is from the patterns in the binding sites. The measurement of the information is called Rsequence. This is the information that we are interested in tracking in the Ev model.
Doesn't the program contain that information from the start? No. The program starts with random genetic sequences and when one measures the information in the binding sites (Rsequence), it is (approximately) zero. (There will be small fluctuations because of small sampling.) (See: Evolution Fairytale Forum)
But isn't Rfrequency an information measure too? Very perceptive of you. Rfrequency is a measure of the information needed to locate sites in the genome:
Rfrequency = - log ₂ (γ / G)
Because Rfrequency is a function of the size of the genome (the number of potential binding sites is G) and the number of sites (γ), it is fixed when the model begins and (usually) is not changed during an evolution run. So it doesn't teach us about how Rsequence changes. However, Rsequence does evolve towards Rfrequency as you can see in Figure 2b of the Ev paper and in the figure to the right. The dashed line shows Rfrequency, the green curve shows Rsequence.
Aren't you surprised that the information gain Rsequence is exactly as predicted by Rfrequency? Yup.
Did you know that Rsequence would evolve to Rfrequency when you first ran the program? No, I ran the program to see if this would happen or not. I was testing my PhD thesis. If the program had failed, my thesis would have been in jeopardy!
But you set up the size of the genome and the number of sites, so didn't you put information into the organisms that way? Nope. More precisely, the input parameters define Rfrequency, which is determined by information put into the program, but that is not the information being measured from the organisms. Remember that we are measuring Rsequence from patterns in the genome, and this starts out near zero bits, as you can see from the green curve in the graph. Also, the size of the genome and number of required sites can be set to a wide variety of values and yet Rsequence still evolves towards Rfrequency. This only happens by replication, mutation and selection, demonstrating that those factors are necessary and sufficient for information gain to occur.

Replication, mutation and selection
are necessary and sufficient for information gain to occur.
This process is called evolution.

Where is the environment in this picture of evolution of binding sites? The size of the genome and number of required sites is the 'environment' from the viewpoint of the binding site recognizer. This is, of course, an exact mirror of the situation in nature. DNA recognition proteins can be activated to bind or blocked from binding DNA by outside factors (e.g. LacI) but once that has taken place, the recognizers function by locating positions on the DNA. So they are buffered from the external environment and they only face the problem of locating their sites on the genome. The Ev organism recognizers have the same challenge.
Does the Special Rule smuggle information into the Ev program? This claim, by William Dembski, is answered in the on-line paper Effect of Ties on the Evolution of Information by the Ev program. Basically, changing the rule still gives an information gain, so Dembski's prediction was wrong.
Has Dembski ever acknowledged this error? Not to my knowledge.
Don't scientists admit their errors? Generally, yes, by publishing a retraction explaining what happened.
Don't you make errors too? Do you admit them? Yes and Yes, see: Schneider Lab Errata and Corrigenda.
If you had a different recognition method would you get a different result? No, so long as the recognition function gives a finely graded and ordered response to input sequences. In the Ev program, recognition is done using a numerical matrix of numbers, encoded in the genomes. In nature, DNA is copied to RNA, the RNA is translated into a polypeptide and then the polypeptide folds to make a protein. Finally, the protein recognizes the binding sites by physical interactions with the DNA. We already know that when the recognition method is the natural one, Rsequence is close to Rfrequency (see Schneider 1986, flexrbs). Even these vastly different mechanisms give the same results, so the answer is no. However, you are quite welcome to put a different recognition method into the Ev program source code and see what happens. If you do that you might be able to publish the results!
Why don't you do a real biological experiment instead of just a computer model? The primary reason is that we don't have infinite resources and time. If you have the resources (a molecular biology lab), are interested in doing an experiment, and would like to discuss it please contact me. The second reason is that nature has already done experiments, and we generally see that Rsequence is close to Rfrequency in real examples (Schneider 1986, flexrbs). The third reason is that many people have already done related evolutionary experiments, such as SELEX and similar experiments ( J Am Chem Soc. 2004 Apr 28;126(16):5130-7. Informational complexity and functional activity of RNA structures. Carothers JM, Oestreich SC, Davis JH, Szostak JW.) though to my knowledge no one has tested whether Rsequence evolves to Rfrequency in vivo.
If you were to change the Ev program by making X into Y, then I predict that there won't be an information gain. Could you change the Ev program for me? No. Don't be lazy, go do it yourself!
The so-called random numbers really are not random, they are made by an algorithm. So is there 'information' imparted by the random number generator? No. First, you can use a different random number series by changing a parameter in the program. You can also substitute in a different random number generator. Finally, you could supply random numbers from a nuclear source. This is available from HotBits: Genuine random numbers, generated by radioactive decay! None of these changes should affect the results. If they do, suspect that you have a bad random number generator!
Isn't the standard Ev mutation rate of one base change per genome per generation excessive? No. If you think about it (or try it yourself) you will see that if you slow it down you get the same results: Rsequence still will evolve towards Rfrequency. Of course it will take longer to get the results.
Isn't the Ev mutation rate much higher than natural rates? It's only 10 fold faster than HIV. Interestingly, there are mutations in the bacteriophage T4 DNA polymerase that reduce mutation rates. So the rate of mutation is itself under evolutionary control (though not in the ev program).
Won't a slower evolution take too long in nature? No. For practical reasons we usually use a tiny population in Ev, generally only 16 organisms. In nature there are usually populations of millions. For example, in the lab a single cubic centimeter (ml, a milliliter) of E. coli culture can easily contain 10⁸ bacteria. (That's 100 million.) With an error rate of one in 10⁶ (i.e., one in a million) at each genetic location, there will be plenty of variation to drive evolution. Notice that we have 6 billion people on the planet, so there is lots of opportunity for us to continue evolving. (Have you been wearing your seatbelt? People who don't wear seatbelts are being selected against ...)
If you had a reasonable sized genome would you find that there won't be an information gain? No. Don't be lazy, go try it yourself! But notice that it will take a lot more computation, and the runs may take some years unless you write a version that uses parallel processors.
Where did you get that cool dinosaur picture? http://www.dinosauria.com/gallery/chris/chris.html. It is copyrighted and is used with permission.
Do you believe in evolution? No. I don't need to believe it. It's blatently written in tons of evidence. See: Do you believe in Evolution?. (google: riggins do you believe in evolution, alternative link)
Is there an easy way to run the program myself without lots of work? Yes! You can Run an Evolutionary Model on Your Own Computer. This is a Java version of the program, and it runs on Suns, Macs, Windows and Linux (Ubuntu: you will need to install Java - follow the directions).

Can the mistakes be expressed as Type I and Type II errors? Yes. Let the Null Hypothesis (Ho) be: "there is no site at this position in the genome". Then, following Type I and II errors in HyperStat Online we have:

Statistical
decision

True state of null hypothesis
Ho True = there is no site	Ho False = there is a site

Reject Ho
= site found

Type I error

Correct

Do not
Reject Ho
= site not found

Correct

Type II error

The color coding is the one used in the Java version:

green for sites found in the right place,
red for sites missed from the right place,
yellow for sites found in the wrong place
but
blue is normally not displayed.

References:

google: type II errors
Type I and II errors in HyperStat Online

Why do the sequence logos sometimes go below zero? The computed information has to be corrected for small sample size. In the method used, this makes small negative deviations. See:
- Schneider.Zen2002 figure 3
- small sample correction
Why are there so few mistakes at the beginning? Cristi asked:
I ran the program with a few different seeds, and the best organism is at the first step already in a great shape, with only around 20 mistakes. I think that is not a reasonable starting state for the population; the best organism at the first step should have at least about 200 mistakes, if not be even closer to the maximum number of mistakes. (Unfortunately, I cannot modify the threshold to deal with that, and I am not going to try more seeds either, since it does not appear to go anywhere far from those values.)
You didn't say what your parameters were, but suppose that you have 16 sites and 64 organisms as in the standard java run. Sorting gives the best organism, of course, so right away you have a strong bias. Why 20? I guess that this is most easily "accomplished" by having a weight matrix that does not recognize ANYTHING, or has little recognition capability. If it didn't recognize anything there would be exactly 16 mistakes. This could happen by having a very high initial threshold. If it accidently recognized 4 more sites in the wrong locations that would account for your 20. This is a hypothesis and so you can test it by looking closely at the organism that has that situation. I do agree it is a somewhat curious effect. Would it happen in nature? Sure. All that has to happen is a recognition protein is duplicated (apparently a common occurance since we see lots of nearly identical genes in various organisms and the recombination mechanism for doing this is pretty well understood). Then one copy diverges so that it doesn't recognize much at all on the DNA. As it then starts to locate a few spots, if it matches, WHOSH selection takes over and it locks on. This effect occurs in Ev too of course.
What would happen if the threshold were forced to always be zero? Evidently, it doesn't make much difference. See: Zero Threshold.
Why is number of initial mistakes often the number of sites? Paul asked. When the organisms are generated randomly at the beginning of a run, some will have a high weight matrix threshold and this means that their weight matrix cannot recognize anything. In that case, most non-sites are missed and the sites are missed too so the initial number of mistakes of the best organism is the number of sites. In a large enough initial random population of creatures, it is likely that one has a high threshold. That creature is likely to make the fewest mistakes and so that one is displayed. Here's an example: There were 64 creatures and 16 sites. The Pascal version of the program was run 2000 times, once a second, using the timeseed so that a different initial random number started each run. About 16% (317 in 2000) of the organisms had 16 mistakes initially. About 1% ( 15 in 2000) of the organisms had less than 16 mistakes initially.
How is the distribution of threshold values related to the distribution of mistakes initially? I presume the lower the threshold, the higher the number of mistakes. Paul asked. Yup! but it's not a strong effect - notice the regression line. (Density plot of the same data.)

color bar Small icon for Theory of Molecular Machines: physics,
chemistry, biology, molecular biology, evolutionary theory,
genetic engineering, sequence logos, information theory,
electrical engineering, thermodynamics, statistical
mechanics, hypersphere packing, gumball machines, Maxwell's
Daemon, limits of computers

Schneider Lab
origin: 2005 May 24
updated: 2013 May 08

color bar

ev: Evolution of Biological Information: Frequently Asked Questions

ev: Evolution of Biological Information:
Frequently Asked Questions