head of Tyrannosaurus Rex Dissection of "A Vivisection of the ev Computer Organism: Identifying Sources of Active Information"

Thomas D. Schneider
The cycle of evolution as done in the Ev program.  A
circle of arrows between mutate, evaluate, sort, kill,
replicate and back to mutate.  An arrow running from
evaluate to sort to kill is inside and labeled 'selection'.

This is a brief (and perhaps preliminary) response to:

@article{Montanez.Marks2010,
author = "G. {Monta\~{n}ez}
and W. Ewert
and W. Dembski
and R. Marks",
title = "{A Vivisection of the ev Computer Organism: Identifying
Sources of Active Information}",
journal = "BIO-Complexity",
volume = "2010",
pages = "1--6",
url = "http://bio-complexity.org/ojs/index.php/main/article/view/36"
year = "2010"}

The abstract of the above paper claims:

"ev is an evolutionary search algorithm proposed to simulate biological evolution. As such, researchers have claimed that it demonstrates that a blind, unguided search is able to generate new information."

That's not what is claimed in the original paper, Evolution of Biological Information.

There are two parts to evolution: replication with variation AND selection. In addition to mutational variation, the ev program has selection and so it is does not do an 'unguided search'.

Figure 2 of the paper Evolution of Biological
Information showing information (bits per site) ranging
from -1.0 to 6.0 bits versus Generation running from0 to
2000.  A dashed line at 4.0 bits is Rfrequency.  The
evolution of the binding sites is shown by a green curve
that starts near zero bits and evolves to around 4.0 bits
by 1000 generations and then oscillates there around 4
bits.  Selection covers this entire range.  A second red
curve starts again at 1000 without selection.  The red
curve decays exponenentially to near zero bits. Chris Adami has pointed out that the genetic information in biological systems comes from the environment. In the case of Ev, the information comes from the size of the genome (G) and the number of sites (γ), as stated clearly in the paper. From G and γ one computes Rfrequency = log2 G / γ bits per site, the information needed to locate the sites in the genome. The information measured in the sites (Rsequence, bits per site) starts at zero and entirely by an evolutionary process Rsequence converges on Rfrequency. In the figure to the right, Rfrequency is shown by the dashed line and the evolving Rsequence is the green curve. At 1000 generations the population was duplicated and in one case selection continued (horizontal noisy green curve) and in the other case selection was turned off (exponentially decaying noisy red curve). Thus the information gain depends on selection and it not blind and unguided. The selection is based on biologically sensible criteria: having functional DNA binding sites and not having extra ones. So Ev models the natural situation.

You can try this experiment yourself on any computer in less than 10 seconds (!) by using the new
Java version of Evj
  1. Click on the link above; a window opens
  2. Wait for a second window to open
  3. Click on "Speed ^" (the up arrow) several times to set the Speed to 21 (left side control panel on top)
  4. Click on "Run" (left side control panel on top)
Result: The evolution rapidly goes to completion (by generation 675) with
          Rfrequency = 4.00 bits/site
          Rsequence  ≈ 4 bits/site (fluctuating)
So the information in the binding sites (Rsequence) evolves to matches the predicted value (Rfrequency).
See the Guide to Evj for more details about running this program.

Note that the size of the genome is determined by the number of functions required for survival in an environment. For example, the bacterium E. coli has about γ = 4,000 genes. It doesn't need more than that to survive in its environment. We know that parasites have fewer genes because they can use the nutrients from their host. So E. coli doesn't lose genes either and γ is fixed by the environment, as is the size of the genome. E. coli has a genome of about G = 4.6 million base pairs, each of which codes for roughly only one strand of mRNA. So Rfrequency = log2 4.6 x 106 / 4000 = 10.1 bits per site. This information is determined by the environment. The information in a verified set of E. coli genes is Rsequence = 10.35(+/-0.16) bits. So Rfrequency is within two standard deviations of Rsequence. The information from the environment becomes imbedded in the genome by natural selection. The Ev/Evj programs demonstrate this process. (For further details on this computation, see Anatomy of Escherichia coli Ribosome Binding Sites.)

Conclusion

A common misconception of how evolution works is to say is it is 'unguided', as Montanez et. al claim. This entirely misses Darwin's concept of variation and natural selection. It is clear that variation, being a randomizing process will destroy information. However, natural selection is sufficient to overcome the variation. This is because the amount of variation is also under genetic control and is selected to be optimal. We know this because there are mutations of bacteriophage T4 DNA polymerase (and other polymerases) that improve their fidelity. So the basic premise of Montanez.Marks2010 is incorrect. This is amply demonstrated by the Evj program in a few seconds. Furthermore, the results from Evj, that the information in binding sites evolves to match that needed to locate the binding sites, is observed in a number of natural systems (Schneider et. al 1986, Stephens & Schneider 1992, Shultzaberger et. al 2001) so Evj models the natural results.


Notes on Montanez.Marks2010
  1. Page 2: The definitions are not the one Shannon used. They are not adequate for computing the information at the binding sites. However, since P = γ / G is the probability of a binding site, Rfrequency = -log2 P which is of the same form.
  2. Page 2: "genome consisting of a 256 base string (excluding five extra bases at the end that are not part of the genome proper)" The 5 extra bases are part of the genome and the weight matrix is sensitive to what is in them. The reason for them is to make the number of potential sites exactly 256 and yet allow the weight matrix to be in all positions. An alternative would be to make a circular genome.
  3. Page 2: The coordinates given do not match the original figure coordinates. The actual coordinates, taken from the file (dated Jul 10, 2000) that generated the original Ev paper figure are:
    132 141 148 158 164 174 181 191 201 208 214 223 232 240 248 256
    
    The differences between these numbers are:
    9  7 10  6 10  7 10 10  7  6  9  9  8  8  8
    
    The coordinates reported in Montanez.Marks2010 are:
    1 10 17 26 33 43 50 60 70 76 83 92 101 109 117 125
    
    The differences between these numbers are:
    9  7  9  7  10  7 10 10  6  7  9  9  8  8  8
    
    As indicated in red, two of the sites were misplaced by one base. However, this error will not change the results as can be seen by noting that Ev is an entirely different program from Evj but they both give the same results and from repeat runs of either program with different initial random seeds.
  4. Page 3: "In the search for the binding sites, the target sequences are fixed at the beginning of the search. The weights, bias, and remaining genome sequence are all allowed to vary." This is not what Ev does since ALL sequences, including the sites are allowed to vary since DNA recognition proteins coevolve with their binding sites so that the target sequences are not fixed.
  5. concern for last point: Is this unnatural constraint recognized in the remaining text? They have multiple models of what ev is, but ev as published was only one algorithm.
  6. Page 4: "We show here that the evolutionary algorithm originally used for ev performs worse than simple stochastic hill climbing." While this may be true, it's irrelevant since directed stochastic hill climbing is not a mechanism available to natural genomes. The evolutionary process does something like that by using multiple organisms each varying in a different direction. The ones that vary the wrong way are lost and the remainder survive. So it's 'inefficient' but it works. There is no way for an evolutionary mechanism to work within one organism on the whole organism. The immune system is within a single organism, but it too has multiple entities out of which are selected useful ones.
  7. Page 6. They do not compute the information in the binding sites. So they didn't evaluate the relevant information (Rsequence) at all.
    As far as ev can be viewed as a model for biological processes in nature, it provides little evidence for the ability of a Darwinian search to generate new information.
    This is incorrect. They aren't looking at the right information. What they call "external knowledge" is Rfrequency and that represents information outside the genetic control system and in general outside or being directed from the the environment. But it is absolutely clear by looking at the sequene logo at the start of any Evj run that the information inside the genome starts at zero and then increases to the requesite amount by evolution.
    Rather, it demonstrates that preexisting sources of information can be re-used and exploited, with varying degrees of efficiency, by a suitably designed search process, biased computation structure, and tuned parameter set.
    So that's exactly what the evolutionary process is doing - adapting the organism to the environment by capturing information about the environment in its genome.
  8. They appear to be claiming that Ev is not an efficient algoritm compared to others out there. If so, they missed the point of Ev being a model of natural genetic control systems! It may be inefficient but it gets the job done and the evolution is rapid anyway!
  9. As a double check, I modified Ev (the original Pascal program) to ignore the threshold by setting it to zero. The evolution still occured and Rsequence still converged to Rfrequency. So a lot of what is said in this paper is not relevant.

Summary Aside from their propensity to veer away from the actual biological situation, the main flaw in this paper is the apparent misunderstanding of what Ev is doing, namely what information is being measured and that there are two different measures. The authors only worked with Rfrequency and ignored Rsequence. They apparently didn't compute information from the sequences. But it is the increase of Rsequence that is of primary importance to understand. Thanks to Chris Adami, we clearly understand that information gained in genomes reflects 'information' in the environment. I put environmental 'information' in quotes because it is not clear that information is meaningful when entirely outside the context of a living organism. An organism interprets its surroundings and that 'information' guides the evolution of its genome.


Response by Robert Marks The autopsy of a dissection of a vivisection: response to Schneider's response by Robert Marks (2011-09-03)

"We agree with Schneider that information is gained by the genome through extraction of it from the environment."
Since this is what happens in nature the game is over. Saying that the mechanism is a perceptron, etc, just obscures the issues. Ev models what happens in nature as shown by my previous work on binding sites, Information Content of Binding Sites.


Other papers and ideas by Dembski have been evaluated previously on related pages:

color bar Small icon for Theory of Molecular Machines: physics,
chemistry, biology, molecular biology, evolutionary theory,
genetic engineering, sequence logos, information theory,
electrical engineering, thermodynamics, statistical
mechanics, hypersphere packing, gumball machines, Maxwell's
Daemon, limits of computers


Schneider Lab

origin: 2011 Feb 16
updated: version = 1.07 of Montanez.Marks2010.html 2013 Nov 12
color bar