By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 4.13; (* of ev.p 2011 Sep 28}
(* begin module describe.ev *)
(*
name
ev: evolution of binding sites
synopsis
ev(list: out, all: inout, evp: inout, output: out)
files
list: a record of the evolutionary events that occurred. See the
description of evp for the kinds of data that can be printed. The data
may be graphed as desired using the xyplo program. The first few lines
of the file form an informative header. All of these lines begin with
an asterisk, '*', which acts as a comment. The data themselves are
organized into individual lines broken by spaces into a set of tokens.
These are described in the header.
all: All the variables, genome sequences and genetic structure to allow
continuation of the evolution. Start this as an empty file. Under
unix you can use:
touch all
After the ev program has been run, the all file will be filled with the
last generation. If you run ev again, it will use the current all file
to continue. To make a movie, step by 1 or more generations and use
evd to extract information from the all file. When you want to start
over, clear the file out. To do this under unix you can:
echo -n "" > all
evp: parameters to control the program, one per line:
parameterversion: The version number of the program. This allows the
user to be warned if an old parameter file is used.
When possible, the program will upgrade old parameter files.
Number of creatures (c, integer).
Number of potential binding sites per creature (G=Genome, integer).
Number of sites per creature (gamma, integer).
Width of the recognizer in bases (width, integer).
Bases per integer of the recognizer weights (bpi, integer).
Mutation rate. The original method gives the rate in hits per
creature per generation (mu, integer).
New method as of 2006 Jun 28: mutations can be specified in
two ways: per genome or per base.
This is determined by the first character on the line.
if the first character is:
'g': mutations per genome per generation
'b': mutations per base per generation
If the first character is a digit or a space, then the
original method (g, mutations per genome) is used.
Each of these is followed by an integer.
When mutations per base are given, the form is mutations
in a defined length of sequence.
{mmm}
Method: Provide mutations per base as a real number: "1 in
every ______ bases" Compute: m = genome size * mutations /
bases. To create mutations in an organism, split m into
integer (mi) and decimal (md) parts. Do the integer part of m
mutations. Chose a random number r between 0 and 1. If r <=
md, do an additional mutation. In this way the requested
mutations will be done on average.
Note: The original method was implemented for speed.
However, it is not natural in the sense that polymerases and
other factors work on a per base basis. So the new method is
more reasonable to use if one is exploring effects of genome
size.
Seed: a real number between 0 and 1 used to start the random
number generator. The date and time is used if this
number is outside 0 to 1.
Cycles: number of additional generations to run (cycles, integer).
Display interval: for example, 10 means every 10th generation.
Display control: the first 7 characters on the line control
the kind of data printed to the list file:
a = display average number of mistakes and the standard
deviation for the population.
c = display changes in the number of mistakes.
The current Rsequence is given if r (below) is turned on.
This allows graphs of Rsequence vs mistakes to be made.
g = display genomic uncertainty, Hg. If this deviates much
from 2.0, then the model is probably bad.
i = display individuals' mistakes
o = display orbits: information of individual sites is shown
r = display information (Rsequence, bits)
s = current status (range of mistakes) is printed to the output
file.
m = current status (range of mistakes) is printed to the list
file.
These may be in any order. Any other characters (eg, blanks) are
ignored.
Selecting: boolean. If true (ie the first character is 't'), then the
organisms are sorted by their mistakes. If false, then the
organisms are randomly sorted. Normally this should be 'true', but
it does allow one to switch the selection off suddenly and watch
random drift (with no evolution) and the decay of existing patterns
by entropy increase. Selecting is true unless the first character
on the line is an 'f'.
StorageFrequency: The frequency (every so many generations) with which
to store a copy of everything in the all file. If the computer
crashes part way through a long run, then the run can be continued
from the last storage. Of course, a storage is always made at the
end of the evolution.
RandomSites: If the first character is 'r' then the gamma sites are
placed randomly. 'n' means that the sites are placed randomly,
but not overlapping. 'a' means that the sites are placed in a
regular array with constant spacing. In either case sites are
placed outside the recognizer gene, and there are exactly gamma
sites. That is, two randomly placed sites cannot be in the same
place. This avoids conflicts that would make interpretation of the
results difficult.
dummy: real.
This function has been removed. Just use 0 as the value.
specialrule: char. SPECIAL RULE parameter. This parameter (introduced
[2001 June 6]) determines what to do when two organisms tie for
survival. If the first character on the line is:
r - random survival of one organism
b - both survive (original method)
s - one survives depending on the sorting algorithm. This is an
arbitrary function of how the quicksort routine works and is
therefore generally not a good way to make the decision.
See below for more details.
haltoncondition: char. This parameter (introduced
[2006 June 24]) causes ev to halt when a given condition
has occured.
If the first character on the line is:
- none - no halting condition
r Rs>=Rf - the best creature has Rs at least equal to Rf
m mistakes 0 - the best creature makes no mistakes
b both r and m
Note: As of version 3.68 the program can automatically upgrade
the parameter file.
output: messages to the user, including warnings about conditions,
If the display control in evp (see above) includes 'o', then
the generation number and the range of mistakes are given.
If the display control includes 'a', then the mean and standard
deviation of the mistakes are also given.
description
A population of evolving creatures is modelled. Each creature
consists of a genome made of the 4 bases. All creatures have a
certain number of binding sites, and the recognizer for the sites
is encoded by a gene in each genome. The genomes are completely
random at first. The recognizer of each creature is translated
from the gene form to a perceptron-like weight matrix. This matrix
is scanned across the genome. The number of mistakes made is
counted. There are two kinds of mistake:
how many times the recognizer misses a real site
and
how many times a non-site is detected by the recognizer.
These are weighted equally. (If they were weighted differently it
should affect the rate but not the final product of the model.)
All creatures are ranked by their number of mistakes. The half of
the population with the most mistakes dies; the other half
reproduces to take over the empty positions. Then all creatures
are mutated and the process repeats.
The integer weights of the recognizer are stored as base 4 numbers
in twos complement notation. a=00, c=01, g=10, t=11. If 'bases
per integer' were 3, then aaa encodes 0, acg is 6, etc. txx and
gxx (where x is any base) are negative numbers; ttt is -1.
The threshold for recognition of a site is encoded in the genome
just after the individual weights. It is encoded by one integer.
************************************************************************
SPECIAL RULE: if the bugs have the same number of mistakes, reproduction
(by replacement) does not take place. This ensures that the quicksort
algorithm does not affect who takes over the population, and it also
preserves the diversity of the population. [1988 October 26] Without
this, the population quickly is taken over and evolution is extremely
slow!
[2001 June 6] In response to William Dembski's objection (see below for a
link) that this rule is inserting information into the results, a new
parameter is now available to turn off this rule.
What does this rule mean? It means that in a duel it is possible for two
bugs to have a tie. This is not an unreasonable result, and it frequently
happens in the natural world! What does removing the rule mean? It means
that the bugs will duel to the death, even if it is an arbitrary death!
Clearly this cannot affect the overall evolution, but it might slow things
down. Indeed it is interesting because it is often the case that in
fights the combatants are not killed.
************************************************************************
TO RUN THE PROGRAM, create an empty all file (see the description of the
all file for how to do this) and obtain a copy of an evp file (see the
"see also" section below for links to examples). Note that in parameter
files everything to the right of the actual parameter is ignored on each
line. Do not delete this material, it serves as a comment to remind you
what the parameter does! Then you can run the ev program. The results
are in the list and all files. To see the contents of the all file (which
will be a binary on your computer) use the evd program. You can use the
xyplo program to plot items from the list file.
HISTORICAL NOTE: After getting the idea that Rsequence is close to
Rfrequency from data on ribosome binding sites (on 1982 December
7), it became clear that this had to have evolved. Eventually I
recognized that Rsequence must evolve toward a more or less set
Rfrequency. (See the paper Schneider.ev2000 for a discussion of
this.) The challenge was to test this hypothesis. My first
thinking was that a model of the evolution would take thousands of
years on a computer. However, I read the book "Ever Since Darwin,
Reflections in Natural History" by Stephen J. Gould and found in
there an interesting chapter "Is the Cambrian Explosion a Sigmoid
Fraud?" (Gould1977.sigmoid) This described Gould and Eldredge's
idea of punctuated equilibrium. I realized that with selection, if
an organism has an advantage it can take over a population
exponentially. By 1984 I had written up my thesis (on the
observation that Rs ~ Rf, Schneider1986) and had a week before my
defense in which I had nothing more to do. So I wrote the ev
program and it showed that Rs does indeed evolve toward Rf. I
don't recall that I even mentioned this during the defense, but it
was a nice backup!
documentation
For information calculations, see:
@article{Schneider1986,
author = "T. D. Schneider
and G. D. Stormo
and L. Gold
and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide sequences",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431",
year = "1986"}
The inspirational article is:
@inproceedings{Gould1977.sigmoid,
author = "S. J. Gould",
title = " Is the Cambrian Explosion a Sigmoid Fraud?",
booktitle = "Ever Since Darwin, Reflections in Natural History",
publisher = "W. W. Norton \& Co.",
address = "N. Y.",
name ="Stephen Jay",
pages = "126-133",
year = "1977"}
Applying the Perceptron (a simple neural net) to binding sites:
@article{StormoPerceptron1982,
author = "G. D. Stormo
and T. D. Schneider
and L. Gold
and A. Ehrenfeucht",
title = "Use of the {`Perceptron'} algorithm to distinguish translational
initiation sites in {{\em E. coli}}",
journal = "Nucl. Acids Res.",
volume = "10",
pages = "2997-3011",
year = "1982"}
The evolution paper describing this program is:
@article{Schneider.ev2000,
author = "T. D. Schneider",
title = "Evolution of Biological Information",
journal = "Nucleic Acids Res",
volume = "28",
number = "14",
pages = "2794-2799",
year = "2000"}
examples
A lovely (but long) evolution can be had with the following evp:
*******************************************************************************
3.42 version of ev that this parameter file is designed for.
32 NUMBER OF CREATURES
1024 NUMBER OF BASES PER CREATURE, G
64 NUMBER OF SITES PER CREATURE, gamma
6 WIDTH OF THE RECOGNIZER IN BASES
5 BASES PER INTEGER OF THE RECOGNIZER
1 MUTATION RATE IN HITS PER CREATURE PER GENERATION
0.50 SEED FOR THE RANDOM NUMBER GENERATOR
40000 CYCLES
10 DISPLAY INTERVAL
cgrs5678 a=av, c=change, i=indivls, g=Hg, r=Rs, o=orbit, s=status, m=mistakes
true SELECTING
1000 STORAGE FREQUENCY
a r = random site placement, n = no overlap random, a = array
0 sigma
b SPECIAL RULE for ties: r: random, b: both, s: sort
- halt on condition: - none; r Rs>=Rf; m mistakes 0; b both r and m
*******************************************************************************
The list file may be plotted with this parameter file for xyplo, the xyplop:
*******************************************************************************
3 6 zerox zeroy graph coordinate center
x 0 40000 zx min max (character, real, real) if zx='x' then set xaxis
y -1 6 zy min max (character, real, real) if zy='y' then set yaxis
10 28 1 1 xinterval yinterval xsubintervals ysubintervals: axis intervals
8 8 xwidth ywidth width of numbers in characters
0 2 xdecimal ydecimal number of decimal places
15 15 xsize ysize size of axes in cm
generation
Rsequence (bits) | Hg (bits) near 2 | mistakes/gamma are connected circles
a zc 'c' crosshairs, axXyYnN
n 2 zxl base if zxl='l' then make x axis log to the given base
n 2 zyl base if zyl='l' then make y axis log to the given base
*********************************************************************
1 3 xcolumn ycolumn columns of xyin that determine plot location
2 symbol column the xyin column to read symbols from
0 0 xscolumn yscolumn columns of xyin that determine the symbol size
0 0 0 hue saturation brightness columns for color manipulation
*********************************************************************
symbol to plot 'c'=circle, 'b','d'=box, 'x', '+', 'I', 'f', 'g'
r symbol flag character in xyin that indicates that this symbol
0.05 symbol sizex side in inches on the x axis of the symbol.
0.05 symbol sizey as for the x axis, get size from yscolumn
cl 0.05 connection (example for connection is c- 0.05 for dashed 0.05 inch)
n 0.05 linetype size linetype l.-in and size of dashes or dots
*********************************************************************
symbol to plot 'c'=circle, 'b','d'=box, 'x', '+', 'I', 'f', 'g'
g symbol flag character in xyin that indicates that this symbol
0.05 symbol sizex side in inches on the x axis of the symbol.
0.05 symbol sizey as for the x axis, get size from yscolumn
cl 0.05 connection (example for connection is c- 0.05 for dashed 0.05 inch)
n 0.05 linetype size linetype l.-in and size of dashes or dots
*********************************************************************
c symbol to plot 'c'=circle, 'b','d'=box, 'x', '+', 'I', 'f', 'g'
c symbol flag character in xyin that indicates that this symbol
0.05 symbol sizex side in inches on the x axis of the symbol.
0.05 symbol sizey as for the x axis, get size from yscolumn
cl 0.05 connection (example for connection is c- 0.05 for dashed 0.05 inch)
n 0.05 linetype size linetype l.-in and size of dashes or dots
*********************************************************************
.
**** version 8.50 of xyplop: uses cm for distances
. 0 6 0.10
. 0 5 0.10
- 0 4 0.10
. 0 3 0.10
. 0 2 0.10
. 0 1 0.10
l 0 0 1
*******************************************************************************
Note that dotted lines are placed across the graph for each bit, but a
dashed line is put for Rfrequence (= 4 bits). The vertical axis represents
the three kinds of data in the list file.
see also
OTHER PROGRAMS
Program to display the current population: evd.p
Program to graph the list file: xyplo.p
EXAMPLE FILES
A very early original test run:
parameter file ev.original.evp
list ev.original.list
output from the evd.p display program:
display ev.original.display
evfeatures ev.original.evfeatures
genomes ev.original.genomes
sites ev.original.sites
Example ev parameter file: ev.example.evp
Example xyplop parameter file given above: ev.example.xyplop
Parameter file for selective phase in the paper: evp.selection
Parameter file for atrophy phase in the paper: evp.atrophy
LINKS TO THE PAPER
The "Evolution of Biological Information" paper (with active hypertext
links in references): https://alum.mit.edu/www/toms/paper/ev/
The "Evolution of Biological Information" paper at Nucleic Acids
Research: http://nar.oupjournals.org/cgi/content/full/28/14/2794
TO MAKE LOGOS:
Briefly, use the evd.p program to get the sequences (in the sites file)
followed by makebk.p (preferably in automatic mode) to create a Delila
book. Then use the delila system to create the logo:
https://alum.mit.edu/www/toms/delila.html. More details are given in
the evd.p program documentation.
***********************************************************************
COMPILING AND TIME FUNCTIONS:
https://alum.mit.edu/www/toms/delila.html#How.To.Compile
***********************************************************************
RELATED WORKS:
William Dembski's objection:
http://www.metanexus.net/archives/message_fs.asp?list=views&listtype=Magazine&action=sp_simple_archive_&page=1&ARCHIVEID=3294
Page testing William Dembski's claim:
https://alum.mit.edu/www/toms/paper/ev/dembski
author
Thomas Dana Schneider
bugs
none known (well, actually it's full of them... ;-)
technical notes
RANDOM NUMBERS: The program can use the date and time to create random
numbers. These are computer system calls and may not function correctly
on other computers. One solution is to simply delete the calls to disable
them if you are having trouble compiling. You can always set different
seeds to start the random numbers by hand. See the link above for modules
to change the time function for different compilers.
SPEED IMPROVEMENT: 2002 April 3. I noticed that not all cases of passing
the e: everything were by var. They were being passed by value. Since
this means that a *copy* of "everything" is made for each call, this
should be very expensive! I ran the program (version 3.71) on an
evp.original and it took 9.5 or 9.6 user seconds on my Sun Ultra 60
according to the Unix time function. I then changed the variable e:
everything in the dorseq procedure to be var (passed as a variable
pointer) and the time dropped to 5.7 seconds! Here are the successive
improvements for the changes:
no change 9.51+/-0.08 seconds in 10 tries
dorseq 5.72+/-0.04 seconds in 10 tries !!!
puteverything 5.74+/-0.05 seconds in 10 tries
listparams 4.12+/-0.04 seconds in 10 tries
dori 4.05+/-0.05 seconds in 10 tries
dogenomich 2.01+/-0.03 seconds in 10 tries !!!
makemoreheader 2.01+/-0.03 seconds in 10 tries
putout 2.00+/-0.02 seconds in 20 tries (2.005 seconds)
dori 2.01+/-0.03 seconds in 20 tries change fbl
Wow, nearly a 10 fold speed up ...
*)
(* end module describe.ev version = 2.50; (@ of ev 1988 oct 6 *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}