By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.83; (* of ri.p 2013 Jun 06}
(* begin module describe.ri *)
(*
name
ri: Rindividual is calculated for every site in the aligned book
synopsis
ri(inst: in, book: in, rsdata: in, values: in, rip: in, wave: in,
rixyin: out, sequ: out, ribl: out, riinst: out, output: out)
files
inst: delila instructions of the form 'get from 56 -5 to 56 +10;'
(This file may be empty, in which case the sequences will be
aligned by their 5' ends.)
book: the book generated by delila using inst
rsdata: data file from rseq program
values: a file containing the values of the objects to which the Ri
values are to be compared. The file may be empty. The file
can contain some values and then terminate. Further values
will be assumed to be zero. This allows one to copy a rixyin file
produced by ri in to the values file. Then one adds a new site
runs delila etc and when ri runs again it will get the old values
lined up to the new values, with zero for the one that was
not there previously.
rip: Parameters to control the program. The file must contain the
following parameters, one per line:
* parameterversion: The version number of the program. This allows
the user to be warned if an old parameter file is used.
* name (string in quotes): The name of the weight matrix.
* thefrom, theto (integers): The FROM and TO over which to do the Ri
calculation. These must not exceed either of those in the inst/book
or the rsdata. The FROM value must be less than or equal to the TO
value.
* column (integer): defines the column of the values file to use.
* lowerRi, upperRi (two reals): the lowest and highest Ri evaluation to
report to rixyin, sequ and riinst. If the first character of the
line is 'a' then all evaluations are reported. Otherwise two real
numbers are expected. Sequences within this range are printed to
rixyin, sequ and riinst depending also on the fifth parameter.
* lowerValue,upperValue (two reals): the lowest and highest Value
evaluation to report to rixyin, sequ and riinst. If the first
character of the line is 'a' then all evaluations are reported.
Otherwise two real numbers are expected. Sequences within this range
are printed to rixyin, sequ and riinst depending also on the fifth
parameter.
* printsequ (character): determines whether or not to produce any raw
sequences in the sequ file. If the first character of the line is
'p', sequences selected according to the third and fouth parameters
are printed to sequ file. (This is a complete on-off switch for the
sequ file.)
* printrixyin (character): determines whether or not to print the
sequence of the site being analyzed. If the first character is 'p'
then the sequence is printed to the rixyin file.
* partials (character): determines whether or not to print sequences
which have a partial site. The problem is that if there is part of a
site, then the Ri value is questionable, depending on where the
deletion was. The best analysis would not use a partial site, as it
messes up the statistics. If the first character is:
n Don't print the line at all.
i Keep the line, but force the Ri value to be -infinity.
This allows the lines of rixyin to be correlated to the values
still.
- (any other character): print as it is.
* niot (character (one of 'slnb') and/or real): negative infinity or
t: determines what to do when f(b,l) = 0. Positions for which
f(b,l) = 0 will have negative infinity in the Ri(b,l) table. The
letter 's' means to use an "extended" Rodger Staden's method of
giving f(b,l) = 1/(n+t), where t is a non-negative integer following
the 's'. When t = 0, this gives Staden's original method. Using
t = 1 would often give frequencies that are too low. The value of
t=2 corresponds to the Laplace Law of Succession.
The ``law of succession of Laplace''
\cite{Feller1968.Laplace,Papoulis1990} states that given $n$ trials
in which there were $k$ results of one kind, the best estimate for
the probability in another trial is $(k+1)/(n+2)$. In the present
case, we need the probability of the absence of a particular base
when searching for {\em another\/} binding site, so $k = 0$ and the
best estimate is $1/(n+2)$. For this reason we set $t=2$ for most
purposes.
If instead of 's' for Staden's extended method, 'l' is used, the
program will use the full Laplace Law of sucession
f(b,l) = (k+1)/(n+2) = (n(b,l)+1)/(n(l)+2)
as the estimate for the probability for k items out of n total.
This OVERESTIMATES the information content and so is NOT to be used
in general. (This can be easily demonstrated by looking at the
effect on the average of the distribution.)
If the character is 'b' then Staden's method is used as above but
Ehnb is not added. This is ONLY to be used to simulate the bug
corrected on 2001 May 10! (See Technical Notes below.)
If there is no 's', 'l' or 'b', (or when the line begins with 'n')
then the program expects a number which the value for negative
infinity. It should be a value sufficiently below zero so that
sites that are being excluded from the definition according to
f(b,l) are separated from the true sites. -1000 is a useful value,
as it will always displace sites with exceptions far away from
zero.
* alignmenttype (character): defines how to align the pieces. See the
alist program for the detailed logic. There are three choices, as in
alist:
'f' (for 'first') then the sequences are always aligned by their
first base.
'i' then the sequences are aligned by the delila instructions. If
the inst file is empty, alignment is forced to the 'b' mode.
'b' (for 'internal') then the alignment is on the internal zero of
the book's sequence. This option is to be used when "default
coordinate zero" is used in the Delila instructions.
* Ribound (real): the Ri boundary for this definition, bits.
The lower bound on Ri to find in the book.
Use 0 to represent the Second Law of Thermodynamic bound.
Use -500 or lower to locate all sites.
* Zbound (real): the Z boundary for this definition.
The lower bound on Z (standard deviation) to find in the
book.
Use 100 to locate all sites.
* Pbound (real): the probability boundary for this definition.
The upper probability P to find in the book.
Use 1.00 to locate all sites.
Ribound, Zbound and Pbound are used in conjunction, so 0, 100,
1.00 respectively would locate all sites greater than zero
bits, with no other constraint. Most of the time we adjust
Ribound and do not play with the others.
wave: Define a cosine wave. See makelogo for the definition of this
file. These data are simply copied to the end of the ribl file.
This allows the lister program to create a wave under the walkers If
you use the same wave definition for makelogo, the same wave in the
logo will be on the walker.
rixyin: input to the xyplo program. The file contains these columns of data:
1 piece number
2 piece name
3 sequence of region analyzed
4 length of region analyzed on this piece
5 aligning coordinate on the piece
6 Rindividual for the piece
7 value from the values file (or 0 if values is empty)
8 standard deviation of Rindividual for that sequence.
zzzqqq describe Ri SD computation here. This computation is EXPERIMENTAL.
The statistics is NOT PROVEN and the program is not even guaranteed
to compute this unproven value correctly!
********************************************************************************
2002 Jan 24
risd = Ri standard deviation
the test directory contains a functioning routine
that works - in pascal. It needs to be boiled down more.
2002 Jan 29
Here is the comptuation, at a position l with n(l) sequences
and f(b,l) = n(b,l)/n(l).
If the frequency is zero, then use
f := 1/(2+n(l))
otherwise use
f := f(b,l)
The (corrected to account for n(l) variation) formula given in the perl
program is
variance = ((1-f)/f)/n(l)
or
variance = ((1/f-1)/n(l)
for most cases where n(b,l) > 0
For simplicity let
b = n(b,l)
n = n(l)
variance = ((1/f(b,l)-1)/n(l)
= ((1/(n(b,l)/n(l))-1)/n
= ((1/(b/n))-1)/n
= (1/b)- (1/n)
which is amusingly simple.
********************************************************************************
sequ: the raw sequences reported to rixyin if any selection is made
(fourth line of rip file). These end in periods, so they can be
given to makebk to create a book.
ribl: weight matrix Ri(b,l). The file begins with the name of the weight
matrix. It contains the information content for each base b at
each position l, in bits. Lines that start with * are notes. The next
line contains the matrix FROM and TO coordinates (thefrom and theto).
This is followed by the matrix in the order A, C, G, T from FROM to TO.
At the end of each line of the matrix is the position (integer)
followed by the numbers of bases (A, C, G, T) at that position.
After the matrix, real numbers on individual lines report:
Ri mean (Rsequence of selected region)
Ri standard deviation
Ri of consensus sequence
Ri of anticonsensus sequence
Ri average for random (equiprobable) sequence
the value of negative infinity
These are all for the given range.
(Note: Although the mean Ri for the sites is Rsequence, to get a good
estimate of this, it is better to use the value calculated by the rseq
program because that is less sensitive to missing sequence data.)
Then one integer is given; n the number of sequences.
Then, the symmetry of the matrix is defined by the first character
on the final line:
a: asymmetric
o: odd symmetry (the zero base is in the center)
e: even symmetry (the zero base is left of the center)
The file contains any number of wave definitions, copied
from the wave file.
The file ends with a period, ".".
riinst: Delila instructions for the selected subset of sites according to
the parameters lowerRi, upperRi, lowerValue, upperValue. The gets are
given in this form:
piece K01789; get from 283 -FROM to same +TO direction +; (@ 4.555974 bits @)
This allows the user to substitute a range for the FROM and TO tags.
For example, using the unix sed:
cat riinst | sed -e "s/FROM/50/" | sed -e "s/TO/50/" > newinst
gives results like this:
piece K01789; get from 283 -50 to same +50 direction +; (@ 4.555974 bits @)
See bugs.
(In the actual instructions the '@' is replaced with *; it
can't be in these examples because they are comments in
the source code.)
output: messages to the user
description
The program determines the individual informations of the sites in the book
as aligned by the instructions, according to the frequency table given in
the rsdata file. The program calculates the Ri(b,l) table:
Ri(b,l) := 2 - (- log2( f(b,l)))
and sums this up for each sequence. Ri is defined so that the average of
the Ri's for a set of sequences is Rsequence. However, if the sequences are
incomplete, the average will probably be less than Rsequence. The rixyin
output is ready to read into the xyplo program for plotting and linear
regression. The ribl matrix is ready to be used to scan sequences with the
scan program.
The program can be used in subtle ways. For example, one can analyze the
individual information of the left half of a binding site. This result can
then be used in the values file to compare against the analysis of the right
side of a binding site.
examples
rip:
2.54 version of ri that this parameter file is designed for.
"Fis" name of the ribl matrix
-10 +10 From-to range to do the evaluation
1 column of the values file to copy to xyin
a 0 1000 lowest to highest Ri to put in xyin and sequ (a = any)
a -1000 +1000 lowest to highest Value to put in xyin and sequ (a = any)
n p means print sequence to the sequ file
p p means print sequence to the xyin file
- -: accept all sites; n: no partials; i: partials -> -infinity
s 2 s: use Staden's Method, f(b,l)=1/(n+t); else negative infinity
i alignmenttype first, book, instructions
-5000 Ribound: real; the Ri boundary for this definition
100 Zbound: real; the Z boundary for this definition
1.00 Pbound: real; the probability boundary for this definition
documentation
@article{Schneider.Ri,
author = "T. D. Schneider",
title = "Information Content of Individual Genetic Sequences",
journal = "J. Theor. Biol.",
volume = "189",
number = "4",
pages = "427-441",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/paper/ri/}
{https://alum.mit.edu/www/toms/paper/ri/}",
comment = "indiv.tex",
comment = "Submitted, April 1997",
year = "1997"}
@article{Schneider.walker,
author = "T. D. Schneider",
title = "Sequence Walkers:
a graphical method to display how binding proteins
interact with {DNA} or {RNA} sequences",
journal = "Nucl. Acids Res.",
volume = "25",
comment = "walker.tex, November 1, issue 21",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/paper/walker/}
{https://alum.mit.edu/www/toms/paper/walker/},
erratum: NAR 26(4): 1135, 1998",
pages = "4408-4415",
year = "1997"}
@article{Staden1984,
author = "R. Staden",
title = "Computer methods to locate signals in nucleic acid sequences",
journal = "Nucl. Acids Res.",
volume = "12",
pages = "505-519",
year = "1984"}
@inproceedings{Feller1968.Laplace,
author = "William Feller",
booktitle = "An Introduction to Probability Theory and Its Applications",
volume = "I",
edition = "3rd",
publisher = "John Wiley \& Sons, Inc.",
address = "New York",
isbn = "0-471-25708-7",
pages = "123-124",
year = "1968"}
@book{Papoulis1990,
author = "Athanasios Papoulis",
title = "Probability \& Statistics",
publisher = "Prentice Hall",
address = "Englewood Cliffs, NJ",
comment = "Englewood Cliffs NJ 07632
QA273.P197 1990 / ISBN 0-13-711698-5.
The Laplace law of succession is on page 173",
year = "1990"}
see also
program that generates the rsdata file: rseq.p
program to plot the rixyin data file: xyplo.p
program that uses the ribl file for searches: scan.p
program that uses the ribl to create an interactive
display of sequence walkers: makewalker.p
program that uses the output of the scan program
to display multiple sequence walkers and other features
and marks along a sequence: lister.p
program that should be run to get a sequence logo
before attempting to use this program to make walkers: makelogo.p
modules used internally to this program are in: prgmod.p
For people who have the ri program installed, example files are in your
ii/delila directory under under files beginning with 'fis'.
author
Thomas D. Schneider
bugs
In riinst the organism and chromosome names are not written because they
are not available by the alignment method of reading. For now the
instructions are given as:
organism ORG; chromosome CHR;
Of course this only will work for a single organism, but that is useful!
technical notes
2001 May 10: Weight Matrix Bug corrected
The weight function is Ri = Enbh+log2(f) where f is the frequency of a
base in an aligned position, Enbh ~ 2 (2 with small sample
correction). If f is zero the function goes to negative infinity.
Since that is too extreme - another sequence might contain the base -
in the Staden-like method we use an estimate for f in that case,
1/(n+2) which is the best estimate of the frequency given the small
sample size.
Example of the bug. With n = 18, (and putting Enhb=2 for simplicity)
we have Ri = 2 + log2(1/20) = -2.32. Oh no! The weights are -4.32!
I looked into the code (the routine is Ricalc) and found that there is
a bug! I forgot to add the 'Enbh' part.
Versions of Ri up to 2.64 incorrectly used Ri = log2(1/(n+2)) instead
of the correct function when the number of bases is zero. This means
that the scan program has been missing some sites because they were
weighted too low. To allow for comparison to previous versions of the
code, the niot parameter can be set to 'b'. This should not be used
normally.
*)
(* end module describe.ri *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}