Delila Program: ri

ri program

Documentation for the ri program is below, with links to related programs in the "see also" section.

{   version = 2.83; (* of ri.p 2013 Jun 06}

(* begin module describe.ri *)
(*
name
   ri: Rindividual is calculated for every site in the aligned book

synopsis
   ri(inst: in, book: in, rsdata: in, values: in, rip: in, wave: in,
      rixyin: out, sequ: out, ribl: out, riinst: out, output: out)

files
   inst: delila instructions of the form 'get from 56 -5 to 56 +10;'
      (This file may be empty, in which case the sequences will be
      aligned by their 5' ends.)

   book: the book generated by delila using inst

   rsdata: data file from rseq program

   values: a file containing the values of the objects to which the Ri
      values are to be compared.  The file may be empty.  The file
      can contain some values and then terminate.  Further values
      will be assumed to be zero.  This allows one to copy a rixyin file
      produced by ri in to the values file.  Then one adds a new site
      runs delila etc and when ri runs again it will get the old values
      lined up to the new values, with zero for the one that was
      not there previously.

   rip:  Parameters to control the program.  The file must contain the
      following parameters, one per line:

      * parameterversion:  The version number of the program.  This allows
        the user to be warned if an old parameter file is used.

      * name (string in quotes): The name of the weight matrix.

      * thefrom, theto (integers): The FROM and TO over which to do the Ri
        calculation.  These must not exceed either of those in the inst/book
        or the rsdata.  The FROM value must be less than or equal to the TO
        value.

      * column (integer): defines the column of the values file to use.

      * lowerRi, upperRi (two reals): the lowest and highest Ri evaluation to
        report to rixyin, sequ and riinst.  If the first character of the
        line is 'a' then all evaluations are reported.  Otherwise two real
        numbers are expected.  Sequences within this range are printed to
        rixyin, sequ and riinst depending also on the fifth parameter.

      * lowerValue,upperValue (two reals): the lowest and highest Value
        evaluation to report to rixyin, sequ and riinst.  If the first
        character of the line is 'a' then all evaluations are reported.
        Otherwise two real numbers are expected.  Sequences within this range
        are printed to rixyin, sequ and riinst depending also on the fifth
        parameter.

      * printsequ (character):  determines whether or not to produce any raw
        sequences in the sequ file.  If the first character of the line is
        'p', sequences selected according to the third and fouth parameters
        are printed to sequ file.  (This is a complete on-off switch for the
        sequ file.)

      * printrixyin (character):  determines whether or not to print the
        sequence of the site being analyzed.  If the first character is 'p'
        then the sequence is printed to the rixyin file.

      * partials (character): determines whether or not to print sequences
        which have a partial site.  The problem is that if there is part of a
        site, then the Ri value is questionable, depending on where the
        deletion was.  The best analysis would not use a partial site, as it
        messes up the statistics.  If the first character is:

         n  Don't print the line at all.
         i  Keep the line, but force the Ri value to be -infinity.
            This allows the lines of rixyin to be correlated to the values
            still.
         -  (any other character): print as it is.

      * niot (character (one of 'slnb') and/or real): negative infinity or
         t:  determines what to do when f(b,l) = 0.  Positions for which
         f(b,l) = 0 will have negative infinity in the Ri(b,l) table.  The
         letter 's' means to use an "extended" Rodger Staden's method of
         giving f(b,l) = 1/(n+t), where t is a non-negative integer following
         the 's'.  When t = 0, this gives Staden's original method.  Using
         t = 1 would often give frequencies that are too low.  The value of
         t=2 corresponds to the Laplace Law of Succession.

         The ``law of succession of Laplace''
         \cite{Feller1968.Laplace,Papoulis1990} states that given $n$ trials
         in which there were $k$ results of one kind, the best estimate for
         the probability in another trial is $(k+1)/(n+2)$.  In the present
         case, we need the probability of the absence of a particular base
         when searching for {\em another\/} binding site, so $k = 0$ and the
         best estimate is $1/(n+2)$.  For this reason we set $t=2$ for most
         purposes.

         If instead of 's' for Staden's extended method, 'l' is used, the
         program will use the full Laplace Law of sucession

             f(b,l) = (k+1)/(n+2) = (n(b,l)+1)/(n(l)+2)

         as the estimate for the probability for k items out of n total.
         This OVERESTIMATES the information content and so is NOT to be used
         in general.  (This can be easily demonstrated by looking at the
         effect on the average of the distribution.)

         If the character is 'b' then Staden's method is used as above but
         Ehnb is not added.  This is ONLY to be used to simulate the bug
         corrected on 2001 May 10!  (See Technical Notes below.)

         If there is no 's', 'l' or 'b', (or when the line begins with 'n')
         then the program expects a number which the value for negative
         infinity.  It should be a value sufficiently below zero so that
         sites that are being excluded from the definition according to
         f(b,l) are separated from the true sites.  -1000 is a useful value,
         as it will always displace sites with exceptions far away from
         zero.

      * alignmenttype (character): defines how to align the pieces.  See the
        alist program for the detailed logic.  There are three choices, as in
        alist:

         'f' (for 'first') then the sequences are always aligned by their
         first base.

         'i' then the sequences are aligned by the delila instructions.  If
         the inst file is empty, alignment is forced to the 'b' mode.

         'b' (for 'internal') then the alignment is on the internal zero of
         the book's sequence.  This option is to be used when "default
         coordinate zero" is used in the Delila instructions.

       * Ribound (real): the Ri boundary for this definition, bits.
            The lower bound on Ri to find in the book.
            Use 0 to represent the Second Law of Thermodynamic bound.
            Use -500 or lower to locate all sites.

       * Zbound (real): the Z boundary for this definition.
            The lower bound on Z (standard deviation) to find in the
            book.
            Use 100 to locate all sites.

       * Pbound (real): the probability boundary for this definition.
            The upper probability P to find in the book.
            Use 1.00 to locate all sites.

       Ribound, Zbound and Pbound are used in conjunction, so 0, 100,
       1.00 respectively would locate all sites greater than zero
       bits, with no other constraint.  Most of the time we adjust
       Ribound and do not play with the others.

   wave: Define a cosine wave.  See makelogo for the definition of this
         file.  These data are simply copied to the end of the ribl file.
         This allows the lister program to create a wave under the walkers If
         you use the same wave definition for makelogo, the same wave in the
         logo will be on the walker.

   rixyin: input to the xyplo program.  The file contains these columns of data:
         1 piece number
         2 piece name
         3 sequence of region analyzed
         4 length of region analyzed on this piece
         5 aligning coordinate on the piece
         6 Rindividual for the piece
         7 value from the values file (or 0 if values is empty)
         8 standard deviation of Rindividual for that sequence.

zzzqqq describe Ri SD computation here.  This computation is EXPERIMENTAL.
The statistics is NOT PROVEN and the program is not even guaranteed
to compute this unproven value correctly!

********************************************************************************
2002 Jan 24

risd = Ri standard deviation

the test directory contains a functioning routine
that works - in pascal.  It needs to be boiled down more.

2002 Jan 29

Here is the comptuation, at a position l with n(l) sequences
and f(b,l) = n(b,l)/n(l).
If the frequency is zero, then use

f := 1/(2+n(l))
otherwise use
f := f(b,l)

The (corrected to account for n(l) variation) formula given in the perl
program is

variance = ((1-f)/f)/n(l)

or

variance = ((1/f-1)/n(l)

for most cases where n(b,l) > 0

For simplicity let
b = n(b,l)
n = n(l)

variance = ((1/f(b,l)-1)/n(l)
         = ((1/(n(b,l)/n(l))-1)/n
         = ((1/(b/n))-1)/n
         = (1/b)- (1/n)

which is amusingly simple.
********************************************************************************

   sequ: the raw sequences reported to rixyin if any selection is made
      (fourth line of rip file).  These end in periods, so they can be
      given to makebk to create a book.

   ribl: weight matrix Ri(b,l).  The file begins with the name of the weight
      matrix.  It contains the information content for each base b at
      each position l, in bits.  Lines that start with * are notes.  The next
      line contains the matrix FROM and TO coordinates (thefrom and theto).

      This is followed by the matrix in the order A, C, G, T from FROM to TO.

      At the end of each line of the matrix is the position (integer)
      followed by the numbers of bases (A, C, G, T) at that position.

      After the matrix, real numbers on individual lines report:
         Ri mean (Rsequence of selected region)
         Ri standard deviation
         Ri of consensus sequence
         Ri of anticonsensus sequence
         Ri average for random (equiprobable) sequence
         the value of negative infinity
      These are all for the given range.
      (Note: Although the mean Ri for the sites is Rsequence, to get a good
      estimate of this, it is better to use the value calculated by the rseq
      program because that is less sensitive to missing sequence data.)

      Then one integer is given; n the number of sequences.

      Then, the symmetry of the matrix is defined by the first character
      on the final line:
          a: asymmetric
          o: odd symmetry (the zero base is in the center)
          e: even symmetry (the zero base is left of the center)

      The file contains any number of wave definitions, copied
      from the wave file.

      The file ends with a period, ".".

   riinst: Delila instructions for the selected subset of sites according to
      the parameters lowerRi, upperRi, lowerValue, upperValue.  The gets are
      given in this form:


piece K01789; get from 283 -FROM to same +TO direction +; (@ 4.555974 bits @)

      This allows the user to substitute a range for the FROM and TO tags.
      For example, using the unix sed:

         cat riinst | sed -e "s/FROM/50/" | sed -e "s/TO/50/" > newinst

      gives results like this:

piece K01789; get from 283 -50 to same +50 direction +; (@ 4.555974 bits @)

      See bugs.
      (In the actual instructions the '@' is replaced with *; it
      can't be in these examples because they are comments in
      the source code.)

   output: messages to the user

description
   The program determines the individual informations of the sites in the book
   as aligned by the instructions, according to the frequency table given in
   the rsdata file.  The program calculates the Ri(b,l) table:

       Ri(b,l) := 2 - (- log2( f(b,l)))

   and sums this up for each sequence.  Ri is defined so that the average of
   the Ri's for a set of sequences is Rsequence.  However, if the sequences are
   incomplete, the average will probably be less than Rsequence.  The rixyin
   output is ready to read into the xyplo program for plotting and linear
   regression.  The ribl matrix is ready to be used to scan sequences with the
   scan program.

   The program can be used in subtle ways.  For example, one can analyze the
   individual information of the left half of a binding site.  This result can
   then be used in the values file to compare against the analysis of the right
   side of a binding site.

examples

rip:

2.54          version of ri that this parameter file is designed for.
"Fis"         name of the ribl matrix
-10 +10       From-to range to do the evaluation
1             column of the values file to copy to xyin
a 0 1000      lowest to highest Ri to put in xyin and sequ (a = any)
a -1000 +1000 lowest to highest Value to put in xyin and sequ (a = any)
n             p means print sequence to the sequ file
p             p means print sequence to the xyin file
-             -: accept all sites; n: no partials; i: partials -> -infinity
s 2           s: use Staden's Method, f(b,l)=1/(n+t); else negative infinity
i             alignmenttype first, book, instructions
-5000         Ribound: real; the Ri boundary for this definition
100           Zbound:  real; the Z boundary for this definition
1.00          Pbound:  real; the probability boundary for this definition

documentation

@article{Schneider.Ri,
author = "T. D. Schneider",
title = "Information Content of Individual Genetic Sequences",
journal = "J. Theor. Biol.",
volume = "189",
number = "4",
pages = "427-441",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/paper/ri/}
{https://alum.mit.edu/www/toms/paper/ri/}",
comment = "indiv.tex",
comment = "Submitted, April 1997",
year = "1997"}

@article{Schneider.walker,
author = "T. D. Schneider",
title = "Sequence Walkers:
a graphical method to display how binding proteins
interact with {DNA} or {RNA} sequences",
journal = "Nucl. Acids Res.",
volume = "25",
comment = "walker.tex, November 1, issue 21",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/paper/walker/}
{https://alum.mit.edu/www/toms/paper/walker/},
erratum: NAR 26(4): 1135, 1998",
pages = "4408-4415",
year = "1997"}

@article{Staden1984,
author = "R. Staden",
title = "Computer methods to locate signals in nucleic acid sequences",
journal = "Nucl. Acids Res.",
volume = "12",
pages = "505-519",
year = "1984"}

@inproceedings{Feller1968.Laplace,
author = "William Feller",
booktitle = "An Introduction to Probability Theory and Its Applications",
volume = "I",
edition = "3rd",
publisher = "John Wiley \& Sons, Inc.",
address = "New York",
isbn = "0-471-25708-7",
pages = "123-124",
year = "1968"}

@book{Papoulis1990,
author = "Athanasios Papoulis",
title = "Probability \& Statistics",
publisher = "Prentice Hall",
address = "Englewood Cliffs, NJ",
comment = "Englewood Cliffs NJ 07632
QA273.P197 1990 / ISBN 0-13-711698-5.
The Laplace law of succession is on page 173",
year = "1990"}

see also
   program that generates the rsdata file:  rseq.p

   program to plot the rixyin data file:    xyplo.p

   program that uses the ribl file for searches:  scan.p

   program that uses the ribl to create an interactive
    display of sequence walkers:  makewalker.p

   program that uses the output of the scan program
    to display multiple sequence walkers and other features
    and marks along a sequence:  lister.p

   program that should be run to get a sequence logo
    before attempting to use this program to make walkers:  makelogo.p

   modules used internally to this program are in:  prgmod.p

   For people who have the ri program installed, example files are in your
   ii/delila directory under under files beginning with 'fis'.

author

   Thomas D. Schneider

bugs

   In riinst the organism and chromosome names are not written because they
   are not available by the alignment method of reading.  For now the
   instructions are given as:
       organism ORG; chromosome CHR;
   Of course this only will work for a single organism, but that is useful!

technical notes

   2001 May 10:  Weight Matrix Bug corrected

The weight function is Ri = Enbh+log2(f) where f is the frequency of a
base in an aligned position, Enbh ~ 2 (2 with small sample
correction).  If f is zero the function goes to negative infinity.
Since that is too extreme - another sequence might contain the base -
in the Staden-like method we use an estimate for f in that case,
1/(n+2) which is the best estimate of the frequency given the small
sample size.

Example of the bug.  With n = 18, (and putting Enhb=2 for simplicity)
we have Ri = 2 + log2(1/20) = -2.32.  Oh no!  The weights are -4.32!
I looked into the code (the routine is Ricalc) and found that there is
a bug!  I forgot to add the 'Enbh' part.

Versions of Ri up to 2.64 incorrectly used Ri = log2(1/(n+2)) instead
of the correct function when the number of bases is zero.  This means
that the scan program has been missing some sites because they were
weighted too low.  To allow for comparison to previous versions of the
code, the niot parameter can be set to 'b'.  This should not be used
normally.

*)
(* end module describe.ri *)
{This manual page was created by makman 1.45}


{created by htmlink 1.62}