Delila Program: diana

diana program

Documentation for the diana program is below, with links to related programs in the "see also" section.

{   version = 2.21; (* of diana.p 2013 Apr 10}

(* begin module describe.diana *)
(*
name

   diana: dinucleotide analysis of an aligned book

synopsis

   diana(book: in, inst: in, protseq: in, dianap: in, da: out, output: out);

files

   book: the standard delila book that is to be analyzed

   inst: the instruction set that was used to make the book

   protseq: aligned sequences in the protseq format (see alpro.p).
      The sequences can only contain bases a, c, g, t.
      This file may be empty, in which case the book and inst are used.
      Likewise, if the book is empty, the protseq will be used.
      One of the two must be empty.

   dianap: Parameters to control the program, one per line:

      0. The version of diana that this parameter file is designed for.
         If the program finds an old version, it will *upgrade* the
         dianap file.

      1. fromparam and toparam (two integers), determine the from-to range
      over which to do the analysis.  If dianap is empty, the range from book
      and inst is used.

      2. fromprotseq and toprotseq (two integers) define the range of the
      protseq.  They are only used if the protseq file is used.

      3. dicontrol:  The first character of the third line controls printing to da:
         'd': just sets of 16 correlation data are printed (ie, only d in column
              1 of the da file).
         'i': just information about the correlation is printed (ie, only i in
              column 1 of the da file).
         'b': both data and information printed

      4. colorslope, colorconstant: two real numbers on the fourth line control
   the reported value of the normalized information content (column 13
   below).  This allows one to adjust the range of colors produced in a
   color plot.  For PostScript colors, 0.84 and 0.16 respecitvely gives
   yellow ranging up to red.  Once the Rnormalized has been calculated,
   it is replaced by the value:  (Rnormalized times colorslope) added to
   colorconstant.

      5.  alignmenttype: the first character on the first line, defines how
          to align the pieces.  See the alist program for the detailed
          logic.  There are three choices, as in the alist program:

          'f' (for 'first') then the sequences are always aligned by their
          first base.

          'i' then the sequences are aligned by the delila instructions.  If
          the inst file is empty, alignment is forced to the 'b' mode.

          'b' (for 'internal') then the alignment is on the internal zero of
          the book's sequence.  This option is to be used when "default
          coordinate zero" is used in the Delila instructions.

      6.  rangecontrol: if the first character on the first line is an 'r'
          then the next two tokens must be two real numbers
          rangelower and rangeupper that define the range of Rnormalized
          values to use.  When Rnormalized is between (inclusive)
          rangelower and rangeupper, column 17 is set to 'INN'
          otherwise it is 'OUT'.  This allows one to grep the range.
      7.  maxsequences: (integer) maximum number of sequences to read.
      8.  maxdistance: (integer) maximum number distance between positions
          to compute.  By making this small the total computation is faster.

   da: the di-analysis file that contains the output triangular array.
          column 1: Tells what that row of data is.  There are three choices:
                n  : the column is a normal data element in the triangle.
                d  : the column is an element on the diagonal of the triangle,
                   where the two coordinates are equal.
                i  : the column contains the information content correlating the
                   two positions defined by columns 3 and 4.
   column 2: Tells what the dinucleotide is.  There are 16 dinucleotides
  aa, ac, ag, at, ... , tt as well as `in' or `id', which denote
  columns that contain information content.  `id' means that the
  information is for the diagonal, while `in' are off-diagonal.
  Combining "in" with "id" gives the entire information triangle.
          column 3: The position on the sequence that corresponds to the x
                coordinate on a Cartesian graph.
          column 4: The position on the sequence that corresponds to the y
                coordinate on a Cartesian graph.
          column 5: A column of constant 1's usable by xyplo (usually xscolumn).
          column 6: A column of constant 1's usable by xyplo (usually xscolumn).
          column 7: The number of data points at a position
          column 8: Rcorrelation (with small sample correction) (bits)
                 The frequency of the dinucleotide in column 2 at position
                 (column 3, column 4).  If the column is an information column,
                 then this is the information at that position.
                 When the position
                 is off diagonal, this is the correlation information
                 Rcorrelation with small sample correction (ssc16).
                 When the position is on diagonal it is Rsequence.
                 (2008 Jul 4: "without small sample correction" was here, but
                 it appears to be in the code as ssc4.)
          column 9: One minus the frequency (or 1 - column 8).  If this is
                an information column, then this is the chi-square value.
   column 10: A column of constants usable by xyplo.  If this is in an
  information row, then this column is the number of degrees of
         freedom at that position
          column 11: A column of constant 1's usable by xyplo (usually hue).
   column 12: A column of constant 0's usable by xyplo (usually saturation).
   column 13: Rnormalized = Rcorrelation/4 or 1-(Rsequence/2)
                Information from column 8 normalized by dividing by the
  maximum possible information (4 for non-diagonal, 2 for
  diagonal).  This gives a number between 0 and 1.  This number
  is then multiplied by the parameter colorslope and then added
  to colorconstant.
   column 14: 1-Rnormalized; 1 minus column 13
   column 15: correlation coefficient for x to y on information (in or
                     id) rows
          column 16: A column of constant 1's usable by xyplo
                     (saturation alternative).
          column 17: INN if rangelower <= Rcorrelation <= rangeupper
                     OUT otherwise
   output: error messages from the program

description
   Diana goes through a book and looks at relationships of dinucleotide pairs
   within the sequences.  The output of the program is in the form of a
   triangular array rather than the entire square of x to y relationships.
   This saves half of the calculations.  For every pair of coordinates, the
   frequency of the dinucleotide pair is tabulated.  The program also
   calculates the chi-square for the dinucleotide given the expected
   mono-nucleotide frequencies.  The correlation information is calculated as
   the information in the dinucleotide frequencies less the information in each
   of the two mononucleotides.  The output of the program may be sent into
   xyplo for graphics.  Note the distinction between the `in' and `id'
   information columns.  Information in a binding site is usually going to
   appear on the diagonal, so by making this distinction, one can eliminate the
   diagonal information peaks for statstical analysis with genhis.

   The program now has an error correction for small sample size.
   See Schneider1986 p. 427, eqn A6. s = 4 or s = 16.

examples

   See xyplop.diana and xyplop.diana.color for example xyplop files.

   To create a xyin file when the di control is 'b', extract all lines from the
da file that contain ' in '.  To do this in Unix use:

   grep ' in ' da > xyin

That will not include the diagonal, which contains Rsequence and so normally
should not be compared to the correlations.  To include the diagonal use:

   grep i da > xyin


   To sort the output files to find the most significant
   correlations, one can use the Unix commands:

   grep in da | sort +8 -9 -n -r | cat > da.sorted

   The -r puts the most significant ones first.

documentation

@article{Stephens.Schneider.Splice,
author = "R. M. Stephens
  and T. D. Schneider",
title = "Features of spliceosome evolution and function
inferred from an analysis of the information at human splice sites",
journal = "J. Mol. Biol.",
volume = "228",
pages = "1124--1136",
year = "1992"}

see also

   Explanation of Diana:
   https://alum.mit.edu/www/toms/diana.html

   Example dianap file:  dianap

   Example xyplop files used in conjunction with this program:
   xyplop.diana, xyplop.diana.color

   The program to plot the output: xyplo.p

   Program to list aligned sequences: alist.p

   Program to graph histograms: genhis.p

   The alpro program defines the protseq format: alpro.p

   Related programs for plotting the data in three dimensions:
   da3d.p, da3drotate.p

author
   R. Michael Stephens
   Modified by Tom Schneider to have user requested range and:
   faster calculation in main loop [1995 Jan 9],
   new input routine that reads protseq format [1995 Jan 9].

bugs

  It would be nice to allow a restricted list of pairs to be done.

  protseq has no definition of coordinates, and this forces the use of the
  fromprotseq and toprotseq.  A definition for multiply aligned sequences with
  coordinates should be created.

technical notes

*)
(* end module describe.diana *)
{This manual page was created by makman 1.45}


{created by htmlink 1.62}