By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.21; (* of diana.p 2013 Apr 10}
(* begin module describe.diana *)
(*
name
diana: dinucleotide analysis of an aligned book
synopsis
diana(book: in, inst: in, protseq: in, dianap: in, da: out, output: out);
files
book: the standard delila book that is to be analyzed
inst: the instruction set that was used to make the book
protseq: aligned sequences in the protseq format (see alpro.p).
The sequences can only contain bases a, c, g, t.
This file may be empty, in which case the book and inst are used.
Likewise, if the book is empty, the protseq will be used.
One of the two must be empty.
dianap: Parameters to control the program, one per line:
0. The version of diana that this parameter file is designed for.
If the program finds an old version, it will *upgrade* the
dianap file.
1. fromparam and toparam (two integers), determine the from-to range
over which to do the analysis. If dianap is empty, the range from book
and inst is used.
2. fromprotseq and toprotseq (two integers) define the range of the
protseq. They are only used if the protseq file is used.
3. dicontrol: The first character of the third line controls printing to da:
'd': just sets of 16 correlation data are printed (ie, only d in column
1 of the da file).
'i': just information about the correlation is printed (ie, only i in
column 1 of the da file).
'b': both data and information printed
4. colorslope, colorconstant: two real numbers on the fourth line control
the reported value of the normalized information content (column 13
below). This allows one to adjust the range of colors produced in a
color plot. For PostScript colors, 0.84 and 0.16 respecitvely gives
yellow ranging up to red. Once the Rnormalized has been calculated,
it is replaced by the value: (Rnormalized times colorslope) added to
colorconstant.
5. alignmenttype: the first character on the first line, defines how
to align the pieces. See the alist program for the detailed
logic. There are three choices, as in the alist program:
'f' (for 'first') then the sequences are always aligned by their
first base.
'i' then the sequences are aligned by the delila instructions. If
the inst file is empty, alignment is forced to the 'b' mode.
'b' (for 'internal') then the alignment is on the internal zero of
the book's sequence. This option is to be used when "default
coordinate zero" is used in the Delila instructions.
6. rangecontrol: if the first character on the first line is an 'r'
then the next two tokens must be two real numbers
rangelower and rangeupper that define the range of Rnormalized
values to use. When Rnormalized is between (inclusive)
rangelower and rangeupper, column 17 is set to 'INN'
otherwise it is 'OUT'. This allows one to grep the range.
7. maxsequences: (integer) maximum number of sequences to read.
8. maxdistance: (integer) maximum number distance between positions
to compute. By making this small the total computation is faster.
da: the di-analysis file that contains the output triangular array.
column 1: Tells what that row of data is. There are three choices:
n : the column is a normal data element in the triangle.
d : the column is an element on the diagonal of the triangle,
where the two coordinates are equal.
i : the column contains the information content correlating the
two positions defined by columns 3 and 4.
column 2: Tells what the dinucleotide is. There are 16 dinucleotides
aa, ac, ag, at, ... , tt as well as `in' or `id', which denote
columns that contain information content. `id' means that the
information is for the diagonal, while `in' are off-diagonal.
Combining "in" with "id" gives the entire information triangle.
column 3: The position on the sequence that corresponds to the x
coordinate on a Cartesian graph.
column 4: The position on the sequence that corresponds to the y
coordinate on a Cartesian graph.
column 5: A column of constant 1's usable by xyplo (usually xscolumn).
column 6: A column of constant 1's usable by xyplo (usually xscolumn).
column 7: The number of data points at a position
column 8: Rcorrelation (with small sample correction) (bits)
The frequency of the dinucleotide in column 2 at position
(column 3, column 4). If the column is an information column,
then this is the information at that position.
When the position
is off diagonal, this is the correlation information
Rcorrelation with small sample correction (ssc16).
When the position is on diagonal it is Rsequence.
(2008 Jul 4: "without small sample correction" was here, but
it appears to be in the code as ssc4.)
column 9: One minus the frequency (or 1 - column 8). If this is
an information column, then this is the chi-square value.
column 10: A column of constants usable by xyplo. If this is in an
information row, then this column is the number of degrees of
freedom at that position
column 11: A column of constant 1's usable by xyplo (usually hue).
column 12: A column of constant 0's usable by xyplo (usually saturation).
column 13: Rnormalized = Rcorrelation/4 or 1-(Rsequence/2)
Information from column 8 normalized by dividing by the
maximum possible information (4 for non-diagonal, 2 for
diagonal). This gives a number between 0 and 1. This number
is then multiplied by the parameter colorslope and then added
to colorconstant.
column 14: 1-Rnormalized; 1 minus column 13
column 15: correlation coefficient for x to y on information (in or
id) rows
column 16: A column of constant 1's usable by xyplo
(saturation alternative).
column 17: INN if rangelower <= Rcorrelation <= rangeupper
OUT otherwise
output: error messages from the program
description
Diana goes through a book and looks at relationships of dinucleotide pairs
within the sequences. The output of the program is in the form of a
triangular array rather than the entire square of x to y relationships.
This saves half of the calculations. For every pair of coordinates, the
frequency of the dinucleotide pair is tabulated. The program also
calculates the chi-square for the dinucleotide given the expected
mono-nucleotide frequencies. The correlation information is calculated as
the information in the dinucleotide frequencies less the information in each
of the two mononucleotides. The output of the program may be sent into
xyplo for graphics. Note the distinction between the `in' and `id'
information columns. Information in a binding site is usually going to
appear on the diagonal, so by making this distinction, one can eliminate the
diagonal information peaks for statstical analysis with genhis.
The program now has an error correction for small sample size.
See Schneider1986 p. 427, eqn A6. s = 4 or s = 16.
examples
See xyplop.diana and xyplop.diana.color for example xyplop files.
To create a xyin file when the di control is 'b', extract all lines from the
da file that contain ' in '. To do this in Unix use:
grep ' in ' da > xyin
That will not include the diagonal, which contains Rsequence and so normally
should not be compared to the correlations. To include the diagonal use:
grep i da > xyin
To sort the output files to find the most significant
correlations, one can use the Unix commands:
grep in da | sort +8 -9 -n -r | cat > da.sorted
The -r puts the most significant ones first.
documentation
@article{Stephens.Schneider.Splice,
author = "R. M. Stephens
and T. D. Schneider",
title = "Features of spliceosome evolution and function
inferred from an analysis of the information at human splice sites",
journal = "J. Mol. Biol.",
volume = "228",
pages = "1124--1136",
year = "1992"}
see also
Explanation of Diana:
https://alum.mit.edu/www/toms/diana.html
Example dianap file: dianap
Example xyplop files used in conjunction with this program:
xyplop.diana, xyplop.diana.color
The program to plot the output: xyplo.p
Program to list aligned sequences: alist.p
Program to graph histograms: genhis.p
The alpro program defines the protseq format: alpro.p
Related programs for plotting the data in three dimensions:
da3d.p, da3drotate.p
author
R. Michael Stephens
Modified by Tom Schneider to have user requested range and:
faster calculation in main loop [1995 Jan 9],
new input routine that reads protseq format [1995 Jan 9].
bugs
It would be nice to allow a restricted list of pairs to be done.
protseq has no definition of coordinates, and this forces the use of the
fromprotseq and toprotseq. A definition for multiply aligned sequences with
coordinates should be created.
technical notes
*)
(* end module describe.diana *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}