By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.21; (* of diana.p 2013 Apr 10} (* begin module describe.diana *) (* name diana: dinucleotide analysis of an aligned book synopsis diana(book: in, inst: in, protseq: in, dianap: in, da: out, output: out); files book: the standard delila book that is to be analyzed inst: the instruction set that was used to make the book protseq: aligned sequences in the protseq format (see alpro.p). The sequences can only contain bases a, c, g, t. This file may be empty, in which case the book and inst are used. Likewise, if the book is empty, the protseq will be used. One of the two must be empty. dianap: Parameters to control the program, one per line: 0. The version of diana that this parameter file is designed for. If the program finds an old version, it will *upgrade* the dianap file. 1. fromparam and toparam (two integers), determine the from-to range over which to do the analysis. If dianap is empty, the range from book and inst is used. 2. fromprotseq and toprotseq (two integers) define the range of the protseq. They are only used if the protseq file is used. 3. dicontrol: The first character of the third line controls printing to da: 'd': just sets of 16 correlation data are printed (ie, only d in column 1 of the da file). 'i': just information about the correlation is printed (ie, only i in column 1 of the da file). 'b': both data and information printed 4. colorslope, colorconstant: two real numbers on the fourth line control the reported value of the normalized information content (column 13 below). This allows one to adjust the range of colors produced in a color plot. For PostScript colors, 0.84 and 0.16 respecitvely gives yellow ranging up to red. Once the Rnormalized has been calculated, it is replaced by the value: (Rnormalized times colorslope) added to colorconstant. 5. alignmenttype: the first character on the first line, defines how to align the pieces. See the alist program for the detailed logic. There are three choices, as in the alist program: 'f' (for 'first') then the sequences are always aligned by their first base. 'i' then the sequences are aligned by the delila instructions. If the inst file is empty, alignment is forced to the 'b' mode. 'b' (for 'internal') then the alignment is on the internal zero of the book's sequence. This option is to be used when "default coordinate zero" is used in the Delila instructions. 6. rangecontrol: if the first character on the first line is an 'r' then the next two tokens must be two real numbers rangelower and rangeupper that define the range of Rnormalized values to use. When Rnormalized is between (inclusive) rangelower and rangeupper, column 17 is set to 'INN' otherwise it is 'OUT'. This allows one to grep the range. 7. maxsequences: (integer) maximum number of sequences to read. 8. maxdistance: (integer) maximum number distance between positions to compute. By making this small the total computation is faster. da: the di-analysis file that contains the output triangular array. column 1: Tells what that row of data is. There are three choices: n : the column is a normal data element in the triangle. d : the column is an element on the diagonal of the triangle, where the two coordinates are equal. i : the column contains the information content correlating the two positions defined by columns 3 and 4. column 2: Tells what the dinucleotide is. There are 16 dinucleotides aa, ac, ag, at, ... , tt as well as `in' or `id', which denote columns that contain information content. `id' means that the information is for the diagonal, while `in' are off-diagonal. Combining "in" with "id" gives the entire information triangle. column 3: The position on the sequence that corresponds to the x coordinate on a Cartesian graph. column 4: The position on the sequence that corresponds to the y coordinate on a Cartesian graph. column 5: A column of constant 1's usable by xyplo (usually xscolumn). column 6: A column of constant 1's usable by xyplo (usually xscolumn). column 7: The number of data points at a position column 8: Rcorrelation (with small sample correction) (bits) The frequency of the dinucleotide in column 2 at position (column 3, column 4). If the column is an information column, then this is the information at that position. When the position is off diagonal, this is the correlation information Rcorrelation with small sample correction (ssc16). When the position is on diagonal it is Rsequence. (2008 Jul 4: "without small sample correction" was here, but it appears to be in the code as ssc4.) column 9: One minus the frequency (or 1 - column 8). If this is an information column, then this is the chi-square value. column 10: A column of constants usable by xyplo. If this is in an information row, then this column is the number of degrees of freedom at that position column 11: A column of constant 1's usable by xyplo (usually hue). column 12: A column of constant 0's usable by xyplo (usually saturation). column 13: Rnormalized = Rcorrelation/4 or 1-(Rsequence/2) Information from column 8 normalized by dividing by the maximum possible information (4 for non-diagonal, 2 for diagonal). This gives a number between 0 and 1. This number is then multiplied by the parameter colorslope and then added to colorconstant. column 14: 1-Rnormalized; 1 minus column 13 column 15: correlation coefficient for x to y on information (in or id) rows column 16: A column of constant 1's usable by xyplo (saturation alternative). column 17: INN if rangelower <= Rcorrelation <= rangeupper OUT otherwise output: error messages from the program description Diana goes through a book and looks at relationships of dinucleotide pairs within the sequences. The output of the program is in the form of a triangular array rather than the entire square of x to y relationships. This saves half of the calculations. For every pair of coordinates, the frequency of the dinucleotide pair is tabulated. The program also calculates the chi-square for the dinucleotide given the expected mono-nucleotide frequencies. The correlation information is calculated as the information in the dinucleotide frequencies less the information in each of the two mononucleotides. The output of the program may be sent into xyplo for graphics. Note the distinction between the `in' and `id' information columns. Information in a binding site is usually going to appear on the diagonal, so by making this distinction, one can eliminate the diagonal information peaks for statstical analysis with genhis. The program now has an error correction for small sample size. See Schneider1986 p. 427, eqn A6. s = 4 or s = 16. examples See xyplop.diana and xyplop.diana.color for example xyplop files. To create a xyin file when the di control is 'b', extract all lines from the da file that contain ' in '. To do this in Unix use: grep ' in ' da > xyin That will not include the diagonal, which contains Rsequence and so normally should not be compared to the correlations. To include the diagonal use: grep i da > xyin To sort the output files to find the most significant correlations, one can use the Unix commands: grep in da | sort +8 -9 -n -r | cat > da.sorted The -r puts the most significant ones first. documentation @article{Stephens.Schneider.Splice, author = "R. M. Stephens and T. D. Schneider", title = "Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites", journal = "J. Mol. Biol.", volume = "228", pages = "1124--1136", year = "1992"} see also Explanation of Diana: https://alum.mit.edu/www/toms/diana.html Example dianap file: dianap Example xyplop files used in conjunction with this program: xyplop.diana, xyplop.diana.color The program to plot the output: xyplo.p Program to list aligned sequences: alist.p Program to graph histograms: genhis.p The alpro program defines the protseq format: alpro.p Related programs for plotting the data in three dimensions: da3d.p, da3drotate.p author R. Michael Stephens Modified by Tom Schneider to have user requested range and: faster calculation in main loop [1995 Jan 9], new input routine that reads protseq format [1995 Jan 9]. bugs It would be nice to allow a restricted list of pairs to be done. protseq has no definition of coordinates, and this forces the use of the fromprotseq and toprotseq. A definition for multiply aligned sequences with coordinates should be created. technical notes *) (* end module describe.diana *) {This manual page was created by makman 1.45}{created by htmlink 1.62}