By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.83; (* of ri.p 2013 Jun 06} (* begin module describe.ri *) (* name ri: Rindividual is calculated for every site in the aligned book synopsis ri(inst: in, book: in, rsdata: in, values: in, rip: in, wave: in, rixyin: out, sequ: out, ribl: out, riinst: out, output: out) files inst: delila instructions of the form 'get from 56 -5 to 56 +10;' (This file may be empty, in which case the sequences will be aligned by their 5' ends.) book: the book generated by delila using inst rsdata: data file from rseq program values: a file containing the values of the objects to which the Ri values are to be compared. The file may be empty. The file can contain some values and then terminate. Further values will be assumed to be zero. This allows one to copy a rixyin file produced by ri in to the values file. Then one adds a new site runs delila etc and when ri runs again it will get the old values lined up to the new values, with zero for the one that was not there previously. rip: Parameters to control the program. The file must contain the following parameters, one per line: * parameterversion: The version number of the program. This allows the user to be warned if an old parameter file is used. * name (string in quotes): The name of the weight matrix. * thefrom, theto (integers): The FROM and TO over which to do the Ri calculation. These must not exceed either of those in the inst/book or the rsdata. The FROM value must be less than or equal to the TO value. * column (integer): defines the column of the values file to use. * lowerRi, upperRi (two reals): the lowest and highest Ri evaluation to report to rixyin, sequ and riinst. If the first character of the line is 'a' then all evaluations are reported. Otherwise two real numbers are expected. Sequences within this range are printed to rixyin, sequ and riinst depending also on the fifth parameter. * lowerValue,upperValue (two reals): the lowest and highest Value evaluation to report to rixyin, sequ and riinst. If the first character of the line is 'a' then all evaluations are reported. Otherwise two real numbers are expected. Sequences within this range are printed to rixyin, sequ and riinst depending also on the fifth parameter. * printsequ (character): determines whether or not to produce any raw sequences in the sequ file. If the first character of the line is 'p', sequences selected according to the third and fouth parameters are printed to sequ file. (This is a complete on-off switch for the sequ file.) * printrixyin (character): determines whether or not to print the sequence of the site being analyzed. If the first character is 'p' then the sequence is printed to the rixyin file. * partials (character): determines whether or not to print sequences which have a partial site. The problem is that if there is part of a site, then the Ri value is questionable, depending on where the deletion was. The best analysis would not use a partial site, as it messes up the statistics. If the first character is: n Don't print the line at all. i Keep the line, but force the Ri value to be -infinity. This allows the lines of rixyin to be correlated to the values still. - (any other character): print as it is. * niot (character (one of 'slnb') and/or real): negative infinity or t: determines what to do when f(b,l) = 0. Positions for which f(b,l) = 0 will have negative infinity in the Ri(b,l) table. The letter 's' means to use an "extended" Rodger Staden's method of giving f(b,l) = 1/(n+t), where t is a non-negative integer following the 's'. When t = 0, this gives Staden's original method. Using t = 1 would often give frequencies that are too low. The value of t=2 corresponds to the Laplace Law of Succession. The ``law of succession of Laplace'' \cite{Feller1968.Laplace,Papoulis1990} states that given $n$ trials in which there were $k$ results of one kind, the best estimate for the probability in another trial is $(k+1)/(n+2)$. In the present case, we need the probability of the absence of a particular base when searching for {\em another\/} binding site, so $k = 0$ and the best estimate is $1/(n+2)$. For this reason we set $t=2$ for most purposes. If instead of 's' for Staden's extended method, 'l' is used, the program will use the full Laplace Law of sucession f(b,l) = (k+1)/(n+2) = (n(b,l)+1)/(n(l)+2) as the estimate for the probability for k items out of n total. This OVERESTIMATES the information content and so is NOT to be used in general. (This can be easily demonstrated by looking at the effect on the average of the distribution.) If the character is 'b' then Staden's method is used as above but Ehnb is not added. This is ONLY to be used to simulate the bug corrected on 2001 May 10! (See Technical Notes below.) If there is no 's', 'l' or 'b', (or when the line begins with 'n') then the program expects a number which the value for negative infinity. It should be a value sufficiently below zero so that sites that are being excluded from the definition according to f(b,l) are separated from the true sites. -1000 is a useful value, as it will always displace sites with exceptions far away from zero. * alignmenttype (character): defines how to align the pieces. See the alist program for the detailed logic. There are three choices, as in alist: 'f' (for 'first') then the sequences are always aligned by their first base. 'i' then the sequences are aligned by the delila instructions. If the inst file is empty, alignment is forced to the 'b' mode. 'b' (for 'internal') then the alignment is on the internal zero of the book's sequence. This option is to be used when "default coordinate zero" is used in the Delila instructions. * Ribound (real): the Ri boundary for this definition, bits. The lower bound on Ri to find in the book. Use 0 to represent the Second Law of Thermodynamic bound. Use -500 or lower to locate all sites. * Zbound (real): the Z boundary for this definition. The lower bound on Z (standard deviation) to find in the book. Use 100 to locate all sites. * Pbound (real): the probability boundary for this definition. The upper probability P to find in the book. Use 1.00 to locate all sites. Ribound, Zbound and Pbound are used in conjunction, so 0, 100, 1.00 respectively would locate all sites greater than zero bits, with no other constraint. Most of the time we adjust Ribound and do not play with the others. wave: Define a cosine wave. See makelogo for the definition of this file. These data are simply copied to the end of the ribl file. This allows the lister program to create a wave under the walkers If you use the same wave definition for makelogo, the same wave in the logo will be on the walker. rixyin: input to the xyplo program. The file contains these columns of data: 1 piece number 2 piece name 3 sequence of region analyzed 4 length of region analyzed on this piece 5 aligning coordinate on the piece 6 Rindividual for the piece 7 value from the values file (or 0 if values is empty) 8 standard deviation of Rindividual for that sequence. zzzqqq describe Ri SD computation here. This computation is EXPERIMENTAL. The statistics is NOT PROVEN and the program is not even guaranteed to compute this unproven value correctly! ******************************************************************************** 2002 Jan 24 risd = Ri standard deviation the test directory contains a functioning routine that works - in pascal. It needs to be boiled down more. 2002 Jan 29 Here is the comptuation, at a position l with n(l) sequences and f(b,l) = n(b,l)/n(l). If the frequency is zero, then use f := 1/(2+n(l)) otherwise use f := f(b,l) The (corrected to account for n(l) variation) formula given in the perl program is variance = ((1-f)/f)/n(l) or variance = ((1/f-1)/n(l) for most cases where n(b,l) > 0 For simplicity let b = n(b,l) n = n(l) variance = ((1/f(b,l)-1)/n(l) = ((1/(n(b,l)/n(l))-1)/n = ((1/(b/n))-1)/n = (1/b)- (1/n) which is amusingly simple. ******************************************************************************** sequ: the raw sequences reported to rixyin if any selection is made (fourth line of rip file). These end in periods, so they can be given to makebk to create a book. ribl: weight matrix Ri(b,l). The file begins with the name of the weight matrix. It contains the information content for each base b at each position l, in bits. Lines that start with * are notes. The next line contains the matrix FROM and TO coordinates (thefrom and theto). This is followed by the matrix in the order A, C, G, T from FROM to TO. At the end of each line of the matrix is the position (integer) followed by the numbers of bases (A, C, G, T) at that position. After the matrix, real numbers on individual lines report: Ri mean (Rsequence of selected region) Ri standard deviation Ri of consensus sequence Ri of anticonsensus sequence Ri average for random (equiprobable) sequence the value of negative infinity These are all for the given range. (Note: Although the mean Ri for the sites is Rsequence, to get a good estimate of this, it is better to use the value calculated by the rseq program because that is less sensitive to missing sequence data.) Then one integer is given; n the number of sequences. Then, the symmetry of the matrix is defined by the first character on the final line: a: asymmetric o: odd symmetry (the zero base is in the center) e: even symmetry (the zero base is left of the center) The file contains any number of wave definitions, copied from the wave file. The file ends with a period, ".". riinst: Delila instructions for the selected subset of sites according to the parameters lowerRi, upperRi, lowerValue, upperValue. The gets are given in this form: piece K01789; get from 283 -FROM to same +TO direction +; (@ 4.555974 bits @) This allows the user to substitute a range for the FROM and TO tags. For example, using the unix sed: cat riinst | sed -e "s/FROM/50/" | sed -e "s/TO/50/" > newinst gives results like this: piece K01789; get from 283 -50 to same +50 direction +; (@ 4.555974 bits @) See bugs. (In the actual instructions the '@' is replaced with *; it can't be in these examples because they are comments in the source code.) output: messages to the user description The program determines the individual informations of the sites in the book as aligned by the instructions, according to the frequency table given in the rsdata file. The program calculates the Ri(b,l) table: Ri(b,l) := 2 - (- log2( f(b,l))) and sums this up for each sequence. Ri is defined so that the average of the Ri's for a set of sequences is Rsequence. However, if the sequences are incomplete, the average will probably be less than Rsequence. The rixyin output is ready to read into the xyplo program for plotting and linear regression. The ribl matrix is ready to be used to scan sequences with the scan program. The program can be used in subtle ways. For example, one can analyze the individual information of the left half of a binding site. This result can then be used in the values file to compare against the analysis of the right side of a binding site. examples rip: 2.54 version of ri that this parameter file is designed for. "Fis" name of the ribl matrix -10 +10 From-to range to do the evaluation 1 column of the values file to copy to xyin a 0 1000 lowest to highest Ri to put in xyin and sequ (a = any) a -1000 +1000 lowest to highest Value to put in xyin and sequ (a = any) n p means print sequence to the sequ file p p means print sequence to the xyin file - -: accept all sites; n: no partials; i: partials -> -infinity s 2 s: use Staden's Method, f(b,l)=1/(n+t); else negative infinity i alignmenttype first, book, instructions -5000 Ribound: real; the Ri boundary for this definition 100 Zbound: real; the Z boundary for this definition 1.00 Pbound: real; the probability boundary for this definition documentation @article{Schneider.Ri, author = "T. D. Schneider", title = "Information Content of Individual Genetic Sequences", journal = "J. Theor. Biol.", volume = "189", number = "4", pages = "427-441", note = "\htmladdnormallink {https://alum.mit.edu/www/toms/paper/ri/} {https://alum.mit.edu/www/toms/paper/ri/}", comment = "indiv.tex", comment = "Submitted, April 1997", year = "1997"} @article{Schneider.walker, author = "T. D. Schneider", title = "Sequence Walkers: a graphical method to display how binding proteins interact with {DNA} or {RNA} sequences", journal = "Nucl. Acids Res.", volume = "25", comment = "walker.tex, November 1, issue 21", note = "\htmladdnormallink {https://alum.mit.edu/www/toms/paper/walker/} {https://alum.mit.edu/www/toms/paper/walker/}, erratum: NAR 26(4): 1135, 1998", pages = "4408-4415", year = "1997"} @article{Staden1984, author = "R. Staden", title = "Computer methods to locate signals in nucleic acid sequences", journal = "Nucl. Acids Res.", volume = "12", pages = "505-519", year = "1984"} @inproceedings{Feller1968.Laplace, author = "William Feller", booktitle = "An Introduction to Probability Theory and Its Applications", volume = "I", edition = "3rd", publisher = "John Wiley \& Sons, Inc.", address = "New York", isbn = "0-471-25708-7", pages = "123-124", year = "1968"} @book{Papoulis1990, author = "Athanasios Papoulis", title = "Probability \& Statistics", publisher = "Prentice Hall", address = "Englewood Cliffs, NJ", comment = "Englewood Cliffs NJ 07632 QA273.P197 1990 / ISBN 0-13-711698-5. The Laplace law of succession is on page 173", year = "1990"} see also program that generates the rsdata file: rseq.p program to plot the rixyin data file: xyplo.p program that uses the ribl file for searches: scan.p program that uses the ribl to create an interactive display of sequence walkers: makewalker.p program that uses the output of the scan program to display multiple sequence walkers and other features and marks along a sequence: lister.p program that should be run to get a sequence logo before attempting to use this program to make walkers: makelogo.p modules used internally to this program are in: prgmod.p For people who have the ri program installed, example files are in your ii/delila directory under under files beginning with 'fis'. author Thomas D. Schneider bugs In riinst the organism and chromosome names are not written because they are not available by the alignment method of reading. For now the instructions are given as: organism ORG; chromosome CHR; Of course this only will work for a single organism, but that is useful! technical notes 2001 May 10: Weight Matrix Bug corrected The weight function is Ri = Enbh+log2(f) where f is the frequency of a base in an aligned position, Enbh ~ 2 (2 with small sample correction). If f is zero the function goes to negative infinity. Since that is too extreme - another sequence might contain the base - in the Staden-like method we use an estimate for f in that case, 1/(n+2) which is the best estimate of the frequency given the small sample size. Example of the bug. With n = 18, (and putting Enhb=2 for simplicity) we have Ri = 2 + log2(1/20) = -2.32. Oh no! The weights are -4.32! I looked into the code (the routine is Ricalc) and found that there is a bug! I forgot to add the 'Enbh' part. Versions of Ri up to 2.64 incorrectly used Ri = log2(1/(n+2)) instead of the correct function when the number of bases is zero. This means that the scan program has been missing some sites because they were weighted too low. To allow for comparison to previous versions of the code, the niot parameter can be set to 'b'. This should not be used normally. *) (* end module describe.ri *) {This manual page was created by makman 1.45}{created by htmlink 1.62}