By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.72; (* of malign.p 2022 May 21} (* begin module describe.malign *) (* name malign: optimal alignment of a book, based on minimum uncertainty synopsis malign(inst: in, book: in, malignp: in, uncert: out, newalign: out, optalign: out, optinst: out, malignxyin: out, output: out) files inst: delila instructions of the form 'get from 56 -5 to 56 +10;' book: the book generated by delila using inst malignp: parameter file with the following parameters: winleft, winright: left and right ends of window for calculating uncertainty, relative to aligned base shiftmin, shiftmax: minimum and maximum shift of aligned base iseed: integer random seed nranseq: number of random sequences, or 0 to use sequences in book nshuffle: number of times to redo alignment after random shuffle ifpaired: 1 to treat each pair of sequences as complementary strands, 0 not to standout: output run #, pass # and H to standard output every pass if 1, every run if 0, or not at all if -1 npassout: output H and alignment every npassout passes to file newalign, or only at end of runs if zero, or not at all if -1 nshiftout: output L and H(L) every nshiftout sequence shifts (to file uncert), or only at end of passes if zero, or not at all if -1 tolerance: tolerance in change of H ntolpass: maximum number of passes with change below tolerance new parameter allowed but not required (default is i): alignmenttype: char; 'f' means alignment by First internal coordinate base, 'b' means alignment by Book, 'i' means alignment by Instructions. See the alist program for more information. Normally one will align by delila instructions. If this parameter is 'f', then the first base of the book is considered the zero base and if it is 'b' then the zero base is given by the coordinates of each piece in the book. uncert: uncertainty as function of position, for the last run, at the end of each pass or after selected number of sequence shifts Controlled by variable nshiftout. newalign: values of H and the relative alignments; starting, final, and intermediate if selected. Controlled by variable npassout. optalign: user-readable listing of unique optimal relative alignments and number of times each was achieved optinst: list of unique optimal alignments in absolute coordinates, to be used to make inst file for selected alignment This file is like optalign, but the coordinates are for the original sequence. malignxyin: a list of the number of occurrences of alignments and their H values in bits. This may be plotted with xyplo, as described in the paper. Each line contains these numbers: rank: from 1 to the number of alignment classes occurrences: how many times the class was found H: the uncertainty of the alignment, in bits R: the information content of the alignment, in bits with small sample correction. description Given a book of aligned sequences, this program searches for the alignment of the sequences that has the lowest uncertainty, i.e. the highest value of Rsequence. The user specifies the "window" of bases within which uncertainty is calculated, and the maximum number of bases that each sequence is allowed to shift from the original alignment. The program considers each sequence in turn, shifting it to an alignment with minimum uncertainty while holding the other sequences fixed. A "pass" is complete when all sequences have been considered. A "run" is complete when no alignments have changed in the preceding pass, and the alignment is then considered "optimal". The first run starts with the original alignment; every run after that starts with a "shuffled" alignment obtained by shifting each sequence independently by a random amount between the allowed limits. The program maintains a list of all of the unique optimal alignments achieved from these starting alignments, and it outputs them in order of increasing uncertainty. In version 2.33 and earlier, the program did not keep track of the organism and chromosome names in bestinst. This file is now superceeded by the malin program which copies the inst file to cinst and modifies it according to one of the alignments in optinst. Statistical testing We have found a method to work with malign that gives reliable results. First run malign many times (e.g. 100) on the sites of interest using the timeseed (with at least 1 second delay between runs so that the timeseed changes). Collect the information content distribution. Then extract the same length sequences from random places on the host organism or use comp to get the composition of the host and the markov program to generate a random set. Run malign again with the same number of sequences and parameters. If you find that the two distributions differ significantly (using the ttest program) then you've got something. This was useful for us in a paper we are just finishing - in one case we see a distinct pattern clearly distinguishable from the random host sequences and in another the distributions were identical. ******************************************************************************** Summary of file output: Malign produces: uncert, newalign, optalign, optinst, malignxyin, output -- output -- Line 7 of "malignp" controls output Parameter: 1 - every run, pass, and uncertainty will be outputed to the screen 0 - only the lowest uncertainty run will be outputed -1 - nothing will be outputed *NOTE* no file will be produced regardless of the parameter -- newalign -- Line 8 of "malignp" controls newalign Parameter: 1 - produces a full newalign file 0 - produces a smaller newalign file (about half the size) -1 - produces no newalign file *NOTE* this is the largest file produced and is unnecessary -- uncert -- Line 9 of "malignp" controls uncert Parameter: 1 - produces a full uncert file 0 - produces a smaller uncert file (about half the size) -1 - produces an empty uncert file *NOTE* file will be produced regardless of the parameter, however this file is large and unnecessary -- optalign -- *NOTE* this file will always be produced and is needed to run malin -- optinst -- *NOTE* this file will always be produced and is needed to run malin -- malignxyin -- *NOTE* this file will always be produced and can be used to plot data using xyplo If you set our parameters so newalign and uncert are not created, this can save some space. (Thanks to Brent M. Jewett for compiling this information on 2001 Feb 7.) ******************************************************************************** documentation A paper describing the algorithm in detail is available from <A href="https://alum.mit.edu/www/toms/papers/malign/malign.pdf" >https://alum.mit.edu/www/toms/papers/malign/malign.pdf</A> @article{Schneider.Mastronarde.malign, author = "T. D. Schneider and D. Mastronarde", title = "{Fast} multiple alignment of ungapped {DNA} sequences using information theory and a relaxation method", journal = "Discrete Applied Mathematics", note = "https://alum.mit.edu/www/toms/papers/malign", volume = "71", pages = "259-268", year = "1996"} The use of malign to align sequences with a subtle pattern is described in: @article{Toledano1994, author = "M. B. Toledano and I. Kullik and F. Trinh and P. T. Baird and T. D. Schneider and G. Storz", title = "Redox-Dependent Shift of {OxyR-DNA} Contacts Along An Extended {DNA} Binding Site: A Mechanism for Differential Promoter Selection", journal = "Cell", volume = "78", pages = "897-909", year = "1994"} For how the information content and small sample correction are computed: @article{Schneider1986, author = "T. D. Schneider and G. D. Stormo and L. Gold and A. Ehrenfeucht", title = "Information content of binding sites on nucleotide sequences", journal = "J. Mol. Biol.", volume = "188", pages = "415-431", year = "1986"} see also Paper Schneider.Mastronarde.malign: https://alum.mit.edu/www/toms/paper/malign/ Program to graph the malignxyin file: xyplo.p You can use the malign.xyplop file for the xyplop and the malignxyin for the xyin. Set the xyplom to be empty. I ALWAYS make this graph to see what is going on. Program to make delila instructions from nth alignment of malign: malin.p Example parameter file, malignp: malignp A COMPLETE SET FOR DEMONSTRATING THE PROGRAM malign.inst instructions for grabbing the first 6 EcoRI sites on the E. coli genome, but messed up by setting the last digit to zero malign.book The Delila book corresponding to malign.inst malign.malignp Parameter file for malign to realign the inst and book. If one uses malin to pick the first alignment, one finds that they are correctly realigned: - + 1--------- +++++++++1 098765432101234567890 ..................... EcoRI U00096 3842 + 1 cgacctgccggaattcagcct U00096 12889 + 2 tctggttgaagaattcaagaa U00096 32545 + 3 tcagggtatcgaattcgacta U00096 50237 + 4 ggtattcagcgaattccacga U00096 56282 + 5 agaggtagcggaattcgttct U00096 96860 + 6 gctacgtcaggaattcctgct Program for listing aligned sequences, as above: alist.p Program for comparing two distributions: ttest.p author David Mastronarde and Tom Schneider bugs The realignment algorithm, which shifts all sequences by the same amount to attempt to keep the window near its original position, is somewhat ad hoc in nature and the effects of different settings for it parameters have not been explored. If the window spans two real sites with competing alignments, many optimal but meaningless alignments with similar uncertainties may be obtained. The random sequences can't be examined. For the computation of Rsequence, composition is assumed to be equiprobable, there is no provision for reading in a cmp file yet. technical notes The malignxyin file Rsequence has the small sample correction. The choice between the estimate and the more exact computation is determined by constant "kickover". The constant maxlen is one longer than the longest sequence. The constant maxnseq is the maximum number of sequences. *) (* end module describe.malign *) {This manual page was created by makman 1.45}{created by htmlink 1.62}