By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.72; (* of malign.p 2022 May 21}
(* begin module describe.malign *)
(*
name
malign: optimal alignment of a book, based on minimum uncertainty
synopsis
malign(inst: in, book: in, malignp: in, uncert: out, newalign: out,
optalign: out, optinst: out, malignxyin: out, output: out)
files
inst: delila instructions of the form 'get from 56 -5 to 56 +10;'
book: the book generated by delila using inst
malignp: parameter file with the following parameters:
winleft, winright: left and right ends of window for calculating
uncertainty, relative to aligned base
shiftmin, shiftmax: minimum and maximum shift of aligned base
iseed: integer random seed
nranseq: number of random sequences, or 0 to use sequences in book
nshuffle: number of times to redo alignment after random shuffle
ifpaired: 1 to treat each pair of sequences as complementary strands,
0 not to
standout: output run #, pass # and H to standard output every pass
if 1, every run if 0, or not at all if -1
npassout: output H and alignment every npassout passes to file newalign,
or only at end of runs if zero, or not at all if -1
nshiftout: output L and H(L) every nshiftout sequence shifts (to file
uncert), or only at end of passes if zero, or not at all if -1
tolerance: tolerance in change of H
ntolpass: maximum number of passes with change below tolerance
new parameter allowed but not required (default is i):
alignmenttype: char; 'f' means alignment by First internal coordinate
base, 'b' means alignment by Book, 'i' means alignment by
Instructions. See the alist program for more information.
Normally one will align by delila instructions.
If this parameter is 'f', then the first base of the book is
considered the zero base and if it is 'b' then the zero base is
given by the coordinates of each piece in the book.
uncert: uncertainty as function of position, for the last run, at the
end of each pass or after selected number of sequence shifts
Controlled by variable nshiftout.
newalign: values of H and the relative alignments; starting, final, and
intermediate if selected.
Controlled by variable npassout.
optalign: user-readable listing of unique optimal relative alignments
and number of times each was achieved
optinst: list of unique optimal alignments in absolute coordinates,
to be used to make inst file for selected alignment
This file is like optalign, but the coordinates are for the
original sequence.
malignxyin: a list of the number of occurrences of alignments and their
H values in bits. This may be plotted with xyplo, as described
in the paper. Each line contains these numbers:
rank: from 1 to the number of alignment classes
occurrences: how many times the class was found
H: the uncertainty of the alignment, in bits
R: the information content of the alignment, in bits
with small sample correction.
description
Given a book of aligned sequences, this program searches for the alignment
of the sequences that has the lowest uncertainty, i.e. the highest value of
Rsequence. The user specifies the "window" of bases within which
uncertainty is calculated, and the maximum number of bases that each
sequence is allowed to shift from the original alignment. The program
considers each sequence in turn, shifting it to an alignment with minimum
uncertainty while holding the other sequences fixed. A "pass" is complete
when all sequences have been considered. A "run" is complete when no
alignments have changed in the preceding pass, and the alignment is then
considered "optimal". The first run starts with the original alignment;
every run after that starts with a "shuffled" alignment obtained by shifting
each sequence independently by a random amount between the allowed limits.
The program maintains a list of all of the unique optimal alignments
achieved from these starting alignments, and it outputs them in order of
increasing uncertainty.
In version 2.33 and earlier, the program did not keep track of the
organism and chromosome names in bestinst. This file is now superceeded
by the malin program which copies the inst file to cinst and modifies
it according to one of the alignments in optinst.
Statistical testing
We have found a method to work with malign that gives reliable
results. First run malign many times (e.g. 100) on the sites of
interest using the timeseed (with at least 1 second delay between
runs so that the timeseed changes). Collect the information
content distribution. Then extract the same length sequences from
random places on the host organism or use comp to get the
composition of the host and the markov program to generate a random
set. Run malign again with the same number of sequences and
parameters.
If you find that the two distributions differ significantly (using
the ttest program) then you've got something. This was useful for
us in a paper we are just finishing - in one case we see a distinct
pattern clearly distinguishable from the random host sequences and
in another the distributions were identical.
********************************************************************************
Summary of file output:
Malign produces:
uncert, newalign, optalign, optinst, malignxyin, output
-- output --
Line 7 of "malignp" controls output
Parameter:
1 - every run, pass, and uncertainty will be outputed to the
screen
0 - only the lowest uncertainty run will be outputed
-1 - nothing will be outputed
*NOTE* no file will be produced regardless of the parameter
-- newalign --
Line 8 of "malignp" controls newalign
Parameter:
1 - produces a full newalign file
0 - produces a smaller newalign file (about half the size)
-1 - produces no newalign file
*NOTE* this is the largest file produced and is unnecessary
-- uncert --
Line 9 of "malignp" controls uncert
Parameter:
1 - produces a full uncert file
0 - produces a smaller uncert file (about half the size)
-1 - produces an empty uncert file
*NOTE* file will be produced regardless of the parameter,
however this file is large and unnecessary
-- optalign --
*NOTE* this file will always be produced and is needed to run malin
-- optinst --
*NOTE* this file will always be produced and is needed to
run malin
-- malignxyin --
*NOTE* this file will always be produced and can be used to
plot data using xyplo
If you set our parameters so newalign and uncert are not created,
this can save some space.
(Thanks to Brent M. Jewett for compiling this information on 2001 Feb 7.)
********************************************************************************
documentation
A paper describing the algorithm in detail is available from
<A href="https://alum.mit.edu/www/toms/papers/malign/malign.pdf"
>https://alum.mit.edu/www/toms/papers/malign/malign.pdf</A>
@article{Schneider.Mastronarde.malign,
author = "T. D. Schneider
and D. Mastronarde",
title = "{Fast} multiple alignment of ungapped {DNA}
sequences using information theory and a relaxation method",
journal = "Discrete Applied Mathematics",
note = "https://alum.mit.edu/www/toms/papers/malign",
volume = "71",
pages = "259-268",
year = "1996"}
The use of malign to align sequences with a subtle pattern is described in:
@article{Toledano1994,
author = "M. B. Toledano
and I. Kullik
and F. Trinh
and P. T. Baird
and T. D. Schneider
and G. Storz",
title = "Redox-Dependent Shift of {OxyR-DNA} Contacts Along
An Extended {DNA} Binding Site:
A Mechanism for Differential Promoter Selection",
journal = "Cell",
volume = "78",
pages = "897-909",
year = "1994"}
For how the information content and small sample correction are computed:
@article{Schneider1986,
author = "T. D. Schneider
and G. D. Stormo
and L. Gold
and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide sequences",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431",
year = "1986"}
see also
Paper Schneider.Mastronarde.malign:
https://alum.mit.edu/www/toms/paper/malign/
Program to graph the malignxyin file: xyplo.p
You can use the malign.xyplop file for the xyplop
and the malignxyin for the xyin. Set the xyplom to be empty.
I ALWAYS make this graph to see what is going on.
Program to make delila instructions from nth alignment of malign:
malin.p
Example parameter file, malignp: malignp
A COMPLETE SET FOR DEMONSTRATING THE PROGRAM
malign.inst instructions for grabbing the first 6 EcoRI sites on
the E. coli genome, but messed up by setting the last
digit to zero
malign.book The Delila book corresponding to malign.inst
malign.malignp Parameter file for malign to realign the inst and book.
If one uses malin to pick the first alignment, one finds that they are
correctly realigned:
- +
1--------- +++++++++1
098765432101234567890
.....................
EcoRI U00096 3842 + 1 cgacctgccggaattcagcct
U00096 12889 + 2 tctggttgaagaattcaagaa
U00096 32545 + 3 tcagggtatcgaattcgacta
U00096 50237 + 4 ggtattcagcgaattccacga
U00096 56282 + 5 agaggtagcggaattcgttct
U00096 96860 + 6 gctacgtcaggaattcctgct
Program for listing aligned sequences, as above: alist.p
Program for comparing two distributions: ttest.p
author
David Mastronarde and Tom Schneider
bugs
The realignment algorithm, which shifts all sequences by the same amount
to attempt to keep the window near its original position, is somewhat
ad hoc in nature and the effects of different settings for it parameters
have not been explored. If the window spans two real sites with competing
alignments, many optimal but meaningless alignments with similar
uncertainties may be obtained. The random sequences can't be examined.
For the computation of Rsequence, composition is assumed to be
equiprobable, there is no provision for reading in a cmp file yet.
technical notes
The malignxyin file Rsequence has the small sample correction. The choice
between the estimate and the more exact computation is determined
by constant "kickover".
The constant maxlen is one longer than the longest sequence.
The constant maxnseq is the maximum number of sequences.
*)
(* end module describe.malign *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}