By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.37; (* of palinf.p 2013 Jul 25}
(* begin module describe.palinf *)
(*
name
palinf: find palindromes, based on information theory
synopsis
palinf(book: in, palinfp: in,
fout: out, palinfeatures: out, output: out)
files
book: a book from the Delila system
palinfp: parameters to control palinf, one per line
1. The minimum rsequence of the palindrome to detect.
alternatively, if the number is negative, it is the
desired significance of the detected peaks, given in
standard deviations.
2. (Optional) size (integer). The largest size palindrome allowed;
base pairs across both halves of the site. if omitted, the
entire sequence is used (which may be very expensive).
if this number is even, the next higher odd number will be used.
3. (Optional) If the first character of this line is an 'm' then
palinf will plot palindrome size (m) versus information content
(rsequence). A sharply rising curve indicates a good palindrome.
'x' means plot position (x) versus information content (rsequence).
a different character, such as 'n', means to list
the detected palindromes.
fout: Locations of palindromes.
In the m mode, the coordinate location of significant palindromes
(ie ones that passed the criterian) is given followed by a graph
that shows the structure and significance of the palindrome from
center to the outside:
at position 725
1 2 3
m even odd<0.1.2.3.4.5.6.7.8.9.0.1.2.3.4.5.6.7.8.9.0.1.2.3.4.5.6.7.8.9.0
1 -0.5 -0.5=. 1 2 3 4 5 . . . . .
2 1.0 1.0 . = 2 3 4 5 . . . . .
3 2.5 0.5 .o 1 e2 3. 4 5 . . . . .
4 4.0 0.0 o 1 2 e3. 4 5 . . . . .
5 5.5 -0.5o. 1 2 .e3 4 5 . . . .
6 7.0 -1.0o. 1 2 . 3 e 4 5 . . . .
7 8.5 0.5 .o 1 2 3 e 4 5 . . . .
8 8.0 2.0 . o1 2 3e 4 5 . . . .
9 7.5 3.5 . 1 o 2 e 4 5 . . . .
10 7.0 3.0 . 1o 2 e3 4 5 . . . .
11 6.5 4.5 . 1 o. 2e 3 . 4 5 . . .
12 6.0 6.0 . 1 . = 3 . 4 5 . . .
13 7.5 7.5 . 1 . 2 = 3 . 4 5 . . .
14 7.0 9.0 . 1 . 2 e o . 4 5 . . .
15 8.5 10.5 . 1 . 2 e .o 4 . 5 . . .
16 8.0 12.0 . 1 . 2 e .3 o 4 . 5 . . .
17 9.5 13.5 . 1 . 2 e.3 o4 . 5 . . .
18 9.0 15.0 . 1 . 2 e .3 4 o 5 . . .
19 8.5 16.5 . 1 . 2e . 3 . 4o 5 . .
20 8.0 18.0 . 1 . e . 3 . 4 o 5 . .
21 7.5 19.5 . 1 . e2 . 3 . 4 o5 . .
22 7.0 21.0 . 1 . e 2 . 3 . 4 5 o . .
23 6.5 22.5 . 1 . e 2 . 3 . 4 5 o . .
24 8.0 22.0 . 1 . e . 3 . 4 5 o . .
25 7.5 21.5 . 1. e 2 . 3 . 4 . o 5 . .
26 7.0 21.0 . 1. e 2 . 3 . 4 . o 5 . .
27 8.5 20.5 . 1. e2 . 3 . 4 .o 5 . .
28 8.0 20.0 . 1. e 2 . 3 . 4 o 5 . .
29 7.5 19.5 . 1. e 2 . 3 . 4 o. 5 . .
30 7.0 21.0 . 1. e 2 . 3 . 4 . o 5 . .
31 6.5 20.5 . 1 e 2 3 4o 5 .
32 6.0 22.0 . 1 e 2 3 4 o 5 .
33 7.5 23.5 . 1 e 2 3 4 o 5 .
34 7.0 25.0 . 1 e 2 3 4 o .
35 6.5 24.5 . 1 e 2 3 4 o5 .
at 725 25.0 bits
The horizontal axis is in bits, the vertical axis is in bases. The
numbers are the standard deviations. With this chart one can determine
the significance of each palindrome. Clearly there is a strong (nearly
standard deviations) odd palindrome at coordinate 725.
In the x mode, the sequence is given:
1 2
x bp even odd<0.1.2.3.4.5.6.7.8.9.0.1.2.3.4.5.6.7.8.9.0.1.2.3.4.5.6.7.8
2 a -0.5 1.5e. 1o2 3 4 5 . . . .
3 c -0.5 -0.5=. 1 2 3. 4 5 . . . .
4 a -0.5 3.0e. 1 o 3. 4 5 . . . .
5 g -0.5 1.5e. o1 2 . 3 4 5 . . .
6 t -0.5 1.5e. o1 2 . 3 4 5 . . .
7 a 1.5 1.5 . = 1 2 3 4 5 . . .
8 a -0.5 3.5e. 1 o 2 3 4 5 . . .
9 g 0.5 -0.5o.e 1 2 3 4 5 . . .
10 a -0.5 1.5e. o 1 2 3 4 5 . . .
11 c -0.5 1.5e. o 1 . 2 3 . 4 5 . .
12 g 3.0 1.5 . o e . 2 3 . 4 5 . .
13 g 6.5 -0.5o. 1 . 2e 3 . 4 5 . .
Here the horizontal axis is again in bits, but the vertical
axis is the location on the sequence (which is why the bp column
shows the bases).
In the n mode, only a summary of the palindrome locations is provided:
even odd palindromes
at 537 21.0 bits
at 547 24.5 bits
at 707 22.5 bits
at 725 25.0 bits
at 1101 21.0 bits
at 1180 24.0 bits
at 1279 21.0 bits
at 1322 24.5 bits
palinfeatures: The locations of palindromes in the features format that
the lister program uses. Pass these to lister and the palindrome will
be drawn on your sequence listing.
The format that the features are listed is:
define "odd60.K00042" "-" "(((|)))" "(((|)))" -3 -2 -1 0 1 2 3
@ K00042 60 +1 "odd60.K00042" " 4.5 bits"
define "even547.K00042" "-" "((()))" "((()))" -3 -2 -1 0 1 2
@ K00042 547 +1 "even547.K00042" " 4.5 bits"
output: messages to the user.
description
Each piece of the book is searched for imperfect palindromes with
significance determined by the first parameter in palinfp. There are
two kinds of palindrome: even and odd, refering to the size of the
palindrome in bases. An odd palindrome will have a central base, while
an even one will not have one. Method of use: search without the 'm'
option to pick out sites of interest. Then use 'm' under 'stringent
conditions' or on a smaller fragement to see the structure of the
palindrome. The final r value will be the maximum of r values for all
smaller palindromes. Note: equiprobable compositions are assumed for
e(hnb).
Theory:
When there are a large number of sequences, the information needed to
chose one of the 4 bases is log2(4) = 2 bits. In contrast, for only
two sequences (n = 2), the information measure is severely biased.
This reflects the statistical likelyhood of finding matches. One
quarter of the time two randomly chosen bases will match. In
information theory terms, this means that a match counts only as 0.75
bits (see reference Schneider1986 appendix figure A2). So, for
example, the restriction site for EcoRI, GAATTC is 6 x 2 = 12 bits when
taken from many examples of the site (as when EcoRI binds). However,
as a single sequence, it only counts as 6 x 0.75 = 4.5 bits. This
effect prevents one from identifying spurious palindromes, but it is,
unfortunately, not intuitive.
To avoid duplicate definitions as much as possible, the names now
include the piece name in which the palindrome is found.
examples
The parameters
21 positive: bits minimum to find; negative: st.dev out to find
71 largest size palindrome to find (measured from center to edge in bases)
m n=indicate detected palindromes; x=show sequence; m=show palindromes
palinfp: parameters for palinf
will locate the E. coli lac operator uniquely in the 401 bases
surrounding the start of the lacZ transcript.
The inverted repeats of pSC101 in GenBank K00042 are located with the
same [13/35/m] parameters at coordinates 707 and 725. (Other things
are found as well, they have been ignored in the literature because
they don't match the inverted repeats.)
The parameters [4.5/6/n] will locate 6 base palindromes.
documentation
Schneider, T.D., G.D. Stormo, L. Gold and A. Ehrenfeucht (1986)
The information content of binding sites on nucleotide sequences.
J. Mol. Biol. 188: 415-431.
see also
Example parameter file: palinfp
Program to display the palindrome features: lister.p
author
Thomas D. Schneider and Karen A. Lewis
bugs
If parameter 2 is very large, spurious sites will be found.
technical notes
Limiting the size of the palindrome will increase the search speed.
*)
(* end module describe.palinf *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}