By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.09; (* of alpro.p 2020 Mar 26}
(* begin module describe.alpro *)
(*
name
alpro: frequency and information of aligned sequences
synopsis
alpro(protseq: in, alprop: inout, symvec: out, sequ: output, output: out)
files
protseq: Aligned sequences in one of two formats.
The first line, intended for identification of the entire data set, is
skipped. This header line must begin with an asterisk '*' or '>'.
When the header begins with '>', fasta format is used, otherwise the
original protseq format is used.
In the original protseq format, the remaining lines are used for the
sequences. They are divided into `entries'. The beginning of an entry
has any (positive) number of identification lines, each of which begins
with an asterisk '*'. The sequence follows. Gaps are indicated with
dashes (-). The end of the sequence is indicated by a period. The
program automatically figures out what the sequences are so that the
correct kind of information calculation can be made. Sequences can be
DNA (ACGT - 4 characters), RNA (with U - 4 characters), protein (20
characters) or alphabetic (26 characters).
Fasta format has two differeces. First, all identification lines begin
with '>'. Second, sequences do not end with a period. Instead, they
end with the next sequence entry identifier (ie another '>') or the end
of the file. In this format dashes '-' or dots '.' may be used
as the alignment character.
If fasta format is used then the dots represent bases of the
first sequence. (New as of 2007 jul 16; previously the dot
became a dash.)
Spaces are allowed in the sequence, but they are ignored.
alprop: parameters to control the program, a series of lines:
1. parameterversion: The version number of the program. This allows the
user to be warned if an old parameter file is used.
2. alignment: alignment point for the sequences. This allows one to
assign the numbering in the symvec.
3. normalization: 4 integers (a, c, g, t) giving the relative
frequencies of
random sequence to normalize against. Use "1 1 1 1" normally. If the
data represent randomized chemical synthesis and there are biases in
the bases, use the base biases. Normalization is performed on the
frequencies using equations 1 and 2 of Schneider1989:
fo(b,l) = rho(b) fi(b,l) (1)
f'o(b,l) = f'i(b,l) rho(b) / sum_b [f'i(b,l) rho(b,l)] (2)
For fi(b,l) being the frequencies defined by the second line, fo(b,l)
should = 1/4 for DNA. This defines rho(b). (Note: rho is not a
function of position for this prograam, so rho(b) not rho(b,l).)
DO NOT USE THIS FEATURE UNLESS YOU HAVE GENERATED SYNTHETIC RANDOM
SEQUENCE as in Schneider1989. USE 1:1:1:1 NORMALLY.
See Schneider.ridebate1999 for further discussion.
4. varlogo: If the first letter is 'v' then the makelogo
program will produce a 'varlogo'. This method was invented by
Peter Shenkin (Shenkin.Mastrandrea1991). In a regular sequence
logo the vertical scale is the information content. However in
some systems, as in the immunoglobulin variable regions, one is
not interested in the conservation, but rather the degree of
variability. This is best expressed as the uncertainty Hafter
rather than the information R = Hbefore - Hafter. Basically, it
"turns over" the curve.
5. genomic composition: 4 integers (gna, gnc, gng, gnt) giving the
numbers of A, C, G and T in the genome of the organism from which DNA
or RNA sequences come from. This genomic composition is used to
compute Hgenome. The information content is Rsequence(l) = Hgenome -
(H(b,l) + e(n)), where e(n) is the small sample correction. You can
use '1 1 1 1' to set an equiprobable genome. See
Schneider.ridebate1999 for a discussion of relevant issues.
6. sequ: If the first letter is 's' then create a file called
sequ that contains the full sequences followed by a period which
can be used by makebk to create a Delila book. This allows one
to convert periods from the protseq into the corresponding
letter of the initial sequence.
Old versions of alpro will be automatically upgraded to new versions
if you set the version number to less than 1.
symvec: Table of frequencies and information content. The information
measure is corrected for small sample size (Schneider et al, 1986).
The format of this file is the same as produced by dalvec.
sequ: raw sequences followed by periods for creating
a delila book. This is generated only if the 6th parameter
is 's'.
output: messages to the user
description
Take an aligned set of sequences and produce input to the makelogo program
for producing a sequence logo. Small sample size and odd genomic
composition are accounted for.
The program will take lines that begin with '>' to accomodate fasta format.
However, sequences still must end with a period.
This program provides a 'short cut' for making logos. The "longer" route
(in terms of numbers of programs and complexity, but not significantly
time to compute) is formed by these Delila programs:
dbbk.p, catal.p, delila.p, alist.p, encode.p, rseq.p, dalvec.p
* dbbk converts from genbank to delila format
* catal creates a delila library
* delila extracts the precise sequences you want (powerful!)
* alist shows the extracted, aligned sequences
* encode converts the aligned sequences to binary vectors
* rseq converts the binary vectors to a table of computed information
* dalvec converts the table of computed information to a symvec
Why use alpro? Because it is currently the *only* way to get a protein
sequence logo, and it is currently the only way to handle sequences with
gaps in them (someday Delila will do these things). Why use the above
Delila programs? Because they provide much more flexibility for chosing
the range of sites (via Delila) and interfacing with the sequence walker
programs (via the information table, rsdata).
examples
* Example protseq file
* This is an example sequence.
AG-EGCTT.
* This is the second example sequence.
* It is the last one.
YLREBS-A.
Example parameter file (NOTE CHANGED FORMAT AS OF 1999 NOVEMBER 29!):
1.71 version of alpro that this parameter file is designed for.
1 alignment point
1 1 1 1 normalization bases
normal a first letter 'v' will give varlogo
1 1 1 1 genomic composition
The files globin.protseq (see below) and protseq.fasta are working examples.
Use protseq.makelogop and colors.protein with makelogo. If you also use
protein.wave as the wave file, you can see how much the logo corresponds to
an alpha helix.
documentation
@article{Hein1990,
author = "Jotun Hein",
title = "Unified approach to alignment and phylogenies",
journal = "Methods Enzymol",
volume = "183",
pages = "626-645",
year = "1990"}
@article{Schneider1986,
author = "T. D. Schneider
and G. D. Stormo
and L. Gold
and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide sequences",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431",
year = "1986"}
@article{Schneider.Stephens.Logo,
author = "T. D. Schneider
and R. M. Stephens",
title = "Sequence Logos: A New Way to Display Consensus Sequences",
journal = "Nucl. Acids Res.",
volume = "18",
pages = "6097-6100",
year = "1990"}
@article{Schneider1989,
author = "T. D. Schneider
and G. D. Stormo",
title = "Excess Information at Bacteriophage {T7} Genomic Promoters
Detected by a Random Cloning Technique",
year = "1989",
journal = "Nucl. Acids Res.",
volume = "17",
pages = "659-674"}
@article{Schneider.ridebate1999,
author = "T. D. Schneider",
title = "Measuring Molecular Information",
journal = "Journal of Theoretical Biology",
volume = "201",
pages = "87-92",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/paper/ridebate/}
{https://alum.mit.edu/www/toms/paper/ridebate/}",
year = "1999"}
as:
https://alum.mit.edu/www/toms/paper/ridebate/
@article{Shenkin.Mastrandrea1991,
author = "P. S. Shenkin
and B. Erman
and L. D. Mastrandrea",
title = "{Information-theoretical entropy as a measure of sequence
variability}",
journal = "Proteins",
volume = "11",
pages = "297--313",
pmid = "1758884",
comment = "was Shenkin1991",
year = "1991"}
see also
Standard parameter file: alprop
PROTEIN EXAMPLE: THE GLOBINS
To try the alpro program, use the standard alprop
with a copy of these files as your protseq:
Example input file for protseq: globin.protseq
Example like globin.protseq but in fasta format: globin.fasta
The symvec file generated by alpro with this globin data should be
close to or identical with this symvec: globin.symvec
Then you can use the program that makes the logo, makelogo.p, to
create a logo. You will need these files:
symvec (from above or from the archive): globin.symvec
marks (currently empty): globin.marks
colors file to use for proteins: protein.colors
wave file to use for proteins: protein.wave
makelogop (parameter file to use for this globin example): globin.makelogop
NOTE: each file needs to have the name that makelogo expects. Get
the file and rename it.
After you run makelogo, the resulting sequence logo should be like this:
globin.logo.ps
Read the manual page on makelogo.p to learn how to control the display
more.
There is a more powerful way to make DNA logos. See:
https://alum.mit.edu/www/toms/logoprograms.html
Related programs:
dbbk.p, catal.p, delila.p, alist.p, encode.p, rseq.p, dalvec.p
What the heck is Pascal system error 0?
See:
https://alum.mit.edu/www/toms/pascalp2c.html#system.error.0
Michael Sauder <michael_sauder@stromix.com> has generously written two
perl scripts to convert files into the protseq format that alpro uses:
convert from CLUSTAL format to protseq: clustalw2alpro.pl
convert from MSF format to protseq: msf2alpro.pl
author
Dr. Thomas D. Schneider
Laboratory of Experimental and Computational Biology
toms@alum.mit.edu
permanent email: toms@alum.mit.edu
https://alum.mit.edu/www/toms/
bugs
technical notes
Historical note: The program originally only created a vector that
contained the characters of the alphabet, so the output was called an
'alvec'. To reflect the use of symbols, the name of the output file was
changed to symvec, but I like 'alpro', and 'prosym' is awkward that I
decided to keep the name alpro. Later I generalized the program to handle
DNA or RNA or alphabetic sequences, but kept the name. Now it might be
considered to be the 'alignment professional'. Oh well.
The feature which adjusts the stack height when there is a small amounts of
data, (described in the second paragraph of page 6100 of the logo paper),
has been removed now because the ability to display the variance as a
standard deviation by makelogo alerts the person that the position has
little data in it. Thanks to Peter Shenkin for the suggestion.
The original feature was described as follows:
"Positions that contain mostly spacer characters for the alignment are
also reduced in weight by multiplying the information by the maximum
number of sequences and dividing it by the actual number at the spacer
position. Thus if there are 10,000 sequences, a position with 200 A's
would would be close to 2 bits of pattern. However, since the position
only represents 2% of the sequences, this program would only give it a
weight of 0.02*2 = 0.04 bits. A better method is not known. However,
this prevents one from being fooled by positions that don't appear in
most sequences."
*)
(* end module describe.alpro *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}