By downloading this code you agree to the
Source Code Use License (PDF). |
{version = 2.20; (* of dalvec.p 1995 June 24}
(* begin module describe.dalvec *)
(*
name
dalvec: converts Rseq rsdata file to symvec format
synopsis
dalvec(rsdata: in, dalvecp: in, symvec: out, output: out)
files
rsdata: data file from rseq program
dalvecp: parameters to control dalvec
If empty, then the normal sequence logo will be produced.
If the first character of the first line is a 'c', then a chi-logo
is produced. The height of this logo is the information. The
heights of the individual letters are, however, not the frequencies,
but rather their partial chi-square values. The expected value
is 1/4 of the number of characters. This is compared to the observed
value by:
partial chi-square =(observed - expected)^2/expected
These partial values are normalized and placed in symvec in place of
the relative frequencies. Thus the significance of each letter is
used. When the observed is less than expected, the reported value
is made negative. Makelogo prints these characters upside down.
symvec: reformating of the rsdata file for input to the makelogo program.
A series of header lines begining with asterisk ("*") are produced.
The next line contains one integer which is the number of symbols
in the vector (4 for DNA or RNA, 20 for proteins).
After this, the format of the file is a series of entries. Each entry
has two parts. The first part is on one line and contains
position total information
position: the position number
total: the sum of the values in the vector
information: the information content of the vector.
The remaining parameters on the line are from the rsdata file:
rs: sum of rsl
varhnb: variance of rsl
sumvar: sum of varhnb
ehnb: 2-e(n)
The second part consists of a list of a series of symbol lines. The
number of these matches the numer of symbols (4 in the case of DNA),
representing the the numbers of bases or amino acids at the position in
an aligned set of sequences. Each line begins with the character of the
symbol, followed by the number of that symbols.
output: messages to the user
description
Convert the rsdata file from rseq into a format that the makelogo program
can use. The format is a 'symbol vector'.
ChiLogos: If you leave the parameter file empty, then the standard sequence
logo will be created. However, if the first letter of the file is a 'c',
then a new kind of logo will emerge from makelogo: the chi-logo. The height
is as it was before. The height of the individual letters is different,
instead of being proportional to the frequency of the letter, it is
proportional to the significance of the letter, based on the chi-square
test. That is, the expected number of letters is the number of letters at
that position, n(l) divided by 4 (for simplicity!). The observed number
comes from the rsdata file. The partial-chi square is
(observed-expected)^2/expected. Note that the sum of the partials is the
normal chi-square. So bases that contribute strongly get big. Also, bases
that are under represented are printed UPSIDE DOWN, so you can (usually)
tell you have a chilogo at a glance. The chilogo allows one to see the
importance of the infrequent letters. The technical mechanism for making a
letter upside down is to have its number negative in the symvec file.
examples
see also
rseq.p, makelogo.p
author
Thomas D. Schneider
bugs
The program originally only created a vector that contained the characters
of the alphabet, so the output was called an 'alvec'. To reflect the use of
symbols, the name of the output file was changed to symvec, but I like
'dalvec', and 'dsymvec' is so awkward that I decided to keep the name
dalvec.
*)
(* end module describe.dalvec *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}