By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 9.67; (* of makelogo.p 2023 Jan 02}
(* begin module describe.makelogo *)
(*
name
makelogo: make a graphical `sequence logo' for aligned sequences
synopsis
makelogo(symvec: in, makelogop: in, colors: in, marks: in,
wave: in, logo: out, output: out)
files
symvec: A "symbol vector" file usually created by the alpro or dalvec
program. Makelogo will ignore any number of header lines that begin
with "*". The next line contains one number (k) that defines the number
of letters in the alphabet. and then defines the composition of letters
at each position in the set of aligned sequences.
Each composition begins with 4 numbers on one line:
1. position (integer);
2. number of sequences at that position (integer);
3. information content of the position (real);
4. variance of the information content (real).
This is followed by k lines. The first character on the line
is the character. This is followed by the number of that character
at that position.
Example:
* position, number of sequences, information Rs, variance of Rs
4 number of symbols in DNA or RNA
-100 86 -0.00820 6.3319e-04
a 27
c 18
g 20
t 21
-99 86 -0.00436 6.3319e-04
a 26
c 19
g 17
t 24
* If the symvec file is empty, the alphabet is printed as a test.
* If the error bars values are negative, they are not displayed. This
allows the sites program to control the display when it would not be
appropriate.
* If the number of a symbol is negative in symvec, then the symbol will
be rotated 180 degrees before being printed. The absolute value is used
by makelogo to determine the height. This allows statistical tests
which find rare symbols to be significant to show that the symbol is
rare by having it up side down. Notice that ACGT are all easy to
distinguish from their upside down versions, but unfortunately this is
not always true for protein sequences. Program dalvec contains a switch
for turning the letters over in the ChiLogo.
makelogop: parameters to control the program.
line 1: contains the lowest to highest range of the binding site to do
the logo graph. (FROM to TO range)
line 2: bar: sequence coordinate before which to print a vertical bar
NOTE: the vertical bar takes up a small amount of horizontal
space. However, to make sure that marks are placed correctly,
the logo is not offset. The bar will overwrite the previous
stack and the next stack will overwrite the bar.
To remove the bar, just set its location outside the range
of the display.
line 3: xcorner and ycorner. This is the coordinate of the lower left
hand corner of the logo (in cm). These should be real numbers.
z xcorner ycorner zerobase
If the first letter of this line is "z", then the program
expects three numbers: xcorner, ycorner and zerobase. Zerobase
is a real number defining the position on the sequence that the
zero of the coordinate system is to be set. For example,
setting zerobase to 0 (zero) will place the center of the 0 at
xcorner, ycorner. This special feature allows the logo to be
precisely placed relative to other logos so that they can be
aligned one above another in a figure.
line 4: rotation: angle in degrees to rotate the logo. Warning:
rotations other than by factors of 90 degrees may produce
incorrect logos because character scaling depends on the
orientation of the characters. (Essentially, it's a design
fault of PostScript.)
line 5: charwidth: (real, > 0) the width of the logo characters, in cm
squashspray (real, > 1.0 - optional.
2018 Jul 26: If there is a second number after the
charwidth, this becomes squashspray. Make this value
larger than '1.0' if you are getting a line to the right
of your logo; these are squashed letters that are too
small. After being squashed they are sprayed to the
right of the logo for unknown reasons. This appears to
be a bug in PostScript interpreters and printers. If a
character height is smaller than squashspray, it is drawn
as a solid rectangle. Most of the time a user will not
notice this. You can see them by making squashspray a
large value (e.g. 80). See the technical notes for how
and why to use this.
line 6: barheight barwidth: (real, > 0) height of the vertical bar, in
cm, and its width, in cm.
WARNING: if the barwidth is too big, it can cover the
smaller tic marks.
line 7: barbits: (real) The height of the vertical bar, in bits, is
given by the absolute value of barbits. If barbits is positive,
an "I-beam" will appear at the top of the symbol stack. The
I-beam indicates one standard deviation of the stack height,
based entirely on how small the sample of sequences is. If the
value of barbits is negative, the I-beam is not displayed. Not
knowing how big the sampling effects are can fool one, so one
should usually have the I-beam, even if it is ugly.
WARNING: it is not known how to calculate the error for data
derived from a dirty DNA synthesis experiment (see
Schneider1989, reference given below). In that case the error
could be calculated (in program sites) from the number of
sequences, so that the error bar would be an underestimate of
the variation. Unfortunately, when I tried this, people
interpreted the error bar as the size they saw, so this does not
work well visually. Therefore when data come from the sites
program, the I-beam is suppressed.
The combination of barheight and barbits determines the size
of the logo in bits per centimeter. Both must be specified even
if no vertical bar is desired.
line 8: Ibeamfraction: real. The fraction of the vertical part of the
Ibeam to draw. When it is 1, the Ibeam is normal. When it is
zero, no vertical line is drawn. At 0.1, only 10 percent of the
top half and 10 percent of the bottom half of the Ibeam is
drawn, for a total of 10 percent of the entire ibeam. More
precisely, this number is the fraction of a standard deviation
to draw. Negative values will reverse the direction of the part
drawn, making a 'thumbtack'. (Note: if this parameter is
missing, as in old makelogop files, the program will ignore
it.) I thank Shmuel Pietrokovski (Structural Biology
department, Weizmann Institute of Science, 76100 Rehovot -
ISRAEL, bppietro@dapsas1.weizmann.ac.il) for suggesting this
method, and for the code to do it. See further description
below.
Note: This parameter can be skipped. The code looks for
a number at this position in the parameter file. If
there is a number, the Ibeamfraction is read. Otherwise,
the Ibeamfraction defaults to 1.0.
line 9: barends: if the first character on the line is a 'b', then bars
are put before and after each line, in addition to the other
bar. The first bar on each line is labeled with tic marks and
the number of bits. If you don't want this, you can remove the
call to maketic in the logo. This is easily done in Unix with
grep -v maketic.startline logo > logo.without.tics
That is, the PostScript code that generates the tic marks is on
one line and there is a comment containing "maketic.startline".
The grep removes that entire line from the logo file. Likewise,
the bars at the start and end of the lines can be removed with:
grep -v makebar.startline logo > logo.without.start.bar
grep -v makebar.end linelogo > logo.without.end.bar
If barends is:
b - put a bar on both left and right sides of the logo
l - left only
r - right only
n - no bar on either side
One can control tic marks that are not numbered. These
are called 'subtics' and they are controlled by the
second character on the line.
If the second character on the line (ticcommand) is:
t - it is followed by two numbers: subticsBig and
subticsSmall.
Both numbers define the number of intervals of sub-tic
marks to show for each vertical bit of
the bar.
subticsBig is the number of intervals for big subtics.
These are the same size as the numbered tics.
subticsSmall is the number of intervals for small subtics.
These are half the size as the numbered tics.
Examples:
't 2 10' will put a big tic at 0.5 and 1.5 bits
and small tic marks every 0.1 bit. This is the
default.
't 2 2' will put a big tic at 0.5 and 1.5 bits
There will also be small tic marks but since they
are in the same location as the big ones, you
would not see them.)
't 1 1' will make the tic marks fall under the
numbered tic marks so none are visible.
WARNING: if the barwidth is 0.1 (the previous
standard) then the tic marks will get covered. A
barwidth of 0.05 works.
Any other character for ticcommand will be the same as
't 2 10', so this is the default.
line 10: showingbox: if the first character on the line is an 's', then
show a box around each character. This is useful to check that
the heights of the letters are correct and to distinguish the
letters from each other when amino acids are represented.
If the character is an 'f', then the box is filled and no
character is shown. This is useful for showing 'logos' of
extremely large size where the individual character is not
readable, but the color is.
line 11: outline: If the first character is 'o' then the characters show
up in outline form. Otherwise, they are solid.
The outline of an entire stack can be turned on or off using the
marks file. The command is toggleoutline and it is treated as a
user defined command. The first parameter is the position, the
remaining three must be given but are ignored. The state of the
outlining will apply to the stack following the given position.
For example,
U 0 0 0 0 toggleoutline
U 1 0 0 0 toggleoutline
will set position 1 to be the reverse of the rest of the logo.
(New as of 1999 April 12)
line 12: caps: if the first letter is 'c' then alphabetic characters are
converted to capital form.
line 13: stacksperline: number of character stacks per line output. A
"stack" is a vertical set of characters. A "line" is a series
of stacks. One may have several lines per page (next
parameter). Special note: This value is used to do the
centering of strings. For a range of -23 to +19, you have to
set it to (19)-(-23)+1 = 43 to get your title centered
correctly. You can get the program to tell you the number '43'
by setting stacksperline very large, in which case it realizes
there is something wrong and does the calculation.
line 14: linesperpage: number of lines per page output
line 15: linemove: line separation relative to the barheight Note: This
affects the BoundingBox discussed below.
line 16: numbering: if the first letter is 'n' then each stack is
numbered. Otherwise, the number is suppressed in a PostScript
if statement. This allows you to modify the logo file by hand
to reinstate numbering for only the positions you want by
removing or changing the if statement calls to makenumber. For
example,
numbering {(6) makenumber} if
Is the PostScript for making the number "6" under the global
numbering control. To make "6" always be there, change it to:
true {(6) makenumber} if
line 17: shrinking: (real) Factor by which to shrink the characters. If
shrinking <= 0 or shrinking >= 1 then the characters exactly fit
into the box. If shrinking > 0 and shrinking < 1, the
characters are shrunk inside the box. To use this feature, the
parameter showningbox be on, so that the user does not create a
logo whose height is misleading.
line 18: strings: the number of user defined strings to follow. Each
string definition takes up two lines. The first is the (x,y)
coordinate of the string, the second is the string itself. The
coordinates are in centimeters relative to the coordinate
transforms performed above. (This way, the title position stays
the same relative to the logo.)
line 18+strings+1: (x,y,s) coordinates of first user defined string (if
strings >= 1) followed by the factor by which to scale the
string. A factor of 1 means no scaling. In addition, if the x
coordinate is less than or equal to -1000, then the string is
centered by using the string width, the stacksperline and
charwidth. Note! To allow more parameters, it is no longer
true that one may turn off the strings by setting the number of
strings to 0, but the lines can be left in the file. If strings
are zero, then they must be removed.
line 18+strings+2: the first user defined string (if strings >= 1)
line 18+strings+3: (x,y,s) coordinates of second user defined string (if
strings >= 2)
line 18+strings+3: (x,y,s) coordinates of second user defined string (if
strings >= 2)
Special string controls:
\i italics toggle
To make italics, use \i twice, around the text.
\n 5 give number of sequences at coordinate 5.
More than one \n can be used for different coordinates.
If out of range, give maximum in symvec.
\\ produce backslash
\160 produce the Greek letter pi from the PostScript Symbol font.
These fonts are listed on pages 270 to 273 of the "Red" book
(see references, below).
\r produce Rsequence
\s produce standard deviation
\d decimal places: must be followed immediately by the number of
decimal places to use for the next \r or \s
Example:
\n 0 \i E. coli\i LexA binding sites
will give the number of lexA sites at coordinate 0
and make "E. coli" in italics.
\d2 Rs = \r +/- \s bits
will look like this:
Rs = 5.72 +/- 3.46 bits
For advanced users:
HOW TO MAKE ITALICS IN YOUR STRINGS using PostScript
To allow for italics, use a string like this:
38\) \( E. coli \) IT \(LexA binding sites
This will make the words "E. coli" in
Helvetica-BoldItalic font, but leave "38" and "LexA
binding sites" in Helvetica-Bold. See the technical
notes for how this works. The toggle form "\i" uses the
same method, but simplifies it for the user. This method
allows one to create any PostScript commands.
line 18+2*strings+1: edgecontrol edgeleft, edgeright, edgelow, edgehigh:
edgecontrol is a single character that controls how the bounding
box of the figure is handled. If it is 'p' then the bounding
box will be the page parameters defined in constants inside the
program (llx, lly, urx, ury). Otherwise, there are four real
numbers that define the edges around the logo in cm. To allow a
sequence logo to be imbedded into another figure, its size must
be defined in PostScript (with %%BoundingBox). The basic logo
fits within a rectangle, but the numbers along the bottom
symbols and labels may be anywhere outside. By setting these
four numbers, the edges are defined.
line 18+2*strings+2: ShowEnds: a single character
d: show for DNA 5' and 3' on the logo
p: show for protein N and C on the logo
otherwise: nothing is shown.
line 18+2*strings+3: formcontrol: a single character that determines
the overall form control of the output.
See discussion below and the examples.
n: normallogo. standard sequence logo (or any other character)
v: varlogo. See discussion below for what this is.
e: equallogo. All stack heights are at the maximum.
Of course this loses the useful data about the exact
sequence conservation (measured in bits) at each
position.
r: rarelogo. Plot (1-Pi) for each symbol instead of Pi.
See discussion below.
R: rareequallogo. As with r, but equal stack heights.
WARNING: To avoid missing important biological
discoveries, BEFORE using the equallogo and rarelogo
parameters read this page:
https://alum.mit.edu/www/toms/logorecommendations.html
To avoid a user thinking that a symbol is used when it is
not, for r and R a '.' is plotted instead of the letter
when Pi = 0. This shows up as a black rectangle.
This parameter was implemented on 2011 Mar 09.
The remainder of the file is ignored and may contain comments.
colors: Defines the color of each character printed. Any number of lines
that begin with an asterisk [*] can be used as comments to identify the
file or portions of the file. Put into the file one line for each
character that is to have a color other than black. The line must
contain:
character red green blue
The last three parameters are real values between 0 and 1 (inclusive).
The values depend on the PostScript interpreter, but 0 means black and a
value of 1 means the most bright.
To assign the asterisk a color, proceed it with a backslash [as \*].
To assign the backslash a color, proceed it with a backslash [as \\].
If the file is empty, the logo is made in black and white and the lower
half of the I-beam error bar is made white so that when it is inside the
letters it is visible.
To make any letter invisible, assign it any color less than zero, for
example -1 -1 -1. This is different than black, which is 0 0 0 and
white which is 1 1 1. The error bar will still be displayed.
Each of the symbols A, C, G and T can represent either DNA or
amino acids. To distinguish between them, the lister program
uses lower case in the colors file for DNA/RNA and upper case for
amino acids. This is now fully implemented in makelogo. Note
that the usual sequence logo for DNA has upper case letters.
This is done using the caps parameter. New as of 2007 Mar 31.
marks: an empty file means no marks are made. Otherwise, a series of
lines contain data that define marks to be placed on the output:
symbol and kind: the first two characters on the line define
the symbol and then how to draw the symbol. The symbols are:
c circle
b box
l line
t triangle
s square
u Begin a user defined symbol. Define a symbol yourself in
PostScript. The PostScript code may extend over several lines.
The end of the code is given by the character "!" at the start
of a line. (The rest of the "!" containing line is ignored.)
This allows one simply to insert some pre-tested PostScript
between "u" and "!" lines of the marks file. The code will be
passed 4 coordinates and any other parameters given in the U
line (defined below).
U Call the user defined symbol. The U must be followed by 4
coordinates numbers: x1 y1 x2 y2. The x1 and x2 are in bases,
while y1 and y2 are in bits. The remainder of the line is
copied to the logo file, so you can have more parameters there.
End the line with the name of one of your defined symbols.
* a comment line
% a comment line
The drawing types are:
s stroke
f fill
d dash
If marksymbol is c, t or s, three more parameters are required:
base coordinate: a real number that determines the center of the
mark
bits coordinate: a real number that determines the position of the
mark in bits.
scale: a positive real number in units of bases that is the
diameter of the circle or the diameter of a circle that the
equilateral triangle would be enscribed in. For the square, it
is the side. By using units of bases, these marks
automatically will fit between bases on the logo, as the
charwidth is changed or other scaling is done.
If marksymbol is b, l or U, 4 more parameters are required:
base coordinate: a real number that determines end 1
bits coordinate: a real number that determines end 1
base coordinate: a real number that determines end 2
bits coordinate: a real number that determines end 2
The line is drawn from end 1 to end 2 while these ends define box
diagonal. Note that the center of a base is defined as an integer,
so one must add 0.5 to base coordinates to put a boxes around a
base. You may make the user symbol use these coordinates however you
want.
********************************************************************
* The symbols MUST be in increasing order of position in the site! *
********************************************************************
The symbols must be given in the order of their use in bases. If
symbols are not there, check the order.
Since symbols are drawn concurrently with the logo letters, drawing a
box or line symbol that has an end 2 to the left of the current
position (which is end 1) will draw over the letters (because the
letter was already drawn), while drawing to the right will draw under
the letters (because the base is drawn over later).
There is a special predefined user mark that allows one to toggle
stacks between regular and outlined characters; see the outline
parameter of makelogop.
wave: Define a cosine wave over the graph. Empty file means no cosine
wave, otherwise the parameters of the wave are given one per line:
extreme: char; h or l, the extreme high or low point on the curve
defined by the wavelocation and wavebit
wavelocation: real; the location in bases of the extreme
wavebit: real; the location in bits of the extreme
waveamplitude: real; the amplitude of the wave in bits
wavelength: real (positive); the wave length of the wave in bases
dash: real; the size of dashes in cm. Zero or negative means no
dashes.
If the first character on the line is 'd' then a
new method of dash control is applied. In this case there
are three parameters:
dashon: real; the size of dashes ON segment in cm.
Zero or negative means no dashes.
dashooff: real; the size of dashes OFF segment in cm.
dashooffset: real; the offset for dashing.
These parameters follow the PostScript Language Reference Manual,
Second Edition, page 500. Dashes start with the ON segment,
followed by the OFF segment. They are shifted by the offset,
which is the amount into the dash cycle to start.
NOTE: The distances are defined along the length of the cosine,
which is a function of the waveamplitude, bits per cm (barbits)
and wavelength and bitsperbase. For now it is simplest to
empirically first determine the dashon and dashoff values that give
repeats every wavelength, then set the dashoffset.
thickness: real; thickness of the wave in cm. Zero or negative means
the value defaults to PostScript line thickness.
logo: the output file, a PostScript program to display the logo.
The last line of the file gives:
Rsequence = area under the logo (bits)
small sampling error (bits)
range from, (bases)
range to, (bases)
information density = Rsequence /(two times bases in range)
output: messages to the user
description
The makelogo program generates a `sequence logo' for a set of aligned
sequences. A full description is in the documentation paper. The input
is an `symvec', or symbol-vector that contains the information at each
position and the numbers of each symbol. The output is in the graphics
language PostScript.
The program now indicates the small sample error in the logo by a small
'I-beam' overlayed on the top of the logo. Although the user may turn
this off to make pretty logos, I strongly recommend use of it to avoid
being fooled by small amounts of data.
********************************************************************************
Making A Logo As Part of Another Figure
---------------------------------------
The normal logo file is designed to stand by itself. However, it is often
desirable to incorporate the logo as part of another figure. The
difficulty is that the stand-alone logo PostScript program will erase the
page (which wipes out any previous figure drawing) and show the page
(which prints the page right after the logo). To prevent these actions,
the lines of PostScript code which do this have comments that contain the
word REMOVE. All you have to do is remove these lines and your logo will
be able to fit into your figure. In Unix this can be easily done by:
grep -v REMOVE logo > logo.ps
If you do this, then it is advisable to do the erasepage and the showpage
yourself. A convenient way to do this is to have several files that
contain postscript commands, and to use a shell script to concatenate them
together:
cat start.ps logo.ps end.ps > myfigure.ps
If you have a large number of logos together in one figure, you can reduce
the size of the final figure by another trick. Logo files begin with a
header which is the same from one figure to the next assuming you don't
change colors/letter combinations. So the first logo in the figure must
contain this header, but later ones don't really need it. You can remove
the header material by using the censor program:
censor < logo.ps > logo.no.header.ps
EXAMPLE:
Suppose that you have two logo files, 1 and 2. Then to join them, you can
use the unix commands:
grep -v REMOVE 1 > 3
censor < 2 >> 3
echo "showpage" >> 3
The grep removes the REMOVE lines from file 1 and puts the rest into the
start of file 3. The censor removes the duplicate PostScript definitions
from file 2 and appends the remainder to the end of 3. Finally, the echo
puts a 'showpage' command on the end of the file so that the printer will
print the page (otherwise you won't get any printout).
********************************************************************************
Playing with Ibeams
-------------------
Shmuel Pietrokovski (bppietro@dapsas1.weizmann.ac.il) suggested that the
middle of the Ibeams be removable so that it doesn't get in the way of
logos. That is, a normal Ibeam looks like:
-----
|
|
|
|
|
|
-----
This is sitting on the top of the sequence logo stack of letters. This is
obtained by setting the Ibeamfraction to 1.0. Shmuel suggested that there
be a parameter to remove the vertical part or to have it partway:
-----
|
|
|
|
-----
This is obtained by setting the Ibeamfraction to 0.6. Setting
Ibeamfraction to -1.0, puts the vertical parts OUTSIDE the bars. This way
one can read one standard deviation of the stack and also have a mark at
(for example) 2 standard deviations out at the tips of the thumb tacks:
|
|
-----
-----
|
|
********************************************************************************
How do I disable the error bar?
-------------------------------
Set barbits negative. If I were to do it again I'd separate the
variables. For example, -2 gives a height of 2 bits for the bar but would
be no error bars.
********************************************************************************
How do I label the residues every 5, for example 0, 5, 10, 15 ...
-----------------------------------------------------------------
There isn't a way to do this directly since I like having all positions
labeled because it is less work for the reader to figure out where things
are. However, you can remove all numbering (set the numbering parameter
to anything but 'n'). Then you can use the marks file to put numbers
where you want. See: marks.lettering for a mechanism that I put together
for this. (There is a link from the 'See Also' section below.) You could
even rotate the numbers if you know how to program PostScript. If you get
a nice working example, I can add it to my set. If not, you *might*
convince me to generate the marks file if you describe what you want and
marks.lettering doesn't do it ;-).
********************************************************************************
How do I set the default paper size (A4 or letter)?
---------------------------------------------------
The simplest thing is to place the logo wherever you want on the page.
You can set the box boundaries with the edgecontrol variables.
You can also set the PostScript page size by changing the four constants:
llx, lly, urx and ury. This would require a recompile. These numbers are
in 'points', one point is 1/72 inch (I know, silly!) but you can convert
precisely to cm by multiplying by 2.54/72.
********************************************************************************
How do I make a logo that has several lines?
--------------------------------------------
If you are working with a protein or a very long DNA sequence, you might
consider setting linesperpage to more than 1 and adjusting stacksperline
and linemove accordingly.
********************************************************************************
rarelogo:
Sometimes one would like to examine the rare symbols. This is one
technique for doing so. A parameter called 'formcontrol' is set to
'r' to use this.
In a conventional logo, for the bases A, C, G, T the heights are
set to the conservation. Call this "1" so that A+C+G+T = 1.
A "rare logo" graphs:
(1-A)
(1-C)
(1-G)
(1-T)
The sum of these is 4 - (A+C+G+T) = 3. That's a bit strange, but ok!
It says that you plot each symbol with a height:
conservation*(1-Pi)/(M-1)
Where M is the number of symbols in the alphabet.
varlogo: If the first letter is 'v' then the makelogo program will
produce a 'varlogo'. This method was invented by Peter Shenkin
(Shenkin.Mastrandrea1991). In a regular sequence logo the vertical
scale is the information content. However in some systems, as in
the immunoglobulin variable regions, one is not interested in the
conservation, but rather the degree of variability. This is best
expressed as the uncertainty Hafter(l) rather than the information
R(l) = Hbefore - Hafter(l) (where 'l' is the position in the
sequence alignment). Basically, it "turns over" the curve. This
is also implemented in alpro.
********************************************************************************
@article{Shenkin.Mastrandrea1991,
author = "P. S. Shenkin
and B. Erman
and L. D. Mastrandrea",
title = "{Information-theoretical entropy as a measure of sequence
variability}",
journal = "Proteins",
volume = "11",
pages = "297--313",
pmid = "1758884",
year = "1991"}
see also
Example sequence logos:
A Gallery of Sequence Logos:
https://alum.mit.edu/www/toms/sequencelogo.html
Glossary definition of Sequence Logo:
https://alum.mit.edu/www/toms/glossary.html#sequence_logo
-----------------------
FORM CONTROL FOR SEQUENCE LOGOS
controlled by parameter formcontrol
WARNING: To avoid missing important biological discoveries, BEFORE
using the equallogo and rarelogo parameters read this:
https://alum.mit.edu/www/toms/logorecommendations.html
Normal logo (normallogo):
Note: the sine wavelength is 3.6 amino acids, corresponding to an alpha helix.
Variable logo (varlogo):
Plot Hafter(l) instead of R(l).
Equal logo (equallogo):
Equal stack heights. Note that sequence conservation data is lost.
SEE WARNING ABOVE!
Rare logo: (rarelogo):
Plot 1-Pi instead of Pi. Normal stack heights
Rare-Equal logo: (rareequallogo):
Plot 1-Pi instead of Pi, equal stack heights.
SEE WARNING ABOVE!
-----------------------
FULL WORKING EXAMPLE
This is a full test of makelogo.
1. obtain these files:
lambdacicro.colors lambdacicro.makelogop lambdacicro.symvec
lambdacicro-logo.ps lambdacicro.marks lambdacicro.wave
2. Except for the lambdacicro-logo.ps file,
copy these to files without the 'lambdacicro.'.
3. Run makelogo.
4. Except for the version number, makelogo should create a logo file
identical to lambdacicro-logo.ps.
Unix commands for doing the test are:
cp lambdacicro.colors colors
cp lambdacicro.makelogop makelogop
cp lambdacicro.symvec symvec
cp lambdacicro.marks marks
cp lambdacicro.wave wave
makelogo
diff lambdacicro-logo.ps logo
-----------------------
Related programs:
There are several ways to get the symvec file, this is described in:
https://alum.mit.edu/www/toms/logoprograms.html
1. The Alpro route to making logos: alpro.p
2. The Delila route to making logos:
dbbk.p, catal.p, delila.p, alist.p, encode.p, rseq.p, dalvec.p
3. A program that creates a symvec from a list of words is:
alword.p
-----------------------
To PRINT LOGOS see:
https://alum.mit.edu/www/toms/postscript.html
-----------------------
Other related programs:
rsgra.p, sites.p, censor.p, rav.p
Example input files:
symvec, makelogop, colors, wave, marks
Some demonstration input files:
symvec.demo, colors.demo, makelogop.demo, wave.demo, marks.demo
Resulting output file:
logo.demo
Example output files, in postscript:
logo
Other examples and useful control files:
colors.protein
marks.arrow
marks.ellipse
marks.lettering
marks.plusminus
marks.symbols
marks.userdefined
author
Thomas D. Schneider, Ph.D.
toms@alum.mit.edu
toms@alum.mit.edu (permanent)
https://alum.mit.edu/www/toms (permanent)
examples
makelogop parameters:
-15 2 FROM to TO range to make the logo over
1 sequence coordinate before which to put a bar on the logo
15 2 (xcorner, ycorner) lower left hand corner of the logo (in cm)
90 rotation: angle to rotate the graph
1.0 charwidth: (real, > 0) the width of the logo characters, in cm
10 0.1 barheight, barwidth: (real, > 0) height of vertical bar, in cm
2 barbits: (real) height of the vertical bar, in bits; < 0: no I-beam
no bars barends: if 'b' put bars before and after each line
show showingbox: if 's' show a dashed box around each character; f = fill
no outline outline: if 'o' make each character as an outline
100 stacksperline: number of character stacks per line output
1 linesperpage: number of lines per page output
1.1 linemove: line separation relative to the barheight
numbers numbering: if the first letter is 'n' then each stack is numbered
1 shrinking: factor by which to shrink characters inside dashed box
2 strings: the number of user defined strings to follow
2 14 1 coordinates of the first string (in cm)
First TITLE
3 13 1 coordinates of the second string (in cm)
SECOND TITLE
n 2 1 2 1 edgecontrol (p=page), edgeleft, edgeright, edgelow, edgehigh in cm
d d: 5' 3'; p: N C; else: nothing shown on ends
makelogop.dna: parameters for the makelogo program, version 8.31 or higher
colors:
* Color scheme for logos of DNA (for the makelogo program).
* color order is red-green-blue
*
* green:
A 0 1 0
a 0 1 0
*
* blue:
C 0 0 1
c 0 0 1
*
* red:
T 1 0 0
t 1 0 0
*
* orange:
G 1 0.7 0
g 1 0.7 0
wave:
l extreme: char; h or l, the high or low extreme to be defined
2 wavelocation: real; the location in bases of the extreme
1.0 wavebit: real; the location in bits of the extreme
0.5 waveamplitude: real; the amplitude of the wave in bits
10.4 wavelength: real; the wave length of the wave in bases
0 dash: real; the size of dashes in cm. dash <= 0 means no dashes
0.1 thickness: real; thickness of the wave in cm. <=0: default.
marks:
* example marks file for makelogo 8.06 and higher
*
* square stroked, filled and dotted:
ss -2 -0.40 0.5
sf -1 -0.30 0.5
sd 0 -0.20 0.5
*
* circle stroked, filled and dotted:
cs 1 -0.40 0.5
cf 2 -0.30 0.5
cd 3 -0.20 0.5
*
* triangle stroked, filled and dotted:
ts 4 -0.40 0.5
tf 5 -0.30 0.5
td 6 -0.20 0.5
*
* box stroked, filled and dotted base to base:
bs 7 -0.40 8 0
bf 8 -0.30 9 0
bd 9 -0.20 10 0
*
* line stroked, filled and dotted base to base:
ls 10 -0.40 11 0
lf 11 -0.30 12 0
ld 12 -0.20 13 0
*
* box stroked, filled and dotted, around bases:
bs 13.5 -0.40 14.5 0
bf 14.5 -0.30 15.5 0
bd 15.5 -0.20 16.5 0
*
* line stroked, filled and dotted, around bases:
ls 16.5 -0.40 17.5 0
lf 17.5 -0.30 18.5 0
ld 18.5 -0.20 19.5 0
A test symvec is provided with the program, file 'symvec.demo', to be run
with 'colors.demo' and 'makelogop.demo'.
documentation
Description of Logos:
@article{Schneider.Stephens1990,
author = "T. D. Schneider
and R. M. Stephens",
title = "Sequence Logos: A New Way to Display Consensus Sequences",
journal = "Nucleic Acids Res.",
volume = "18",
pages = "6097--6100",
pmid = "2172928",
pmcid = "PMC332411",
note = "\htmladdnormallink
{https://alum.mit.edu/www/toms/papers/logopaper/}
{https://alum.mit.edu/www/toms/papers/logopaper/}",
year = "1990"}
Use of wave:
@article{Papp.Schneider1993,
author = "P. P. Papp
and D. K. Chattoraj
and T. D. Schneider",
title = "{Information analysis of sequences that bind the replication
initiator RepA}",
journal = "J. Mol. Biol.",
volume = "233",
pages = "219--230",
pmid = "8377199",
comment = "Cover of 233, number 2!",
year = "1993"}
Dirty DNA synthesis experiments:
@article{Schneider.Stormo1989,
author = "T. D. Schneider
and G. D. Stormo",
title = "{Excess information at bacteriophage T7 genomic promoters
detected by a random cloning technique}",
journal = "Nucleic Acids Res.",
volume = "17",
pages = "659--674",
pmid = "2915926",
pmcid = "PMC331610",
year = "1989"}
The Blue Book:
@book{PostScriptTutorial1985,
author = "{Adobe Systems Incorporated}",
title = "PostScript Language Tutorial and Cookbook",
publisher = "Addison-Wesley Publishing Company",
address = "Reading, Massachusetts",
callnumber = "QA76.73.P67P68",
isbn = "0-201-10179-3",
year = "1985"}
The Red Book:
@book{PostScriptManual1985,
author = "{Adobe Systems Incorporated}",
title = "PostScript Language Reference Manual",
publisher = "Addison-Wesley Publishing Company",
address = "Reading, Massachusetts",
callnumber = "QA76.73.P67P67",
isbn = "0-201-10174-2",
year = "1985"}
bugs
Some chi-logo (upside down characters) do not display on OpenWindows, but
do print ok on the Apple LaserWriter IIntx. The reason is completely
obscure.
A bug in NeWS 1.1 is that characters that are scaled too small are forced
to be big. This messes up the logo and can be confusing. Another bug in
NeWS 1.1 prevents one from using the outline, but the dashed boxes will
show up. Sometimes displaying a logo in NeWS 1.1 on a Sun 4 will cause an
'illegal instruction', after which one is thrown completely off the
computer. The source of this is not known, since it is not repeatable.
The first two bugs are resolved under OpenWindows 2; the third has not
been observed. These NeWS bugs do not apply to the Apple LaserWriter
IIntx, which prints everything correctly.
* MISSING LOGO LETTER PROBLEM
The OpenWindows PostScript on a Sun workstation will mess up displaying a
stack of letters if the vertical movement is too small. The result is
that the letters above that point are missing. This occurs if there is a
highly conserved base and very few other bases. The result is a huge gap
where the highly conserved base should be. Other printers do fine, so
this is a problem with the Sun implementation of PostScript (will they
ever get it right???). If you don't have this window system, set the
constant gooddisplay to true. If you do want the logos to show up
properly on the screen, use false. Unfortunately, this will mean that the
vertical translation for the small letters won't be done, so the display
will be very slightly wrong.
* The freeware program Ghostview will sometimes refuse to print some
bases, but they come out just fine on many printers.
*******************************************************************
* Eric Miller (esm@unity.ncsu.eduk, http://www.mbio.ncsu.edu/esm) pointed
out (2000 Dec 15):
> Aesthetically, the error bars at the bottom of the logo (little to no
> information regions) obscure the base coordinate line.
Yes that's bothered me at times also.
> For a given logo, the error bars are / appear the same length, probably
> as a function of the number of sequences present in the alignment, since
> each position is represented in each sequence.
That's correct.
> It would be preferable to have the logo error bar in a single location
> (since they are the same),
No they aren't all the same. The delila system handles blanks, where no
sequence is known or reported. So error bars tend to be bigger away from
the center of the logo where there are fewer sequences. Some examples are
in the Gallery, especially the 8 E. coli sequence logos.
> maybe off any letter of the sequence (above a specified coordinate
> position, at a specified bit height), or just on the high part of the
> logo. I need to check the makelogop to see if the error bars can be
> removed or modified.
One can remove the bars, though of course one goes blind at that point.
Moving them is an interesting idea. Of course the problem is in the cases
where there is low information content, so wouldn't work. If one had a
lower bound, then explaining it to people would be complex - one's eye
would see it more than the background! Also, one could not judge the
background against the bars. One solution might be to block the bar below
zero, but then I'm worried that partial bars may be misinterpreted. So
you raise a good issue but I don't know a good solution. Fortunately it
is for the most part aesthetic as you say - one can figure out the
numbering.
*******************************************************************
technical notes
* HERE'S HOW ITALIC STRINGS WORK. User defined strings have to be
rendered into PostScript. To indicate that a region of the string is to
be done in italics, one must gain access to the PostScript machinery. For
example:
38\) \( E. coli \) IT \(LexA binding sites (extra parenthesis)
The first "\)" after the "38" switches to the PostScript interpreter. The
backslash "\" is used as an "escape" character, telling makelogo that the
following character is to be interpreted as PostScript. (Otherwise
makelogo would protect the character and you would just get a
parenthesis.) Likewise, the string
\( E. coli \)
is interpreted as a PostScript string. At that point there will be
two strings on the stack, the (38) and the "( E. coli )". There is
a special function defined in makelogo called IT. IT takes these
two strings and shows the first in Helvetica-Bold and the second in
Helvetica-BoldItalic. After that we must return to normal typing,
and this is done with "\(" just before "LexA". The general form
for using PostScript commands is therefore
\) postscriptstuff \(
That is, the parenthesis always match backwards. The code (procedure
postscriptstring) is curious and interesting because it starts with a
string like this:
38\) \( E. coli \) IT \(LexA binding sites (extra parenthesis)
and converts it to the following valid PostScript:
(38) ( E. coli ) IT (LexA binding sites \(extra parenthesis\))
The escape character by the user is removed from parenthesis, while
unprotected parenthesis get escape characters!
Why not let the user type raw postscript? Because they would have to
remember to type a \ in front of various characters, and this would often
lead to programs that would bomb.
Note that one can define ANY function one would like by this means!
* Unfortunately PostScript fonts are not exactly the same height. Thus if
A and T are the standard, then C and G hang above and below the line.
This has been solved in this version of makelogo. As a consequence, the
user never need to determine any character sizes empirically, and the
logos should work on any PostScript printer.
Special thanks go to the following people for their help in solving this
problem:
Kevin Andresen [kevina@apple.com]
"The problem facing you is that, while the PostScript language is more or
less standard, the font shapes depend on the designer, type vendor, or
language implementation. The fonts used in NeWS are not exactly the same
as those from Adobe, which are not the same as those from Bitstream, which
are not the same as the original lead type, etc. (This is an
industry-wide issue.) One way to compensate for this in PostScript is to
use the charpath and pathbbox operators and scale appropriately."
He provided a program, which I then rewrote and generalized. That version
almost worked, but not quite. This was solved by:
finlay@Eng.Sun.COM (John Finlay) who said:
"It would appear that the calculation of the pathbbox for characters
varies with the scale of the characters (I don't know why exactly but
would speculate that there's probably some weirdness with the font hints
and scaling). I modified your postscript to iterate once on the size and
recalculate the pathbbox at the scaled size. Seems to printout OK (inside
the boxes) on a LWI, LWII and in NeWS2.0 (though NeWS still seems to get
the wide slightly wrong)."
shiva@well.sf.ca.us (Kenneth Porter) was also involved and actively
interested. My apologies if I have forgotten someone else who
contributed.
The letter I and the vertical bar (|) are treated specially since
in the Helvetica-Bold font they are rectangles and would completely
fill the character space. In addition, the letter I is centered by
makelogo.
* Thanks go to Joe Mack for suggesting numbering and titles (strings) and
to Pete Lemkin and Wojciech Kasprzak for pointing out that the shrink
option would be helpful. Thanks to Jeff Haemer for pointing out that the
PostScript program should begin with '%!', and for suggesting that the
string fonts should be different from the logos themselves.
* As of version 8.12, makelogo produces encpsulated PostScript. This
allows the logo to be more safely imbedded in other figures. The
BoundingBox, which defines the region a figure resides on a page, is
computed from the basic size of the logo. The width is computed from the
charwidth and stacksperline. The height is computed from barheight,
linesperpage and linemove. The linemove parameter is used only if
linesperpage is more than 1. The edge parameters are then added around
all edges. This allows the numbering and labels to be inside the
BoundingBox. The figure can be rotated by -90 or +90 degrees. Other
rotations result in a BoundingBox that is page-sized. Note that rotations
can place much of the logo outside of the page. The bounding box will not
show parts outside of the page, so this can be confusing. To see roughly
where the logo will appear on the page, use -89 or +89 angles.
* Constant centertrigger determines the value of the base position of a
string at which the string will be centered.
* 2006 Oct 25. Very small values of Rs(l) = rsl < 0.00005, cause
ghostview to crash. Changing rsl would alter the sum, so that is
not a solution. The solution is to restrict the minimum stack size
drawn to the constant minimumStackSize.
* 2018 Jul 25. Very small character sizes cause a 'squash-spray'
effect. The effect is a thin colored line extending to the right
of the sequence logo (a spray), usually at zero bits corresponding
to a letter whose height is very small (squash). That is, the
letter is squashed and sprayed to the right. This did not happen
with MacOS some time ago but more recently it has occured with both
Skim and Adobe viewers. Experimenting with calls to numchar
suggested that it occurs when the charheight is less than the
constant squashspray in points. So the code now simply does not
show the character if the height is below that value. It appears
that squashspray is best to be set to 1 point. When the effect
occurs, the squashed near-zero height letters are displayed anyway
as "junk" along the bottom of the logo since they have been sorted
to be small. When they are removed, the larger letters do not move
vertically , but the "junk" disappears. This was confirmed using
the flicker technique: https://alum.mit.edu/www/toms/flicker.html
The example used to solve this bug is show by this flicker:
https://alum.mit.edu/www/toms/images/squashspray.gif
* 2018 Jul 26. squashspray becomes a hidden variable next to
charwidth. The default (squashspraydefault) is 1 point but this
doesn't always work for unknown reasons.
* 2018 Aug 08. When a character is smaller than squashspray it
will be drawn as a filled rectangle instead of being invisible.
Users won't notice this much since the rectagle will generally be
small.
*)
(* end module describe.makelogo *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}