version = 1.63 of README 2010 Jul 21
# 2010 Jul 21, 1.63: upgrade example
# 2009 Jun 24, 1.62: ftp archive merged into website
# 2002 Oct 13, 1.61: upgrade due!
# 1996 Jul 25, 1.60: previous version

INFORMATION ABOUT DELILA SYSTEM PROGRAMS

  Dr. Thomas D. Schneider
  toms@alum.mit.edu
  https://alum.mit.edu/www/toms/
_____________________________________________________________________________

Context:  what's this all about?

The delila directory contains all of Tom Schneider's Delila system
programs and also a bunch of other stuff.  To study binding sites
using information theory we need to gather together the sequences of
the sites.  The first step is to make a database.  I did this in 1978
(!) and then realized that I needed a way to extract subsets of the
database for analysis.  This was before I started using information
theory (about 1980).  At that time I wanted to do statistical analyses
of ribosome binding sites.  I had put in a bunch of ribosome binding
sites into the computer for study, and realized that to edit the
sequences by hand would be likely to introduce lots of errors, and
would also be a pain.  So I realized that there could be a program
that allows one to extract exactly the portions of the database
desired for a particular analysis.  I devised Delila to do that.
Delila uses a language (DEoxyribonucleic acid LIbrary LAnguage) which
a human writes.  Then Delila runs somewhat like a compiler and
extracts the requested fragments.  That's all that Delila does.  Other
programs then analyze the data, eventually producing a sequence logo,
for example.  Sequence logos are a graphical method that replaces
consensus sequences.  Consensus sequences are a (very poor) way to
show the nucleic acid patterns to which proteins or protein/RNA
complexes bind.  The database work got taken over by GenBank, but
Delila is still useful.  I ended up being a GenBank advisor for a
while because of my database work, but GenBank still is not up to the
standards implied by the Delila system.  Some day GenBank will have
all objects in the feature table named, for example.  (See the philgen
paper in the archive for more details about what an advanced database
would look like.)

A bibliography is in the file Tom.Schneider.bib.  The papers
Schneider1982 and Schneider1984 describe Delila.  Papers on
information theory are SchneiderPrimer, Schneider1986, Schneider1989,
Schneider.ccmm, Schneider.edmm among others.  Check out
Schneider.Stephens.Logo before you look at anything else.  We send a
package of papers on request, see the file cover.ps.  The Logo paper
is in the package but not the Delila papers.
______________________________________________________________________

SOURCES OF INFORMATION:

This README file is kept in an anonymous ftp arcive in the pub/delila
directory.  There are several information files in the pub/delila directory:
(.Z means the file is Unix compressed.)

   README - This file, which contains overview of what's in the archive.  There
     are also lists of files that begin with the word README.
   bionet.info-theory.faq.Z - frequently asked questions about the news group
   delman - The DELila MANual.
   libdef - Definition of the delila library system, including BNF's.
   moddef - Definition of a portable mechanism for moving plug-in
      text modules around from one program to the next.

Also, the source code of every program has an descriptive information
near the top.  This is duplicated in delman, but since I do not update
delman very often, some of the descriptions there lag behind the
source code.  Always look in the source code for the definitive
information on how to run a program.

_____________________________________________________________________________

OBTAINING FILES

Delila files are available on Internet by anonymous ftp from
ftp.ncifcrf.gov in the directory pub/delila.  In the ftp directory,
most files are compressed by the Unix compress command.  (So they end
with a ".Z")  Don't forget to use the binary transfer mode when you
get them.

The uncompress program can be obtained by anonymous ftp from
"ftp.uu.net" in "compress.tar".  There is also a "help" file there.
For VAX VMS users, it may also be obtained from genbank.bio.net in
directory pub/vms as the file "lzdcmp.exe" (Contact Dave
Kristofferson, biosci-help@net.bio.net for more information.)

The files are also available to people on BITNET from Dan Davison
(davison@uh.edu) on "gene-server%bchs.uh.edu@CUNYVM" (many thanks to
Dan for this service).  A reference for this is:

@article{Davison1990,
author = "D. B. Davison
 and J. E. Chappelear",
title = "The {Genbank}-Server at the {University of Houston}",
journal = "Nucl. Acids Res.",
volume = "18",
pages = "1571-1572",
year = "1990"}

I will announce significant upgrades on the newsgroups bionet.general,
bionet.info-theory and sci.bio.

If you do obtain any programs, please email back to me your
experiences, so I can improve the archive for you.

______________________________________________________________________

ABOUT TRANSLATION TO C

Many Delila programs can now be translated into C!  Successfully translated
programs will be placed in the archive only on request.  You should first see
if the program is there, then make sure that it has the same version number as
the pascal version.  If they differ send me email.

The p2c translator and library is available from:
    David Gillespie (daveg@csvax.cs.caltech.edu)
You can obtain it by
    anonymous ftp to csvax.caltech.edu
in the pub directory.

To compile the programs you need:

    include file: p2c.h
    pascal library: p2clib.c

In your home directory (assuming Unix operating system) you will need to have
a control file for p2c (it's called .p2crc) with following lines:

LiteralFiles 2
NestedComments 2
StructFiles 1

Translate and compile.  You may get a warning about "SYSTEM" which you can
ignore.  I have translated makelogo and some other programs.  Date and time
functions may be a problem since they are system dependent.  The functions in
cdatemod.p allow correct conversions.  Please report problems to me or David
Gillespie.  Good luck!

When you need empty files, use
  echo > thefilename

p2c is not available on BITNET so far as I know, but perhaps Dan Davison would
be willing to make it available if someone wants it.

The file tc is my Unix script for translation and compiling.
Try the program test.c first to see that you can get it to work.
_____________________________________________________________________________

WHAT IS THE DELILA SYSTEM?

The Delila System is a large group of about 150 programs designed for the study
of binding sites of proteins (and other things) on DNA or RNA.  One of the most
fun tools in the collection is called makelogo.  It creates the 'sequence
logos' described in the paper by Stephens and Schneider 1990 (references are
listed at the end of this file).  Not all Delila System programs are in the ftp
directory (or on whatever media you used to obtain this file).  If you would
like me to make more available, just send me a note, some of them are listed in
the two papers on the Delila system, Schneider1982 and Schneider1984.  The
Delila manual, called delman, describes how to use the delila programs, and
lists most of the tools available.  Delman tends to get out of date though, so
there are usually more tools than are listed, and some of the tools may be more
advanced than is described there.  UP TO DATE DOCUMENTATION FOR EACH PROGRAM IS
ALWAYS INSIDE THE SOURCE CODE OF EACH PROGRAM.  (Program names end with ".p".)

All programs are in Pascal.  The only non-standard routines are the date and
time calls.  These have worked on a number of UNIX systems, but I have been
careful to avoid compiler and operating system dependencies, so they should
work on any machine that has a good Pascal compiler.  If you have trouble
compiling, contact me.  Difficulties can sometimes be fixed on this end, so
that when you get an update the problem will no longer exist.  (If you don't
tell me bugs and their fixes, then the problem will bother you again the next
time!) The programs are also available in C.

The paper on Logos is in:
  logo.tex logo.bbl
the figures are in:
  globin.logo.Z  lambcro.logo.Z  ribo.logo.Z  t7.logo.Z
These are all in the graphics language PostScript.

The minimal set of files for doing Sequence Logos are listed in the file
README.logo.  These files allow you to demonstrate logos and create protein
logos.

* Uncompress all the files if they were compressed.
* Compile the programs, which all end with a '.p'

DEMONSTRATING THE LOGO PROGRAMS

'ALPHABET' DEMONSTRATION
The first demonstration requires only an empty data file and two parameter
files.  Later demonstrations require more and more inputs.
Create empty files named 'symvec' and 'marks'.  In Unix:
    echo -n "" > symvec
    echo -n "" > marks
then copy (cp) the following 'demo' files into the common working names:
    cp makelogop.alphabet makelogop
    cp colors.alphabet colors
The makelogop file contains parameters to control the makelogo program.  The
color file contains the definitions of colors of each symbol.  Now you can run
the makelogo program.  Print the resulting logo file on your PostScript
printer.

'DEMO' DEMONSTRATION
This demonstrates reading from the symvec file.  Copy (cp) the following 'demo'
files into the common working names:
    cp symvec.demo symvec
    cp makelogop.demo makelogop
    cp colors.demo colors
run makelogo and print the resulting logo file on your PostScript printer.

'PROTEIN' DEMONSTRATION
This demonstrates creating the symvec file from aligned protein sequences, the
protseq.  The alpro and dalvec programs both produce an 'symbol vector' (called
symvec) that is input to the makelogo program.  This demonstrates the use of
alpro for proteins:
    cp protseq.globin protseq
    cp makelogop.protein makelogop
    cp colors.protein colors
run alpro (makes the symvec), makelogo (makes the logo) and print the logo.

'DNA' DEMONSTRATION
This demonstrates creating the symvec file from aligned DNA sequences,
the rsdata.
    cp rsdata.dna rsdata
    cp makelogop.dna makelogop
    cp colors.dna colors
Be sure you have the following files:
    dalvecp
    marks
    wave
run dalvec (makes the symvec, it should be identical to symvec.dna), makelogo
(makes the logo) and print the logo.

YOUR OWN DEMONSTRATION
You can play with any of the examples and modify the logos.  Many Delila
programs have a control file, called the parameter file.  They are identified
by a suffex 'p'.  So the control file for the makelogo program is makelogop.
The parameters are spread over several lines in the file.  For example, the
location of the vertical bar in a logo is the second line of makelogop:

1          sequence coordinate before which to put a bar on the logo

Makelogo reads the 1 and then ignores the rest of the line, which is a comment
to remind the user what the parameter does.  Leave the comments in to help you
as you edit the parameter file.  When a program is changed, the parameter file
will often change, so you should switch to the new format.

TO CREATE AN RSDATA FILE YOU WILL NEED MANY OTHER PROGAMS.
This is because they are created using the Delila system of DNA analysis
programs (Schneider1982, Schneider1984).  Additional files and programs needed
to do DNA sequence logos are listed in the file README.delila.  Other programs
will be placed in the archive on request.

CREATING LOGOS FROM GENBANK DATA: introduction to the Delila System.
Following is a description of how, starting from a set of Genbank entries, you
can create logos.  More information is in the delila manual, delman.  The
program dbbk will convert a set of GenBank (or perhaps still EMBL) files (in a
file named 'db') into the Delila format (in a file named 'l1').  (Delila was
written before GenBank existed, and the programs have not been converted yet.)

Once you have made your 'l1 file, create empty files for 'l2', 'l3' and
'catalp'.  The catal program will create a 'catalogue' of what is in the 'l1'
file in file 'c1', and make a corresponding LIBRARY in file 'cat1'.  (You can
make several catalogues at once.)

Run catal to create 6 files: lib1, lib2, lib3, cat1, cat2 and cat3.  This set
of 6 files is your specialized database.  The Delila program will extract
fragments from the database for analysis by other programs.  One must tell
Delila how to do this in a set of instructions written in the language Delila.
Since Delila reaches into a database (a 'library') to extract sequences, it is
called the 'librarian', and the set of sequences it produces, which are all
contained in a single file, is called a 'book'.

There are several example Delila instructions, ex1in, ex2in,...
The 'ex7in' file is:

title "ex7: aligned book";
organism ecoli; chromosome ecoli;
piece lac;
get from 29 -5 to 29 +10; (* laci rbs *)
get from 1234 -5 to 1234 +10; (* lacz rbs *)

Each instruction ends with a semicolon.  The first instruction defines the
title of the book to be created.  This is required by Delila.  The next
instructions define the organism and chromosome that one is interested in.
Unfortunately GenBank cannot be accessed this way yet, so the names created by
dbbk from the database may be a bit odd.  You can find out the names from the
'humcat' file produced by the catal program.  The third line of the
instructions above defines the piece of DNA that you want to get something
from.  (This is roughly equivalent to a GenBank entry.)

The next two lines actually tell Delila the regions to extract.  They say "Move
to position 29, go 5 bases before there (ie -24).  That is to be the first
base.  Then move to 29 + 10, (ie, 39).  That is to be the last base.  Grab the
sequence from the first to the last base".  The second get instruction says a
similar thing.  The result is always 5' to 3'.  You tell Delila to get the
complementary sequence simply by saying:

get from 1234 +10 to 1234 -5 direction -;

Things inside "(* ... *)" are comments ignored by Delila.

You may write as many instructions as you desire, getting as many sequence
fragments as you may want from various organisms.  More details are available
in the Delila Manual ("delman").

Write Delila instructions for the sites you want to analyze in the file
'inst'.  Be sure to go FAR outside the known region of binding to avoid
chopping off part of the site!  DO NOT STICK TO THE KNOWN "BOX" region.  Go at
least 50 bases in front and behind it.  You may be surprised!  This way, you
will get a feeling for what the background looks like.  It is especially
important to do if you don't have many example sites, because seeing a horribly
noisy background will allow you to avoid over-interpreting your data.

To make writing instructions easier, the catal program now generates all the
specification instructions (ie everything except the get commands) in the file
humin.  You can make a copy of this file and modify it.  This saves typing
names, and avoids errors.

Once you have your library ('lib1', 'lib2', 'lib3') your catalogue ('cat1',
'cat2', 'cat3') and instructions ('inst'), you can run Delila.  Delila will
create a listing of what happened ('listing') and your 'book'.  The programs
lister and count can be used to check the book.  I almost never look at the
book directly.  Rather, I use programs to look at the data.  This allows the
form of a book to be made optimal for computer programs.

The combination of this 'book' and the 'inst' files form an 'aligned book'.
Create an empty file called 'namebook' and run the alist program.  The 'list'
file produced should contain the fragments you want, all nicely aligned with
the numbering on top written vertically.  There is a PostScript color version
in clist.  YOU CAN PROCEED ONLY AFTER YOU HAVE GOTTEN THIS TO WORK.

Create a parameter file for the encode program in file 'encodep':
    -5 20
    1
    1
    1
    1
The -5 is the position to analyze FROM, 20 is the TO.  Run the encode program
to encode the sequences in file 'encseq'.  Create an empty 'cmp' file and run
rseq.  The output is the 'rsdata' file for use with dalvec and then makelogo.

Please tell me if you have difficulties with any of this.  All comments are
welcome, and will help the next person.
_____________________________________________________________________________

Summary: TO CREATE SEQUENCE LOGOS from GenBank files, you will need the
following programs:

   dbbk
   catal
   dbinst (This program will automatically make delila instructions
    from feature tables.)
   delila
   alist
   encode
   rseq
   dalvec
   makelogo

in that order.

Remember that if you modify one of the earlier data files, you must run ALL of
the later programs to have the data flow into your logo.  For example, if you
modify one of your instructions, and you run all the programs but forget to run
rseq, your logo will still look the same and you will be very puzzled!

NOTE that delila, encode and rseq cannot handle gaps or protein sequences.
To do that, use the alpro program followed by makelogo.
_____________________________________________________________________________

Scripts

The best method is to establish a Delila directory where you keep copies of all
the original things.  Then you can use scripts that are in the archive:

   put - put a copy of a delila file to the delila directory
         (only for the sys admin to do!)
   get - copy and change permissions of a delila file
   ck - check source code of a program or delila file
   lk - create links easily
   lkdelila - create links easily to delila directory

Do your work on each particular binding site in a different directory.  You can
use lkdelila to make links or "get" to get a copy of the file.

For example I might make a directory:

 mkdir xer
 cd xer
 # create file with xer related genbank entries in db
 dbbk
 touch l2 l3 catalp     # touch creates empty files
 catal                  # the delila library is now ready
 cp catin inst
 vi inst                # modify the prebuilt catin
 delila
 alist
 gv clist               # use ghostview to look at the color listing
 vi encodep             # make the encodep
 encode
 touch cmp
 rseq
 touch dalvecp
 dalvec
 touch wave marks
 lkdelila colors        # there will be no changes to colors, so a link is ok
 get makelogop          # I intend to modify makelogop, so I use get
 makelogo               # first shot through
 gv logo
 vi makelogop           # fix it this time
 makelogo
 ...

_____________________________________________________________________________