version = 1.63 of README 2010 Jul 21 # 2010 Jul 21, 1.63: upgrade example # 2009 Jun 24, 1.62: ftp archive merged into website # 2002 Oct 13, 1.61: upgrade due! # 1996 Jul 25, 1.60: previous version INFORMATION ABOUT DELILA SYSTEM PROGRAMS Dr. Thomas D. Schneider toms@alum.mit.edu https://alum.mit.edu/www/toms/ _____________________________________________________________________________ Context: what's this all about? The delila directory contains all of Tom Schneider's Delila system programs and also a bunch of other stuff. To study binding sites using information theory we need to gather together the sequences of the sites. The first step is to make a database. I did this in 1978 (!) and then realized that I needed a way to extract subsets of the database for analysis. This was before I started using information theory (about 1980). At that time I wanted to do statistical analyses of ribosome binding sites. I had put in a bunch of ribosome binding sites into the computer for study, and realized that to edit the sequences by hand would be likely to introduce lots of errors, and would also be a pain. So I realized that there could be a program that allows one to extract exactly the portions of the database desired for a particular analysis. I devised Delila to do that. Delila uses a language (DEoxyribonucleic acid LIbrary LAnguage) which a human writes. Then Delila runs somewhat like a compiler and extracts the requested fragments. That's all that Delila does. Other programs then analyze the data, eventually producing a sequence logo, for example. Sequence logos are a graphical method that replaces consensus sequences. Consensus sequences are a (very poor) way to show the nucleic acid patterns to which proteins or protein/RNA complexes bind. The database work got taken over by GenBank, but Delila is still useful. I ended up being a GenBank advisor for a while because of my database work, but GenBank still is not up to the standards implied by the Delila system. Some day GenBank will have all objects in the feature table named, for example. (See the philgen paper in the archive for more details about what an advanced database would look like.) A bibliography is in the file Tom.Schneider.bib. The papers Schneider1982 and Schneider1984 describe Delila. Papers on information theory are SchneiderPrimer, Schneider1986, Schneider1989, Schneider.ccmm, Schneider.edmm among others. Check out Schneider.Stephens.Logo before you look at anything else. We send a package of papers on request, see the file cover.ps. The Logo paper is in the package but not the Delila papers. ______________________________________________________________________ SOURCES OF INFORMATION: This README file is kept in an anonymous ftp arcive in the pub/delila directory. There are several information files in the pub/delila directory: (.Z means the file is Unix compressed.) README - This file, which contains overview of what's in the archive. There are also lists of files that begin with the word README. bionet.info-theory.faq.Z - frequently asked questions about the news group delman - The DELila MANual. libdef - Definition of the delila library system, including BNF's. moddef - Definition of a portable mechanism for moving plug-in text modules around from one program to the next. Also, the source code of every program has an descriptive information near the top. This is duplicated in delman, but since I do not update delman very often, some of the descriptions there lag behind the source code. Always look in the source code for the definitive information on how to run a program. _____________________________________________________________________________ OBTAINING FILES Delila files are available on Internet by anonymous ftp from ftp.ncifcrf.gov in the directory pub/delila. In the ftp directory, most files are compressed by the Unix compress command. (So they end with a ".Z") Don't forget to use the binary transfer mode when you get them. The uncompress program can be obtained by anonymous ftp from "ftp.uu.net" in "compress.tar". There is also a "help" file there. For VAX VMS users, it may also be obtained from genbank.bio.net in directory pub/vms as the file "lzdcmp.exe" (Contact Dave Kristofferson, biosci-help@net.bio.net for more information.) The files are also available to people on BITNET from Dan Davison (davison@uh.edu) on "gene-server%bchs.uh.edu@CUNYVM" (many thanks to Dan for this service). A reference for this is: @article{Davison1990, author = "D. B. Davison and J. E. Chappelear", title = "The {Genbank}-Server at the {University of Houston}", journal = "Nucl. Acids Res.", volume = "18", pages = "1571-1572", year = "1990"} I will announce significant upgrades on the newsgroups bionet.general, bionet.info-theory and sci.bio. If you do obtain any programs, please email back to me your experiences, so I can improve the archive for you. ______________________________________________________________________ ABOUT TRANSLATION TO C Many Delila programs can now be translated into C! Successfully translated programs will be placed in the archive only on request. You should first see if the program is there, then make sure that it has the same version number as the pascal version. If they differ send me email. The p2c translator and library is available from: David Gillespie (daveg@csvax.cs.caltech.edu) You can obtain it by anonymous ftp to csvax.caltech.edu in the pub directory. To compile the programs you need: include file: p2c.h pascal library: p2clib.c In your home directory (assuming Unix operating system) you will need to have a control file for p2c (it's called .p2crc) with following lines: LiteralFiles 2 NestedComments 2 StructFiles 1 Translate and compile. You may get a warning about "SYSTEM" which you can ignore. I have translated makelogo and some other programs. Date and time functions may be a problem since they are system dependent. The functions in cdatemod.p allow correct conversions. Please report problems to me or David Gillespie. Good luck! When you need empty files, use echo > thefilename p2c is not available on BITNET so far as I know, but perhaps Dan Davison would be willing to make it available if someone wants it. The file tc is my Unix script for translation and compiling. Try the program test.c first to see that you can get it to work. _____________________________________________________________________________ WHAT IS THE DELILA SYSTEM? The Delila System is a large group of about 150 programs designed for the study of binding sites of proteins (and other things) on DNA or RNA. One of the most fun tools in the collection is called makelogo. It creates the 'sequence logos' described in the paper by Stephens and Schneider 1990 (references are listed at the end of this file). Not all Delila System programs are in the ftp directory (or on whatever media you used to obtain this file). If you would like me to make more available, just send me a note, some of them are listed in the two papers on the Delila system, Schneider1982 and Schneider1984. The Delila manual, called delman, describes how to use the delila programs, and lists most of the tools available. Delman tends to get out of date though, so there are usually more tools than are listed, and some of the tools may be more advanced than is described there. UP TO DATE DOCUMENTATION FOR EACH PROGRAM IS ALWAYS INSIDE THE SOURCE CODE OF EACH PROGRAM. (Program names end with ".p".) All programs are in Pascal. The only non-standard routines are the date and time calls. These have worked on a number of UNIX systems, but I have been careful to avoid compiler and operating system dependencies, so they should work on any machine that has a good Pascal compiler. If you have trouble compiling, contact me. Difficulties can sometimes be fixed on this end, so that when you get an update the problem will no longer exist. (If you don't tell me bugs and their fixes, then the problem will bother you again the next time!) The programs are also available in C. The paper on Logos is in: logo.tex logo.bbl the figures are in: globin.logo.Z lambcro.logo.Z ribo.logo.Z t7.logo.Z These are all in the graphics language PostScript. The minimal set of files for doing Sequence Logos are listed in the file README.logo. These files allow you to demonstrate logos and create protein logos. * Uncompress all the files if they were compressed. * Compile the programs, which all end with a '.p' DEMONSTRATING THE LOGO PROGRAMS 'ALPHABET' DEMONSTRATION The first demonstration requires only an empty data file and two parameter files. Later demonstrations require more and more inputs. Create empty files named 'symvec' and 'marks'. In Unix: echo -n "" > symvec echo -n "" > marks then copy (cp) the following 'demo' files into the common working names: cp makelogop.alphabet makelogop cp colors.alphabet colors The makelogop file contains parameters to control the makelogo program. The color file contains the definitions of colors of each symbol. Now you can run the makelogo program. Print the resulting logo file on your PostScript printer. 'DEMO' DEMONSTRATION This demonstrates reading from the symvec file. Copy (cp) the following 'demo' files into the common working names: cp symvec.demo symvec cp makelogop.demo makelogop cp colors.demo colors run makelogo and print the resulting logo file on your PostScript printer. 'PROTEIN' DEMONSTRATION This demonstrates creating the symvec file from aligned protein sequences, the protseq. The alpro and dalvec programs both produce an 'symbol vector' (called symvec) that is input to the makelogo program. This demonstrates the use of alpro for proteins: cp protseq.globin protseq cp makelogop.protein makelogop cp colors.protein colors run alpro (makes the symvec), makelogo (makes the logo) and print the logo. 'DNA' DEMONSTRATION This demonstrates creating the symvec file from aligned DNA sequences, the rsdata. cp rsdata.dna rsdata cp makelogop.dna makelogop cp colors.dna colors Be sure you have the following files: dalvecp marks wave run dalvec (makes the symvec, it should be identical to symvec.dna), makelogo (makes the logo) and print the logo. YOUR OWN DEMONSTRATION You can play with any of the examples and modify the logos. Many Delila programs have a control file, called the parameter file. They are identified by a suffex 'p'. So the control file for the makelogo program is makelogop. The parameters are spread over several lines in the file. For example, the location of the vertical bar in a logo is the second line of makelogop: 1 sequence coordinate before which to put a bar on the logo Makelogo reads the 1 and then ignores the rest of the line, which is a comment to remind the user what the parameter does. Leave the comments in to help you as you edit the parameter file. When a program is changed, the parameter file will often change, so you should switch to the new format. TO CREATE AN RSDATA FILE YOU WILL NEED MANY OTHER PROGAMS. This is because they are created using the Delila system of DNA analysis programs (Schneider1982, Schneider1984). Additional files and programs needed to do DNA sequence logos are listed in the file README.delila. Other programs will be placed in the archive on request. CREATING LOGOS FROM GENBANK DATA: introduction to the Delila System. Following is a description of how, starting from a set of Genbank entries, you can create logos. More information is in the delila manual, delman. The program dbbk will convert a set of GenBank (or perhaps still EMBL) files (in a file named 'db') into the Delila format (in a file named 'l1'). (Delila was written before GenBank existed, and the programs have not been converted yet.) Once you have made your 'l1 file, create empty files for 'l2', 'l3' and 'catalp'. The catal program will create a 'catalogue' of what is in the 'l1' file in file 'c1', and make a corresponding LIBRARY in file 'cat1'. (You can make several catalogues at once.) Run catal to create 6 files: lib1, lib2, lib3, cat1, cat2 and cat3. This set of 6 files is your specialized database. The Delila program will extract fragments from the database for analysis by other programs. One must tell Delila how to do this in a set of instructions written in the language Delila. Since Delila reaches into a database (a 'library') to extract sequences, it is called the 'librarian', and the set of sequences it produces, which are all contained in a single file, is called a 'book'. There are several example Delila instructions, ex1in, ex2in,... The 'ex7in' file is: title "ex7: aligned book"; organism ecoli; chromosome ecoli; piece lac; get from 29 -5 to 29 +10; (* laci rbs *) get from 1234 -5 to 1234 +10; (* lacz rbs *) Each instruction ends with a semicolon. The first instruction defines the title of the book to be created. This is required by Delila. The next instructions define the organism and chromosome that one is interested in. Unfortunately GenBank cannot be accessed this way yet, so the names created by dbbk from the database may be a bit odd. You can find out the names from the 'humcat' file produced by the catal program. The third line of the instructions above defines the piece of DNA that you want to get something from. (This is roughly equivalent to a GenBank entry.) The next two lines actually tell Delila the regions to extract. They say "Move to position 29, go 5 bases before there (ie -24). That is to be the first base. Then move to 29 + 10, (ie, 39). That is to be the last base. Grab the sequence from the first to the last base". The second get instruction says a similar thing. The result is always 5' to 3'. You tell Delila to get the complementary sequence simply by saying: get from 1234 +10 to 1234 -5 direction -; Things inside "(* ... *)" are comments ignored by Delila. You may write as many instructions as you desire, getting as many sequence fragments as you may want from various organisms. More details are available in the Delila Manual ("delman"). Write Delila instructions for the sites you want to analyze in the file 'inst'. Be sure to go FAR outside the known region of binding to avoid chopping off part of the site! DO NOT STICK TO THE KNOWN "BOX" region. Go at least 50 bases in front and behind it. You may be surprised! This way, you will get a feeling for what the background looks like. It is especially important to do if you don't have many example sites, because seeing a horribly noisy background will allow you to avoid over-interpreting your data. To make writing instructions easier, the catal program now generates all the specification instructions (ie everything except the get commands) in the file humin. You can make a copy of this file and modify it. This saves typing names, and avoids errors. Once you have your library ('lib1', 'lib2', 'lib3') your catalogue ('cat1', 'cat2', 'cat3') and instructions ('inst'), you can run Delila. Delila will create a listing of what happened ('listing') and your 'book'. The programs lister and count can be used to check the book. I almost never look at the book directly. Rather, I use programs to look at the data. This allows the form of a book to be made optimal for computer programs. The combination of this 'book' and the 'inst' files form an 'aligned book'. Create an empty file called 'namebook' and run the alist program. The 'list' file produced should contain the fragments you want, all nicely aligned with the numbering on top written vertically. There is a PostScript color version in clist. YOU CAN PROCEED ONLY AFTER YOU HAVE GOTTEN THIS TO WORK. Create a parameter file for the encode program in file 'encodep': -5 20 1 1 1 1 The -5 is the position to analyze FROM, 20 is the TO. Run the encode program to encode the sequences in file 'encseq'. Create an empty 'cmp' file and run rseq. The output is the 'rsdata' file for use with dalvec and then makelogo. Please tell me if you have difficulties with any of this. All comments are welcome, and will help the next person. _____________________________________________________________________________ Summary: TO CREATE SEQUENCE LOGOS from GenBank files, you will need the following programs: dbbk catal dbinst (This program will automatically make delila instructions from feature tables.) delila alist encode rseq dalvec makelogo in that order. Remember that if you modify one of the earlier data files, you must run ALL of the later programs to have the data flow into your logo. For example, if you modify one of your instructions, and you run all the programs but forget to run rseq, your logo will still look the same and you will be very puzzled! NOTE that delila, encode and rseq cannot handle gaps or protein sequences. To do that, use the alpro program followed by makelogo. _____________________________________________________________________________ Scripts The best method is to establish a Delila directory where you keep copies of all the original things. Then you can use scripts that are in the archive: put - put a copy of a delila file to the delila directory (only for the sys admin to do!) get - copy and change permissions of a delila file ck - check source code of a program or delila file lk - create links easily lkdelila - create links easily to delila directory Do your work on each particular binding site in a different directory. You can use lkdelila to make links or "get" to get a copy of the file. For example I might make a directory: mkdir xer cd xer # create file with xer related genbank entries in db dbbk touch l2 l3 catalp # touch creates empty files catal # the delila library is now ready cp catin inst vi inst # modify the prebuilt catin delila alist gv clist # use ghostview to look at the color listing vi encodep # make the encodep encode touch cmp rseq touch dalvecp dalvec touch wave marks lkdelila colors # there will be no changes to colors, so a link is ok get makelogop # I intend to modify makelogop, so I use get makelogo # first shot through gv logo vi makelogop # fix it this time makelogo ... _____________________________________________________________________________