version = 5.06 of delman1 2014 Mar 06 ddddddd eeeeeeee ll m m aa n nn dd dd ee ll mm mm aaaa nn nn dd dd ee ll mmm mmm aa aa nnn nn dd dd eeeeeee ll mmmmmmmm aa aa nnnn nn dd dd ee ll mm mm mm aa aa nn nn nn dd dd ee ll mm mm aaaaaaaa nn nnnn dd dd ee ll mm mm aa aa nn nnn dd dd ee ll mm mm aa aa nn nn ddddddd eeeeeeee llllllll mm mm aa aa nn nn 11 111 11 11 11 11 11 11 11111111 THE DELILA SYSTEM MANUAL THOMAS D. SCHNEIDER COPYRIGHT (C) 1993 1. Don't Panic! You don't have to absorb this all at once! 2. There is an index at the end of any printed copy of Delman! 3. To create Delman2, see file aa.p(end of version)
IIIIIIII N NN TTTTTTTT RRRRRRR OOOOO II NN NN TT RR RR OO OO II NNN NN TT RR RR OO OO II NNNN NN TT RR RR OO OO II NN NN NN TT RR RR OO OO II NN NNNN TT RRRRRRR OO OO II NN NNN TT RR RR OO OO II NN NN TT RR RR OO OO IIIIIIII NN NN TT RR RR OOOOO(end of delman.intro)
DELILA SYSTEM MANUAL OUTLINE INTRO: Introduction To The Delila System OUTLINE: Outline For The Delila Manual DESCRIPTION: What Is The Delila System? ORGANIZATION: Organization Of The Manual POLICY: Our Policies, A Disclaimer, Obtaining The Delila System, Our Address And Acknowledgements TRANSPORT: Transportation Of The Delila System REQUIREMENTS: What You Will Need To Get The Delila System Running TAPE.FORMATS: Tape Data Formats ASSEMBLY: Assembly Of The Delila System Programs INTRO: What We Mean By Assembly CHACHA: Changing Characters And Getting The First Program Running REMBLA: Removing Excess Blanks From Files WORCHA: The Reserved Word Problem MODULE: Module Libraries - What They Are And How To Use Them EXAMPLE: An Example Of Constructing A Delila System Program PROBLEMS: Problems That May Arise During Assembly GUIDE: Hello, Computer - A Guide To The New User INTRO: Introduction To The Guide And Your Computer ADVICE: Advice And Tips To The New User DELILA: How To Use The Delila System On Your Computer PROGRAM: System Independent Notes On Programming ESSAY: Suggestions On How To Learn And Do Programming FABLE: A Fairy Tale For Programmers(end of delman.intro.outline.1)
USE: Uses Of The Delila System INTRO: Introduction STRUCTURE: Library Structure: Trees, Nested And Named Objects LANGUAGE: Delila - The Language AUXILIARY.PROGRAMS: Lister And Search DATA.FLOW: Data Flow And Data Loops COORDINATES: The Coordinate System Of A PIECE CONTROL: How To Control The Responses Of Delila COMPARISON: Ways To Compare Sequences ALIGNED.BOOKS: How To Make And Use Aligned Books PERCEPTRON: Use Of The Pattern Programs ENCODE: Use Of The Fabulous And Powerful Encode Program DBPULL: Using The Data Base Extraction Programs SEARCH: Using The Search Program CONSTRUCTION: Constructing Your Own Libraries INTRO: Introduction STRUCTURE: More On Library Structure - Logical Vs Physical Structure CATAL: Making New Libraries - The Catalogue Program EXAMPLE: An Example Of Constructing Delila Libraries DATA.ENTRY: Using Your Own Data LIBRARY.DESIGN: Making A Delila Data Base [FORM...]: The Forms For Library Module Entry DESCRIBE: Program And Data Descriptions CONVENTIONS: Notation For Naming, Writing And Running Programs SHORT.CLUSTER: Short Clustered Descriptions Of Delila System Files DOCUMENTATION: How Programs Are Documented The format for documentation in the Delila System is in file aa.p at the start of the Delman2 manual. INDEX An Alphabetical Listing Of The Pages In The Manual. (See The Page Named DELMAN.INTRO.ORGANIZATION For How To Generate The Index.)(end of delman.intro.outline.2)
WHAT IS THE DELILA SYSTEM? The Delila System is a collection of Pascal programs and data originally written at the University of Colorado, Boulder that allows one to manipulate and study sets of nucleic-acid sequences. A set of sequences is called a library. There is a librarian, and "her" name is Delila. One gives Delila a list of instructions that name desired fragments. Delila then searches the library, collects all the sequences together and produces a "book". The book may then be searched for patterns, listed with translation to amino acids, or studied in various ways using programs other than Delila ("auxiliary" programs). Since books may be small, these analyses can be efficient. Books have the same form as libraries. In other words, libraries have a particular structure so that Delila can work with them. Books have that same structure. For example, given a Master DNA sequence library one can use Delila to make a subset such as a transcript library, containing sequences of mRNA. From the transcript library subsets for gene initiation regions can be made and these are guaranteed to be sequences from mRNA. During all these manipulations the numbering of the sequences remains consistent so that one can refer back to the original library or the literature. (The technical differences between libraries and books will be discussed later.) Any auxiliary program that searches a library will know about the structure of the library. Using this structure and the search results, the program can write Delila instructions that specify the locations of the found objects. Once again, using Delila, one can loop back and create a book of these objects. Also, the instructions (instead of the sequences) can be manipulated by various programs. A NOTE FOR PROGRAMMERS Each auxiliary program that reads a book or library knows about the library structure. To make programming easy, a set of routines was written as an interface between the actual database (kept in a file) and the program calls and variables. These "book reading routines" are kept together in what we call a Module Library, containing many chunks of Pascal code. Each module performs certain kinds of tasks. The modules are transferred from the module library into the source code of each auxiliary program by using the Module program. In this way all changes to the interface packages can be made once in the Module Library, followed by a series of transfers. We may send the Delila System with modules removed because there is no reason to send duplicate code. After transportation you would assemble the programs. We hope that this section gave you a rough overview of what the Delila System can do. Many more details and examples can be found in the sections that follow.(end of delman.intro.description)
libdef - the definition of the Delila Library System (a file) moddef - the definition of the Module Transfer System (a file) doodle.info - describes Pascal graphics portable under UNIX Some of the Delila programs and the method of moving modules around are described in these papers: Schneider, T.D., G.D. Stormo, J.S. Haemer and L. Gold. (1982) A design for computer nucleic-acid sequence storage, retrieval and manipulation. Nucleic Acids Research, 10: 3013-3024. Schneider, T.D., G.D. Stormo, M.A. Yarus, and L. Gold (1984) Delila system tools. Nucleic Acids Research, 12: 129-140. Some related papers are: Stormo, G.D., T.D. Schneider and L.M. Gold (1982) Characterization of translational initiation sites in E. coli. Nucleic Acids Research, 10: 2971-2996. Stormo, G.D., T.D. Schneider, L. Gold and A. Ehrenfeucht (1982) Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Research, 10: 2997-3011. Clift, B., D. Haussler, R. McConnell, T. D. Schneider and G. D. Stormo (1986) Sequence Landscapes. Nucleic Acids Research, 14: 141-158. Schneider, T.D., G.D. Stormo, L. Gold and A. Ehrenfeucht (1986) The information content of binding sites on nucleotide sequences. J. Mol. Biol. 188: 415-431. Stormo, G.D., T.D. Schneider and L. Gold (1986) Quantitative analysis of the relationship between nucleotide sequence and functional activity Nucleic Acids Research, 14: 6661-6679. T. D. Schneider (1988) Information and entropy of patterns in genetic switches. In G. J. Erickson and C. R. Smith, editors, Maximum-Entropy and Bayesian Methods in Science and Engineering, volume 2, pages 147--154, Dordrecht, The Netherlands, Kluwer Academic Publishers. T. D. Schneider and G. D. Stormo (1989) Excess information at bacteriophage T7 genomic promoters detected by a random cloning technique. Nucleic Acids Research, 17:659--674. Reference for Dotmat, Helix, Matrix and Keymat: J. V. Maizel, Jr. and R. P. Lenk PNAS 78: 7665-7609 (1981) A reference for Index: L. J. Korn, C. L. Queen and M. N. Wegman PNAS 74: 4401-4405 (1977)(end of delman.intro.references)
ORGANIZATION OF THE MANUAL The Delila Manual is broken into several somewhat independent sections. When Delman is paged by program PBREAK (see Technical notes below) you will find an index at the end. We anticipate at least two kinds of reader: 1) The builder who wants to get a Delila System running on a local computer. The section on transportation will help you get the data into your computer. The section on assembly will guide you through the difficult task of getting the programs running. At that point the Delila Libraries will still not be ready to use: you must construct catalogues as described in the section on CONSTRUCTING YOUR OWN LIBRARIES (DELMAN.CONSTRUCTION). Finally you will be able to use the Delila System. We suggest that you first look over the entire manual and associated documents. Then begin the transport. Good luck! 2) The user who wants to use a Delila System that is already running on a local computer. You may be interested in looking over the sections on transportation and assembly of the system, but this is not necessary. If you don't know anything about using the computer you should start at DELMAN.GUIDE. In any case, read the section on USE OF THE DELILA SYSTEM (DELMAN.USE). Each program is described in a separate manual, Delman 2. TECHNICAL NOTES (These are not be useful to people just starting.) 1. The section DELMAN.GUIDE must be rewritten after transportation to a new computer system. 2. DELMAN is physically broken into a set of modules. Each module is a page of the manual. The individual pages can be extracted (or transferred and rearranged) by using the program MODULE, as described in the document MODDEF and DESCRIBE.MODULE. The pages may be looked at on-line with the SHOW program (DESCRIBE.SHOW). The manual or extracted modules may be broken into pages for output to a lineprinter by using the PBREAK program with a parameter file containing: (* begin module 1 There is no closing "*)" in the trigger because many different module names may follow the trigger, so the trigger is for the common part of the module beginnings. You can generate another index of the contents of this manual in the List file of program Module if you use Delman as the Modlib and a copy of Delman as Sin. (See MODDEF for the definitions of these files.)(end of delman.intro.organization)
OBTAINING THE DELILA SYSTEM The Delila system is available From https://alum.mit.edu/www/toms OBTAINING THE DELILA SYSTEM BY TAPE We prefer not to have to write tapes or disks, but we will send the Delila System by tape as a single package if you do not have have ftp access. Under most circumstances we cannot send parts of the system or subsets of the data. Please send us a tape as described in delman.transport.tape.formats, and we will write out the entire current version and send it back to you. There is no fee. You may redistribute the system. If you receive a a copy of the system from someone else, you may want to check back with us to see if there have been any major changes or corrections. Referring to the version number of the program or documentation will help us know if there were any changes. DISCLAIMER No claim or guarantee is made that Delila System programs and data are free of error. Although we send source code, we cannot guarantee that this code will compile and run on all computers. We believe that our code is reasonably efficient, but we cannot be responsible for any costs due to using the Delila System. We do not offer programming support, though we are willing to answer questions about the Delila System. We would appreciate a detailed description of any program errors (bugs) or data errors that you encounter. OUR ADDRESS Thomas D. Schneider, Ph.D. toms@alum.mit.edu https://alum.mit.edu/www/toms ACKNOWLEDGEMENTS Jeff Haemer, Mike Aden and Gary Stormo were instrumental in the original design of the Delila system. Many people have helped us by reading and commenting on this manual. We would like to thank: Ginny Fonte, Larry Gold, Jeff Haemer, John Hoffhines, Jane Hessler (VA), Brent Hughes, Billie Lemmon, Melissa Mockensturm, Sandy Parkinson (UT), Pat Roche, Herb Schneider, Susan Scolman, Sidney Shinedling, Britta Singer, Rosemary Sweeney, and Mike Yarus. Computer time and resources were generously provided by the University of Colorado at Boulder, and the Frederick Biomedical Supercomputing Center. Funds for this project were provided through grants NIH 1 R01 GM28755, NIH 5 R01 GM19963 and ACS NP-178D.(end of delman.intro.policy)
Please use this page to write comments you have about the manual and the Delila system. Our address is on page delman.intro.policy. Thankyou. Name: Date:(end of delman.intro.comments)
tttttttt rrrrrrr aa n nn ssssss tt rr rr aaaa nn nn ss ss tt rr rr aa aa nnn nn ss tt rr rr aa aa nnnn nn ssssss tt rr rr aa aa nn nn nn ss tt rrrrrrr aaaaaaaa nn nnnn ss -------- tt rr rr aa aa nn nnn ss tt rr rr aa aa nn nn ss ss tt rr rr aa aa nn nn ssssss ppppppp oooooo rrrrrrr tttttttt pp pp oo oo rr rr tt pp pp oo oo rr rr tt pp pp oo oo rr rr tt pp pp oo oo rr rr tt ppppppp oo oo rrrrrrr tt pp oo oo rr rr tt pp oo oo rr rr tt pp oooooo rr rr tt(end of delman.transport)
TRANSPORTATION - WHAT YOU WILL NEED If you have obtained the Delila System by computer tape, you will need some way of moving the data on the tape into your computer. We suggest that you find someone who has already dealt with tapes. All Delila System programs are written in the language Pascal. There are many books available on this language, but the definition of the language is in: K. Jensen and N. Wirth Pascal User Manual and Report Springer-Verlag, New York 1978 Some of the Delila programs have been automatically translated to C. See the README file for further details. To run Pascal programs you will need a Pascal compiler on your computer, and enough memory to use it. It is impossible to make an accurate estimate of the memory requirements, because this depends on the computer system. However, we once set up an older version of the entire system on two computers: CDC Cyber/KRONOS 5000 pru x 640 char/pru = 3,200,000 characters DIGITAL VAX/VMS 7000 blocks x 512 char/block = 3,584,000 characters Since then more programs have been added, and we find roughly: 4,300,000 characters of source code and files 5,300,000 bytes of compiled code on a Pyramid 90x computer running UNIX. Since these estimates include object code, it is possible that the amount you require will be more or less. The estimates do not include memory required for running the system. Since transportation of programs from one computer to another is still a tricky business, we recommend that either you learn about tapes, your computer, and Pascal, or that you find local people who know about these things and are willing to give you help. The first Delila system file on the tape is called AAA (the name guarantees that it will be first). It lists the name of all the Delila files on the tape, in the order that they were taped. Following AAA the other files are in alphabetical order. Files are described in the manual section DELMAN.DESCRIBE. If you keep notes on difficulties that you encounter and how each was solved, transportation of future versions of the Delila System will be easier.(end of delman.transport.requirements)
TAPE DATA FORMATS We send the Delila System (programs and data) out on tape. Send us a standard 2400 foot tape. We will send you back the tape with the format: 9 track 1600 bits per inch Unlabeled Standard ASCII character set 80 characters per record 10 records per block We can also send UNIX tar tapes. The first file on the tape lists the names of all the files on the tape.(end of delman.transport.tape.formats)
AA SSSSSS SSSSSS EEEEEEEE M M BBBBBBB LL YY YY AAAA SS SS SS SS EE MM MM BB BB LL YY YY AA AA SS SS EE MMM MMM BB BB LL YYYY AA AA SSSSSS SSSSSS EEEE MMMMMMMM BBBBBBB LL YY AA AA SS SS EE MM MM MM BB BB LL YY AAAAAAAA SS SS EE MM MM BB BB LL YY AA AA SS SS EE MM MM BB BB LL YY AA AA SS SS SS SS EE MM MM BB BB LL YY AA AA SSSSSS SSSSSS EEEEEEEE MM MM BBBBBBB LLLLLLLL YY(end of delman.assembly)
ASSEMBLY OF THE DELILA SYSTEM PROGRAMS At this point we will assume that all the programs and data are in files on your computer. Be sure to read the sections in PROGRAMS AND DATA DESCRIPTIONS (DELMAN.DESCRIBE.CONVENTIONS) that discusses our file naming and running conventions. This section will guide you in the construction of the Delila System programs. There are several stages to this process: changing characters - making sure that all the characters are correct removing blanks - blank characters at the end of lines can be removed to speed processing and save memory. changing words - changing the words that your compiler thinks are reserved words in Pascal (but aren't in standard Pascal...) module corrections - making sure that modular chunks of code function correctly on your computer. module transfers - inserting chunks of code into programs compilation and debugging - making the programs and finding out why things don't work ("If something can go wrong, it will." - Murphy) We have written some tools to aid you in this process - but to use the tools you must first get some of them running - so the first steps must be done by hand. Remember to take dated notes about your problems and how they were solved. USE OF COMMAND FILES Most computer systems allow one to put commands in a file and execute them. If you can do this, it will speed up assembly enormously. One such "command" file could contain instructions to remove blanks, change characters, change words, transfer modules and perhaps even try to compile. However, it would be better to have several command files, each of which did a small part, giving you more flexibility.(end of delman.assembly.intro)
CHANGING CHARACTERS When characters are written to tape they are encoded as binary strings. When your computer reads the tape, the characters are decoded for storage on your computer. If the decoding does not exactly reverse the encoding, then the characters you receive will not be the same as the ones that we send. For example, you many have a pound sign for each exclamation mark that we sent. Your first task is to find out what changes occurred (if any). To aid you, we provided a list of characters with English descriptions in the file 'chars'. Look at this file and write down the changes required. Use the editor on your computer to correct the characters in the file CHACHAS. Now try to compile CHACHAS. Determine the reasons for any errors. (For example, you may have to switch double and single quotes to satisfy the compiler or you may have to remove the non-standard linelimit call.) The CHACHA program will now assist you in converting characters in the files from the tape. You should try it out on chars, remembering not to destroy the original file. NOTE: Some Pascal compilers may not allow programs that read "nonstandard" characters. (Example: small characters.) You may be able to get around this by setting compiler defaults.(end of delman.assembly.chacha)
REMOVING EXCESS BLANKS FROM FILES The files that you get off the tape may have extra blanks (spaces) at the ends of lines. This may be due to transportation itself, or the source computer may add extra blanks to lines. Although these blanks will not affect the function of most programs, they will slow down program execution and use up extra memory. Transportation can also add blank lines to the end of the file. Some programs will object to this. Catal is one example. The program Rembla (remove blanks) will remove all blanks from the ends of lines in a file, and any extra blank lines at the end. We recommend that you include this as a step during assembly of programs. It should also be done for data files, especially the libraries.(end of delman.assembly.rembla)
THE RESERVED WORD PROBLEM The language Pascal defines certain words (such as PROGRAM, VAR, BEGIN and END) to be reserved words. These words cannot be used as variable names. This in itself presents no difficulties for portability. However, your Pascal compiler (like ours) may reserve more words than just the standard set. If one of the Delila System programs uses a non-standard reserved word of your compiler, then the program will not compile. You will not have to change all these names by hand because we have sent a program to do it automatically. Non-standard reserved words should be listed somewhere in the manual for your Pascal compiler. Use this list and the program WORCHA to remove all the reserved names. We suggest using new names that are not likely to appear in a program. Example: MODULE could be converted to ZMODULE without loss of meaning. ZMODULE is not likely to be already used in a program. Worcha will not alter literals or comments, so the program's operation will not be affected by this change. If one makes the changes with a standard editor, then the program may not act as described in this manual. (We hope that those people who design compilers will consider this problem in the future.)(end of delman.assembly.worcha)
ASSEMBLY USING MODULES First, familiarize yourself with DELMAN.DESCRIBE.CONVENTIONS. You are now ready to assemble a Delila auxiliary program. The raw source LISTERR cannot be compiled as it now stands because it is missing a set of replaceable chunks of code (called modules) to read books (the book reading interface modules). These are to be found in DELMODS, as stated in the first few lines of LISTERR. Notice that DELMODS is a program - compile and run it. This will almost certainly fail. Correct those modules that cause problems. See the section on assembly problems. Modules can be moved around using the MODULE program. The details of this process are described in MODDEF, which you should study now. --------------------------- READ MODDEF NOW --------------------------------(end of delman.assembly.module.1)
Prepare to do the module transfers by compiling MODULES. All programs should be tested on small inputs at first. Test the Module program with the example module source and library: MODULE(EXSIN,EXMODLI,EXSOUT,EXCT,LIST,OUTPUT) Exsout should be identical to the sout example in ModDef. Examine list and exsout. Now try: MODULE(LISTERR, DELMODS, LISTERS, DELCAT, OUTPUT) The OUTPUT file will tell you the progress MODULE makes during the transfer. Modules in DELMODS will be copied into the right places of LISTERR and the result will be LISTERS (LISTER with inserts - source code). It will be useful to save DELCAT for further transfers from DELMODS. Compile LISTERS. Run the LISTER (using the default parameters): LISTER(EX0BK, EX0LIT) The file EX0LIT is a listing of the example book EX0BK. It should be identical to EX0LI. The possible exception is the begin-page character: some computers use a 1 to indicate jump to the next page, while others use control-L. We would now like to know that LISTER works correctly. To do this requires a comparison program. MERGE will do. However, to construct MERGE requires modules from PRGMODS. Compiling PRGMODS and running it will test interactive i/o. The procedures in PRGMODS that may need modification are PROMPT, READCHAR and READLINE, in decreasing order of system dependence. You should modify LINELIMIT and HALT by transferring the corrected modules from DELMODS into PRGMODS. Prepare PRGMODS and run it. Prepare MERGE and use it to prove that EX0LIT = EX0LI. You may now construct the rest of the programs. Note that some of them use several module libraries. For the next stage of setting up the Delila System compile CATALS, LOOCATS and DELILAS. You must now construct the libraries: skip to CONSTRUCTING YOUR OWN LIBRARIES, (DELMAN.CONSTRUCTION). NOTE FOR A SECOND TRANSPORTATION If you obtain a later version of the Delila System, then Delmods and other module libraries are likely to be altered. You will want to replace modules in the new DELMODS and PRGMODS with your own (system dependent) versions. If you did this directly, you would also replace corrections and changes to DELMODS. To avoid this problem, simply construct a small module library (containing for example LINELIMIT, DATETIME modules and the interaction modules). Then use this to change DELMODS and PRGMODS.(end of delman.assembly.module.2)
AN EXAMPLE OF CONSTRUCTING A DELILA SYSTEM PROGRAM In this example we show the series of steps used to set up a Delila system program, given that the module libraries are ready (that is, they compile and run). The example is for Patser, which requires both Delmods and Auxmods. We assume that the tools needed to do this are already set up, as discussed on the previous pages. As noted in DELMAN.ASSEMBLY.INTRO, it is frequently possible to automate these steps. 1. Change Characters chacha(patserr,patser1,chachap) Chachap must contain the changes you determined earlier. 2. Remove Blanks rembla(patser1,patser2) 3. Change Words worcha(patser2,patser3,worchap) Worchap must contain a list of special reserved words and what they are to become. 4. Insert Modules module(patser3,auxmods,patser4,auxcat) module(patser4,delmods,patsers,delcat) Auxcat and delcat will be generated by Module if they were empty. You can reuse them later with their respective module libraries. The module libraries needed are listed in the first few lines of each program. It is not necessary to pickup the DESCRIBE module to compile the program. 5. Compile Patsers is now a source code.(end of delman.assembly.example)
ASSEMBLY PROBLEMS Transportation and assembly problems occur most often because of unavoidable system dependent features of particular Pascal compilers. INTERACTIVE INPUT For interactive input we wrote several modules that work on our computer (INTERACT in PRGMODS). These procedures may or may not be transportable, so you may have to modify them. For example, interactive input on a cyber Pascal compiler requires the file name "input/" - you would have to remove the "/" for your compiler. (This is no longer necessary, as the source code is now under UNIX which does not require this.) DATE AND TIME PROCEDURES The module for date and time calls (module PACKAGE.DATETIME in DELMODS) must be rewritten. We strongly recommend that you keep the same form for the dates in libraries so that these routines remain interfaces. Changing the form of the date would make transportation of libraries difficult because they would not have the same structure in different locations. Modules that will work on a VAX computer are in VAXMODS. You may find it easier to adapt these to your computer rather than the ones that are in Delmods. If your computer does not have a clock, the simplest way to get this module running is to add DATE and TIME procedures in the form called by READDATETIME. These dummy procedures could return either a fixed time or a random time made by a true random number generator. The date and time is used to uniquely identify books and some data files. QUOTES CDC Cyber Pascal compilers require double quotes(") where the standard is the single quote ('). SOLUTION: use CHACHA to convert: " to ' and ' to " In some cases you will have to use two single quotes so that Pascal prints a single quote. Some programs that print 5' and 3' are Lister, Helix, Matrix and Dotmat. To convert, simply alter the constant called 'prime'.(end of delman.assembly.problems.1)
LINELIMIT In CDC Cyber Pascal compilers, output to files is limited to 1000 lines unless the LINELIMIT procedure is called. Your compiler may not require or recognize this silliness. SOLUTION: The calls to linelimit are isolated to the procedure UNLIMITLN in the module by the same name in DELMODS and PRGMODS. Simply surround the call (inside the modules!!!) with comments. INTERNAL FILES (thanks to Sandy Parkinson) An "internal file", for the discussion here, is a file used by a Pascal program as a scratch pad. It is not connected to the outside world. Some computer systems and their Pascal compiler require that all files be connected to the outside, as they are not capable of creating temporary files. At least two Delila programs use internal files: Module and Split. Correction of this problem requires some programming. It may not be possible to do it for Split. COMPARISONS OF PACKED ARRAYS May cause you some problems. One solution is to use arrays that are not packed and to write your own comparison procedure. THINGS THAT WE HAVE NOT THOUGHT OF... Please tell us! Our address is in DELMAN.INTRO.POLICY. For notes on the writing of transportable programs see DELMAN.PROGRAM and DELMAN.DESCRIBE.CONVENTIONS.WRITING.(end of delman.assembly.problems.2)
GGGGGG UU UU IIIIIIII DDDDDDD EEEEEEEE GG GG UU UU II DD DD EE GG UU UU II DD DD EE GG UU UU II DD DD EEEE GG UU UU II DD DD EE GG GGGG UU UU II DD DD EE GG GG UU UU II DD DD EE GG GG UU UU II DD DD EE GGGGGG UUUUUU IIIIIIII DDDDDDD EEEEEEEE(end of delman.guide)
HELLO COMPUTER - A GUIDE TO THE NEW USER ABOUT THIS SECTION: This section is a guide to using the computer. Whenever you have questions about the computer, this is the place to look, because the rest of the manual is about the Delila System ONLY. That is to say, we have split this manual into several parts - and it will not help for you to look for the right thing in the wrong part. The reason for this is that the information about the Delila System can be moved from one computer to another (just like the Delila System) but information about computers usually cannot be moved. DELMAN.GUIDE must be REWRITTEN for other computers and operating systems. ABOUT THIS COMPUTER: This manual section is written specifically for UNIX operating systems. (UNIX is a trademark of Bell Laboratories.) OTHER DOCUMENTS AND RESOURCES: In general, ask around. Type help to get pointers. Learn how to use the UNIX manual program (man). The apropos program is useful for finding things. There are hundreds of books on UNIX. Find one you like. Many people seem to like: UNIX for People by P. Birns, P. Brown and J. C. C. Muster Prentice-Hall, Inc, 1985 The easiest way to learn to use a computer is to use the computer! Obtain a login identification and plunge in. DO NOT REVEAL YOUR PASSWORD TO ANYONE!!!(end of delman.guide.intro)
SOME ADVICE TO A NEW COMPUTER USER: 1) YOU CAN'T HURT THE COMPUTER. Don't hesitate to try things and to play around! 2) After you learn how to get on and off the computer your best bet is to get a firm grip on what files are, how you can make them and how to manipulate them. The easiest way to understand what is happening is to watch it happen. You should use the commands that display your files after each file manipulation - until you have a good feeling about what is happening. If you do this you will quickly become confident about what you are doing. 3) A lot of the general principles that you pick up will be similar on other computers. 4) Be wary of the characters you type. Notice that a zero (0) is NOT the same as the capital letter O - the computer can tell them apart. This is also true for a one (1) and the small l. 5) Do not do any serious work while you learn to use the computer. You are likely to destroy some of your files. That will hurt you and not the computer. Loss of good data can be terribly frustrating. 6) If you have a problem TRY A SIMPLER CASE, TRY TO ISOLATE THE PROBLEM. 7) An experienced advisor is worth a thousand hours of computer time. UNCRITICAL ACCEPTANCE OF COMPUTER RESULTS "So useful has the computer become in all branches of statistical analysis that there may be some tendency to forget that even it has its limitations. The computer cannot work magic--not yet anyway. It will do only what it is instructed to do, and the validity of the results is determined by the accuracy and adequacy of the data put in and the wisdom of the people writing the instructions. Granted, the computer can perform a great many calculations much more rapidly than mere mortals can do them. Nevertheless, speed of computational work is not the same thing as infallibility in aiding with the decision-making process. A statistical critic, of all people, should guard against being overawed by the news that certain information was turned out by a computer. The mere fact that computers are being used these days even to cast horoscopes should be ample proof that a computer is no more immune to spewing out nonsense than are real flesh-and-blood people." -from FLAWS AND FALLACIES IN STATISTICAL THINKING by Stephen K. Campbell (N.J. Prentice-Hall Inc., 1974), p. 182(end of delman.guide.advice)
HOW TO USE THE DELILA SYSTEM ON THIS COMPUTER Computer: Cutterjohn and Sparky. The Delila System programs and documentation are kept in the directory ~toms/delila The binary forms (which you can run) are in ~toms/bin If you put this directory in your path, then they will simply be commands.(end of delman.guide.delila)
PPPPPPP RRRRRRR OOOOOO GGGGGG RRRRRRR AA M M PP PP RR RR OO OO GG GG RR RR AAAA MM MM PP PP RR RR OO OO GG RR RR AA AA MMM MMM PP PP RR RR OO OO GG RR RR AA AA MMMMMMMM PP PP RR RR OO OO GG RR RR AA AA MM MM MM PPPPPPP RRRRRRR OO OO GG GGGG RRRRRRR AAAAAAAA MM MM PP RR RR OO OO GG GG RR RR AA AA MM MM PP RR RR OO OO GG GG RR RR AA AA MM MM PP RR RR OOOOOO GGGGGG RR RR AA AA MM MM(end of delman.program)
SUGGESTIONS ON HOW TO LEARN AND DO PROGRAMMING (An Essay By Tom Schneider) ABOUT LANGUAGES A computer language is the meeting ground between the absolutely rigid requirements of a computer (it must be told exactly what to do) and the ambiguous and flexible uses of human languages (such as "go jump in a lake", "pour me a cup" etc). Recently many academic institutions in the USA have allowed students to substitute computer languages for a knowledge of human languages. Although a knowledge of computers is becoming increasingly important in our society, this change is short sighted: no computer language is anywhere near as powerful or beautiful as those practiced by humans. With dedication one can easily learn twenty computer "languages" in a few years, whereas the polyglot is rare indeed. It is important to learn both kinds of language. For one to substitute FORTRAN for French is preposterous cheating. HOW DO LANGUAGES WORK? COMPILERS Every kind of computer has its own internal "machine" language. It is difficult for a person to write or read this because it consists of long stretches of ones and zero's: 0100101010111010000011 10110111101001110010100101001010... Every "bit" (a one or a zero) must be exactly right or the machine will not operate correctly. Most people can't deal with such immense amounts of detail. The solution is to force the computer to keep track of the details and let the person think in word-like and sentence-like units: IF SUNNY THEN REJOICE ELSE MOPE; Once one has written a set of sentences in a "higher" level language, one must have the computer convert them to its own internal machine language (this is not strictly true, but we will only discuss one method here). The process is called compiling. A self-contained and consistent set of "sentences" and "paragraphs" is called a program. Obviously one also needs a program to do the compiling - that program is called a compiler. For example, one relatively modern language is called Pascal. A Pascal compiler sits ("resides") in ("on" - so much for jargon) a particular computer. It converts statements made in the Pascal language into machine zero's and one's for that computer (and only that computer). In other words, it converts a SOURCE code into an OBJECT code. The object code can be made to operate ("run") only on one kind of computer. (Note: the word "code" means "program". Also, on some computers one must convert the object code into "executable" code before it can be run.) (Here is something to puzzle over. It is now common practice to write a compiler in the same language that the compiler compiles. The Pascal compiler was written in Pascal. It's like pulling oneself out of the mud by the bootstraps... how did it start?) WHY PASCAL? One of the first languages written was called FORTRAN. In its day (the 1950's) it was a great boon because one no longer needed to write in machine language (or even one step up, assembly). Since that time many new ideas have been incorporated into languages. Some of them (such as recursion and complex data types) fall outside the range that FORTRAN can handle. This evolution is to be expected. Yet people still try to teach an old dog, so there have been a series of "improvements" to FORTRAN. The result is a great mish-mash of dialects. For these reasons (and other things like the dread FORMAT statement) it is difficult (although not impossible) to write good transportable code in FORTRAN. ("Transportable" or "machine independent" means that the program will work on several different computers.) Pascal is a more modern language, so it includes recently developed concepts. One can write excellent crystal clear code in this language. Unfortunately this property does not prevent one from writing poor and obscure code! TOPDOWNING: How To Write Clear Code There are as many ways to write code as there are people. Yet a few simple principles allow one to organize one's thoughts quickly and efficiently. Writing a program is just like ... writing an outline. One starts at the "top" by writing the main things to be done: Tom's Day I. Morning II. Travel To Work III. Work IV. Travel Back Home V. Evening Then one writes the first section: I. Morning A. Get Up B. Shower C. Get Dressed D. Eat E. Put On Coat This is repeated for the other sections. Eventually we get even deeper: I. Morning A. Get Up 1. Huh? 2. Open eyes 3. Yawn ... In Pascal, one dispenses with the numbering of sections. Instead, each section has a name. A section is called a procedure. Since you can read all about procedures, I won't go into more detail here. The main advantage to this method is that if one is careful, each procedure is isolated from all the others. There is only one thing to think about at a time. SPAGHETTI PROGRAMMING Many computer languages, including Pascal, allow one to jump from one statement to others in the program. These GOTO statements invariably lead to poor programs because one creates nests of GOTO's that jump all over the place. These can be difficult to figure out. I have seen a case where a professional programmer didn't know about an inefficient series of jumps that he had written. Even large companies sell code that is a tangled mess. Modern programmers have found that the solution is amazingly simple: DON'T USE GOTO'S The Delila system programs use only one GOTO, in a procedure named HALT which terminates the program by jumping to the end of the program. This is necessary because Pascal does not provide for a program abort procedure. (Pascal HALT is not standard.) There are NO other circumstances when a GOTO is required!! A METHOD FOR WRITING PROGRAMS This is what I do when I write a program: I have a stack of old computer paper (or standard size paper, not printer size). I write one procedure on each sheet. An entire procedure is "no longer than" one page. In fact, any procedure longer than a page is usually a warning that I need more procedures. It is not necessary at first to write the details of every procedure, only to define the procedures. Starting from the top I work down a ways, realize that I need a set of primitive procedures (eg. to manipulate text lines) so I define them, but the way they work can be written later. So as the highest levels of the program are formed, the lower levels are defined. Eventually it is time to write details of the lower levels. Sometimes the higher level can be simplified as the lower levels become clearer. As you can tell from this description, one begins from the top, but the entire structure changes as one goes. Don't be afraid to toss out a procedure that's no good - it's only one page and the paper can be recycled. The last point is important: be flexible. Don't keep banging your head against a logical dilemma. I have often outlined a whole program - and then tossed it out because there was a better solution. Learn when to drop. Clues: you find yourself trying to do many things at once; the primitive procedures that you have devised are awkward to use; and you find it impossible to document a procedure. Document a procedure?? DOCUMENTATION: The Key To Immortal Code Even in a high level language like Pascal, it is possible to have a functioning program that is not easy to understand. To define a procedure I often write down the name of the procedure, the variables (pieces of information to be manipulated) that it uses and then a few English sentences that define exactly how the variables are to be used. This is all one needs for the higher levels of the outline. Those written sentences are called comments. They are part of the documentation required to make the program easy to write and ... easy to read. It is impossible to overemphasize the importance of documentation because nobody EVER does enough (me included). If you don't document, within a short time (e.g. a month to half a year) you will have forgotten the details of the program - and it will be painful to figure it out again. Worse than that - nobody else will be able to work with it! It is not hard to write out what you are trying to do in a particular section of code or procedure, and it has a real advantage: one is forced to think clearly. There are several places in a program that ought to have comments: PROGRAM STATEMENT - the program should state its purpose in life, how it should be used, who wrote it and the date of the latest version. Some technical details can be included. CONSTANTS - Include a constant called VERSION and CHANGE THIS EVERY TIME THAT YOU CHANGE THE SOURCE CODE. Write the version to all output from the program. This will assure that all output can be unambiguously associated with a particular version of the program. This will save you many headaches! (Note: some computers keep track of file versions. FILE VERSIONS WILL NOT SUBSTITUTE FOR AN INTERNAL CONSTANT because the program output is not affected and it is not transportable.) All CONSTANTS, TYPES and VARIABLES should have a short description of their purpose. DON'T USE ONE VARIABLE FOR TWO PURPOSES - you will be unable to document these cases properly and the code will be confusing. Each PROCEDURE or FUNCTION should have a short description that tells how to use it and gives the purpose of each passed variable. ***************************************************************************** * SUMMARY: programming is vastly simplified by using two simple tactics: * * topdowning and documentation. * ***************************************************************************** A NOTE ON DATA STRUCTURES Higher level languages, such as Pascal (but not FORTRAN) allow one to describe data in forms (structures) that resemble the way one thinks about the problem. To take advantage of these facilities, it pays to name each "variable" (a structured box into which data is put) and "type" (the structure of the box) carefully. A good name will make operations on the variable obvious, and errors will stand out because they will "sound" wrong. LOCATING ERRORS: Debugging Even with top down programming and documentation, errors are made. These are called "bugs". There are several kinds: SYNTAX - the compiler will yell at you for things like spelling mistakes BOMBING - the program stops abruptly when it should not LOGIC - the program produces strange results SUBTLE - the program can't handle certain rare conditions correctly SYNTAX - It helps to check what you type in. Since I put one procedure per hand written page, this is the easiest unit to check. Many subtle bugs can also be caught this way. BOMBING - It is often obvious where the program died. Work backwards through the logic to find the error. Clear, top-down code makes this much easier: one can often tell immediately where the problem is. Tracing also can help. See below. LOGIC and SUBTLE - Some computer systems allow one to trace the path that the computer follows through a program. So far I have not found these useful because they are cumbersome and they put out too much data. A few well placed write statements will trace the program flow quite well. (A "write statement" could print the value of a variable out for you and tell you where the computer currently is in the program.) In Pascal, one method is to make a global constant: DEBUGGING = TRUE; (* FOR DEBUGGING PURPOSES *) and use it this way: IF DEBUGGING THEN WRITELN(OUTPUT, "BEGIN PROCEDURE CIRCLE"); By changing the value of DEBUGGING one can turn the trace on and off. To turn off an individual trace point, one can "comment it out": (* IF DEBUGGING THEN WRITELN(OUTPUT, "BEGIN PROCEDURE CIRCLE"); *) The symbols "(*" and "*)" will make Pascal ignore the contents, because they become comments. The advantage of this over removing the statement is that it allows one to reactivate it easily. By far, the most time saving method is to write clear, well documented code. TESTING CODE It is often worthwhile to test a program on a small set of examples that one has worked out by hand. You should be aware however, that correct answers to tests do not prove that the program is correct. (This may seem obvious, but it is an easy mistake to make.) Sometimes one can prove the correctness of a program. This is a current field of research in computer science. HOW TO READ MANUALS Obtain your own copy of the manual and begin to read. Get a general idea of how the language, editor or system works. Don't worry about details yet. As soon as you have an idea about how to do something, try it on the computer. Play. Later on, you can read through the manual seriously if you want. However there is often a lot of detail that you would have to memorize. It is simpler to know that something can be done (by reading it once lightly) and to look it up when you need to do it. WRITING TRANSPORTABLE PROGRAMS A program written for one computer may not run on another computer because the compilers for the two computers may not understand the same language. Moving a program from one computer to another is called transportation. If you are going to the trouble and effort to write a good program, then you may as well make it easy for other people to use it. Your program would then be transportable. Obviously to be transportable, a program must be well written and documented. That is not all. You must avoid all the fancy "features" that your compiler advertises, because no one else has these. If you are forced to use some feature, then isolate it to a few replaceable procedures. We have provided you with a transportable(!) mechanism for replacing chunks of code like this - see the document MODDEF and the MODULE program. PROGRAM MAINTENANCE... SENILITY... AND DEATH. The most costly aspect of using computer programs is not their initial writing, but maintaining them once they are written. This is well documented in the literature. But why should a program need maintenance? Aren't they fixed text that does not change? In the simplest sense this is true. But over time, bugs in the code are found and fixed, and needs and expectations change. Programs are not static, they evolve. Good programming techniques and documentation make maintenance easier during the life time of a program, but eventually the program becomes so hard to change that one must scrap it altogether and start a fresh design. So programs have a birth, a life of use and maintenance and, finally, a senility before they die. REFERENCES "Pascal User Manual and Report", Second Edition, by Kathleen Jensen and Niklaus Wirth. Springer-Verlag, 1978. "Software Tools in Pascal", Brian W. Kernighan and P. J. Plauger. Addison-Wesley Publishing Co. 1981. "Algorithms + Data Structures = Programs", Niklaus Wirth. Prentice-Hall, Inc., 1976. "Structured Programming", O. J. Dahl, E. W. Dijkstra and C.A.R. Hoare, Academic Press. London, 1977. "Selected Writings on Computing: A Personal Perspective", E. W. Dijkstra, Springer -Verlag, New York, 1982.(end of delman.program.essay)
A Fairy Tale For Programmers The Three Most Important Concepts for Writing Good Code 1. Put comments in your code. 2. Don't ever forget that six months from now your program will be useless even to you without comments. 3. Several people who published a rather well known article on using computers to study sequences (and whose names shall remain unsaid to protect the guilty) sent their programs to us two years after they had published their article. It turned out that we could not use their programs directly because we did not have available the language that they used. It was necessary to translate each line of code into our language before we could use their program. Ok, fine, we know how to do that. But despite the fact that these were old programs that they had been working on for a long time, there were almost no comments in their code. That made the translation 100 times more difficult!! One sees an equation in the code - what does it mean? If they do something in a funny way, was it a mistake or is it important to do it that way? What a headache! We threw out their programs and wrote our own. MORAL: Code that is not documented in English will not survive in the long run. Therefore: Put In Comments. Comment As You Code, NOT AFTERWARDS - Comments Are Part Of The Code. Change The Comments When You Change The Code, NEVER PUT THIS OFF. Epilogue Years later, out of curiosity, the program called CODE (COmment DEnsity) was written. We were startled to discover that the frequency of characters devoted to comments in our code averages around 30 percent!(end of delman.program.fable)
UU UU SSSSSS EEEEEEEE UU UU SS SS EE UU UU SS EE UU UU SSSSSS EEEE UU UU SS EE UU UU SS EE UU UU SS EE UU UU SS SS EE UUUUUU SSSSSS EEEEEEEE(end of delman.use)
Use Of The Delila System INTRODUCTION This section of the Delila Manual assumes that you have read the introduction to the manual, that a Delila System is running on your computer, and that you know how to get on the computer, to make files, to modify and correct files, and to run programs (See DELMAN.GUIDE.). There are several sources of information that you can keep in mind: 1) The papers in DELMAN.INTRO.REFERENCES will show you how we have used the Delila System. 2) LIBDEF. This is a technical specification of Delila and the libraries. However, there is a set of detailed examples that can be read profitably without reading all the definitions. 3) The section of DELMAN called Program and Data Descriptions (DELMAN.DESCRIBE) lists everything that is available to you. Whenever you want a tool to do something, that is the place to look. In this section we will first discuss the structure of a Delila Library and how you can find your pet (pet's?) sequence in it. Next we describe how to tell Delila to go and fetch your sequences. We will then discuss programs that let you study the sequences. The sequence analysis will bring us back to Delila.(end of delman.use.intro)
LIBRARY STRUCTURE Think about a tree. The trunk spreads into a series of branches, sticks and twigs. A Delila library looks something like that, except that there are several kinds of branch, stick and twig, much as each twig ends in a leaf, bud or a flower. We have given names to the kinds of branches and leaves in Delila libraries. Near the trunk there are the ORGANISM and the RECOGNITION-CLASS. An ORGANISM is a cluster of data pertaining to a real-world organism. The term "organism" is somewhat ambiguous, so it is a matter of taste as to the classification of some creatures (is a virus a traveling plasmid?). In our library T4, T7 and E. coli information is stored in ORGANISMs. A RECOGNITION-CLASS is a cluster of data about any process that recognizes specific nucleic-acid sequences. These include chemical modification and restriction enzymes. (At present this portion of the library is not fully implemented, so we will not discuss it further.) The library structure can be diagrammed in a schema: A-->>--B means A has one or more of B. C--->--D means C has one of D. LIBRARY : : V V V V : : ............: :............. : : ORGANISM RECOGNITION-CLASS : : V V V V : : CHROMOSOME : : : : : : V V V V : V V V V : : : : : : ............: : : :......... : : ......: :.... : : : : : : : MARKER TRANSCRIPT GENE PIECE.... ENZYME : : : : : : : : : V V V V : : : V V : : : :.....: : : : : : : :...................: : : : : :...........................: : : : : : SEQUENCE SEQUENCE SEQUENCE(end of delman.use.structure.1)
In this schema you can see that ORGANISMs have one or more CHROMOSOME branches. Once again, the term CHROMOSOME is intended to be somewhat flexible. In Delila it means a complete biological unit of nucleic-acid either DNA or RNA. For example, we refer to both the ECOLI (the 5 million base one) and the CHROMOSOME PBR322 (the 4.3kb plasmid). Notice that real-world chromosomes are "inside" their organism. In the same way, one can think of CHROMOSOMEs to be inside their ORGANISM and ORGANISMs to be inside a library. You may think of a Delila Library either as a tree or a series of objects, one nested inside the other. A little reflection will show that these are equivalent because one can convert from one form to the other. Every ORGANISM and CHROMOSOME has a name by which it can be identified. For example, T4 is the name of the coliphage of rII fame, while ECOLI is the name for Escherichia coli. There is other information stored at these branch points as well. An ORGANISM tells us the genetic map units used, such as centiMorgan or kilobasepair. The CHROMOSOME goes on to specify the beginning and ending of the corresponding chromosome in the given units. Now we will delve inside a CHROMOSOME. There are MARKERs, TRANSCRIPTs, GENEs and PIECEs. What is going on? So far we have been leaning toward a description of an ideal situation where all the nucleic-acid sequence information of a chromosome would be stored inside a single data object -- a PIECE. Although this fits small phages such as PHIX174 and FD, it is nowhere near true even for ECOLI. There are many dis- connected fragments of E. coli sequence now known. As sequencing progresses, the fragments will connect more and more until the entire sequence is known. So a PIECE may be either the entire sequence information in a CHROMOSOME or only one of many fragments. In this way we can store sequences in their natural arrangement, and still accommodate data that is fragmented due to technical limitations. As more sequence is obtained, the SEQUENCE inside a PIECE is extended or fused to neighboring PIECEs. Like all the other library objects, a PIECE has a name, usually related to its biological functions. To keep all the fragments straight, each PIECE tells its location on the genetic map. The nucleic-acid sequence is stored inside a SEQUENCE, written 5' to 3'. Besides these data, each PIECE stores a useful set of information: a coordinate system. For the purposes of identification, every published sequence is given a set of consecutive integers corresponding to basepairs or bases along the DNA or RNA sequence. This numbering scheme is captured in the coordinates of each PIECE. Using Delila, subfragments of a PIECE can be easily obtained. These are also PIECES and every base in the new PIECE has the same number that its parent did. This has WONDERFUL consequences: every printout can refer to the original published literature. It is also easy to compare the results from several analyses.(end of delman.use.structure.2)
Let's move on to the GENE, one of the other data-objects inside a CHROMOSOME. A GENE defines the endpoints of the genetic information of a protein in the SEQUENCE of a PIECE. For example, in ORGANISM ECOLI; CHROMOSOME ECOLI there is a PIECE LAC. The GENE LACI refers to this PIECE by pointing to the first G of the GTG and the A of the TGA. A TRANSCRIPT is similar to a GENE, but it defines any region transcribed into mRNA. For consistency, we consider a tRNA to be a TRANSCRIPT and not a GENE. GENE is reserved for the coding sequence of polypeptide products. Suppose that a mutation is known for your favorite sequence. The MARKER is designed to record the change made by the mutation. MARKERs can also record splice junctions and other interesting sequence features. In the future Delila will allow one to obtain both a sequence and its mutated forms using MARKERs. Notice that MARKERs, TRANSCRIPTs and GENEs all refer or point to a particular PIECE. Each PIECE therefore has a "family" of related branches. It is here that the tree-like structure of the library begins to break down: some of the branches are connected to one another in a kind of network. Now it is time to become practical. Obtain a copy of HUMCAT. This is a catalogue of the library, the HUMan's CATalogue. (Delila also has one for herself). Look around HUMCAT. Notice that it is organized by ORGANISM, CHROMOSOME, and so forth. Find a GENE or TRANSCRIPT that you are interested in. In the next section you will learn how to obtain it to play with.(end of delman.use.structure.3)
DELILA - THE LANGUAGE WHY WRITTEN INSTRUCTIONS? One of our major design decisions was the use of written instructions for the librarian. While we realize that this is somewhat foreboding to a new user, it does have several advantages over direct interactive use. One is that it is easier to correct mistakes in the list of sequences that are to go into the book than it is to change sequences by hand. Corrections to instructions are done with a text editor. Also, the amount of information necessary to obtain a fragment of sequence is usually less than the information in the sequence itself, so storing instructions instead of sequences is efficient. Another advantage is that a complete and concise record may be kept. As we will see later, the instructions can also be generated by auxiliary programs, allowing one to automate many complex manipulations. WHAT IS THE DELILA LANGUAGE? This section describes the use of the language Delila: DEoxyribonucleic-acid LIbrary LAnguage. The language is not as complex or comprehensive as a natural language such as English or French. It was designed for a particular task: telling a nucleic-acid data base manager - the librarian - the set of fragments that one wants to collect for study. (The name Delila is an anachronism that we can't bear to part with...) Since the library is structured like a tree, the language must allow one to specify individual branches. Eventually a particular PIECE will be identified, and one can request one or more fragments from the PIECE. Let us look at an example: TITLE "EX1: THE LACI GENE"; ORGANISM ECOLI; CHROMOSOME ECOLI; GENE LACI; GET ALL GENE; (Note: this instruction set is kept in the file EX1IN, so you can try it. All EXn examples are sent with the Delila System.) Statements in Delila end with a semicolon (;) - there are five statements above. The first statement will give a title to the book. The next three specify a particular GENE in the library structure. One thinks of this as a series of steps climbing the library tree. Starting at the "root" of the library, we first named the ORGANISM ECOLI. This moves us out to that ORGANISM. Then the CHROMOSOME was chosen to be ECOLI - the main chromosome (as opposed to a plasmid such as PBR322). Next, the particular gene, lacI, is specified by "GENE LACI;". As we noted in the section on structure, GENES point to the particular PIECE that they reside on. GENE LACI points to the PIECE LAC. Although we need not know this for the request, Delila knows it automatically. When the GET is performed, Delila will obtain the sequence of lacI from the G of the GTG through the A of the TGA. After Delila has read each of these statements, the information about the object (ORGANISM, CHROMOSOME or GENE) is put into the book. The GET generates a PIECE that is also placed into the book.(end of delman.use.language.1)
TRY IT OUT Type a file containing Delila instructions that specify the gene you chose at the end of the section on library structure. For this discussion, we will use the name EX1IN, although you may use another name. Find the entry on Delila (DESCRIBE.DELILA) in the back of this manual and run it: delila(ex1in,ex1bo,ex1dl) Look at the ex1dl file. This is the Delila Listing. The first line will look like this: 82/01/21 23:17:51 DELILA 1.20 PASS 1 PAGE 1 Delila performs two passes through the instructions. Pass 1 checks for spelling and syntax errors. If you made a typing mistake, it will be noted in the listing and Delila will not begin Pass 2. Should Pass 1 be successful, then Pass 2 begins. Notice that there are several lines that look something like this: * 81/01/18 22:29:26, 80/11/19 22:17:46, LIBRARY 1: BACTERIOPHAGE * 81/01/18 22:29:26, 80/11/19 22:17:46, LIBRARY 2: E. COLI AND S. TYPHIMURIUM These are the full titles of the libraries from which you are pulling sequences. Each title has three parts separated by commas: 1) the instant (date and time in descending order) that the library was created. 2) the instant that the PARENT of this library was created. 3) the title of the library. Notice that Delila also prints the current date and time at the top of the listing (if your system has these functions). The first line of a book or library contains its full title. For this example, this is: * 82/01/21 23:17:51, 81/01/18 22:29:26, EX1: THE LACI GENE What is the "genealogy" of the book that you obtained? Back to the listing, Pass 1. The instructions that you typed are repeated on the listing. To the left are two columns of numbers - the leftmost is the line number and the next is the statement number (there can be several statements on one line or one line may contain only part of a statement). This information is sometimes useful. Now let's look at the listing, Pass 2. Notice that the instructions that you typed are repeated again, but that there are extra lines inserted. In Pass 1 Delila checked for typing errors, while in Pass 2 Delila pulls out data items and places them into the book. As each item is put into the book, it is given a number: 2 2 ORGANISM ECOLI; #1 This is useful for some auxiliary programs. We will discuss control of the numbering in a later section. If your instructions worked then there will be two other numbers just below the get: 5 5 GET ALL GENE; #4 ^29^1111 These numbers show you the numbers of the beginning base (29) and the ending base (1111) for the PIECE put into the book.(end of delman.use.language.2)
RANGE DEFAULTS It is quite possible that you got an error message at this point: 4 4 GENE LACZ; 5 5 GET ALL GENE; #4 ^1234^100000 ---ERROR(S)---------------------------^206^203 203: OUT OF RANGE AND DEFAULT RANGE = HALT 206: WE DO NOT KNOW THIS LIMIT (A WARNING) This indicates that only part of the gene you are interested in exists in the library. Delila detects the fact that one end of the GENE goes off the end of its PIECE, and says that this limit (the end of the gene) is unknown. (This is indicated by the 100000.) Normally Delila will HALT when this situation is discovered. You can change this by using the instruction: DEFAULT OUT-OF-RANGE REDUCE-RANGE; anywhere before the problem but after the TITLE. This resets the default response to an out of range situation. In REDUCE-RANGE mode, Delila will attempt to find the closest edge of the PIECE and use that. The listing will show a record of what Delila does: 6 6 GET ALL GENE; #4 ^1234^100000^1419 ---ERROR(S)---------------------------^206^208 206: WE DO NOT KNOW THIS LIMIT (A WARNING) 208: OUT OF RANGE AND DEFAULT RANGE = REDUCE (A WARNING) In this case the PIECE in the book begins at 1234 and ends at 1419. To cause Delila to continue without putting any PIECE down in the book one would use: DEFAULT OUT-OF-RANGE CONTINUE; You may use several default statements to affect how Delila responds. To reset the default to halting, use HALT instead of CONTINUE or REDUCE-RANGE. (See DELMAN.USE.CONTROL) Use the programs COUNT and LISTER to look at your book.(end of delman.use.language.3)
MORE ON INSTRUCTIONS There are several ways to obtain sequences in a book. For example one could use: TITLE "EX2: AN ABSOLUTE GET"; (* FIRST WE WILL SPECIFY THE LAC PIECE: *) ORGANISM ECOLI; CHROMOSOME ECOLI; PIECE LAC; (* NEXT WE WILL REQUEST A PARTICULAR FRAGMENT OF THAT PIECE: *) GET FROM 29 (* THE BEGINNING ABSOLUTE POSITION *) TO 1111; (* THE ENDING ABSOLUTE POSITION *) There are several things to note about these instructions. First, there are 5 instructions and four comments. A comment is the text between a (* and a *). You should use comments freely to document what you are doing. This is made easy by the fact that comments can extend over several lines. Delila ignores comments. Several instructions can be put on one line (the specifications, above) and one instruction can be spread over several lines (the request). The GET above defines two basepairs in the LAC sequence. The sequence between (and including) these bases is put into the book. Delila always puts sequence in the book 5' to 3'. Thus to get the complement of the instructions above, one simply uses: GET FROM 1111 TO 29; RELATIVE VERSUS ABSOLUTE REQUESTS In contrast to EX2 we could write: TITLE "EX3: A RELATIVE GET"; ORGANISM ECOLI; CHROMOSOME ECOLI; GENE LACI; GET FROM GENE BEGINNING TO GENE ENDING; In this case we did not state absolute numbers to define our book. Yet in all three examples (EX1, EX2, and EX3) the same PIECE will be generated in the book. There are two ways to define a base in a sequence. One is to give its exact coordinate as in EX2. That is called an ABSOLUTE reference. The other way is to define the distance from a fixed point, as in EX3: a RELATIVE reference. Both absolute and relative referencing have advantages and disadvantages. Using absolute coordinates allows us to pinpoint particular bases. However, Delila libraries evolve over time, and when two previously separate PIECEs are fused, only one coordinate system is kept. An absolute reference will not last. On the other hand, a relative reference will last because the GENE BEGINNING will always be the start of the gene no matter what happens to the actual coordinate system.(end of delman.use.language.4)
FORMS OF REQUESTS By now you may have noticed that there are two kinds of GET: GET ALL ... ; GET FROM ... TO ... ; The two positions of the FROM-TO form are independent as long as one refers to locations on the same PIECE. In absolute terms one can say GET FROM -22 TO 56; (* ABSOLUTE *) or one can make it relative to a gene beginning: GET FROM GENE BEGINNING - 10 TO GENE BEGINNING + 5; One can even write instructions relative to an absolute location: GET FROM 56 - 10 TO 56 + 5; This is to be pronounced "get from fifty-six minus ten to fifty-six plus five". We will come back to this form later. MARKERs, GENEs, TRANSCRIPTs and PIECEs all have a BEGINNING and an ENDING that you can use. For example, TITLE "EX4: NON-CODING LAC LEADER"; ORGANISM ECOLI; CHROMOSOME ECOLI; GENE LACZ; (* NOW DELILA KNOWS THE PIECE *) TRANSCRIPT LACZ; GET FROM TRANSCRIPT BEGINNING TO GENE BEGINNING -1; Notice that both a GENE and a TRANSCRIPT can be specified at the same time. AMBIGUOUS DIRECTIONS Consider the circular genome of ORGANISM G4. The numbering of the PIECE is from 1 to 5577. Suppose that you asked for: TITLE "G4 COORDINATE PUZZLE"; ORGANISM G4; CHROMOSOME G4; PIECE G4; GET FROM 1 TO 10; This is ambiguous! There are TWO PIECES that run from 1 to 10: one clockwise and the other counterclockwise. In this case Delila will supply you with the clockwise fragment. However to be more specific in one's request, one would write: GET FROM 1 TO 10 DIRECTION +; or GET FROM 1 TO 10 DIRECTION -; But there are still two other possibilities! GET FROM 10 TO 1 DIRECTION +; GET FROM 10 TO 1 DIRECTION -; Delila is capable of handling most requests like these. (Certain of the most complex cases remain to be solved.)(end of delman.use.language.5)
RESPECIFICATION What if one wanted to specify more than one "leaf" (GENE, TRANSCRIPT, or MARKER) at one time? Then one would use: TITLE "EX5: THE REGION BETWEEN LACI AND LACZ"; ORGANISM ECOLI; CHROMOSOME ECOLI; PIECE LAC; (* NOW DELILA KNOWS THE PIECE *) GET FROM (GENE LACI) ENDING + 1 TO (GENE LACZ) BEGINNING - 1; This form is called a "respecification", to distinguish it from a specification. MULTIPLE REQUESTS After Delila has completed a GET, as in the last few examples, the specifications are still in effect and one can do more GETs, change the specification, more GETs, etc: TITLE "EX6: MULTIPLE SPECIFICATION AND REQUESTS"; ORGANISM ECOLI; CHROMOSOME PBR322; GENE AMPR; GET ALL GENE; (* GET GENE OF BETA-LACTAMASE *) CHROMOSOME ECOLI; (* CHANGE SPECIFICATION *) TRANSCRIPT 16SRRNAB; GET ALL TRANSCRIPT; (* 16S RRNA *) TRANSCRIPT 23SRRNAB; GET ALL TRANSCRIPT; (* 23S RRNA *) ORGANISM PHIX174; CHROMOSOME PHIX174; (* GET TWO OVERLAPPING GENES *) GENE A; GET ALL GENE; GENE B; GET ALL GENE; WHEN DOES DELILA ACT? During Pass 2, Delila places the various items into the book. Thus as ORGANISM, CHROMOSOME, GENE or TRANSCRIPT instructions are read, they are executed immediately. This is not true for the PIECE in the example EX3 because at that point Delila does not know the endpoints of the sequence desired. Delila "knows" which PIECE you are interested in, but not what particular bases. When Delila reads the GET, the bases become apparent. You can see this in the Pass 2 listing: a PIECE is not given a number, rather the number is listed for the GET that generates the PIECE in the book. The numbers are for objects in the book, not for those in the library.(end of delman.use.language.6)
AUXILIARY PROGRAMS: LISTER AND SEARCH In the section on language, we discussed how one can use Delila to generate books containing sequences one is interested in. It is difficult to read the sequences in a book because they are in an awkward (from your viewpoint) compressed format. In every day use, we almost never look inside a book because there is a much easier way: generate a fancy listing using the program LISTER. In the section on the Delila language you used LISTER to look at the books that you generated. (If you have not done this, then you should do it now.) As other programs, LISTER will print sequence 5' to 3'. If you want the complement, it is easy to use Delila to obtain it. LISTER is an example of an auxiliary program. In contrast, Delila is the center of the Delila System. The purpose of Delila is the manipulation of sequence information. Other "auxiliary" programs perform tasks such as making listings or doing analyses. These programs are explained in DELMAN.DESCRIBE. The only other auxiliary program that we will discuss here is the SEARCH program. SEARCH will search a book for a simple pattern. As you will recall, books have the same structure as libraries. As SEARCH proceeds to look into an ORGANISM it will know the name of the ORGANISM: ORGANISM ECOLI; Then it will enter the CHROMOSOME: CHROMOSOME PBR322; Finally it begins to search a PIECE: PIECE PBR322; In other words, SEARCH can write Delila instructions that trace the search path. Suppose that we had told SEARCH to search for the pattern 5' AAGCTT 3' (HindIII). We also tell it that the FROM should be -5 and the TO +10. When search finds the site it can then write: GET FROM 29 -5 TO 29 +10 DIRECTION +; 29 is the position of the first A of AAGCTT in PBR322. These Delila instructions are an answer to the search! You should try this and the other Auxiliary programs.(end of delman.use.auxiliary.programs)
DATA FLOW AND DATA LOOPS In the section on Auxiliary programs we discussed the use of the SEARCH program to locate patterns in books. The search results appear in three ways: on the screen, in a file for printing, and as Delila instructions. These instructions can be given to Delila to generate the sequences of found sites. One can view this entire process as a flow of data between one program and the next. Since this manual can not have (nice) line figures, we strongly urge you to look at the flow figures in the published papers listed in DELMAN.INTRO.DESCRIPTION. Connecting parts of the Delila system together is much like playing with tinkertoys. Data flowing in the Delila system can pass through a program several times. Our first example was the conversion of a book to a library and the subsequent extraction of book subsets. The SEARCH program provides a more complex case where searching of a book generates Delila instructions that can be used to create a new book. The new book is the set of located sequences. This cyclic string of events is called a loop. Once you are acquainted with these data flow loops you can look at the SEPA program. This program deals entirely with Delila instructions of the form: GET FROM 56 -40 to 56 +60; along with ORGANISM, CHROMOSOME and PIECE specifications. The SEARCH program produces instructions in this form. SEPA is used to separate instruction sets. For example, suppose you are interested in all the AluI (5' AGCT 3') sites that are not part of PvuII (5' CAGCTG 3') sites. You have used DELILA and SEARCH to generate two sets of instructions, ALUIMIX and PVUII. You then can use SEPA to get the set that you want: SEPA(PVUII,ALUIMIX,PVUIIO,ALUI) PVUIIO would be a reorganized non-redundant list of the PvuII instructions, and ALUI would list all AluI sites that are not PvuII sites. Both our second and third papers describe the way that we use SEPA. (Note: to do a search like this one must be sure that the sites are numbered the same way. The search rule for AluI would be #AGCT, while the search for PvuII would be C#AGCTG. The # symbol tells SEARCH to write the number of the following base in the instructions. This forces the SEARCH program to number the same A in the two cases.)(end of delman.use.data.flow)
THE COORDINATE SYSTEM OF A PIECE In the sections on library structure and the Delila language, we kept touching on the topic of coordinate systems for PIECEs. Delila is required to maintain the numbering of sequence fragments, and a coordinate system is the means to do so. This is not a simple problem, for one must handle both linear and circular genomes. For the new user, it suffices to know that Delila can do that, and you could skip this section. Let us start with the simpler case, a linear PIECE. The SEQUENCE in the library is numbered consecutively from 1 to 100. So far so good, we need to record three pieces of information: CONFIGURATION: LINEAR BEGINNING: 1 ENDING: 100 Any subset of the PIECE such as: GET FROM 40 TO 50; will also be linear and can be handled by these three variables. Notice that one could: GET FROM 50 TO 40; to obtain a complement. In that case the BEGINNING is greater than the ENDING and the numbering decreases. What if the CONFIGURATION is CIRCULAR? Then based on our discussion about ambiguous directions, we should at least add a DIRECTION: + for linear sub-fragments. However the situation can be worse than that! Let us imagine a circular PIECE in the library. It is numbered 1 to 100 in the direction 5' to 3' of one DNA strand. We then make a request: GET FROM 10 TO 90 DIRECTION -: The PIECE to be placed in the book is 21 bases long, with descending numbers, EXCEPT for a COMPLETELY UNPREDICTABLE DISCONTINUITY where the numbering jumps from 1 to 100. Some more information about the "parent" coordinates must be stored.(end of delman.use.coordinates.1)
The problem is to record the necessary coordinate information and to avoid becoming confused. In the Delila System, the numbering of each PIECE has two parts: a COORDINATE part and a PIECE part. The COORDINATE part defines the location of a sequenced region on the genetic map. Once that is established, the PIECE part tells what fragment is stored in the PIECE. Both parts are transmitted to the book by Delila, but the coordinate part is fixed and unchanging while the PIECE part will vary depending on the fragment. In summary so far: COORDINATE part = defines the relation of coordinates to the genetic map PIECE part = defines the relation of SEQUENCE to the COORDINATE part For the coordinate part: GENETIC MAP BEGINNING This number locates the beginning nucleotide of the coordinate system on the genetic map. We use these numbers to order the PIECEs in our Master library. The COORDINATE CONFIGURATION refers to the topological shape of the coordinates. A linear genetic map could only have PIECEs with linear coordinates. For a circular genetic map, circular coordinates may be chosen, but when only a portion of the sequence is known, each PIECE may be more conveniently handled as a linear coordinate system. A COORDINATE DIRECTION defines the orientation of the numbering system with respect to the genetic map. + means "in the same direction as", - means "in the opposite direction as". The COORDINATE BEGINNING and COORDINATE ENDING nucleotides are integers that specify the limits of the coordinate system. They are usually the ends of the largest known contiguous sequence. The BEGINNING base corresponds to the genetic map beginning, the bases are consecutively numbered, and the ENDING is always greater than the BEGINNING number. The coordinate system described above provides a framework for stating the exact numbering of the SEQUENCE in a PIECE. This also requires four items of information: configuration, direction, beginning and ending, all relative to the coordinate system. The PIECE CONFIGURATION may be circular only if the coordinate configuration is also circular. When the coordinates are linear, the PIECE must also be linear. The PIECE DIRECTION may be + or - with respect to the coordinates, representing homology or complementarity to the coordinate system. The PIECE BEGINNING and ENDING are the numbers of the endpoints of the SEQUENCE. Both must lie within the bounds set by the COORDINATE BEGINNING and ENDING. The BEGINNING is always the 5' end of the molecule.(end of delman.use.coordinates.2)
It turns out that this system handles all the confusing cases noted earlier. To write out the nine values of coordinates we will keep this order: (GENETIC MAP BEGINNING, COORDINATE CONFIGURATION, COORDINATE DIRECTION, COORDINATE BEGINNING COORDINATE ENDING, PIECE CONFIGURATION, PIECE DIRECTION, PIECE BEGINNING, PIECE ENDING) The linear piece that we began this section with would be: (1,LINEAR,+,1,100,LINEAR,+,1,100) (The GENETIC MAP BEGINNING and COORDINATE DIRECTION are arbitrary.) The first subset was "GET FROM 40 TO 50;": (1,LINEAR,+,1,100,LINEAR,+,40,50) The complement: "GET FROM 50 TO 40;" is: (1,LINEAR,+,1,100,LINEAR,-,50,40) The circular PIECE is: (1,CIRCULAR,+,1,100,CIRCULAR,+,1,100) The request GET FROM 10 TO 90 DIRECTION -; would make: (1,CIRCULAR,+,1,100,LINEAR,-,10,90) You should work out the results for the other three possible request on this circular PIECE: GET FROM 10 TO 90 DIRECTION +; GET FROM 90 TO 10 DIRECTION +; GET FROM 90 TO 10 DIRECTION -; HINT: It helps to make diagrams. The catalogue program, described in DESCRIBE.CATAL, will list the coordinate systems for pieces of a book or library in tabular format.(end of delman.use.coordinates.3)
HOW TO CONTROL THE RESPONSES OF DELILA There are several situations in which Delila manipulates the information in a library in a way that may not always be what one wants. That is, there are certain things that Delila does in the absence of any instructions. These default actions can be changed by using a special class of instructions - they are called default resets. There are four basic kinds of default (as defined in LIBDEF) but we will discuss only three of them here. OUT-OF-RANGE DEFAULT We discussed this default in the section on the Delila language (DELMAN.USE.LANGUAGE). A request may be outside the limits of a PIECE in a library for two reasons: 1) The place is outside the coordinate system and is therefore unsequenced (Delila calls it "unknown"). 2) The place is within the coordinates, but the PIECE does not extend that far in the particular library being used. In either case, Delila's actions will be based on the RANGE default: DEFAULT OUT-OF-RANGE REDUCE-RANGE; Delila will attempt to find the nearest edges of the PIECE and use these. (NOTE: there are known bugs associated with this process, although it works in almost all cases.) DEFAULT OUT-OF-RANGE CONTINUE; Delila will not place the requested PIECE in the book, and will continue to process any further instructions. DEFAULT OUT-OF-RANGE HALT; Delila will stop processing instructions. The book will not be useable by auxiliary programs. In all cases, a warning message is put into the listing. KEY DEFAULT One can use this default to prevent the information about MARKERs, TRANSCRIPTs and GENEs from going into the book. For example: DEFAULT KEY GENE OFF; will turn off printing of the GENE information. The various data items in a library will contain free form notes about the object. (You can use the REFER program to look at these.) This command can also be used to turn off the NOTEs when one wants to reduce the size of the resulting book.(end of delman.use.control.1)
NUMBERING DEFAULT In the section on language we discussed the numbering of the items going into a book. This command is used to control the numbering. One can turn it on or off: DEFAULT NUMBERING OFF; (* NOTHING FROM HERE ON WILL BE NUMBERED *) One can set numbering for particular items: DEFAULT NUMBERING PIECE; (* ONLY PIECES WILL BE NUMBERED *) DEFAULT NUMBERING TRANSCRIPT GENE; (* BOTH TRANSCRIPTS AND GENES WILL BE NUMBERED *) To make numbering more flexible, one can reset the number that the next item will get: DEFAULT NUMBERING 27; (* THE NEXT ITEM WILL BE NUMBERED 27 *) This default can be used to make sure that particular items will have the same numbers in different books. The number will be put into the notes of the item as the first line in the notes. This allows them to be easily found by auxiliary programs. NOTE INSERTION One can put one's own notes into the next object placed in the book by using: NOTE "THIS IS THE REPLICATION ORIGIN FROM PHIX174"; GET FROM ... Since this is not a default reset, it does not use the word "default". The new notes will follow the notes that were in the library. By turning off notes from the library, and using note insertion, one can replace notes in a library. Notes in PIECEs can be seen with program REFER. One can put these default or note insertion statements anywhere in a set of Delila instructions. More details on these and other commands can be found in LIBDEF. All the defaults have initial values: default type initial value ============ ============== KEY NOTE ON MARKER ON TRANSCRIPT ON GENE ON OUT-OF-RANGE HALT NUMBERING ON, 1, ALL(end of delman.use.control.2)
SEQUENCE COMPARISONS AND STRUCTURE ANALYSIS The purpose of this section is to point out auxiliary programs that can be used to compare two sequences or find structures in a sequence. Sequence comparisons can be done with DOTMAT, which forms all possible pairs between sequences in two books. For each pair, one sequence is put on the X axis of a coordinate system and the other is on the Y axis. Both 5' ends are at the origin and X runs down the printout page while Y runs across the page. (Simply rotate the page 90 degrees counter-clockwise to get standard Cartesian coordinates.) The sequences are compared for complementarity at each possible (X,Y) pair formed between the two sequences. A "dot" is placed at a coordinate if pairing can occur. Notice that the display will be symmetrical around the line Y = X. Long stretches of pairing will run on diagonals (along segments of lines Y = -X + C). To look for homology using DOTMAT, use DELILA to obtain the complement of one of the pieces. DOTMAT produces all possible pairings. Sometimes one wants to eliminate the short helixes, to make finding the longer ones easier. The pair of programs HELIX and MATRIX will do this. One can use these two programs to find overlaps between sequences obtained by shot-gun cloning. Put the complete sequence on the X axis book and 20 bases from each end of the other sequence in the Y axis book. Search for long oligo's, say 15 or longer. If there is a significant overlap, you will get a response from HELIX. Another program that can be used for comparisons is the INDEX program. With this tool you can make an index of the locations of the oligo- nucleotides in a book. The measure of the similarity between oligonucleotides in the final alphabetized list of oligo's is related to sequence homologies. This method is extremely powerful. MATRIX/HELIX vs INDEX MATRIX/HELIX advantage: The 2 dimensional plot is easy to look at. disadvantage: It is slow. For two sequences M and N bases long, a dot matrix operation takes MxN operations. It is so-called Order N Squared in computation time since the time to compare a sequence with itself is a function of the square of the sequence length. INDEX advantage: It is fast, since the sorting algorithm is order NlogN. disadvantage: One can't get a feeling for the results easily. One method is to mark listings made with LISTER.(end of delman.use.comparison)
HOW TO MAKE AND USE ALIGNED BOOKS WHAT IS AN ALIGNED BOOK? To perform statistical analysis on sequence sites (eg. ribosome binding sites, promoters, splice junctions, etc.) one needs a way to align a set of PIECEs in a book. For ribosome binding sites, we have used the A of the AUG or various points in the Shine/Dalgarno. A book is aligned by chosing one base from each PIECE to be the alignment point. The alignment bases could be chosen by a list of coordinates, but we have found that there are advantages to using Delila instructions to specify the base: TITLE "EX7: ALIGNED BOOK"; ORGANISM ECOLI; CHROMOSOME ECOLI; PIECE LAC; GET FROM 29 -5 TO 29 +10; (* LACI RBS *) GET FROM 1234 -5 TO 1234 +10; (* LACZ RBS *) Here, the zero point for LACI alignment is base 29 and for LACZ it is base 1234. The "from parameter" is -5 and the "to parameter" is +10. The instructions allow one to align the book that is created from the instructions. WARNING: the instructions must follow a rigid format; this is described in DELMODS in module info.align, along with details on how to write programs using aligned books. (See also DELMAN.USE.DATA.FLOW and DESCRIBE.ALIST) AUXILIARY PROGRAMS FOR ALIGNED BOOKS After generating an aligned book (a book and an aligning instruction set) one can list it using program ALIST or obtain a histogram that tells the composition of the book at each point relative to the aligned base with HIST. A chi-squared analysis of an aligned book is done using HISTAN. GENERATING A SET OF ALIGNED RIBOSOME BINDING SITES We have provided the instructions for creating a set of aligned gene starts, in file GAIN. GAIN was originally created from instructions of the form: ORGANISM ...; CHROMOSOME ...; GENE ...; GET FROM GENE BEGIN TO GENE BEGIN +2; ... This is file GRIN (genes relative to begin instructions). The resulting book was searched (one would use SEARCH with a rule of (A/G/T)TG ) to generate the instructions in aligned form. GAIN was then made by replacing the from-position with the word FIRST and the to-position with LAST. To use GAIN you must first create the transcript library from file TRAIN (TRAnscript library Instructions, use DELILA with LIB1 and LIB2). Then replace FIRST and LAST with the desired range. Notice that there are a few cases, marked "SPECIAL" that you must deal with individually. Notice also, that genes that are oriented in the direction opposite the PIECE had to be set up by hand (this may be automated someday). The instructions could now be named GAIN1, and DELILA can be used to generate the aligned book. A detailed example of these operations is given in DELMAN.CONSTRUCTION.EXAMPLE.(end of delman.use.aligned.books)
USE OF THE PATTERN PROGRAMS "Perceptron" is the name given to a class of algorithms for pattern recognition with learning capabilities. Minsky and Papert have written an excellent book on the topic ("Perceptrons", MIT Press, 1969) which explores both the limitations and potentials of the method. They also prove the "Perceptron Convergence Theorem" which guarantees that a solution will be found if one exists. We have written an article (Stormo, et. al., 1982, Nucleic Acids Research, 10: 2997-3011) which describes our use of the algorithm to investigate translational initiation sites. The algorithm takes as input patterns which can be divided into two classes, and finds a "Weighting Function" which serves to distinguish the patterns in the two classes. More rigorously, if we encode a sequence into a string of bits, S, the algorithm attempts to find a W such that W*S >= T (some "threshold") if and only if S belongs to one class of the two classes of sequences. We mean by "*" the dot, or inner product of S and W, which are vectors of the same dimensions. If we start with two sets of sequences, S+ and S-, and an arbitrary W and T, the algorithm can be described by the following three step procedure: Test: choose a sequence S from S+ or S-, if S is in S+ and W*S >= T go to Test, if S is in S+ and W*S < T go to Add, if S is in S- and W*S < T go to Test, if S is in S- and W*S >= T go to Subtract; Add: replace W by W + S, go to Test; Subtract: replace W by W - S, go to Test. An example of this process is shown in our NAR paper (reference given above). (Note: this process can be done without goto's...) The program which implements the perceptron algorithm to work on sequences is called PatLrn. Other programs which use the output of PatLrn are: PatLst - a lister program for the output of PatLrn; PatAna - does some simple analyses of the output of PatLrn; PatVal - evaluates the aligned sequences in a book by the PatLrn output; PatSer - searches a book for sites which are evaluated with a given PatLrn W output to be above some user specified value.(end of delman.use.perceptron.1)
EXAMPLES FOR THE PATTERN PROGRAMS The files "exspbk" and "exsnbk" are the sets of positive and negative sequences used in the example of Figure 1 of our "Perceptron" paper (NAR 10, 2997-3011). The file "expa1" contains the initial pattern from that same example. Given these files and the program "PatLrn" you can recreate the example thusly: PatLrn(exspbk,a,exsnbk,b,pat,expa1). The file "pat" should be identical (except for the date/time) to the file "expa2" that we have provided. You can check that with the "Merge" program if you want. It is also identical to the solution pattern from the example and it keeps track of the number of changes needed to get to that solution. The files "a" and "b" are empty in this case, because we are aligning the sequences by their first bases. If we wanted to align them by any other base those files would contain the instructions which generated the sequences (see DELMAN.USE.ALIGNED.BOOK). Now use the program "PatAna" to do some simple analyses of the pattern. PatAna(pat,patan). The file "patan" is identical to the file expan2 that we provided. It contains some useful information about the pattern, such as the minimum and maximum sequence values which could be obtained from this pattern, as well as the average value expected for random sequences and a feeling for the distribution of values. The program "PatVal" will use a pattern to evaluate a book of sites. Try: PatVal(exspbk,a,pat,valp). and PatVal(exsnbk,b,pat,valn). "valp" is the evaluation of each sequence of the positive class, and "valn" is the evaluation of each of the negative class sequences. Check with the example in the paper to see that they are correct. Again the "a" and "b" files are empty because we are aligning by the first base of the sequences. The program "PatSer" will use a pattern to search through a sequence, using each base in turn as the aligned base. Those sites which are evaluated above some minimum, either set by the user or taken to be the minimum functional from the pattern itself, are identified. Furthermore, instructions to get those sites so identified are written to the file "inst". Try this on an example file: PatSer(exsebk,pat,val,inst). notice that when the pattern extends beyond the sequence the sites are still evaluated, but the user is notified of the over-extension. The program "PatLst" is used to make nice horizontal printings of the patterns, such as for use as publishable figures. Try this on the W51 matrix which is from the paper and which we provide. Read the page DESCRIBE.PATLST to see how to set the width of the pattern printed to a page to whatever you want.(end of delman.use.perceptron.2)
A NOTE ABOUT SIGNIFICANCE While the example we provide in the paper, and that you have just done, is convenient for demonstrating the method, separating two sets of two sequences, each five long, is in fact trivial. Try: PatLrn(exspbk,a,exsnbk,b,newpat). "newpat" is identical to "expa0" that we provided, and as you can see is not interesting. The mathematical problem of when it becomes significant that one can separate two sets of sequences is still an open problem, but we can say some things. As the number of sequences in each class gets larger the probability of separation decreases, as it does when the number of nucleotides in each sequence diminishes. As a good rule of thumb we like to have more sequences in the smallest class (usually the functional class) than there are nucleotides in any one of the sequences. Under these conditions one can be reasonably confident that a solution pattern is likely to identify features of biological significance.(end of delman.use.perceptron.3)
USE OF THE "ENCODE" PROGRAM The program Encode was written to allow a user to encode sequences into strings of integers in a flexible way. For instance, one can encode the sequences as mono-, di-, tri-, or higher oligonucleotides. One can assign specific oligos to certain positions or record only that they are within some "window" of positions. Within a window all the oligos may be counted or only some, such as only those "in frame". The program takes as input the book of sequences and the instruction set which generated it and which specifies the alignment. If the instruction file is empty then all the sequences are aligned by their first bases. The other input file, which must be non-empty, is the parameter file "EncodeP" which specifies how the sequences are to be encoded. It is the options of the parameter file which give the program its flexibility and power, and so they should be thoroughly understood. The parameter file may contain any number of individual parameter records, each of which will in turn be applied to each sequence in the book. This allows one to encode different regions of the sequences differently, or to encode one region in more than one way. Each parameter record has five pieces of information, each written on a separate line: line 1 - the range over which this parameter record is to operate; this line has two integers which are the bases, relative to the aligned base, for which to use this encoding; line 2 - the size of the window; the window begins at the start of the range and contains this many nucleotides in it; the number of each base, or oligo, which occurs in this window is written to the output; note that positional information within the window is lost, so that if exact position is needed the window size should be 1; line 3 - the shift to the next window; this specifies how many bases to move the window over to its next position; this is repeated until the window begins beyond the end of the range; line 4 - this specifies the coding level, and the arrangement of the bases to be coded; the coding level is the number of bases in the oligos which are encoded, i.e., 1 means monos are encoded, 2 means dis are encoded, ...; for coding levels greater than 1 the user may allow for skips between the encoded bases; for instance, one may want to encode as di-nucleotides bases which are separated by a nucleotide; this would be declared on this line by writing "2 : 1"; likewise, one could encode as a tri- nucleotide the first bases of three consecutive codons by the line "3 : 2 2", where the 3 indicates the coding level (tri- nucleotides) and the 2's represent the number of bases skipped between each encoded base; if there is no colon after the coding level declaration, all skips are assumed to be 0; line 5 - the shift to the next coding site; this allows the user to not count every occurrence of the oligos in the window, but rather to move some number of bases to the next encoded site; if all the oligos are wanted, this number should be 1. The above line information constitutes a single parameter record. The parameter file may contain any number of these records concatenated together. Each sequence will be encoded by the entire list of parameter records and the resulting string of integers will be written to the "EncSeq" file. The encoded string for each sequence ends with a special "end of sequence" symbol, which is listed in the file header. For examples of how this program works see "DELMAN.USE.ENCODE.2".(end of delman.use.encode.1)
EXAMPLES OF USING THE "ENCODE" PROGRAM The files "ExEncIn" and "ExEncBk" contain the sequence around the beginning of the rIIB gene of T4, and the instructions which align this sequence by the ATG of the gene. The aligned sequence looks like: --- ++ 111--------- +++++++++11 210987654321012345678901 ........................ ATAAGGAAAATTATGTACAATATT Notice that the 0 base is the A of the ATG (this is what we aligned by) and that our sequence contains the 12 preceding bases and the 11 following. This is through the fourth amino acid of the protein. If we wanted to encode only the mono-nucleotides of the initiation codon we would make our parameter file: 0 2 1 1 1 1 this would give the encoding: 1 0 0 0 0 0 0 1 0 0 1 0 -1 Notice the -1 which specifies the end of the encoded sequence. Each 4 integers before that specifies which base occurs at each of the three encoded positions. The A is encoded as 1 0 0 0, the T as 0 0 0 1, and the G as 0 0 1 0. If we wanted to know the number of each mono-nucleotide in this whole region and we didn't care about their positions, we would encode as: -12 11 24 24 1 1 This would give the encoding: 12 1 3 8 -1 Notice that this is really just the composition of the sequence, since our window covers the entire sequence. We could get the di-nucleotide composition with the parameters: -12 11 24 24 2 1 and get the encoding: 5 1 1 5 1 0 0 0 1 0 1 1 4 0 1 2 -1 Notice that this encoded string is a vector of 16 integers (up to the end of sequence mark, -1). The number in each element of the vector is the number of each di-nucleotide in the sequence, in the order AA,AC,AG...TC,TG,TT. Examples continued in DELMAN.USE.ENCODE.3.(end of delman.use.encode.2)
Examples of using the "encode" program, continued from DELMAN.USE.ENCODE.2. To encode the di-nucleotide composition of the Shine and Dalgarno region and also the mono-nucleotides of the coding sequence, each in its own position, we would make this list of parameters: -10 -6 5 5 2 1 0 11 1 1 1 1 This would give us the encoding: 2 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 -1 Here the first 16 integers are the di-nucleotide composition of the Shine and Dalgarno region, and appended to that are the mono-nucleotide encodings for each position of the coding sequence. We could get the di-nucleotides of successive codon first positions by: 0 11 12 12 2 : 2 3 or we could get the codon composition by: 0 11 12 12 3 3 or we could get the di-nucleotide encoding of the first and last position of each codon, including the position of the codon by: 0 11 3 3 2 : 1 3 These are left as exercises to the user, and it is encouraged that the user make up other tests and try them until this program is easy to use.(end of delman.use.encode.3)
In addition to Delila, there are at least two other generally available large nucleic sequence data bases. The DB program system handles both the European Molecular Biology Laboratory (EMBL) libraries and those of the Genetic Sequence Databank (GenBank(TM)). If you want to contact someone who helps operate these data bases use the following addresses: GenBank c/o Computer Systems Division Bolt Beranek and Newman Inc. 10 Moulton St. Cambridge, Ma. 02238 USA Graham Cameron European Molecular Biology Laboratory Postfach 10.2209, 0-6900 Heidelberg, West Germany The DB program system is a small set of programs. DBcat prepares catalogs for DBpull. DBpull extracts part or all of an entry of either EMBL or GenBank format. DBbk converts database entries into the Delila book form that Delila programs use. All of these programs handle both data base formats even when both occur together in the same library. At this point, please obtain some sample library entries from both data bases and look them over. Embl and GenBank libraries are arranged in series of entries, each entry possessing a unique entry id, a nucleic acid sequence, and other miscellaneous information. Most of the lines in the libraries start with a word or abbreviated code that indicates what kind of information the line contains. The following definitions will clarify these points. Library definitions: Entry: An entry starts with a line which begins with an "ID" (EMBL) or a "LOCUS" (GenBank). All subsequent lines are part of the entry until the line that contains simply "//". "//" is the entry terminus code for both data bases. Entry id: On the first line of each entry, after the "LOCUS" or the "ID", comes a few spaces and then a weird looking word or code that may or may not resemble a familiar biological name. This is the entry id, it is the name the entry is known by and it is what DBpull uses to identify which entries it will extract. Line codes: The phrases "ID" and "LOCUS" are line codes. There are other line codes in each entry such as "REFERENCE" and "ORIGIN" in GenBank and "DE" "SQ" in EMBL. Some lines do not have a code and some have one, but it is in- dented. Other lines have codes, but there is no other information on the line. these special cases will be discussed below in the definition of line code request instructions. Now that you are familiar with the data bases you can understand the DBpull instruction set. Each instruction takes up only one line. Each line does one of two things; either it indicates what entry type (GenBank or EMBL) is requested on the following lines or it makes an actual request for part or all of an entry identified by its entry id. Please note that the following definitions will be made clearer by referring to the examples that follow.(end of delman.use.dbpull.define)
Note: Instructions are entirely upper case because that is what the computer system DBpull was designed on required. Instructions that determine entry request type of succeeding lines: EMBL: This indicates that requests for entries somewhere in the EMBL libraries will be on the following lines. GENBANK: Same for requests found in the GenBank libraries. GENB: Same as "GENBANK". Instructions that tell which entries are to be pulled: Entry id: An instruction line beginning with an entry id will pull part or all of that entry. The parts extracted will depend on which of the "instructions that define extraction" (defined below) follows the id on the same line. Wildcard id: This request looks like an entry id request but somewhere in the entry name are one or two "*" symbols. The "*" represents any number of unspecified characters. It may be inserted at the beginning of the id, at the end, or at both the beginning and the end but not the middle. (Confused? see instructions example 3 below) EVERY: The word "EVERY" at the start of a request line calls for every entry of a particular entry type. (See instruction example 4) Instructions that define extraction: Line codes: Following the instruction that tells which entry or entries are to be pulled, on the same line, come instructions that structure the extraction. One or more line codes occurring in this space will result in the lines of the entry which have matching codes being pulled. Genbank line codes are actually words. The full word or an abbreviation will work, but the abbreviation can not be shorter than 3 letters. "LOC", for instance, will pull the "LOCUS" line while "LO" would not. When there are one or more lines in the entry directly below a pulled line that either do not possess a line code, posses indented codes, or posses the code "xx", these additional lines will be extracted also. RAW: Instead of line codes one can simply insert the word "RAW". This will pull only the sequence of the entry without origin or coordinate labels. The sequence will end with a "." to separate it from other sequences and to make it suitable for input into Makebk. (see delman.describe.makebk) Also, if the first request of fin is "RAW", fout will have no dateline and therefore it will not make a suitable secondary data base for DBpull. ALL: Instead of "RAW" or line codes the word "ALL" will result in an entire entry being extracted.(end of delman.use.dbpull.instructions)
Instruction examples (DBpull input file Fin) Example 1: EMBL ADCXXX ID DE SQ GENBANK M13 LOC REFERENCE ANABANIFH LOCUS Comments: The first and third lines indicate what types of entries are requested on the following lines. If, for instance, M13 were an EMBL entry this set of instructions would not find it. Example 2: GENB T7 RAW MS2 ALL Comments: The two requested ids are not in alphabetical order and the DBpull output file fout will have the same order as the requests. Example 3: EMBL *RNA SQ ID *RNA* ID SQ GENB M* ORI SITES GOOGOOGAGA ALL T7 RAW Comments: The character "*" is a wildcard; it represents any number of unspecified characters. The first request will grab any entry whose id ends in "RNA", the second any one that has "RNA" anywhere in it, and the third any id which starts in an "M". The fourth request is a joke and, like any other non- existent id, will yield a "not found" message and then halt the program. If there were no GenBank entry ids beginning in "m" a "not found" would appear but DBpull would not halt because this id request is a wildcard. The logic behind this distinction is that wildcards are used to search for the possible existence of an entry, but regular ids are used only for entries that are well known by the user. Note that "ORI" (origin) pulls sequence in GenBank and "SITES" tells you where the genes and other features are. "SQ ID" and "ID SQ" are equivalent; lines are pulled in the order that they occur. Example 4: EMBL EVERY ID GENB EVERY LOC Comments: This example would make a catalog for users of the entire EMBL and GenBank data bases. The catalog would be alphabetical because the catalog files used by DBpull (produced by DBcat) are presorted. If "catalogs for humans" are provided with your libraries do not try this example; it is very expensive. If you do try it, you might want to request additional line codes to "LOC" and "ID" for a more informative catalog.(end of delman.use.dbpull.examples)
Use of the Search Program i. searching dna sequences for particular strings The search program works on books of sequences. Any search pattern will be looked for in each sequence of the book. Search patterns consist of strings of nucleotides, such as 'aatggct'. You may also specify ambiguous patterns, such as 'a or g', in either of two ways: '(a/g)' or 'r'. All possible ambiguities can be asked for, by either way. From within the search program type 'l' to see the list of one-letter codes for each ambiguous base combination. One can also include in the search positions for which you don't care what the base is, indicated by 'n'. For instance, 'anc' would search for a and c separated by any base. One can also use 'e' (for extension) to vary the spacing between specified regions. The 'e' is considered to be an 'n' and also as nothing. For example, 'aec' would search for both 'anc' and 'ac'. We used this feature to search for 'shine and dalgarno' sequences before 'atg's by specifying 'gga5n4eatg'. This means 'gga followed by 5 to 9 unspecified bases followed by atg'. One can search for strings which are close to the specified by allowing mismatches to the specified sequence. This is done by typing 'm' as a search command, and then specifying how many mismatches are allowed. If there are regions within the specified sequence where you want no mismatches, this is stated by enclosing that region between and '<' and '>'. For example, if mismatches were set to 1 and the pattern searched were 'aat(end of delman.use.search.1)t', then the 'ggc' must be found exactly, but the rest of the pattern need only be within one of a perfect match. The search program returns to you the positions of the matches found in the book. Unless otherwise specified, the position corresponds to the first base of the pattern. However, one can ask for the position to be another base by preceding that base by '#'. For example, 'aa#atggct' would return as the position of the match the 'a' of the 'atg'. It is also possible to make searchs for relations between bases. Six relations are allowed: identity (i); non-identity (ni); complementarity (c); non-complementarity (nc); complementarity including g-t pairs (w); and non-complementarity including g-t pairs (nw). Relational searchs are specified by first the symbol '^', followed by the pattern position this base is to be related to, followed by the relation. For example, 'n^1i' would find all sites in which there is a repeated base (aa, cc, gg or tt). Notice that the base to which the relation refers must proceed the point of the relation in the pattern. Searching for the pattern '5n^1c' would find sites of complementary bases separated by 4 unspecified bases. More information on search patterns and other commands in general can be obtained by typing 'help' while in the program.
ii. Creating Delila Instruction Files The search program also allows one to create instruction files so that the located sites may be put into a book for further analysis. This is especially useful when you want to include in the analysis regions around the sites. For instance, you could set the 'from' distance to -60 and the 'to' distance to +40. Then by searching for 'gga5n4e#atg' you would get the instructions necessary to obtain the sequences from -60 to +40 around the atg's which are preceded by Shine and Dalgarno sequences. Help on using this feature of the program can be obtained by typing 'd help' while in the program.(end of delman.use.search.2)
cccccc oooooo n nn cc cc oo oo nn nn cc oo oo nnn nn cc oo oo nnnn nn cc oo oo nn nn nn cc oo oo nn nnnn -------- cc oo oo nn nnn cc cc oo oo nn nn cccccc oooooo nn nn ssssss tttttttt rrrrrrr uu uu cccccc tttttttt ss ss tt rr rr uu uu cc cc tt ss tt rr rr uu uu cc tt ssssss tt rr rr uu uu cc tt ss tt rr rr uu uu cc tt ss tt rrrrrrr uu uu cc tt -------- ss tt rr rr uu uu cc tt ss ss tt rr rr uu uu cc cc tt ssssss tt rr rr uuuuuu cccccc tt iiiiiiii oooooo n nn ii oo oo nn nn ii oo oo nnn nn ii oo oo nnnn nn ii oo oo nn nn nn ii oo oo nn nnnn ii oo oo nn nnn ii oo oo nn nn iiiiiiii oooooo nn nn(end of delman.construction)
CONSTRUCTION OF DELILA LIBRARIES Introduction This section assumes that you are familiar with DELMAN.USE. Construction of a Delila System Library involves several steps: - Entry of the raw sequence data (twice) - Correction of the sequences - Gathering of the information about the sequences - Creation of a "module" for insertion into the library (not the same module type as the ones used by program Module.) - Insertion of the module - Construction of a catalogue - Checking that the library is correct. When you are gathering the data to create part of a library (the library insertion module) you may find the forms in DELMAN.CONSTRUCTION.FORM useful. Use the Module program to make as many copies as required. NOTES FOR TRANSPORTATION Since the libraries that we send you have already been checked, you need only run the CATAL program (as discussed below) to generate the catalogues for these libraries. After that, Delila can be used.(end of delman.construction.intro)
MORE ON LIBRARY STRUCTURE - LOGICAL VS PHYSICAL STRUCTURE In DELMAN.USE.STRUCTURE we discussed the structure of a Delila Library. The descriptions were about how the parts are connected, and what is inside each part. This is the logical structure of the data base. We did not discuss the details of how a library is actually constructed, because it is not necessary to know these things when working with the Delila System. The description of these details is the description of the physical structure of the data base. Since we do not yet have an extensive set of tools for constructing Delila Libraries, it is necessary to describe the physical structure enough so that you can build your own libraries. Because these details are rigorously stated in LIBDEF, most things are automated by program Makebk, and Catal does lots of checking, we will only discuss the general concepts here. The logical structure of a library follows the schema shown in LIBDEF or DELMAN.USE.STRUCTURE. This structure is a two dimensional net. Libraries are implemented physically in files, and so are linear structures. If we exclude for the moment the references to a PIECE by MARKERs, TRANSCRIPTs and GENEs, then the library structure is a a tree. Any tree can be represented as a nested series of objects in linear order: ORGANISM (open parenthesis for an ORGANISM) CHROMOSOME (open parenthesis for a CHROMOSOME) GENE (open parenthesis for a GENE) GENE (close parenthesis for a GENE) PIECE (open parenthesis for a PIECE) PIECE (close parenthesis for a PIECE) CHROMOSOME (close parenthesis for a CHROMOSOME) ORGANISM (close parenthesis for an ORGANISM) If you look at any book (eg. EX0BK) or library (eg. LIB1) you will see this structure. Lines in a library either define the structure or are chunks of data (attributes). Attributes are signaled by an asterisk (*) as the first character on the line. We must now allow various objects to refer to PIECEs. This is done by a reference to the name of the PIECE. For example, one of the attributes in a GENE is the name of the PIECE that the GENE is on. (In cases where the GENE spans two PIECEs, we use two GENEs.) To simplify the operation of the CATAL program (to be described later) we have added one more rule. All objects that refer to a particular PIECE are called the "FAMILY" of the PIECE. The rule is that a FAMILY precedes its PIECE in the physical (file) implementation.(end of delman.construction.structure)
MAKING NEW LIBRARIES - THE CATALOGUE PROGRAM The first technical difference between Libraries and Books in the Delila System is that Libraries have catalogues while Books do not. Catalogues serve several purposes. First, since they are a condensed list of the objects in a Library, they allow objects to be found quickly. There are catalogues for both Delila and for people (the latter is called a HUMCAT - HUMan's CATalogue). These are constructed by the program CATAL. Since a library may be constructed by hand, it is also convenient to check the Library's physical structure at the time the catalogue is made. The Problem Of Duplicate Names Using Delila, a Book may be easily constructed that contains two objects within the same structure (if they are in different structures, it won't matter). For example: ORGANISM ECOLI; CHROMOSOME ECOLI; GENE LACI; (* THIS IS ON PIECE LAC *) GET ALL GENE DIRECTION HOMOLOGUOUS; GET ALL GENE DIRECTION COMPLEMENT; If this Book were to become a Library, then a reference to PIECE LAC would be ambiguous since there are two PIECEs with that name within the CHROMOSOME. The CATAL program detects these cases and makes the names differ by adding symbols to the names of second and subsequent duplicately named objects. The second technical difference between Books and Libraries is that Books may have duplicate names, while Libraries may not. Notes For Transportation Unknown ends of objects (such as a GENE) are represented in this version by a number that is off the end of the coordinates of the PIECE. For consistency, we have used +100000 or -100000 so that these can be more easily recognized (to our knowledge no continuous sequences are this long ... yet!). If your computer cannot handle integers this large, then you can reduce these numbers, as long as they are outside of the individual coordinates.(end of delman.construction.catal)
AN EXAMPLE OF CONSTRUCTING DELILA LIBRARIES In this example we show the series of steps used to set up the Delila libraries provided on the tape. The special bracket notation ([...]) is used here to indicate the contents of a file. A slash (/) inside the brackets indicates the beginning of a new line in the file. Other notation is described in DELMAN.DESCRIBE.CONVENTIONS. 1. Generate Library Catalogues catal(humcat,[ADVANCE DATES],lib1,cat1,newlib1,lib2,cat2,newlib2) copy(newlib1,lib1) copy(newlib2,lib2) The humcat should be identical to or similar to the one we send. (Note: l3 is empty, and c3 and newlib3 will not be written, but your computer may require that these files exist as empty files in order to run Catal. A similar situation holds for Delila and many other programs.) 2. Build Transcript Book delila(train,trabk,tradl,lib1,cat1,lib2,cat2) There will be warnings that can be ignored at this point. 3. Build Transcript Library catal(trahu,[ADVANCE DATES],trabk,tract,trali) You will see a number of cases where duplicate names are resolved. 4. Test Grin File delila(grin,grbk,grdl,trali,tract) comp(grbk,cmp,[3]) cmp should show 140 ATG, 7 GTG, 2 TTG. 5. Test Gain File Within the Gain file, the "FIRST", "LAST" and "SPECIAL" cases must be replaced by numbers. The WORCHA program comes in handy here, because it will do this easily: worcha(gain,ga3in,[FIRST/0/LAST/2/SPECIAL/0]) delila(ga3in,ga3bk,ga3dl,trali,tract) comp(ga3bk,cmp,[3]) cmp should be the same as for Grin. 6. Expanding Grin You can now expand the "FIRST" to "LAST" region of Gain, taking care not to violate the "SPECIAL" cases.(end of delman.construction.example)
RULES OF RAW SEQUENCE INSERTION (1) A raw sequence is a file containing only the letters A, C, G or T (no U is allowed, use T). You may type these letters or a set of letters on the keyboard that is convenient (eg. 1234); then convert the letters to ACGT using the program CHACHA. (2) For reasons of transportability and readability, the length of each sequence line should not exceed the width of characters on a typical terminal: Do not type more than 60 bases per line. You can reformat the data with REFORM or MAKEBK. (3) Sequences can and should be entered in free format with spaces to improve the readability of the sequence during entry. This also helps in the corrections described below. Much later it helps one to find parts of the sequence during fusion of PIECEs. (4) Before entry, use a pencil to mark off intervals of sequence to type. This makes entry easier since there are rest points. I often check off each (or every other) interval as I go, so I rarely get lost and duplicate or delete intervals. If you can keep the lines like those in the paper, the sequence will be easier to check and correct later (but remember rule 2). (5) Two people should INDEPENDENTLY enter the sequence. Independence is important: one person will FREQUENTLY make the same mistake twice. Do not be fooled into entry of a sequence and its complement by one person. We have had two cases where the same deletion was entered in the same place by one person, even though he was typing the sequence and its complement. Have two people independently type the sequence and the complement. By doing it this way, you will also catch some typographical errors if you are using a published source. (Another method: if one person is to enter both strands, be sure that they are typed from two copies on which different intervals are used.) The method of independent entry allows automatic correction. It seems to be faster and more reliable than other methods. (6) I caught the deletions mentioned above by knowing how long the sequence should be. You should not rely on the computer for the length. Predict it and then check it. (7) The file names of the two copies should include the initials of the person who typed the file. See the example below. (8) A complemented or inverted strand may be re-complemented or re-inverted using the program REFORM. Note that the free format of (3) will be lost. You should use the reformatted sequence only for checking, and not for the final Library insertion, since you would lose the formatting if you did. (9) At this point you have two files of "raw" sequence. The sequences may be merged together and corrected using MERGE. FOR EXAMPLE: If the sequence was OMPA, TS and MA typed the raw copies, and the copy of MA contains the format desired for the Library, you could use MERGE like this: MERGE(OMPAMA,OMPATS,OMPA,GARBAGE) (10) Be sure to save all raw files (eg. OMPAMA, OMPATS, OMPA) until the library insertion is completed and taped or backed-up.(end of delman.construction.data.entry)
SEQUENCE INSERTION PROCEDURE The following procedure assures the accurate and complete insertion of sequences into a Delila Library. Overview of the method: REFERENCE OBTAINED : .....................*.................... : : : V V V : : : RAW SEQUENCE RAW SEQUENCE DESIGN BOOK COPY 1 COPY 2 : : : : V V : : : : CHACHA CHACHA : : : : V V : : : : :.......MERGE........: : : : V : : : RAW SEQUENCE : CORRECTED COPY : : : V V :............MAKEBK............: : V : LIBRARY INSERTION MODULE : V : LIBRARY INSERTION I. Obtaining Sequences A. Sequences may be obtained from 1) Publications and preprints 2) Computer transfer 3) Your lab B. One copy of the source article and the sequence (or two copies of the sequence when no paper is available) are to be made for entry to our reference shelf. The photocopies must be of GOOD quality, with NO loss of information. II. Raw Sequence Insertion (See DELMAN.CONSTRUCTION.DATA.ENTRY for details) A. Double entry is preferred over other methods. B. Programs are available to make this easy: REFORM and MERGE. RAWBK may be used on the checked raw sequence to get results quickly. C. THE NAME OF THE GAME IS ACCURACY. III. Book Design A. First be sure that you understand library structure and coordinate systems. See LIBDEF and DELMAN.USE. B. Use forms to write out inserted sections. These can be found in the sections that begin with "DELMAN.CONSTRUCTION.FORM". C. Check the library to see if you can fuse the new sequence to previous sequence. D. Decide on a coordinate system or fuse to previously defined coordi- nates. (NOTE: when there is no zero, add 1 to the negative numbers.) Write this information on the source copy for our reference shelf. E. Record the source of all fragments and special information (eg: no zero, negative numbers incremented) in the PIECE notes. Put a complete reference into the PIECE notes. Include the positions on the coordinate system, such as: (-1288 to -208) F. Record all MARKERs, TRANSCRIPTs and GENEs in your coordinates. Unknown values are either +100000 or -100000, depending on which end of the coordinates the value is beyond. G. Create the Library insertion module using MAKEBK. All MARKERs, TRANSCRIPTs and GENEs pointing to a PIECE must be placed immediately prior to the PIECE that they refer to. They are called the "family" of the PIECE. (Note: we call this piece of a Delila library a module, but this is not the same as the ones the Module program works with. The meaning should be clear from the context.) IV. Insertion - With The Utmost Of Care A. Always insert whole Library insertion modules. Replace old parts of the library by modifying a module and reinserting it (with an editor). B. Quickly check the book structure for blatant errors. V. Checking the new Library A. The catalogue program (CATAL) is used to check library structure and to generate human and librarian catalogues. B. Modules that contain only parts of books can be made into whole books by placing a shell around the module. Example: a PIECE and its family can be inserted into a shell of a fake ORGANISM and CHROMOSOME to check the PIECE structure. C. Correct modules are inserted into the library and CATAL is run on the entire library. Be sure that file CATALP is empty, to ensure that the dates are advanced. D. End point checking: all coordinate numbers should be checked. To do this, use DELILA to pull out: COORDINATE, PIECE, GENE, TRANSCRIPT and MARKER endpoints. This is painful, but it has caught many errors. Example: GET FROM GENE BEGINNING TO GENE BEGINNING +2; should give mostly ATG, and a few XTG. (SOMEDAY THIS MAY BE AUTOMATED) VI. Listings Of The New Library These are often useful (program to use in parenthesis) A. LIB (SHIFT) B. HUMCAT (CATAL) C. REF (REFER) D. LIS (LISTER) may be large.(end of delman.construction.library.design)
NAME: LIBDEF, 1980 JUNE 9 ORGANISM * SHORT NAME * LONG NAME NOTE * * * * NOTE * GENETIC MAP UNITS (REAL) (INSERT A SERIES OF ORGANISMS AT THIS POINT) ORGANISM(end of delman.construction.form.organism)
NAME: LIBDEF, 1980 JUNE 9 CHROMOSOME * SHORT NAME * LONG NAME NOTE * * * * NOTE * GENETIC MAP BEGINNING (REAL) * GENETIC MAP ENDING (REAL) (INSERT A SERIES OF MARKERS, GENES, TRANSCRIPTS, AND PIECES AT THIS POINT) CHROMOSOME(end of delman.construction.form.chromosome)
NAME: LIBDEF, 1980 JUNE 9 MARKER * SHORT NAME * LONG NAME NOTE * * * * NOTE * PIECE REFERENCE * GENETIC MAP BEGINNING (REAL) * DIRECTION (+/-) * BEGINNING NUCLEOTIDE (INTEGER) * ENDING NUCLEOTIDE (INTEGER) * STATE (ON/OFF) * PHENOTYPE DNA * * DNA MARKER(end of delman.construction.form.marker)
NAME: LIBDEF, 1980 JUNE 9 TRANSCRIPT * SHORT NAME * LONG NAME NOTE * * * * NOTE * PIECE REFERENCE * GENETIC MAP BEGINNING (REAL) * DIRECTION (+/-) * BEGINNING NUCLEOTIDE (INTEGER) * ENDING NUCLEOTIDE (INTEGER) TRANSCRIPT(end of delman.construction.form.transcript)
NAME: LIBDEF, 1980 JUNE 9 GENE * SHORT NAME * LONG NAME NOTE * * * * NOTE * PIECE REFERENCE * GENETIC MAP BEGINNING (REAL) * DIRECTION (+/-) * BEGINNING NUCLEOTIDE (INTEGER) * ENDING NUCLEOTIDE (INTEGER) GENE(end of delman.construction.form.gene)
NAME: LIBDEF, 1980 JUNE 9 PIECE * SHORT NAME * LONG NAME NOTE * (NOTES INCLUDE PRECISE REFERENCE * FOR EVERY BASE IN THE PIECE) * * NOTE * GENETIC MAP BEGINNING (REAL) * COORDINATE CONFIGURATION (CIRCULAR/LINEAR) * COORDINATE DIRECTION (+/-) * COORDINATE BEGINNING (INTEGER) * COORDINATE ENDING (INTEGER) * PIECE CONFIGURATION (CIRCULAR/LINEAR) * PIECE DIRECTION (+/-) * PIECE BEGINNING (INTEGER) * PIECE ENDING (INTEGER) DNA * (INSERT SEQUENCE HERE) DNA PIECE(end of delman.construction.form.piece)
DDDDDDD EEEEEEEE SSSSSS CCCCCC RRRRRRR IIIIIIII BBBBBBB EEEEEEEE DD DD EE SS SS CC CC RR RR II BB BB EE DD DD EE SS CC RR RR II BB BB EE DD DD EEEE SSSSSS CC RR RR II BBBBBBB EEEE DD DD EE SS CC RR RR II BB BB EE DD DD EE SS CC RRRRRRR II BB BB EE DD DD EE SS CC RR RR II BB BB EE DD DD EE SS SS CC CC RR RR II BB BB EE DDDDDDD EEEEEEEE SSSSSS CCCCCC RR RR IIIIIIII BBBBBBB EEEEEEEE(end of delman.describe)
PROGRAM NAMING CONVENTIONS Every Delila System program exists in several forms: 1) Raw source code - without modules inserted. Example: "lister.r" would be the raw code for the LISTER program. We are not sending code this way. 2) Pascal source code - with all modules inserted. This code is ready to compile. Example: "lister.p". (Our previous convention was to add an s to the end of the file name to indicate this.) 3) Compiled code. Our convention is to remove the suffix: "lister". To simplify the manual, programs are listed under the compiled code name (lister). PARAMETER FILE NAMES A file that controls the operation of a program is called a parameter file. For LISTER this file is LISTERP. For SPLIT it is ... SPLITP (get it? HA! HA! sorry.) RULES FOR PARAMETER FILES 1) If the file is not empty then the file must contain values for all parameters. With few exceptions, this should reduce the number of complex rules that one must deal with. 2) Each parameter is on its own line. 3) Parameters are left justified on the line. 4) A parameter may be followed by one or more spaces and then any comment. This lets the user write reminders of what the allowed values are. WHY CAN'T DEFAULT PARAMETER VALUES BE STATED IN THIS MANUAL? 1) If default values are changed, then the manual must also be changed. since there is no automatic mechanism to assure that these remain the same, it is likely that it will be forgotten. The manual would then be out of date. 2) The manual entry defines the program but does not enforce details of operation. It is somewhat like the LIBDEF specification. 3) It is easy to find out what the defaults are since almost every program states the values used in its listing. Running a small test takes only two minutes.(end of delman.describe.conventions.naming-parameters)
PROGRAM WRITING CONVENTIONS Program source code will always follow certain rules: 1) The first line(s) will be the Pascal PROGRAM statement. 2) The module libraries that are sources of the modules will be stated. 3) One of the global constants will be called VERSION. This number or string identifies the particular version of the source code. We change VERSION every time that we modify the source file. The program name and VERSION are written to the OUTPUT file when the program runs. 4) There will be a document module that describes the program. The module is identical to the one in this manual such as DESCRIBE.LISTER It follows the format defined in DELMAN.DESCRIBE.DOCUMENTATION.PROGRAMS 5) All constants, types, variables, procedures, functions and sections of code will have comments that describe their function. 6) Interactive programs always have a HELP command. FOR TRANSPORTATION: 1) Put non-standard features inside modules. 2) Program lines longer than 80 characters are avoided. (NB: This is ALWAYS possible in PASCAL). The FLAG program will detect any lines that are too long. 3) Reading into packed arrays is forbidden. Read into unpacked arrays and pack or transfer values. 4) The Pascal Users Manual suggests that PASCAL identifiers "must differ over their first 8 characters." There are two problems related to this. Assume that the transport is from a computer that requires N characters to differ, where N > 8 (eg. 10). a) Transport to a computer that requires M < N may cause names like A23456789 and A2345678X to be considered identical, and compilation will be prevented. b) Transport to a computer that recognizes M > N will detect cases where one name was written two ways, with the difference in the last characters (between N and M). The "most famous" such case was in CATAL: HUMCATLINE and HUMCATLINES were used on a computer where N = 10 and failed on computers where M > 10. The solution in both cases is to avoid names that differ beyond 8 characters. Is somebody willing to write a program to detect this?(end of delman.describe.conventions.writing)
PROGRAM RUNNING CONVENTIONS In this manual we will use a single notation to mean running a program: lister(book,list) means to run the program LISTER using a file named BOOK. The program will produce output to file LIST. The names BOOK and LIST are not necessarily the same as the file names declared in the source of LISTER (LISTERS), we assume that the names are mapped one on one. Also, file names to the right may not be always mentioned, to simplify the notation. For example: edit(inst1) : : (create Delila instructions in file INST1) : delila(inst1,book1,delist1) (run DELILA to create a book named BOOK1 and a Delila listing DELIST1 that shows where the errors are. the library and catalogue are not mentioned.) lister(book1,list1) (Run the auxiliary program LISTER. OUTPUT and LISTERP are not mentioned.) The file OUTPUT will always contain messages and diagnostics intended for the CRT screen or teletype. The file INPUT is always used for interactive input by the programs. To fully define the files that a program uses we will write: LISTER(BOOK: IN; LIST: OUT; LISTERP: IN; OUTPUT: OUT) IN and OUT define the direction of information flow into or out of the program. INOUT would mean that the source file may be modified (such as by an editor). This is a symbolic way to represent the data flow diagrammed in our papers (see DELMAN.INTRO.DESCRIPTION). NOTE: The mapping of logical file name (the one the program knows) to physical file name (the actual one the computer system uses) is frequently done with an ASSIGN or LINK command in the job control language of the computer.(end of delman.describe.conventions.running)
Short clustered descriptions of some Delila System files DOCUMENTS AAA Names Of Delila System Files chars Character List delman1 Delila System Manual delman2 Delila System Manual, for program descriptions libdef Delila Library System Definition moddef Module Transfer System Definition LIBRARIES humcat Human's Catalogue For The Library lib1 Library 1: Bacteriophage lib2 Library 2: E. Coli And S. Typhimurium DELILA INSTRUCTIONS train Transcript Library Instructions grin Gene Starts In Relative Form (Use Transcript Library) gain Gene Starts In Absolute Form (Use Transcript Library) SEARCH PROGRAM RULES genrule Finds Genes And Non-Genes enzrule Finds Restriction Enzyme Sites In Books WEIGHT MATRICES FOR THE PERCEPTRON w101 101 Wide, Finds All Genes In Transcript Library w71 71 Wide, Finds All Genes In Transcript Library w51 51 Wide, Finds All Genes And Some Nongenes EXAMPLES ex0bk Example Book ex0hu Example Catalogue For Humans ex0dl Example Delila Listing ex0in Example Instructions - To Create EX0BK ex0li Example Listing From LISTER ex0lo Example Loocat On Catalogue from EX0BK EXAMPLE DELILA INSTRUCTIONS FOR DELMAN ex0in "ex0: example" ex1in "ex1: the laci gene" ex2in "ex2: an absolute get" ex3in "ex3: a relative get" ex4in "ex4: non-coding lac leader" ex5in "ex5: the region between laci and lacz" ex6in "ex6: multiple specification and requests" ex7in "ex7: aligned book" ex8in "ex8: non-coding lac leader- via respecification" EXAMPLES FOR TESTING THE MODULE PROGRAM exsin example source in exmodli example modlue library EXAMPLES FOR TESTING AUXILIARY PROGRAMS expepin Delila Instructions For Testing Pemowe EXAMPLES FOR TESTING THE PERCEPTRON exspbk Example Sequences Positive Book exsnbk Example Sequences Negative Book expa0 Example Pattern 0, Learn EXSPBK Vs EXSNBK With Zero Start expa1 Example Pattern 1, An Initial Matrix For Learning expa2 Example Pattern 2, Learn EXSPBK Vs EXSNBK Using EXPA1 As Start expan2 Result Of Patana On EXPA2 exsebk A Book For Searching With EXPA2 EXAMPLES FOR TESTING ENCODE PROGRAMS exencin Example Encode Instructions exencbk The Book For EXENCIN exencen Example Encoding Of EXENCBK FONTS FOR BIGLET font font for the biglet program phont demonstration font for the biglet program EXAMPLE PARAMETER FILES Often a program will have a file associated with it that controls it and is called a parameter file. For example, the pbreak program uses a parameter file called pbreakp. Many programs have example files. They are not listed here, but you may want to look for them before you run the program. An example is the xyplo program, for which there are the files xyplop.demo, xyin.demo, xyplop.test and xyin.test. As programs are modified, this section will not always be up to date.(end of delman.describe.short.cluster.files)
Short clustered descriptions of Delila System programs Documentation exists as describe.[name] MODULE LIBRARIES auxmod: modules for auxiliary programs delmod: delila module library doodle: pascal graphics library and preprocessor for pic under unix cybmod: specific module library for the cyber computer genmod: genbank access modules matmod: mathematics modules prgmod: programming modules for the delila system unixmod: specific module library for the unix operating system vaxmod: specific module library for the vax computer MODULE MANIPULATION module: module replacement program makemod: create a set of empty modules from a list of names makman: make manual entries from a source code maknam: make manual entry names modin: generate modularized delila instructions for absolute sites modlen: determine module lengths makemod: create a set of empty modules from a list of names nulldate: modules to neutralize the date-time functions pbreak: breaks a file into pages at a certain trigger phrase show: show modules in a module library undel: remove references to delman in modules TOOLS biglet: text enlargement program calc: a calculator that propagates errors calico: character and line counts of a file cap: put capital letters inside quotes of a program censor: removes code from a program chacha: changes characters in a file code: find the comment density of a pascal program column: pull defined column from input concat: concatenate files together copy: copy one file to another file decat: break a file into 10 files decom: remove comment starts from within a comment difint: differences between integers flag: points out excessively long lines ll: line lengths lig: ligation theory lochas: look at characters in a file merge: compare two files and merge them nocom: remove comments number: add line numbers to a file rembla: remove blanks from ends of lines in a file repro: make multiple copies of a file same: counts the number of lines that are identical in two files shell: basic outline for a program shift: copy one file to another file, with a blank in front of each line short: find locations of short lines in a file shortline: make short lines out of long lines split: split a wide file into printable pages sqz: squeeze the input file to fit into fewer characters per line sumfile: sum of file sizes test: a simple test program for Pascal unshi: remove first column of characters from a file ver: look at the version of a program verbop: increment the version number of a program vernum: print the version number of a program versave: save the file under the version number unsqz: unsqueeze the input file whatch: what characters are in a file? worcha: word changing program wl: wrap lines in a file woco: word counting program wordlist: lists words in a file ww: word wrap TOOLS FOR TEX notex: remove tex and latex constructs ref2bib: refer to bibtex converter sortbibtex: sort a bibtex database untex: remove tex and latex constructs untitle: remove titles from bbl file unverb: remove verbatim sections from a latex file GRAPHICS doodle: pascal graphics library and preprocessor for pic under unix domod: doodle modules dops: pascal graphics library and preprocessor for postscript dosun: pascal graphics library and preprocessor for Sun graphics shrink: reduce size of postscript graphics genhis: general histogram plotter genpic: convert genhis output to pic input xyplo: plot x, y data log: convert columns of data to log dnag: graphics of dna LIBRARIAN delila: the librarian for sequence manipulation catal: cataloguer of delila libraries, the catalogue program loocat: look at a catalogue GENBANK dbbk: database to delila book conversion program dbcat: database catalog production and sorting program. dbfilter: filter GenBank databases to remove unwanted entries dbinst: extract Delila instructions from a GenBank database dblo: look at the catalogue of a genbank/embl database dbpull: database extraction program. AUXILIARY PROGRAMS FOR DATA BASE CONSTRUCTION makebk: make a book from a file of sequences. rawbk: make a raw sequence into a book reform: raw sequences reformatted AUXILIARY PROGRAMS FOR SEQUENCE LISTING lister: list the sequences of pieces in a book with translation parse: breaks a book into its components AUXILIARY PROGRAMS FOR ALIGNED SEQUENCES alist: aligned listing of a book gap: gaps in aligned listing of a book hist: make a histogram of aligned sequences. histan: histogram analysis. malign: optimal alignment of a book, based on minimum uncertainty AUXILIARY PROGRAMS FOR ANALYSIS cluster: cluster indana subindexes into groups of duplicate entries coda: composition file to data for genhis comp: determine the composition of a book. compan: composition analysis. count: counts the amount of sequence in a book frame: evaluator of potential reading frames indana: analysis of an index index: make an alphabetic list of oligonucleotides in a book pemowe: peptide molecular weights search: search a book for strings AUXILIARY PROGRAMS FOR HELIXES dotmat: dot matrices of two books helix: find helices between sequences in two books keymat: keyed-matrices for helices between two books matrix: dot matrices for helices between two books rep: records repeats between sequences in two books sorth: sort helix list instal: delila instruction alignment AUXILIARY PROGRAMS FOR PATTERN LEARNING patana: pattern analysis patlrn: pattern learning patlst: lister of patlrn output. patser: pattern searcher patval: pattern evaluations of aligned sequences AUXILIARY PROGRAMS FOR ENCODED SEQUENCES encfrq: encoded sequence frequency analysis encode: encodes a book of sequences into strings of integers encsum: sum of the vectors of encoded sequences AUXILIARY PROGRAMS FOR INFORMATION ANALYSIS calhnb: calculate e(hnb), var(hnb), ae(hnb), avar(hnb), e(n) frese: frequency table to sequ palinf: find palindromes, based on information theory rf: calculate Rfrequency rseq: rsequence calculated from encoded sequences rsim: Rsequence simulation rsgra: rsequence graph dalvec: converts Rseq rsdata file to symvec format makelogo: make a graphical `sequence logo' for aligned sequences ckhelix: check that the helix location is where one wants alpro: frequency and information of aligned protein sequences alword: frequency and information of aligned words dirty: calculate probabilities for dirty DNA synthesis sites: analyse sites from randomized sequence data base bkdb: convert a book to database format for the sites program siva: site information variance diana: diaucleotide analysis of an aligned book tri: test environment for triangle array digrab: diagonal grabs of diana data da3d: diana da file to 3d graphics dotsba: dots to database Ri: Rindividual is calculated for every site in the aligned book scan: scan a book with a wmatrix and generate a vector vfilt: vector filter tod: to database format for sites program winfo: window information curve AUXILIARY PROGRAMS FOR OTHER USES refer: print the references in the pieces of a book sepa: separates delila instruction sets lenin: convert a list of lengths into Delila instructions RANDOM NUMBERS AND SEQUENCES markov: markov chain generation of a dna sequence from composition. tstrnd: test random generator gentst: test random generator normal: generate normally distributed random numbers rndseq: generate random dna sequences aran: aligned random sequences MATHEMATICS av: average integers binomial: produce the binomial probabilities for a found black to white ratio binplo: produce the binomial probabilities for a found black to white ratio cerf: complement of the error function cisq: circle to square chi: estimates chi squared from degrees of freedom linreg: linear regression mnomial: produce the multinomial distribution for base probabilities pcs: partial chi squared riden: ring density graph ring: z space ring sphere: plot density of shannon spheres stirling: test of stirling's formula zipf: Monte Carlo simulation for Peter Shenkin's problem MISCELLANEOUS aa: not actually a program, this is the header page for Delila manual asciicode: converts ascii table to Pascal code binhex: convert binary to hex hexbin: convert hex to binary mstrip: remove control m's from a file epsclean: clean an eps file kenin: create Delila instructions from Kenn's all.gen instructions kenbk: book from a file of sequences of sequences provided by Kenn Rudd tipper: copy a file to the output file with special symbols at end todawg: change a book into dawg format ev: evolution of binding sites evd: evolution display makedate: make a date file makessbdate: make a date file from a Sample_Sheet.bin file PROGRAMS TO CONTROL MACHINERY odti: munch od and time plates together for xyplo titer: analyse titertek optical density data spec: analyse two spectra from the camspec ssbread: read a sample sheet from the ABI sequencer tkod: read od values from tk data(end of delman.describe.short.cluster.programs)