Delila Manual, Hypertext version
version = 5.06 of delman1 2014 Mar 06
ddddddd eeeeeeee ll m m aa n nn
dd dd ee ll mm mm aaaa nn nn
dd dd ee ll mmm mmm aa aa nnn nn
dd dd eeeeeee ll mmmmmmmm aa aa nnnn nn
dd dd ee ll mm mm mm aa aa nn nn nn
dd dd ee ll mm mm aaaaaaaa nn nnnn
dd dd ee ll mm mm aa aa nn nnn
dd dd ee ll mm mm aa aa nn nn
ddddddd eeeeeeee llllllll mm mm aa aa nn nn
11
111
11
11
11
11
11
11
11111111
THE DELILA SYSTEM MANUAL
THOMAS D. SCHNEIDER
COPYRIGHT (C) 1993
1. Don't Panic! You don't have to absorb this all at once!
2. There is an index at the end of any printed copy of Delman!
3. To create Delman2, see file aa.p
(end of version)
IIIIIIII N NN TTTTTTTT RRRRRRR OOOOO
II NN NN TT RR RR OO OO
II NNN NN TT RR RR OO OO
II NNNN NN TT RR RR OO OO
II NN NN NN TT RR RR OO OO
II NN NNNN TT RRRRRRR OO OO
II NN NNN TT RR RR OO OO
II NN NN TT RR RR OO OO
IIIIIIII NN NN TT RR RR OOOOO
(end of delman.intro)
DELILA SYSTEM MANUAL OUTLINE
INTRO: Introduction To The Delila System
OUTLINE: Outline For The Delila Manual
DESCRIPTION: What Is The Delila System?
ORGANIZATION: Organization Of The Manual
POLICY: Our Policies, A Disclaimer, Obtaining The Delila System,
Our Address And Acknowledgements
TRANSPORT: Transportation Of The Delila System
REQUIREMENTS: What You Will Need To Get The Delila System Running
TAPE.FORMATS: Tape Data Formats
ASSEMBLY: Assembly Of The Delila System Programs
INTRO: What We Mean By Assembly
CHACHA: Changing Characters And Getting The First Program Running
REMBLA: Removing Excess Blanks From Files
WORCHA: The Reserved Word Problem
MODULE: Module Libraries - What They Are And How To Use Them
EXAMPLE: An Example Of Constructing A Delila System Program
PROBLEMS: Problems That May Arise During Assembly
GUIDE: Hello, Computer - A Guide To The New User
INTRO: Introduction To The Guide And Your Computer
ADVICE: Advice And Tips To The New User
DELILA: How To Use The Delila System On Your Computer
PROGRAM: System Independent Notes On Programming
ESSAY: Suggestions On How To Learn And Do Programming
FABLE: A Fairy Tale For Programmers
(end of delman.intro.outline.1)
USE: Uses Of The Delila System
INTRO: Introduction
STRUCTURE: Library Structure: Trees, Nested And Named Objects
LANGUAGE: Delila - The Language
AUXILIARY.PROGRAMS: Lister And Search
DATA.FLOW: Data Flow And Data Loops
COORDINATES: The Coordinate System Of A PIECE
CONTROL: How To Control The Responses Of Delila
COMPARISON: Ways To Compare Sequences
ALIGNED.BOOKS: How To Make And Use Aligned Books
PERCEPTRON: Use Of The Pattern Programs
ENCODE: Use Of The Fabulous And Powerful Encode Program
DBPULL: Using The Data Base Extraction Programs
SEARCH: Using The Search Program
CONSTRUCTION: Constructing Your Own Libraries
INTRO: Introduction
STRUCTURE: More On Library Structure - Logical Vs Physical Structure
CATAL: Making New Libraries - The Catalogue Program
EXAMPLE: An Example Of Constructing Delila Libraries
DATA.ENTRY: Using Your Own Data
LIBRARY.DESIGN: Making A Delila Data Base
[FORM...]: The Forms For Library Module Entry
DESCRIBE: Program And Data Descriptions
CONVENTIONS: Notation For Naming, Writing And Running Programs
SHORT.CLUSTER: Short Clustered Descriptions Of Delila System Files
DOCUMENTATION: How Programs Are Documented
The format for documentation in the Delila System is in
file aa.p at the start of the Delman2 manual.
INDEX
An Alphabetical Listing Of The Pages In The Manual.
(See The Page Named DELMAN.INTRO.ORGANIZATION
For How To Generate The Index.)
(end of delman.intro.outline.2)
WHAT IS THE DELILA SYSTEM?
The Delila System is a collection of Pascal programs and data originally
written at the University of Colorado, Boulder that allows one to manipulate
and study sets of nucleic-acid sequences. A set of sequences is called a
library. There is a librarian, and "her" name is Delila. One gives Delila a
list of instructions that name desired fragments. Delila then searches the
library, collects all the sequences together and produces a "book". The book
may then be searched for patterns, listed with translation to amino acids, or
studied in various ways using programs other than Delila ("auxiliary"
programs). Since books may be small, these analyses can be efficient.
Books have the same form as libraries. In other words, libraries have a
particular structure so that Delila can work with them. Books have that same
structure. For example, given a Master DNA sequence library one can use
Delila to make a subset such as a transcript library, containing sequences of
mRNA. From the transcript library subsets for gene initiation regions can be
made and these are guaranteed to be sequences from mRNA. During all these
manipulations the numbering of the sequences remains consistent so that one
can refer back to the original library or the literature. (The technical
differences between libraries and books will be discussed later.)
Any auxiliary program that searches a library will know about the
structure of the library. Using this structure and the search results, the
program can write Delila instructions that specify the locations of the found
objects. Once again, using Delila, one can loop back and create a book of
these objects. Also, the instructions (instead of the sequences) can be
manipulated by various programs.
A NOTE FOR PROGRAMMERS
Each auxiliary program that reads a book or library knows about the
library structure. To make programming easy, a set of routines was written as
an interface between the actual database (kept in a file) and the program
calls and variables. These "book reading routines" are kept together in what
we call a Module Library, containing many chunks of Pascal code. Each module
performs certain kinds of tasks. The modules are transferred from the module
library into the source code of each auxiliary program by using the Module
program. In this way all changes to the interface packages can be made once
in the Module Library, followed by a series of transfers. We may send the
Delila System with modules removed because there is no reason to send
duplicate code. After transportation you would assemble the programs.
We hope that this section gave you a rough overview of what the Delila
System can do. Many more details and examples can be found in the sections
that follow.
(end of delman.intro.description)
libdef - the definition of the Delila Library System (a file)
moddef - the definition of the Module Transfer System (a file)
doodle.info - describes Pascal graphics portable under UNIX
Some of the Delila programs and the method of moving modules around
are described in these papers:
Schneider, T.D., G.D. Stormo, J.S. Haemer and L. Gold. (1982)
A design for computer nucleic-acid sequence storage, retrieval and
manipulation.
Nucleic Acids Research, 10: 3013-3024.
Schneider, T.D., G.D. Stormo, M.A. Yarus, and L. Gold (1984)
Delila system tools.
Nucleic Acids Research, 12: 129-140.
Some related papers are:
Stormo, G.D., T.D. Schneider and L.M. Gold (1982)
Characterization of translational initiation sites in E. coli.
Nucleic Acids Research, 10: 2971-2996.
Stormo, G.D., T.D. Schneider, L. Gold and A. Ehrenfeucht (1982)
Use of the 'Perceptron' algorithm to distinguish translational
initiation sites in E. coli.
Nucleic Acids Research, 10: 2997-3011.
Clift, B., D. Haussler, R. McConnell, T. D. Schneider and G. D. Stormo
(1986)
Sequence Landscapes.
Nucleic Acids Research, 14: 141-158.
Schneider, T.D., G.D. Stormo, L. Gold and A. Ehrenfeucht (1986)
The information content of binding sites on nucleotide sequences.
J. Mol. Biol. 188: 415-431.
Stormo, G.D., T.D. Schneider and L. Gold (1986)
Quantitative analysis of the relationship between nucleotide
sequence and functional activity
Nucleic Acids Research, 14: 6661-6679.
T. D. Schneider (1988)
Information and entropy of patterns in genetic switches.
In G. J. Erickson and C. R. Smith,
editors, Maximum-Entropy and Bayesian Methods in Science
and Engineering, volume 2, pages 147--154,
Dordrecht, The Netherlands, Kluwer Academic Publishers.
T. D. Schneider and G. D. Stormo (1989)
Excess information at bacteriophage T7 genomic promoters detected
by a random cloning technique.
Nucleic Acids Research, 17:659--674.
Reference for Dotmat, Helix, Matrix and Keymat:
J. V. Maizel, Jr. and R. P. Lenk
PNAS 78: 7665-7609 (1981)
A reference for Index:
L. J. Korn, C. L. Queen and M. N. Wegman
PNAS 74: 4401-4405 (1977)
(end of delman.intro.references)
ORGANIZATION OF THE MANUAL
The Delila Manual is broken into several somewhat independent sections.
When Delman is paged by program PBREAK (see Technical notes below) you will
find an index at the end. We anticipate at least two kinds of reader:
1) The builder who wants to get a Delila System running on a local computer.
The section on transportation will help you get the data into your computer.
The section on assembly will guide you through the difficult task of getting
the programs running. At that point the Delila Libraries will still not be
ready to use: you must construct catalogues as described in the section on
CONSTRUCTING YOUR OWN LIBRARIES (DELMAN.CONSTRUCTION). Finally you will be
able to use the Delila System. We suggest that you first look over the entire
manual and associated documents. Then begin the transport. Good luck!
2) The user who wants to use a Delila System that is already running on a
local computer. You may be interested in looking over the sections on
transportation and assembly of the system, but this is not necessary. If you
don't know anything about using the computer you should start at
DELMAN.GUIDE. In any case, read the section on USE OF THE DELILA SYSTEM
(DELMAN.USE).
Each program is described in a separate manual, Delman 2.
TECHNICAL NOTES (These are not be useful to people just starting.)
1. The section DELMAN.GUIDE must be rewritten after transportation
to a new computer system.
2. DELMAN is physically broken into a set of modules. Each module
is a page of the manual. The individual pages can be extracted (or
transferred and rearranged) by using the program MODULE, as described
in the document MODDEF and DESCRIBE.MODULE. The pages may be looked
at on-line with the SHOW program (DESCRIBE.SHOW). The manual or
extracted modules may be broken into pages for output to a lineprinter by
using the PBREAK program with a parameter file containing:
(* begin module
1
There is no closing "*)" in the trigger because many different
module names may follow the trigger, so the trigger is for the common
part of the module beginnings.
You can generate another index of the contents of this manual in
the List file of program Module if you use Delman as the Modlib and a copy
of Delman as Sin. (See MODDEF for the definitions of these files.)
(end of delman.intro.organization)
OBTAINING THE DELILA SYSTEM
The Delila system is available From
https://alum.mit.edu/www/toms
OBTAINING THE DELILA SYSTEM BY TAPE
We prefer not to have to write tapes or disks, but we will send the
Delila System by tape as a single package if you do not have have ftp access.
Under most circumstances we cannot send parts of the system or subsets of the
data. Please send us a tape as described in delman.transport.tape.formats,
and we will write out the entire current version and send it back to you.
There is no fee. You may redistribute the system. If you receive a a copy of
the system from someone else, you may want to check back with us to see if
there have been any major changes or corrections. Referring to the version
number of the program or documentation will help us know if there were any
changes.
DISCLAIMER
No claim or guarantee is made that Delila System programs and data are
free of error. Although we send source code, we cannot guarantee that this
code will compile and run on all computers. We believe that our code is
reasonably efficient, but we cannot be responsible for any costs due to using
the Delila System. We do not offer programming support, though we are willing
to answer questions about the Delila System.
We would appreciate a detailed description of any program errors (bugs)
or data errors that you encounter.
OUR ADDRESS
Thomas D. Schneider, Ph.D.
toms@alum.mit.edu
https://alum.mit.edu/www/toms
ACKNOWLEDGEMENTS
Jeff Haemer, Mike Aden and Gary Stormo were instrumental in the
original design of the Delila system.
Many people have helped us by reading and commenting on this
manual. We would like to thank: Ginny Fonte, Larry Gold, Jeff
Haemer, John Hoffhines, Jane Hessler (VA), Brent Hughes, Billie
Lemmon, Melissa Mockensturm, Sandy Parkinson (UT), Pat Roche, Herb
Schneider, Susan Scolman, Sidney Shinedling, Britta Singer, Rosemary
Sweeney, and Mike Yarus.
Computer time and resources were generously provided by the
University of Colorado at Boulder, and the Frederick Biomedical
Supercomputing Center.
Funds for this project were provided through grants NIH 1 R01 GM28755,
NIH 5 R01 GM19963 and ACS NP-178D.
(end of delman.intro.policy)
Please use this page to write comments you have about the manual
and the Delila system. Our address is on page delman.intro.policy. Thankyou.
Name: Date:
(end of delman.intro.comments)
tttttttt rrrrrrr aa n nn ssssss
tt rr rr aaaa nn nn ss ss
tt rr rr aa aa nnn nn ss
tt rr rr aa aa nnnn nn ssssss
tt rr rr aa aa nn nn nn ss
tt rrrrrrr aaaaaaaa nn nnnn ss --------
tt rr rr aa aa nn nnn ss
tt rr rr aa aa nn nn ss ss
tt rr rr aa aa nn nn ssssss
ppppppp oooooo rrrrrrr tttttttt
pp pp oo oo rr rr tt
pp pp oo oo rr rr tt
pp pp oo oo rr rr tt
pp pp oo oo rr rr tt
ppppppp oo oo rrrrrrr tt
pp oo oo rr rr tt
pp oo oo rr rr tt
pp oooooo rr rr tt
(end of delman.transport)
TRANSPORTATION - WHAT YOU WILL NEED
If you have obtained the Delila System by computer tape, you will need
some way of moving the data on the tape into your computer. We suggest that
you find someone who has already dealt with tapes.
All Delila System programs are written in the language Pascal. There
are many books available on this language, but the definition of
the language is in:
K. Jensen and N. Wirth
Pascal User Manual and Report
Springer-Verlag, New York 1978
Some of the Delila programs have been automatically translated to C.
See the README file for further details.
To run Pascal programs you will need a Pascal compiler on your computer,
and enough memory to use it. It is impossible to make an accurate estimate
of the memory requirements, because this depends on the computer system.
However, we once set up an older version of the entire system on two computers:
CDC Cyber/KRONOS 5000 pru x 640 char/pru = 3,200,000 characters
DIGITAL VAX/VMS 7000 blocks x 512 char/block = 3,584,000 characters
Since then more programs have been added, and we find roughly:
4,300,000 characters of source code and files
5,300,000 bytes of compiled code on a Pyramid 90x computer running UNIX.
Since these estimates include object code, it is possible that the amount
you require will be more or less. The estimates do not include memory
required for running the system.
Since transportation of programs from one computer to another is
still a tricky business, we recommend that either you learn about
tapes, your computer, and Pascal, or that you find local people who
know about these things and are willing to give you help.
The first Delila system file on the tape is called AAA (the name
guarantees that it will be first). It lists the name of
all the Delila files on the tape, in the order that they were taped.
Following AAA the other files are in alphabetical order.
Files are described in the manual section DELMAN.DESCRIBE.
If you keep notes on difficulties that you encounter and
how each was solved, transportation of future versions of the
Delila System will be easier.
(end of delman.transport.requirements)
TAPE DATA FORMATS
We send the Delila System (programs and data) out on tape.
Send us a standard 2400 foot tape. We will send you back the tape with
the format:
9 track
1600 bits per inch
Unlabeled
Standard ASCII character set
80 characters per record
10 records per block
We can also send UNIX tar tapes.
The first file on the tape lists the names of all the files on the tape.
(end of delman.transport.tape.formats)
AA SSSSSS SSSSSS EEEEEEEE M M BBBBBBB LL YY YY
AAAA SS SS SS SS EE MM MM BB BB LL YY YY
AA AA SS SS EE MMM MMM BB BB LL YYYY
AA AA SSSSSS SSSSSS EEEE MMMMMMMM BBBBBBB LL YY
AA AA SS SS EE MM MM MM BB BB LL YY
AAAAAAAA SS SS EE MM MM BB BB LL YY
AA AA SS SS EE MM MM BB BB LL YY
AA AA SS SS SS SS EE MM MM BB BB LL YY
AA AA SSSSSS SSSSSS EEEEEEEE MM MM BBBBBBB LLLLLLLL YY
(end of delman.assembly)
ASSEMBLY OF THE DELILA SYSTEM PROGRAMS
At this point we will assume that all the programs and data are in
files on your computer. Be sure to read the sections in PROGRAMS AND
DATA DESCRIPTIONS (DELMAN.DESCRIBE.CONVENTIONS) that discusses our file
naming and running conventions.
This section will guide you in the construction of the Delila System
programs. There are several stages to this process:
changing characters - making sure that all the characters are correct
removing blanks - blank characters at the end of lines can be removed
to speed processing and save memory.
changing words - changing the words that your compiler thinks
are reserved words in Pascal (but aren't in standard Pascal...)
module corrections - making sure that modular chunks of code function
correctly on your computer.
module transfers - inserting chunks of code into programs
compilation and debugging - making the programs and finding out why
things don't work ("If something can go wrong, it will." - Murphy)
We have written some tools to aid you in this process - but to use the
tools you must first get some of them running - so the first steps must
be done by hand.
Remember to take dated notes about your problems and how they were
solved.
USE OF COMMAND FILES
Most computer systems allow one to put commands in a file and execute
them. If you can do this, it will speed up assembly enormously. One
such "command" file could contain instructions to remove blanks,
change characters, change words, transfer modules and perhaps even try
to compile. However, it would be better to have several command files,
each of which did a small part, giving you more flexibility.
(end of delman.assembly.intro)
CHANGING CHARACTERS
When characters are written to tape they are encoded as binary strings.
When your computer reads the tape, the characters are decoded for
storage on your computer. If the decoding does not exactly reverse
the encoding, then the characters you receive will not be the same as
the ones that we send. For example, you many have a pound sign for each
exclamation mark that we sent. Your first task is to find out what
changes occurred (if any). To aid you, we provided a list of
characters with English descriptions in the file 'chars'.
Look at this file and write down the changes required.
Use the editor on your computer to correct the characters in the file
CHACHAS. Now try to compile CHACHAS. Determine the reasons for any
errors. (For example, you may have to switch double and single quotes
to satisfy the compiler or you may have to remove the non-standard linelimit
call.)
The CHACHA program will now assist you in converting characters in
the files from the tape. You should try it out on chars, remembering
not to destroy the original file. NOTE: Some Pascal compilers may
not allow programs that read "nonstandard" characters. (Example: small
characters.) You may be able to get around this by setting compiler defaults.
(end of delman.assembly.chacha)
REMOVING EXCESS BLANKS FROM FILES
The files that you get off the tape may have extra blanks (spaces) at
the ends of lines. This may be due to transportation itself, or the source
computer may add extra blanks to lines. Although these blanks will not
affect the function of most programs, they will slow down program
execution and use up extra memory.
Transportation can also add blank lines to the end of the file. Some
programs will object to this. Catal is one example.
The program Rembla (remove blanks) will remove all blanks from the ends
of lines in a file, and any extra blank lines at the end. We recommend that
you include this as a step during assembly of programs. It should
also be done for data files, especially the libraries.
(end of delman.assembly.rembla)
THE RESERVED WORD PROBLEM
The language Pascal defines certain words (such as PROGRAM, VAR,
BEGIN and END) to be reserved words. These words cannot be used as
variable names. This in itself presents no difficulties for
portability. However, your Pascal compiler (like ours) may reserve more
words than just the standard set. If one of the Delila System programs
uses a non-standard reserved word of your compiler, then the program will
not compile. You will not have to change all these names by hand because
we have sent a program to do it automatically.
Non-standard reserved words should be listed somewhere in the manual for
your Pascal compiler. Use this list and the program WORCHA to remove all
the reserved names. We suggest using new names that are not likely
to appear in a program. Example: MODULE could be converted to
ZMODULE without loss of meaning. ZMODULE is not likely to be already used in
a program.
Worcha will not alter literals or comments, so the program's
operation will not be affected by this change. If one makes the
changes with a standard editor, then the program may not act as
described in this manual.
(We hope that those people who design compilers will consider this
problem in the future.)
(end of delman.assembly.worcha)
ASSEMBLY USING MODULES
First, familiarize yourself with DELMAN.DESCRIBE.CONVENTIONS.
You are now ready to assemble a Delila auxiliary program. The
raw source LISTERR cannot be compiled as it now stands because
it is missing a set of replaceable chunks of code (called modules) to read
books (the book reading interface modules). These are to be
found in DELMODS, as stated in the first few lines of LISTERR. Notice that
DELMODS is a program - compile and run it. This will almost certainly
fail. Correct those modules that cause problems. See the section on
assembly problems.
Modules can be moved around using the MODULE program. The details
of this process are described in MODDEF, which you should study now.
--------------------------- READ MODDEF NOW --------------------------------
(end of delman.assembly.module.1)
Prepare to do the module transfers by compiling MODULES.
All programs should be tested on small inputs at first.
Test the Module program with the example module source and library:
MODULE(EXSIN,EXMODLI,EXSOUT,EXCT,LIST,OUTPUT)
Exsout should be identical to the sout example in ModDef.
Examine list and exsout.
Now try:
MODULE(LISTERR, DELMODS, LISTERS, DELCAT, OUTPUT)
The OUTPUT file will tell you the progress MODULE makes during the
transfer. Modules in DELMODS will be copied into the right places of LISTERR
and the result will be LISTERS (LISTER with inserts - source code).
It will be useful to save DELCAT for further transfers from DELMODS.
Compile LISTERS. Run the LISTER (using the default parameters):
LISTER(EX0BK, EX0LIT)
The file EX0LIT is a listing of the example book EX0BK. It should be
identical to EX0LI. The possible exception is the begin-page character:
some computers use a 1 to indicate jump to the next page, while others
use control-L.
We would now like to know that LISTER works correctly. To do
this requires a comparison program. MERGE will do. However, to
construct MERGE requires modules from PRGMODS. Compiling PRGMODS and
running it will test interactive i/o. The procedures in PRGMODS
that may need modification are PROMPT, READCHAR and READLINE, in
decreasing order of system dependence. You should modify LINELIMIT
and HALT by transferring the corrected modules from DELMODS into
PRGMODS. Prepare PRGMODS and run it.
Prepare MERGE and use it to prove that EX0LIT = EX0LI.
You may now construct the rest of the programs. Note that some
of them use several module libraries. For the next stage of setting
up the Delila System compile CATALS, LOOCATS and DELILAS. You must
now construct the libraries: skip to CONSTRUCTING YOUR OWN LIBRARIES,
(DELMAN.CONSTRUCTION).
NOTE FOR A SECOND TRANSPORTATION
If you obtain a later version of the Delila System, then Delmods and
other module libraries are likely to be altered. You will want to replace
modules in the new DELMODS and PRGMODS with your own (system dependent)
versions. If you did this directly, you would also replace corrections
and changes to DELMODS. To avoid this problem, simply construct a small
module library (containing for example LINELIMIT, DATETIME modules and
the interaction modules). Then use this to change DELMODS and PRGMODS.
(end of delman.assembly.module.2)
AN EXAMPLE OF CONSTRUCTING A DELILA SYSTEM PROGRAM
In this example we show the series of steps used to set up a Delila
system program, given that the module libraries are ready (that is,
they compile and run). The example is for Patser, which requires both
Delmods and Auxmods. We assume that the tools needed to do this are
already set up, as discussed on the previous pages. As noted in
DELMAN.ASSEMBLY.INTRO, it is frequently possible to automate these steps.
1. Change Characters
chacha(patserr,patser1,chachap)
Chachap must contain the changes you determined earlier.
2. Remove Blanks
rembla(patser1,patser2)
3. Change Words
worcha(patser2,patser3,worchap)
Worchap must contain a list of special reserved words and what they
are to become.
4. Insert Modules
module(patser3,auxmods,patser4,auxcat)
module(patser4,delmods,patsers,delcat)
Auxcat and delcat will be generated by Module if they were empty. You
can reuse them later with their respective module libraries. The
module libraries needed are listed in the first few lines of each
program. It is not necessary to pickup the DESCRIBE module
to compile the program.
5. Compile
Patsers is now a source code.
(end of delman.assembly.example)
ASSEMBLY PROBLEMS
Transportation and assembly problems occur most often because of
unavoidable system dependent features of particular Pascal compilers.
INTERACTIVE INPUT
For interactive input we wrote several modules that work on our computer
(INTERACT in PRGMODS). These procedures may or may not be transportable,
so you may have to modify them. For example, interactive input on a cyber
Pascal compiler requires the file name "input/" - you would have to remove the
"/" for your compiler. (This is no longer necessary, as the source
code is now under UNIX which does not require this.)
DATE AND TIME PROCEDURES
The module for date and time calls (module PACKAGE.DATETIME in DELMODS)
must be rewritten. We strongly recommend that you keep the same form for
the dates in libraries so that these routines remain interfaces. Changing
the form of the date would make transportation of libraries difficult because
they would not have the same structure in different locations.
Modules that will work on a VAX computer are in VAXMODS. You may find
it easier to adapt these to your computer rather than the ones that
are in Delmods.
If your computer does not have a clock, the simplest way to get this
module running is to add DATE and TIME procedures in the form called
by READDATETIME. These dummy procedures could return either a fixed time
or a random time made by a true random number generator. The date
and time is used to uniquely identify books and some data files.
QUOTES
CDC Cyber Pascal compilers require double quotes(") where the standard is
the single quote (').
SOLUTION: use CHACHA to convert:
" to ' and ' to "
In some cases you will have to use two single quotes so that Pascal prints
a single quote. Some programs that print 5' and 3' are Lister, Helix,
Matrix and Dotmat. To convert, simply alter the constant called 'prime'.
(end of delman.assembly.problems.1)
LINELIMIT
In CDC Cyber Pascal compilers, output to files is limited to 1000
lines unless the LINELIMIT procedure
is called. Your compiler may not require or recognize this silliness.
SOLUTION: The calls to linelimit are isolated to the procedure
UNLIMITLN in the module by the same name in DELMODS and PRGMODS. Simply
surround the call (inside the modules!!!) with comments.
INTERNAL FILES (thanks to Sandy Parkinson)
An "internal file", for the discussion here, is a file used
by a Pascal program as a scratch pad. It is not connected to the
outside world. Some computer systems and their Pascal compiler
require that all files be connected to the outside, as they are not
capable of creating temporary files. At least two Delila programs
use internal files: Module and Split. Correction of this problem
requires some programming. It may not be possible to do it for Split.
COMPARISONS OF PACKED ARRAYS
May cause you some problems. One solution is to use arrays
that are not packed and to write your own comparison procedure.
THINGS THAT WE HAVE NOT THOUGHT OF...
Please tell us! Our address is in DELMAN.INTRO.POLICY.
For notes on the writing of transportable programs see DELMAN.PROGRAM
and DELMAN.DESCRIBE.CONVENTIONS.WRITING.
(end of delman.assembly.problems.2)
GGGGGG UU UU IIIIIIII DDDDDDD EEEEEEEE
GG GG UU UU II DD DD EE
GG UU UU II DD DD EE
GG UU UU II DD DD EEEE
GG UU UU II DD DD EE
GG GGGG UU UU II DD DD EE
GG GG UU UU II DD DD EE
GG GG UU UU II DD DD EE
GGGGGG UUUUUU IIIIIIII DDDDDDD EEEEEEEE
(end of delman.guide)
HELLO COMPUTER - A GUIDE TO THE NEW USER
ABOUT THIS SECTION: This section is a guide to using the computer.
Whenever you have questions about the computer, this is the place to
look, because the rest of the manual is about the Delila System ONLY.
That is to say, we have split this manual into several parts - and it will
not help for you to look for the right thing in the wrong part. The
reason for this is that the information about the Delila System can be
moved from one computer to another (just like the Delila System) but
information about computers usually cannot be moved. DELMAN.GUIDE must be
REWRITTEN for other computers and operating systems.
ABOUT THIS COMPUTER: This manual section is written specifically for
UNIX operating systems. (UNIX is a trademark of Bell Laboratories.)
OTHER DOCUMENTS AND RESOURCES:
In general, ask around.
Type
help
to get pointers.
Learn how to use the UNIX manual program (man).
The apropos program is useful for finding things.
There are hundreds of books on UNIX. Find one you like. Many
people seem to like:
UNIX for People by P. Birns, P. Brown and J. C. C. Muster
Prentice-Hall, Inc, 1985
The easiest way to learn to use a computer is to use the computer!
Obtain a login identification and plunge in.
DO NOT REVEAL YOUR PASSWORD TO ANYONE!!!
(end of delman.guide.intro)
SOME ADVICE TO A NEW COMPUTER USER:
1) YOU CAN'T HURT THE COMPUTER. Don't hesitate to try things and
to play around!
2) After you learn how to get on and off the computer your best bet is to
get a firm grip on what files are, how you can make them and how to
manipulate them. The easiest way to understand what is happening is to watch
it happen. You should use the commands that display your files after each
file manipulation - until you have a good feeling about what is happening.
If you do this you will quickly become confident about what you are doing.
3) A lot of the general principles that you pick up will be similar
on other computers.
4) Be wary of the characters you type. Notice that a zero (0) is NOT
the same as the capital letter O - the computer can tell them apart.
This is also true for a one (1) and the small l.
5) Do not do any serious work while you learn to use the computer. You
are likely to destroy some of your files. That will hurt you and not
the computer. Loss of good data can be terribly frustrating.
6) If you have a problem TRY A SIMPLER CASE,
TRY TO ISOLATE THE PROBLEM.
7) An experienced advisor is worth a thousand hours of computer time.
UNCRITICAL ACCEPTANCE OF COMPUTER RESULTS
"So useful has the computer become in all branches of statistical analysis
that there may be some tendency to forget that even it has its limitations.
The computer cannot work magic--not yet anyway. It will do only what it is
instructed to do, and the validity of the results is determined by the
accuracy and adequacy of the data put in and the wisdom of the people
writing the instructions. Granted, the computer can perform a great
many calculations much more rapidly than mere mortals can do them.
Nevertheless, speed of computational work is not the same thing as
infallibility in aiding with the decision-making process. A statistical
critic, of all people, should guard against being overawed by the news
that certain information was turned out by a computer. The mere fact
that computers are being used these days even to cast horoscopes should
be ample proof that a computer is no more immune to spewing out
nonsense than are real flesh-and-blood people."
-from FLAWS AND FALLACIES IN STATISTICAL THINKING
by Stephen K. Campbell (N.J. Prentice-Hall Inc., 1974), p. 182
(end of delman.guide.advice)
HOW TO USE THE DELILA SYSTEM ON THIS COMPUTER
Computer: Cutterjohn and Sparky.
The Delila System programs and documentation are kept in the directory
~toms/delila
The binary forms (which you can run) are in
~toms/bin
If you put this directory in your path, then they will simply be commands.
(end of delman.guide.delila)
PPPPPPP RRRRRRR OOOOOO GGGGGG RRRRRRR AA M M
PP PP RR RR OO OO GG GG RR RR AAAA MM MM
PP PP RR RR OO OO GG RR RR AA AA MMM MMM
PP PP RR RR OO OO GG RR RR AA AA MMMMMMMM
PP PP RR RR OO OO GG RR RR AA AA MM MM MM
PPPPPPP RRRRRRR OO OO GG GGGG RRRRRRR AAAAAAAA MM MM
PP RR RR OO OO GG GG RR RR AA AA MM MM
PP RR RR OO OO GG GG RR RR AA AA MM MM
PP RR RR OOOOOO GGGGGG RR RR AA AA MM MM
(end of delman.program)
SUGGESTIONS ON HOW TO LEARN AND DO PROGRAMMING
(An Essay By Tom Schneider)
ABOUT LANGUAGES
A computer language is the meeting ground between the absolutely
rigid requirements of a computer (it must be told exactly what to
do) and the ambiguous and flexible uses of human languages
(such as "go jump in a lake", "pour me a cup" etc).
Recently many academic institutions in the USA have allowed students
to substitute computer languages for a knowledge of human languages.
Although a knowledge of computers is becoming increasingly important
in our society, this change is short sighted: no computer
language is anywhere near as powerful or beautiful as those
practiced by humans. With dedication one can easily learn twenty
computer "languages" in a few years, whereas the polyglot is rare
indeed. It is important to learn both kinds of language. For one to
substitute FORTRAN for French is preposterous cheating.
HOW DO LANGUAGES WORK? COMPILERS
Every kind of computer has its own internal "machine" language.
It is difficult for a person to write or read this because it
consists of long stretches of ones and zero's: 0100101010111010000011
10110111101001110010100101001010... Every "bit" (a one or a zero) must be
exactly right or the machine will not operate correctly. Most
people can't deal with such immense amounts of detail. The solution
is to force the computer to keep track of the details and let the person
think in word-like and sentence-like units:
IF SUNNY THEN REJOICE
ELSE MOPE;
Once one has written a set of sentences in a "higher" level language,
one must have the computer convert them to its own internal machine
language (this is not strictly true, but we will only discuss one
method here). The process is called compiling. A self-contained and
consistent set of "sentences" and "paragraphs" is called a program.
Obviously one also needs a program to do the compiling - that program
is called a compiler.
For example, one relatively modern language is called Pascal. A
Pascal compiler sits ("resides") in ("on" - so much for jargon)
a particular computer. It converts statements made in the Pascal
language into machine zero's and one's for that computer (and only
that computer). In other words, it converts a SOURCE code into an
OBJECT code. The object code can be made to operate ("run") only
on one kind of computer. (Note: the word "code" means "program". Also,
on some computers one must convert the object code into "executable"
code before it can be run.)
(Here is something to puzzle over. It is now common practice to write
a compiler in the same language that the compiler compiles. The
Pascal compiler was written in Pascal. It's like pulling oneself
out of the mud by the bootstraps... how did it start?)
WHY PASCAL?
One of the first languages written was called FORTRAN. In its day
(the 1950's) it was a great boon because one no longer needed to write
in machine language (or even one step up, assembly). Since that time
many new ideas have been incorporated into languages. Some of them
(such as recursion and complex data types) fall outside the range that
FORTRAN can handle. This evolution is to be expected. Yet people
still try to teach an old dog, so there have been a series of
"improvements" to FORTRAN. The result is a great mish-mash of
dialects. For these reasons (and other things like the dread
FORMAT statement) it is difficult (although not impossible) to write good
transportable code in FORTRAN. ("Transportable" or "machine independent"
means that the program will work on several different computers.)
Pascal is a more modern language, so it includes recently developed
concepts. One can write excellent crystal clear code in this language.
Unfortunately this property does not prevent one from writing poor and obscure
code!
TOPDOWNING: How To Write Clear Code
There are as many ways to write code as there are people. Yet a
few simple principles allow one to organize one's thoughts quickly
and efficiently.
Writing a program is just like ... writing an outline.
One starts at the "top" by writing the main things to be done:
Tom's Day
I. Morning
II. Travel To Work
III. Work
IV. Travel Back Home
V. Evening
Then one writes the first section:
I. Morning
A. Get Up
B. Shower
C. Get Dressed
D. Eat
E. Put On Coat
This is repeated for the other sections. Eventually we get even deeper:
I. Morning
A. Get Up
1. Huh?
2. Open eyes
3. Yawn
...
In Pascal, one dispenses with the numbering of sections. Instead,
each section has a name. A section is called a procedure. Since you
can read all about procedures, I won't go into more detail here.
The main advantage to this method is that if one is careful, each
procedure is isolated from all the others. There is only one thing to
think about at a time.
SPAGHETTI PROGRAMMING
Many computer languages, including Pascal, allow one to jump from one
statement to others in the program. These GOTO statements invariably
lead to poor programs because one creates nests of GOTO's that jump
all over the place. These can be difficult to figure out. I
have seen a case where a professional programmer didn't know about an
inefficient series of jumps that he had written. Even large companies
sell code that is a tangled mess. Modern programmers have found that
the solution is amazingly simple:
DON'T USE GOTO'S
The Delila system programs use only one GOTO, in a procedure named HALT
which terminates the program by jumping to the end of the program. This
is necessary because Pascal does not provide for a program abort procedure.
(Pascal HALT is not standard.) There are NO other circumstances when a
GOTO is required!!
A METHOD FOR WRITING PROGRAMS
This is what I do when I write a program: I have a stack of old
computer paper (or standard size paper, not printer size). I write
one procedure on each sheet. An entire procedure is "no longer than"
one page. In fact, any procedure longer than a page is usually
a warning that I need more procedures. It is not necessary at first
to write the details of every procedure, only to define the
procedures. Starting from the top I work down a ways, realize that I
need a set of primitive procedures (eg. to manipulate text lines)
so I define them, but the way they work can be written later. So
as the highest levels of the program are formed, the lower levels
are defined. Eventually it is time to write details of the lower
levels. Sometimes the higher level can be simplified as the lower
levels become clearer.
As you can tell from this description, one begins from the top, but
the entire structure changes as one goes. Don't be afraid to toss
out a procedure that's no good - it's only one page and the paper
can be recycled.
The last point is important: be flexible. Don't keep banging your
head against a logical dilemma. I have often outlined a whole
program - and then tossed it out because there was a
better solution. Learn when to drop. Clues: you find yourself
trying to do many things at once; the primitive procedures that
you have devised are awkward to use; and you find it impossible to
document a procedure.
Document a procedure??
DOCUMENTATION: The Key To Immortal Code
Even in a high level language like Pascal, it is possible to have a
functioning program that is not easy to understand. To define a procedure
I often write down the name of the procedure, the variables (pieces of
information to be manipulated) that it uses and then a few English sentences
that define exactly how the variables are to be used. This is all one needs
for the higher levels of the outline. Those written sentences are called
comments. They are part of the documentation required to make the program
easy to write and ... easy to read.
It is impossible to overemphasize the importance of documentation
because nobody EVER does enough (me included).
If you don't document, within a short time (e.g. a month to half
a year) you will have forgotten the details of the program - and it will be
painful to figure it out again. Worse than that - nobody else will be
able to work with it!
It is not hard to write out what you are trying to do in a particular
section of code or procedure, and it has a real advantage: one is
forced to think clearly.
There are several places in a program that ought to have comments:
PROGRAM STATEMENT - the program should state its purpose in life, how it
should be used, who wrote it and the date of the latest version. Some
technical details can be included.
CONSTANTS - Include a constant called VERSION and CHANGE THIS EVERY TIME
THAT YOU CHANGE THE SOURCE CODE. Write the version to all output from the
program. This will assure that all output can be unambiguously
associated with a particular version of the program. This will save you
many headaches! (Note: some computers keep track of file versions.
FILE VERSIONS WILL NOT SUBSTITUTE FOR AN INTERNAL CONSTANT because
the program output is not affected and it is not transportable.)
All CONSTANTS, TYPES and VARIABLES should have a short description of
their purpose. DON'T USE ONE VARIABLE FOR TWO PURPOSES - you will
be unable to document these cases properly and the code will be
confusing.
Each PROCEDURE or FUNCTION should have a short description that
tells how to use it and gives the purpose of each passed variable.
*****************************************************************************
* SUMMARY: programming is vastly simplified by using two simple tactics: *
* topdowning and documentation. *
*****************************************************************************
A NOTE ON DATA STRUCTURES
Higher level languages, such as Pascal (but not FORTRAN) allow one to
describe data in forms (structures) that resemble the way one thinks
about the problem. To take advantage of these facilities, it pays to
name each "variable" (a structured box into which data is put) and "type"
(the structure of the box) carefully. A good name will make
operations on the variable obvious, and errors will stand out because
they will "sound" wrong.
LOCATING ERRORS: Debugging
Even with top down programming and documentation, errors are made.
These are called "bugs". There are several kinds:
SYNTAX - the compiler will yell at you for things like spelling mistakes
BOMBING - the program stops abruptly when it should not
LOGIC - the program produces strange results
SUBTLE - the program can't handle certain rare conditions correctly
SYNTAX - It helps to check what you type in. Since I put one procedure
per hand written page, this is the easiest unit to check. Many subtle
bugs can also be caught this way.
BOMBING - It is often obvious where the program died. Work backwards through
the logic to find the error. Clear, top-down code makes this much easier:
one can often tell immediately where the problem is. Tracing also can
help. See below.
LOGIC and SUBTLE - Some computer systems allow one to trace the path that
the computer follows through a program. So far I have not found these
useful because they are cumbersome and they put out too much data.
A few well placed write statements will trace the program flow quite well.
(A "write statement" could print the value of a variable out for you and
tell you where the computer currently is in the program.)
In Pascal, one method is to make a global constant:
DEBUGGING = TRUE; (* FOR DEBUGGING PURPOSES *)
and use it this way:
IF DEBUGGING THEN WRITELN(OUTPUT, "BEGIN PROCEDURE CIRCLE");
By changing the value of DEBUGGING one can turn the trace on and off.
To turn off an individual trace point, one can "comment it out":
(* IF DEBUGGING THEN WRITELN(OUTPUT, "BEGIN PROCEDURE CIRCLE"); *)
The symbols "(*" and "*)" will make Pascal ignore the contents,
because they become comments. The advantage of this over removing the
statement is that it allows one to reactivate it easily.
By far, the most time saving method is to write clear, well documented code.
TESTING CODE
It is often worthwhile to test a program on a small set of examples that
one has worked out by hand. You should be aware however, that correct
answers to tests do not prove that the program is correct. (This may
seem obvious, but it is an easy mistake to make.) Sometimes one can
prove the correctness of a program. This is a current field of research
in computer science.
HOW TO READ MANUALS
Obtain your own copy of the manual and begin to read. Get a general idea
of how the language, editor or system works. Don't worry about details
yet. As soon as you have an idea about how to do something, try it on
the computer. Play. Later on, you can read through the manual seriously
if you want. However there is often a lot of detail that you would have
to memorize. It is simpler to know that something can be done (by reading
it once lightly) and to look it up when you need to do it.
WRITING TRANSPORTABLE PROGRAMS
A program written for one computer may not run on another computer
because the compilers for the two computers may not understand the
same language. Moving a program from one computer to another is called
transportation. If you are going to the trouble and effort to write a
good program, then you may as well make it easy for other people to use
it. Your program would then be transportable.
Obviously to be transportable, a program must be well written and
documented. That is not all. You must avoid all the fancy "features"
that your compiler advertises, because no one else has these. If you
are forced to use some feature, then isolate it to a few replaceable
procedures. We have provided you with a transportable(!) mechanism for
replacing chunks of code like this - see the document MODDEF and the MODULE
program.
PROGRAM MAINTENANCE... SENILITY... AND DEATH.
The most costly aspect of using computer programs is not their initial
writing, but maintaining them once they are written. This is well
documented in the literature. But why should a program need
maintenance? Aren't they fixed text that does not change? In the
simplest sense this is true. But over time, bugs in the code are found
and fixed, and needs and expectations change. Programs are not
static, they evolve. Good programming techniques and documentation
make maintenance easier during the life time of a program, but eventually the
program becomes so hard to change that one must scrap it altogether
and start a fresh design. So programs have a birth, a life of use and
maintenance and, finally, a senility before they die.
REFERENCES
"Pascal User Manual and Report", Second Edition, by Kathleen Jensen
and Niklaus Wirth. Springer-Verlag, 1978.
"Software Tools in Pascal", Brian W. Kernighan and P. J. Plauger.
Addison-Wesley Publishing Co. 1981.
"Algorithms + Data Structures = Programs", Niklaus Wirth.
Prentice-Hall, Inc., 1976.
"Structured Programming", O. J. Dahl, E. W. Dijkstra and C.A.R. Hoare,
Academic Press. London, 1977.
"Selected Writings on Computing: A Personal Perspective",
E. W. Dijkstra, Springer -Verlag, New York, 1982.
(end of delman.program.essay)
A Fairy Tale For Programmers
The Three Most Important Concepts
for Writing Good Code
1. Put comments in your code.
2. Don't ever forget that six months from now your program
will be useless even to you without comments.
3. Several people who published a rather well known article on
using computers to study sequences (and whose names shall remain unsaid
to protect the guilty) sent their programs to us two years after they
had published their article. It turned out that we could not use
their programs directly because we did not have available the language
that they used. It was necessary to translate each line of code into
our language before we could use their program. Ok, fine, we know how to
do that. But despite the fact that these were old programs that they had
been working on for a long time, there were almost no comments in
their code. That made the translation 100 times more difficult!!
One sees an equation in the code - what does it mean? If they do
something in a funny way, was it a mistake or is it important to
do it that way? What a headache!
We threw out their programs and wrote our own.
MORAL: Code that is not documented in English will
not survive in the long run. Therefore:
Put In Comments.
Comment As You Code, NOT AFTERWARDS - Comments Are Part Of The Code.
Change The Comments When You Change The Code, NEVER PUT THIS OFF.
Epilogue
Years later, out of curiosity, the program called CODE
(COmment DEnsity) was written. We were startled to discover that
the frequency of characters devoted to comments in our code
averages around 30 percent!
(end of delman.program.fable)
UU UU SSSSSS EEEEEEEE
UU UU SS SS EE
UU UU SS EE
UU UU SSSSSS EEEE
UU UU SS EE
UU UU SS EE
UU UU SS EE
UU UU SS SS EE
UUUUUU SSSSSS EEEEEEEE
(end of delman.use)
Use Of The Delila System
INTRODUCTION
This section of the Delila Manual assumes that you have read the
introduction to the manual, that a Delila System is running on your
computer, and that you know how to get on the computer, to make
files, to modify and correct files, and to run programs (See DELMAN.GUIDE.).
There are several sources of information that you can keep in mind:
1) The papers in DELMAN.INTRO.REFERENCES will show you
how we have used the Delila System.
2) LIBDEF. This is a technical specification of Delila and the
libraries. However, there is a set of detailed examples that
can be read profitably without reading all the definitions.
3) The section of DELMAN called Program and Data Descriptions
(DELMAN.DESCRIBE) lists everything that is available to you. Whenever
you want a tool to do something, that is the place to look.
In this section we will first discuss the structure of a Delila Library
and how you can find your pet (pet's?) sequence in it. Next we
describe how to tell Delila to go and fetch your sequences. We will
then discuss programs that let you study the sequences. The sequence
analysis will bring us back to Delila.
(end of delman.use.intro)
LIBRARY STRUCTURE
Think about a tree. The trunk spreads into a series of branches,
sticks and twigs. A Delila library looks something like that, except
that there are several kinds of branch, stick and twig, much as each
twig ends in a leaf, bud or a flower.
We have given names to the kinds of branches and leaves in Delila
libraries. Near the trunk there are the ORGANISM and the
RECOGNITION-CLASS. An ORGANISM is a cluster of data pertaining to a
real-world organism. The term "organism" is somewhat ambiguous, so it
is a matter of taste as to the classification of some creatures (is a
virus a traveling plasmid?). In our library T4, T7 and E. coli
information is stored in ORGANISMs.
A RECOGNITION-CLASS is a cluster of data about any process that
recognizes specific nucleic-acid sequences. These include chemical
modification and restriction enzymes. (At present this portion of
the library is not fully implemented, so we will not discuss it further.)
The library structure can be diagrammed in a schema:
A-->>--B means A has one or more of B.
C--->--D means C has one of D.
LIBRARY
: :
V V
V V
: :
............: :.............
: :
ORGANISM RECOGNITION-CLASS
: :
V V
V V
: :
CHROMOSOME :
: : : : :
V V V V :
V V V V :
: : : : :
............: : : :......... :
: ......: :.... : :
: : : : :
MARKER TRANSCRIPT GENE PIECE.... ENZYME
: : : : : : : : :
V V V V : : : V V
: : : :.....: : : : :
: : :...................: : : :
: :...........................: : :
: : :
SEQUENCE SEQUENCE SEQUENCE
(end of delman.use.structure.1)
In this schema you can see that ORGANISMs have one or more
CHROMOSOME branches. Once again, the term CHROMOSOME is intended to
be somewhat flexible. In Delila it means a complete biological
unit of nucleic-acid either DNA or RNA. For example, we refer to both the
ECOLI (the 5 million base one) and the CHROMOSOME PBR322 (the 4.3kb plasmid).
Notice that real-world chromosomes are "inside" their organism. In the
same way, one can think of CHROMOSOMEs to be inside their ORGANISM and
ORGANISMs to be inside a library. You may think of a Delila Library
either as a tree or a series of objects, one nested inside the other.
A little reflection will show that these are equivalent because one
can convert from one form to the other.
Every ORGANISM and CHROMOSOME has a name by which it can be identified.
For example, T4 is the name of the coliphage of rII fame, while ECOLI
is the name for Escherichia coli. There is other information stored
at these branch points as well. An ORGANISM tells us the genetic map units
used, such as centiMorgan or kilobasepair. The CHROMOSOME goes on to
specify the beginning and ending of the corresponding chromosome in
the given units.
Now we will delve inside a CHROMOSOME. There are MARKERs,
TRANSCRIPTs, GENEs and PIECEs. What is going on? So far we have
been leaning toward a description of an ideal situation where all
the nucleic-acid sequence information of a chromosome would be stored inside
a single data object -- a PIECE. Although this fits small phages such as
PHIX174 and FD, it is nowhere near true even for ECOLI. There are many dis-
connected fragments of E. coli sequence now known. As sequencing progresses,
the fragments will connect more and more until the entire sequence is known.
So a PIECE may be either the entire sequence information in a CHROMOSOME
or only one of many fragments. In this way we can store sequences
in their natural arrangement, and still accommodate data that is
fragmented due to technical limitations. As more sequence is obtained,
the SEQUENCE inside a PIECE is extended or fused to neighboring PIECEs.
Like all the other library objects, a PIECE has a name, usually related
to its biological functions. To keep all the fragments straight, each
PIECE tells its location on the genetic map. The nucleic-acid
sequence is stored inside a SEQUENCE, written 5' to 3'. Besides these
data, each PIECE stores a useful set of information: a
coordinate system.
For the purposes of identification, every published sequence is given
a set of consecutive integers corresponding to basepairs or bases
along the DNA or RNA sequence. This numbering scheme is captured
in the coordinates of each PIECE. Using Delila, subfragments of a
PIECE can be easily obtained. These are also PIECES and every base
in the new PIECE has the same number that its parent did. This has
WONDERFUL consequences: every printout can refer to the original
published literature. It is also easy to compare the results from
several analyses.
(end of delman.use.structure.2)
Let's move on to the GENE, one of the other data-objects inside a
CHROMOSOME. A GENE defines the endpoints of the genetic information
of a protein in the SEQUENCE of a PIECE. For example, in ORGANISM ECOLI;
CHROMOSOME ECOLI there is a PIECE LAC. The GENE LACI refers to this
PIECE by pointing to the first G of the GTG and the A of the TGA.
A TRANSCRIPT is similar to a GENE, but it defines any region
transcribed into mRNA. For consistency, we consider a tRNA to be a
TRANSCRIPT and not a GENE. GENE is reserved for the coding sequence
of polypeptide products.
Suppose that a mutation is known for your favorite sequence. The
MARKER is designed to record the change made by the mutation.
MARKERs can also record splice junctions and other interesting
sequence features. In the future Delila will allow one to obtain
both a sequence and its mutated forms using MARKERs.
Notice that MARKERs, TRANSCRIPTs and GENEs all refer or point to
a particular PIECE. Each PIECE therefore has a "family" of related
branches. It is here that the tree-like structure of the library
begins to break down: some of the branches are connected to one
another in a kind of network.
Now it is time to become practical. Obtain a copy of HUMCAT. This
is a catalogue of the library, the HUMan's CATalogue. (Delila also
has one for herself). Look around HUMCAT. Notice that it is
organized by ORGANISM, CHROMOSOME, and so forth. Find a GENE or
TRANSCRIPT that you are interested in. In the next section you
will learn how to obtain it to play with.
(end of delman.use.structure.3)
DELILA - THE LANGUAGE
WHY WRITTEN INSTRUCTIONS?
One of our major design decisions was the use of written instructions
for the librarian. While we realize that this is somewhat foreboding
to a new user, it does have several advantages over direct interactive
use. One is that it is easier to correct mistakes in the list of
sequences that are to go into the book than it is to change sequences by
hand. Corrections to instructions are done with a text editor. Also, the
amount of information necessary to obtain a fragment of sequence is usually
less than the information in the sequence itself, so storing instructions
instead of sequences is efficient. Another advantage is that a complete
and concise record may be kept. As we will see later, the instructions can
also be generated by auxiliary programs, allowing one to automate many
complex manipulations.
WHAT IS THE DELILA LANGUAGE?
This section describes the use of the language Delila:
DEoxyribonucleic-acid
LIbrary
LAnguage.
The language is not as complex or comprehensive as a natural language
such as English or French. It was designed for a particular task:
telling a nucleic-acid data base manager - the librarian - the set of
fragments that one wants to collect for study. (The name Delila is an
anachronism that we can't bear to part with...)
Since the library is structured like a tree, the language must allow
one to specify individual branches. Eventually a particular PIECE
will be identified, and one can request one or more fragments from
the PIECE. Let us look at an example:
TITLE "EX1: THE LACI GENE";
ORGANISM ECOLI;
CHROMOSOME ECOLI;
GENE LACI;
GET ALL GENE;
(Note: this instruction set is kept in the file EX1IN, so you can
try it. All EXn examples are sent with the Delila System.)
Statements in Delila end with a semicolon (;) - there are five
statements above. The first statement will give a title to the book.
The next three specify a particular GENE in the library structure.
One thinks of this as a series of steps climbing the library tree.
Starting at the "root" of the library, we first named the ORGANISM
ECOLI. This moves us out to that ORGANISM. Then the CHROMOSOME
was chosen to be ECOLI - the main chromosome (as opposed to a
plasmid such as PBR322). Next, the particular gene, lacI, is
specified by "GENE LACI;".
As we noted in the section on structure, GENES point to the
particular PIECE that they reside on. GENE LACI points to the PIECE LAC.
Although we need not know this for the request, Delila knows it
automatically. When the GET is performed, Delila will obtain the
sequence of lacI from the G of the GTG through the A of the TGA.
After Delila has read each of these statements, the information
about the object (ORGANISM, CHROMOSOME or GENE) is put into the
book. The GET generates a PIECE that is also placed into the book.
(end of delman.use.language.1)
TRY IT OUT
Type a file containing Delila instructions that specify the gene
you chose at the end of the section on library structure. For this
discussion, we will use the name EX1IN, although you may use another
name. Find the entry on Delila (DESCRIBE.DELILA) in the back of this
manual and run it:
delila(ex1in,ex1bo,ex1dl)
Look at the ex1dl file. This is the Delila Listing. The first
line will look like this:
82/01/21 23:17:51 DELILA 1.20 PASS 1 PAGE 1
Delila performs two passes through the instructions. Pass 1 checks for
spelling and syntax errors. If you made a typing mistake, it will be noted
in the listing and Delila will not begin Pass 2. Should Pass 1 be
successful, then Pass 2 begins. Notice that there are several lines that look
something like this:
* 81/01/18 22:29:26, 80/11/19 22:17:46, LIBRARY 1: BACTERIOPHAGE
* 81/01/18 22:29:26, 80/11/19 22:17:46, LIBRARY 2: E. COLI AND S. TYPHIMURIUM
These are the full titles of the libraries from which you are pulling
sequences. Each title has three parts separated by commas:
1) the instant (date and time in descending order) that the library
was created.
2) the instant that the PARENT of this library was created.
3) the title of the library.
Notice that Delila also prints the current date and time at the top
of the listing (if your system has these functions). The first line of a
book or library contains its full title. For this example, this is:
* 82/01/21 23:17:51, 81/01/18 22:29:26, EX1: THE LACI GENE
What is the "genealogy" of the book that you obtained?
Back to the listing, Pass 1. The instructions that you typed are
repeated on the listing. To the left are two columns of numbers -
the leftmost is the line number and the next is the statement number
(there can be several statements on one line or one line may contain
only part of a statement). This information is sometimes useful.
Now let's look at the listing, Pass 2. Notice that the instructions
that you typed are repeated again, but that there are extra lines
inserted. In Pass 1 Delila checked for typing errors, while in Pass
2 Delila pulls out data items and places them into the book. As
each item is put into the book, it is given a number:
2 2 ORGANISM ECOLI;
#1
This is useful for some auxiliary programs. We will discuss control of
the numbering in a later section.
If your instructions worked then there will be two other numbers just
below the get:
5 5 GET ALL GENE;
#4
^29^1111
These numbers show you the numbers of the beginning base (29) and
the ending base (1111) for the PIECE put into the book.
(end of delman.use.language.2)
RANGE DEFAULTS
It is quite possible that you got an error message at this point:
4 4 GENE LACZ;
5 5 GET ALL GENE;
#4
^1234^100000
---ERROR(S)---------------------------^206^203
203: OUT OF RANGE AND DEFAULT RANGE = HALT
206: WE DO NOT KNOW THIS LIMIT (A WARNING)
This indicates that only part of the gene you are interested in
exists in the library. Delila detects the fact that one end of
the GENE goes off the end of its PIECE, and says that this limit (the
end of the gene) is unknown. (This is indicated by the 100000.) Normally
Delila will HALT when this situation is discovered. You can change this by
using the instruction:
DEFAULT OUT-OF-RANGE REDUCE-RANGE;
anywhere before the problem but after the TITLE. This resets the default
response to an out of range situation.
In REDUCE-RANGE mode, Delila will attempt to find the closest edge
of the PIECE and use that. The listing will show a record of what
Delila does:
6 6 GET ALL GENE;
#4
^1234^100000^1419
---ERROR(S)---------------------------^206^208
206: WE DO NOT KNOW THIS LIMIT (A WARNING)
208: OUT OF RANGE AND DEFAULT RANGE = REDUCE (A WARNING)
In this case the PIECE in the book begins at 1234 and ends at 1419.
To cause Delila to continue without putting any PIECE down in the book
one would use:
DEFAULT OUT-OF-RANGE CONTINUE;
You may use several default statements to affect how Delila responds.
To reset the default to halting, use HALT instead of CONTINUE or
REDUCE-RANGE. (See DELMAN.USE.CONTROL)
Use the programs COUNT and LISTER to look at your book.
(end of delman.use.language.3)
MORE ON INSTRUCTIONS
There are several ways to obtain sequences in a book. For example
one could use:
TITLE "EX2: AN ABSOLUTE GET";
(* FIRST WE WILL SPECIFY THE LAC PIECE: *)
ORGANISM ECOLI; CHROMOSOME ECOLI; PIECE LAC;
(* NEXT WE WILL REQUEST A PARTICULAR FRAGMENT OF THAT PIECE: *)
GET
FROM 29 (* THE BEGINNING ABSOLUTE POSITION *)
TO 1111; (* THE ENDING ABSOLUTE POSITION *)
There are several things to note about these instructions. First, there
are 5 instructions and four comments. A comment is the text between
a (* and a *). You should use comments freely to document what you
are doing. This is made easy by the fact that comments can extend over
several lines. Delila ignores comments.
Several instructions can be put on one line (the specifications, above)
and one instruction can be spread over several lines (the request).
The GET above defines two basepairs in the LAC sequence. The sequence
between (and including) these bases is put into the book. Delila always
puts sequence in the book 5' to 3'. Thus to get the complement of the
instructions above, one simply uses:
GET FROM 1111 TO 29;
RELATIVE VERSUS ABSOLUTE REQUESTS
In contrast to EX2 we could write:
TITLE "EX3: A RELATIVE GET";
ORGANISM ECOLI; CHROMOSOME ECOLI; GENE LACI;
GET FROM GENE BEGINNING
TO GENE ENDING;
In this case we did not state absolute numbers to define our book.
Yet in all three examples (EX1, EX2, and EX3) the same PIECE will be
generated in the book.
There are two ways to define a base in a sequence. One is to give
its exact coordinate as in EX2. That is called an ABSOLUTE reference.
The other way is to define the distance from a fixed point, as in
EX3: a RELATIVE reference.
Both absolute and relative referencing have advantages and disadvantages.
Using absolute coordinates allows us to pinpoint particular bases. However,
Delila libraries evolve over time, and when two previously separate
PIECEs are fused, only one coordinate system is kept. An absolute
reference will not last. On the other hand, a relative reference
will last because the GENE BEGINNING will always be the start of the
gene no matter what happens to the actual coordinate system.
(end of delman.use.language.4)
FORMS OF REQUESTS
By now you may have noticed that there are two kinds of GET:
GET ALL ... ;
GET FROM ... TO ... ;
The two positions of the FROM-TO form are independent as long as
one refers to locations on the same PIECE. In absolute terms one
can say
GET FROM -22 TO 56; (* ABSOLUTE *)
or one can make it relative to a gene beginning:
GET FROM GENE BEGINNING - 10
TO GENE BEGINNING + 5;
One can even write instructions relative to an absolute location:
GET FROM 56 - 10 TO 56 + 5;
This is to be pronounced "get from fifty-six minus ten to fifty-six plus
five". We will come back to this form later.
MARKERs, GENEs, TRANSCRIPTs and PIECEs all have a BEGINNING and an
ENDING that you can use. For example,
TITLE "EX4: NON-CODING LAC LEADER";
ORGANISM ECOLI; CHROMOSOME ECOLI;
GENE LACZ; (* NOW DELILA KNOWS THE PIECE *)
TRANSCRIPT LACZ;
GET FROM TRANSCRIPT BEGINNING
TO GENE BEGINNING -1;
Notice that both a GENE and a TRANSCRIPT can be specified at the
same time.
AMBIGUOUS DIRECTIONS
Consider the circular genome of ORGANISM G4. The numbering of the
PIECE is from 1 to 5577. Suppose that you asked for:
TITLE "G4 COORDINATE PUZZLE";
ORGANISM G4; CHROMOSOME G4; PIECE G4;
GET FROM 1 TO 10;
This is ambiguous! There are TWO PIECES that run from 1 to 10:
one clockwise and the other counterclockwise. In this case Delila
will supply you with the clockwise fragment. However to be more
specific in one's request, one would write:
GET FROM 1 TO 10 DIRECTION +;
or
GET FROM 1 TO 10 DIRECTION -;
But there are still two other possibilities!
GET FROM 10 TO 1 DIRECTION +;
GET FROM 10 TO 1 DIRECTION -;
Delila is capable of handling most requests like these. (Certain
of the most complex cases remain to be solved.)
(end of delman.use.language.5)
RESPECIFICATION
What if one wanted to specify more than one "leaf" (GENE, TRANSCRIPT,
or MARKER) at one time? Then one would use:
TITLE "EX5: THE REGION BETWEEN LACI AND LACZ";
ORGANISM ECOLI; CHROMOSOME ECOLI;
PIECE LAC; (* NOW DELILA KNOWS THE PIECE *)
GET FROM (GENE LACI) ENDING + 1 TO (GENE LACZ) BEGINNING - 1;
This form is called a "respecification", to distinguish it from
a specification.
MULTIPLE REQUESTS
After Delila has completed a GET, as in the last few examples, the
specifications are still in effect and one can do more GETs,
change the specification, more GETs, etc:
TITLE "EX6: MULTIPLE SPECIFICATION AND REQUESTS";
ORGANISM ECOLI;
CHROMOSOME PBR322;
GENE AMPR; GET ALL GENE; (* GET GENE OF BETA-LACTAMASE *)
CHROMOSOME ECOLI; (* CHANGE SPECIFICATION *)
TRANSCRIPT 16SRRNAB; GET ALL TRANSCRIPT; (* 16S RRNA *)
TRANSCRIPT 23SRRNAB; GET ALL TRANSCRIPT; (* 23S RRNA *)
ORGANISM PHIX174;
CHROMOSOME PHIX174;
(* GET TWO OVERLAPPING GENES *)
GENE A; GET ALL GENE;
GENE B; GET ALL GENE;
WHEN DOES DELILA ACT?
During Pass 2, Delila places the various items into the book. Thus
as ORGANISM, CHROMOSOME, GENE or TRANSCRIPT instructions are read,
they are executed immediately. This is not true for the PIECE in the
example EX3 because at that point Delila does not know the endpoints
of the sequence desired. Delila "knows" which PIECE you are interested
in, but not what particular bases. When Delila reads the GET, the bases
become apparent. You can see this in the Pass 2 listing: a PIECE
is not given a number, rather the number is listed for the GET that
generates the PIECE in the book. The numbers are for objects in
the book, not for those in the library.
(end of delman.use.language.6)
AUXILIARY PROGRAMS: LISTER AND SEARCH
In the section on language, we discussed how one can use Delila to
generate books containing sequences one is interested in. It is difficult
to read the sequences in a book because they are in an awkward (from your
viewpoint) compressed format. In every day use, we almost never look
inside a book because there is a much easier way: generate a fancy
listing using the program LISTER.
In the section on the Delila language you used LISTER to look
at the books that you generated. (If you have not done this, then
you should do it now.) As other programs, LISTER will print
sequence 5' to 3'. If you want the complement, it is easy to use
Delila to obtain it.
LISTER is an example of an auxiliary program. In contrast, Delila is
the center of the Delila System. The purpose of Delila is the
manipulation of sequence information. Other "auxiliary" programs
perform tasks such as making listings or doing analyses. These
programs are explained in DELMAN.DESCRIBE.
The only other auxiliary program that we will discuss here is the
SEARCH program. SEARCH will search a book for a simple pattern. As
you will recall, books have the same structure as libraries. As
SEARCH proceeds to look into an ORGANISM it will know the name of the
ORGANISM:
ORGANISM ECOLI;
Then it will enter the CHROMOSOME:
CHROMOSOME PBR322;
Finally it begins to search a PIECE:
PIECE PBR322;
In other words, SEARCH can write Delila instructions that trace the
search path. Suppose that we had told SEARCH to search for the pattern
5' AAGCTT 3' (HindIII). We also tell it that the FROM should be -5 and
the TO +10. When search finds the site it can then write:
GET FROM 29 -5 TO 29 +10 DIRECTION +;
29 is the position of the first A of AAGCTT in PBR322.
These Delila instructions are an answer to the search!
You should try this and the other Auxiliary programs.
(end of delman.use.auxiliary.programs)
DATA FLOW AND DATA LOOPS
In the section on Auxiliary programs we discussed the use of the
SEARCH program to locate patterns in books. The search results appear
in three ways: on the screen, in a file for printing, and as Delila
instructions. These instructions can be given to Delila to generate
the sequences of found sites. One can view this entire process as a
flow of data between one program and the next. Since this manual can
not have (nice) line figures, we strongly urge you to look at the flow
figures in the published papers listed in DELMAN.INTRO.DESCRIPTION.
Connecting parts of the Delila system together is much like playing
with tinkertoys.
Data flowing in the Delila system can pass through a program several
times. Our first example was the conversion of a book to a library and
the subsequent extraction of book subsets. The SEARCH program
provides a more complex case where searching of a book generates
Delila instructions that can be used to create a new book. The new book
is the set of located sequences. This cyclic string of events is
called a loop.
Once you are acquainted with these data flow loops you can look at the
SEPA program. This program deals entirely with Delila instructions
of the form:
GET FROM 56 -40 to 56 +60;
along with ORGANISM, CHROMOSOME and PIECE specifications. The
SEARCH program produces instructions in this form. SEPA is used to
separate instruction sets.
For example, suppose you are interested in all the AluI (5' AGCT 3')
sites that are not part of PvuII (5' CAGCTG 3') sites. You have used
DELILA and SEARCH to generate two sets of instructions, ALUIMIX and
PVUII. You then can use SEPA to get the set that you want:
SEPA(PVUII,ALUIMIX,PVUIIO,ALUI)
PVUIIO would be a reorganized non-redundant list of the PvuII
instructions, and ALUI would list all AluI sites that are not
PvuII sites. Both our second and third papers describe the way that
we use SEPA. (Note: to do a search like this one must be sure that the sites
are numbered the same way. The search rule for AluI would be #AGCT,
while the search for PvuII would be C#AGCTG. The # symbol tells SEARCH
to write the number of the following base in the instructions. This forces
the SEARCH program to number the same A in the two cases.)
(end of delman.use.data.flow)
THE COORDINATE SYSTEM OF A PIECE
In the sections on library structure and the Delila language, we kept
touching on the topic of coordinate systems for PIECEs. Delila is
required to maintain the numbering of sequence fragments, and a
coordinate system is the means to do so. This is not a simple problem,
for one must handle both linear and circular genomes. For the new
user, it suffices to know that Delila can do that, and you could
skip this section.
Let us start with the simpler case, a linear PIECE. The SEQUENCE
in the library is numbered consecutively from 1 to 100. So far so
good, we need to record three pieces of information:
CONFIGURATION: LINEAR
BEGINNING: 1
ENDING: 100
Any subset of the PIECE such as:
GET FROM 40 TO 50;
will also be linear and can be handled by these three variables.
Notice that one could:
GET FROM 50 TO 40;
to obtain a complement. In that case the BEGINNING is greater than
the ENDING and the numbering decreases.
What if the CONFIGURATION is CIRCULAR? Then based on our discussion
about ambiguous directions, we should at least add a
DIRECTION: +
for linear sub-fragments. However the situation can be worse than that!
Let us imagine a circular PIECE in the library. It is numbered 1 to
100 in the direction 5' to 3' of one DNA strand. We then make a
request:
GET FROM 10 TO 90 DIRECTION -:
The PIECE to be placed in the book is 21 bases long, with descending
numbers, EXCEPT for a COMPLETELY UNPREDICTABLE DISCONTINUITY where
the numbering jumps from 1 to 100. Some more information about the
"parent" coordinates must be stored.
(end of delman.use.coordinates.1)
The problem is to record the necessary coordinate information and to
avoid becoming confused. In the Delila System, the numbering of
each PIECE has two parts: a COORDINATE part and a PIECE part.
The COORDINATE part defines the location of a sequenced region on
the genetic map. Once that is established, the PIECE part tells what
fragment is stored in the PIECE. Both parts are transmitted to the
book by Delila, but the coordinate part is fixed and unchanging while the
PIECE part will vary depending on the fragment. In summary so far:
COORDINATE part = defines the relation of coordinates to the genetic map
PIECE part = defines the relation of SEQUENCE to the COORDINATE part
For the coordinate part:
GENETIC MAP BEGINNING This number locates the beginning nucleotide of the
coordinate system on the genetic map. We use these numbers to
order the PIECEs in our Master library.
The COORDINATE CONFIGURATION refers to the topological shape of the
coordinates. A linear genetic map could only have PIECEs with linear
coordinates. For a circular genetic map, circular coordinates may be
chosen, but when only a portion of the sequence is known, each PIECE may be
more conveniently handled as a linear coordinate system.
A COORDINATE DIRECTION defines the orientation of the numbering system with
respect to the genetic map. + means "in the same direction as", - means
"in the opposite direction as".
The COORDINATE BEGINNING and COORDINATE ENDING nucleotides are integers
that specify the limits of the coordinate system. They are usually
the ends of the largest known contiguous sequence. The BEGINNING base
corresponds to the genetic map beginning, the bases are consecutively
numbered, and the ENDING is always greater than the BEGINNING number.
The coordinate system described above provides a framework for stating
the exact numbering of the SEQUENCE in a PIECE. This also requires
four items of information: configuration, direction, beginning and
ending, all relative to the coordinate system.
The PIECE CONFIGURATION may be circular only if the coordinate
configuration is also circular. When the coordinates are linear, the
PIECE must also be linear.
The PIECE DIRECTION may be + or - with respect to the coordinates,
representing homology or complementarity to the coordinate system.
The PIECE BEGINNING and ENDING are the numbers of the endpoints of the
SEQUENCE. Both must lie within the bounds set by the COORDINATE BEGINNING and
ENDING. The BEGINNING is always the 5' end of the molecule.
(end of delman.use.coordinates.2)
It turns out that this system handles all the confusing cases noted
earlier. To write out the nine values of coordinates we will keep
this order:
(GENETIC MAP BEGINNING,
COORDINATE CONFIGURATION,
COORDINATE DIRECTION,
COORDINATE BEGINNING
COORDINATE ENDING,
PIECE CONFIGURATION,
PIECE DIRECTION,
PIECE BEGINNING,
PIECE ENDING)
The linear piece that we began this section with would be:
(1,LINEAR,+,1,100,LINEAR,+,1,100)
(The GENETIC MAP BEGINNING and COORDINATE DIRECTION are arbitrary.)
The first subset was "GET FROM 40 TO 50;":
(1,LINEAR,+,1,100,LINEAR,+,40,50)
The complement: "GET FROM 50 TO 40;" is:
(1,LINEAR,+,1,100,LINEAR,-,50,40)
The circular PIECE is:
(1,CIRCULAR,+,1,100,CIRCULAR,+,1,100)
The request
GET FROM 10 TO 90 DIRECTION -;
would make:
(1,CIRCULAR,+,1,100,LINEAR,-,10,90)
You should work out the results for the other three possible request on
this circular PIECE:
GET FROM 10 TO 90 DIRECTION +;
GET FROM 90 TO 10 DIRECTION +;
GET FROM 90 TO 10 DIRECTION -;
HINT: It helps to make diagrams.
The catalogue program, described in DESCRIBE.CATAL, will list
the coordinate systems for pieces of a book or library in tabular format.
(end of delman.use.coordinates.3)
HOW TO CONTROL THE RESPONSES OF DELILA
There are several situations in which Delila manipulates the information
in a library in a way that may not always be what one wants. That is,
there are certain things that Delila does in the absence of any instructions.
These default actions can be changed by using a special class of
instructions - they are called default resets. There are four basic
kinds of default (as defined in LIBDEF) but we will discuss only
three of them here.
OUT-OF-RANGE DEFAULT
We discussed this default in the section on the Delila language
(DELMAN.USE.LANGUAGE). A request may be outside the limits of a PIECE
in a library for two reasons:
1) The place is outside the coordinate system and is therefore
unsequenced (Delila calls it "unknown").
2) The place is within the coordinates, but the PIECE does not
extend that far in the particular library being used.
In either case, Delila's actions will be based on the RANGE default:
DEFAULT OUT-OF-RANGE REDUCE-RANGE;
Delila will attempt to find the nearest edges of the PIECE and use
these. (NOTE: there are known bugs associated with this process,
although it works in almost all cases.)
DEFAULT OUT-OF-RANGE CONTINUE;
Delila will not place the requested PIECE in the book, and will
continue to process any further instructions.
DEFAULT OUT-OF-RANGE HALT;
Delila will stop processing instructions. The book will not be useable
by auxiliary programs.
In all cases, a warning message is put into the listing.
KEY DEFAULT
One can use this default to prevent the information about MARKERs,
TRANSCRIPTs and GENEs from going into the book. For example:
DEFAULT KEY GENE OFF;
will turn off printing of the GENE information. The various data
items in a library will contain free form notes about the object.
(You can use the REFER program to look at these.) This command can
also be used to turn off the NOTEs when one wants to reduce the size
of the resulting book.
(end of delman.use.control.1)
NUMBERING DEFAULT
In the section on language we discussed the numbering of the items going
into a book. This command is used to control the numbering. One can
turn it on or off:
DEFAULT NUMBERING OFF; (* NOTHING FROM HERE ON WILL BE NUMBERED *)
One can set numbering for particular items:
DEFAULT NUMBERING PIECE; (* ONLY PIECES WILL BE NUMBERED *)
DEFAULT NUMBERING TRANSCRIPT GENE; (* BOTH TRANSCRIPTS AND GENES
WILL BE NUMBERED *)
To make numbering more flexible, one can reset the number that the
next item will get:
DEFAULT NUMBERING 27; (* THE NEXT ITEM WILL BE NUMBERED 27 *)
This default can be used to make sure that particular items will
have the same numbers in different books.
The number will be put into the notes of the item as the first line
in the notes. This allows them to be easily found by auxiliary
programs.
NOTE INSERTION
One can put one's own notes into the next object placed in the book
by using:
NOTE "THIS IS THE REPLICATION ORIGIN FROM PHIX174";
GET FROM ...
Since this is not a default reset, it does not use the word "default".
The new notes will follow the notes that were in the library. By
turning off notes from the library, and using note insertion, one can replace
notes in a library. Notes in PIECEs can be seen with program REFER.
One can put these default or note insertion statements anywhere
in a set of Delila instructions. More details on these and other
commands can be found in LIBDEF.
All the defaults have initial values:
default type initial value
============ ==============
KEY
NOTE ON
MARKER ON
TRANSCRIPT ON
GENE ON
OUT-OF-RANGE HALT
NUMBERING ON, 1, ALL
(end of delman.use.control.2)
SEQUENCE COMPARISONS AND STRUCTURE ANALYSIS
The purpose of this section is to point out auxiliary programs that can
be used to compare two sequences or find structures in a sequence.
Sequence comparisons can be done with DOTMAT, which forms all possible
pairs between sequences in two books. For each pair, one sequence
is put on the X axis of a coordinate system and the other is on the Y
axis. Both 5' ends are at the origin and X runs down the printout
page while Y runs across the page. (Simply rotate the page 90 degrees
counter-clockwise to get standard Cartesian coordinates.) The
sequences are compared for complementarity at each possible (X,Y)
pair formed between the two sequences. A "dot" is placed at a coordinate
if pairing can occur. Notice that the display will be symmetrical
around the line Y = X. Long stretches of pairing will run on diagonals
(along segments of lines Y = -X + C). To look for homology using
DOTMAT, use DELILA to obtain the complement of one of the pieces.
DOTMAT produces all possible pairings. Sometimes one wants to
eliminate the short helixes, to make finding the longer ones easier.
The pair of programs HELIX and MATRIX will do this.
One can use these two programs to find overlaps between sequences
obtained by shot-gun cloning. Put the complete sequence on the X axis book
and 20 bases from each end of the other sequence in the Y axis book.
Search for long oligo's, say 15 or longer. If there is a significant
overlap, you will get a response from HELIX.
Another program that can be used for comparisons is the INDEX program.
With this tool you can make an index of the locations of the oligo-
nucleotides in a book. The measure of the similarity between
oligonucleotides in the final alphabetized list of oligo's is related
to sequence homologies. This method is extremely powerful.
MATRIX/HELIX vs INDEX
MATRIX/HELIX
advantage: The 2 dimensional plot is easy to look at.
disadvantage: It is slow. For two sequences M and N bases long, a
dot matrix operation takes MxN operations. It is so-called Order
N Squared in computation time since the time to compare a sequence
with itself is a function of the square of the sequence length.
INDEX
advantage: It is fast, since the sorting algorithm is order NlogN.
disadvantage: One can't get a feeling for the results easily. One
method is to mark listings made with LISTER.
(end of delman.use.comparison)
HOW TO MAKE AND USE ALIGNED BOOKS
WHAT IS AN ALIGNED BOOK?
To perform statistical analysis on sequence sites (eg. ribosome binding
sites, promoters, splice junctions, etc.) one needs a way to align a set
of PIECEs in a book. For ribosome binding sites, we have used the A of
the AUG or various points in the Shine/Dalgarno. A book is aligned by
chosing one base from each PIECE to be the alignment point. The alignment
bases could be chosen by a list of coordinates, but we have found that there
are advantages to using Delila instructions to specify the base:
TITLE "EX7: ALIGNED BOOK";
ORGANISM ECOLI; CHROMOSOME ECOLI;
PIECE LAC;
GET FROM 29 -5 TO 29 +10; (* LACI RBS *)
GET FROM 1234 -5 TO 1234 +10; (* LACZ RBS *)
Here, the zero point for LACI alignment is base 29 and for LACZ it is base
1234. The "from parameter" is -5 and the "to parameter" is +10.
The instructions allow one to align the book that is created from the
instructions. WARNING: the instructions must follow a rigid format; this
is described in DELMODS in module info.align, along with details on
how to write programs using aligned books.
(See also DELMAN.USE.DATA.FLOW and DESCRIBE.ALIST)
AUXILIARY PROGRAMS FOR ALIGNED BOOKS
After generating an aligned book (a book and an aligning instruction set)
one can list it using program ALIST or obtain a histogram that tells the
composition of the book at each point relative to the aligned base
with HIST. A chi-squared analysis of an aligned book is done using HISTAN.
GENERATING A SET OF ALIGNED RIBOSOME BINDING SITES
We have provided the instructions for creating a set of aligned gene
starts, in file GAIN. GAIN was originally created from instructions
of the form:
ORGANISM ...; CHROMOSOME ...;
GENE ...;
GET FROM GENE BEGIN TO GENE BEGIN +2;
...
This is file GRIN (genes relative to begin instructions).
The resulting book was searched (one would use SEARCH with a rule of
(A/G/T)TG ) to generate the instructions in aligned form. GAIN was
then made by replacing the from-position with the word FIRST and the
to-position with LAST. To use GAIN you must first create the
transcript library from file TRAIN (TRAnscript library Instructions,
use DELILA with LIB1 and LIB2). Then replace FIRST and LAST with
the desired range. Notice that there are a few cases, marked
"SPECIAL" that you must deal with individually. Notice also, that genes
that are oriented in the direction opposite the PIECE had to be set up
by hand (this may be automated someday). The instructions could now
be named GAIN1, and DELILA can be used to generate the aligned book.
A detailed example of these operations is given in
DELMAN.CONSTRUCTION.EXAMPLE.
(end of delman.use.aligned.books)
USE OF THE PATTERN PROGRAMS
"Perceptron" is the name given to a class of algorithms for pattern
recognition with learning capabilities. Minsky and Papert have written an
excellent book on the topic ("Perceptrons", MIT Press, 1969) which explores
both the limitations and potentials of the method. They also prove the
"Perceptron Convergence Theorem" which guarantees that a solution will be
found if one exists. We have written an article (Stormo, et. al., 1982,
Nucleic Acids Research, 10: 2997-3011) which describes our use of the
algorithm to investigate translational initiation sites.
The algorithm takes as input patterns which can be divided into two
classes, and finds a "Weighting Function" which serves to distinguish the
patterns in the two classes. More rigorously, if we encode a sequence into
a string of bits, S, the algorithm attempts to find a W such that W*S >= T
(some "threshold") if and only if S belongs to one class of the two classes of
sequences. We mean by "*" the dot, or inner product of S and W, which are
vectors of the same dimensions. If we start with two sets of sequences,
S+ and S-, and an arbitrary W and T, the algorithm can be described by
the following three step procedure:
Test: choose a sequence S from S+ or S-,
if S is in S+ and W*S >= T go to Test,
if S is in S+ and W*S < T go to Add,
if S is in S- and W*S < T go to Test,
if S is in S- and W*S >= T go to Subtract;
Add: replace W by W + S,
go to Test;
Subtract: replace W by W - S,
go to Test.
An example of this process is shown in our NAR paper (reference given above).
(Note: this process can be done without goto's...)
The program which implements the perceptron algorithm to work on
sequences is called PatLrn. Other programs which use the output of PatLrn
are:
PatLst - a lister program for the output of PatLrn;
PatAna - does some simple analyses of the output of PatLrn;
PatVal - evaluates the aligned sequences in a book by the PatLrn output;
PatSer - searches a book for sites which are evaluated with a given
PatLrn W output to be above some user specified value.
(end of delman.use.perceptron.1)
EXAMPLES FOR THE PATTERN PROGRAMS
The files "exspbk" and "exsnbk" are the sets of positive and negative
sequences used in the example of Figure 1 of our "Perceptron" paper (NAR 10,
2997-3011). The file "expa1" contains the initial pattern from that same
example. Given these files and the program "PatLrn" you can recreate
the example thusly:
PatLrn(exspbk,a,exsnbk,b,pat,expa1).
The file "pat" should be identical (except for the date/time) to the file
"expa2" that we have provided. You can check that with the "Merge" program
if you want. It is also identical to the solution pattern from the example
and it keeps track of the number of changes needed to get to that solution.
The files "a" and "b" are empty in this case, because we are aligning the
sequences by their first bases. If we wanted to align them by any other
base those files would contain the instructions which generated the sequences
(see DELMAN.USE.ALIGNED.BOOK).
Now use the program "PatAna" to do some simple analyses of the pattern.
PatAna(pat,patan).
The file "patan" is identical to the file expan2 that we provided. It
contains some useful information about the pattern, such as the minimum and
maximum sequence values which could be obtained from this pattern, as well
as the average value expected for random sequences and a feeling for the
distribution of values.
The program "PatVal" will use a pattern to evaluate a book of sites.
Try:
PatVal(exspbk,a,pat,valp).
and
PatVal(exsnbk,b,pat,valn).
"valp" is the evaluation of each sequence of the positive class, and "valn"
is the evaluation of each of the negative class sequences. Check with the
example in the paper to see that they are correct. Again the "a" and "b"
files are empty because we are aligning by the first base of the sequences.
The program "PatSer" will use a pattern to search through a sequence,
using each base in turn as the aligned base. Those sites which are
evaluated above some minimum, either set by the user or taken to be the
minimum functional from the pattern itself, are identified. Furthermore,
instructions to get those sites so identified are written to the file "inst".
Try this on an example file:
PatSer(exsebk,pat,val,inst).
notice that when the pattern extends beyond the sequence the sites are still
evaluated, but the user is notified of the over-extension.
The program "PatLst" is used to make nice horizontal printings of the
patterns, such as for use as publishable figures. Try this on the W51
matrix which is from the paper and which we provide. Read the page
DESCRIBE.PATLST to see how to set the width of the pattern printed to
a page to whatever you want.
(end of delman.use.perceptron.2)
A NOTE ABOUT SIGNIFICANCE
While the example we provide in the paper, and that you have just done,
is convenient for demonstrating the method, separating two sets of two
sequences, each five long, is in fact trivial. Try:
PatLrn(exspbk,a,exsnbk,b,newpat).
"newpat" is identical to "expa0" that we provided, and as you can see is
not interesting. The mathematical problem of when it becomes
significant that one can separate two sets of sequences is still an open
problem, but we can say some things. As the number of sequences in each
class gets larger the probability of separation decreases, as it does
when the number of nucleotides in each sequence diminishes. As a good
rule of thumb we like to have more sequences in the smallest class
(usually the functional class) than there are nucleotides in any one
of the sequences. Under these conditions one can be reasonably confident
that a solution pattern is likely to identify features of biological
significance.
(end of delman.use.perceptron.3)
USE OF THE "ENCODE" PROGRAM
The program Encode was written to allow a user to encode sequences into
strings of integers in a flexible way. For instance, one can encode
the sequences as mono-, di-, tri-, or higher oligonucleotides. One can
assign specific oligos to certain positions or record only that they are
within some "window" of positions. Within a window all the oligos may
be counted or only some, such as only those "in frame".
The program takes as input the book of sequences and the instruction set
which generated it and which specifies the alignment. If the instruction
file is empty then all the sequences are aligned by their first bases.
The other input file, which must be non-empty, is the parameter file
"EncodeP" which specifies how the sequences are to be encoded. It is
the options of the parameter file which give the program its flexibility
and power, and so they should be thoroughly understood.
The parameter file may contain any number of individual parameter records,
each of which will in turn be applied to each sequence in the book. This
allows one to encode different regions of the sequences differently, or
to encode one region in more than one way. Each parameter record has
five pieces of information, each written on a separate line:
line 1 - the range over which this parameter record is to operate; this
line has two integers which are the bases, relative to the
aligned base, for which to use this encoding;
line 2 - the size of the window; the window begins at the start of the
range and contains this many nucleotides in it; the number
of each base, or oligo, which occurs in this window is written
to the output; note that positional information within the
window is lost, so that if exact position is needed the window
size should be 1;
line 3 - the shift to the next window; this specifies how many bases
to move the window over to its next position; this is repeated
until the window begins beyond the end of the range;
line 4 - this specifies the coding level, and the arrangement of the
bases to be coded; the coding level is the number of bases in
the oligos which are encoded, i.e., 1 means monos are encoded,
2 means dis are encoded, ...; for coding levels greater than 1
the user may allow for skips between the encoded bases; for
instance, one may want to encode as di-nucleotides bases which
are separated by a nucleotide; this would be declared on this
line by writing "2 : 1"; likewise, one could encode as a tri-
nucleotide the first bases of three consecutive codons by the
line "3 : 2 2", where the 3 indicates the coding level (tri-
nucleotides) and the 2's represent the number of bases
skipped between each encoded base; if there is no colon after
the coding level declaration, all skips are assumed to be 0;
line 5 - the shift to the next coding site; this allows the user to
not count every occurrence of the oligos in the window, but
rather to move some number of bases to the next encoded site;
if all the oligos are wanted, this number should be 1.
The above line information constitutes a single parameter record. The
parameter file may contain any number of these records concatenated
together. Each sequence will be encoded by the entire list of parameter
records and the resulting string of integers will be written to the
"EncSeq" file. The encoded string for each sequence ends with a special
"end of sequence" symbol, which is listed in the file header.
For examples of how this program works see "DELMAN.USE.ENCODE.2".
(end of delman.use.encode.1)
EXAMPLES OF USING THE "ENCODE" PROGRAM
The files "ExEncIn" and "ExEncBk" contain the sequence around the beginning
of the rIIB gene of T4, and the instructions which align this sequence by
the ATG of the gene. The aligned sequence looks like:
--- ++
111--------- +++++++++11
210987654321012345678901
........................
ATAAGGAAAATTATGTACAATATT
Notice that the 0 base is the A of the ATG (this is what we aligned by) and
that our sequence contains the 12 preceding bases and the 11 following. This
is through the fourth amino acid of the protein. If we wanted to encode only
the mono-nucleotides of the initiation codon we would make our parameter file:
0 2
1
1
1
1
this would give the encoding:
1 0 0 0 0 0 0 1 0 0 1 0 -1
Notice the -1 which specifies the end of the encoded sequence. Each 4 integers
before that specifies which base occurs at each of the three encoded positions.
The A is encoded as 1 0 0 0, the T as 0 0 0 1, and the G as 0 0 1 0.
If we wanted to know the number of each mono-nucleotide in this whole region
and we didn't care about their positions, we would encode as:
-12 11
24
24
1
1
This would give the encoding:
12 1 3 8 -1
Notice that this is really just the composition of the sequence, since our
window covers the entire sequence. We could get the di-nucleotide composition
with the parameters:
-12 11
24
24
2
1
and get the encoding:
5 1 1 5 1 0 0 0 1 0 1 1 4 0 1 2 -1
Notice that this encoded string is a vector of 16 integers (up to the end
of sequence mark, -1). The number in each element of the vector is the number
of each di-nucleotide in the sequence, in the order AA,AC,AG...TC,TG,TT.
Examples continued in DELMAN.USE.ENCODE.3.
(end of delman.use.encode.2)
Examples of using the "encode" program, continued from
DELMAN.USE.ENCODE.2.
To encode the di-nucleotide composition of the Shine and Dalgarno region
and also the mono-nucleotides of the coding sequence, each in its own position,
we would make this list of parameters:
-10 -6
5
5
2
1
0 11
1
1
1
1
This would give us the encoding:
2 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1
1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1
-1
Here the first 16 integers are the di-nucleotide composition of the Shine and
Dalgarno region, and appended to that are the mono-nucleotide encodings for
each position of the coding sequence. We could get the di-nucleotides of
successive codon first positions by:
0 11
12
12
2 : 2
3
or we could get the codon composition by:
0 11
12
12
3
3
or we could get the di-nucleotide encoding of the first and last position of
each codon, including the position of the codon by:
0 11
3
3
2 : 1
3
These are left as exercises to the user, and it is encouraged that the user
make up other tests and try them until this program is easy to use.
(end of delman.use.encode.3)
In addition to Delila, there are at least two other generally available
large nucleic sequence data bases. The DB program system handles both the
European Molecular Biology Laboratory (EMBL) libraries and those of the
Genetic Sequence Databank (GenBank(TM)).
If you want to contact someone who helps operate these data bases use the
following addresses:
GenBank
c/o Computer Systems Division
Bolt Beranek and Newman Inc.
10 Moulton St.
Cambridge, Ma. 02238
USA
Graham Cameron
European Molecular Biology Laboratory
Postfach 10.2209, 0-6900 Heidelberg, West Germany
The DB program system is a small set of programs. DBcat prepares
catalogs for DBpull. DBpull extracts part or all of an entry of either
EMBL or GenBank format. DBbk converts database entries into the Delila
book form that Delila programs use. All of these programs handle both data
base formats even when both occur together in the same library.
At this point, please obtain some sample library entries from both data
bases and look them over.
Embl and GenBank libraries are arranged in series of entries, each entry
possessing a unique entry id, a nucleic acid sequence, and other miscellaneous
information. Most of the lines in the libraries start with a word or abbreviated
code that indicates what kind of information the line contains. The
following definitions will clarify these points.
Library definitions:
Entry: An entry starts with a line which begins with an "ID" (EMBL) or a
"LOCUS" (GenBank). All subsequent lines are part of the entry until the
line that contains simply "//". "//" is the entry terminus code for both
data bases.
Entry id: On the first line of each entry, after the "LOCUS" or the "ID",
comes a few spaces and then a weird looking word or code that may or may not
resemble a familiar biological name. This is the entry id, it is the name the
entry is known by and it is what DBpull uses to identify which entries it
will extract.
Line codes: The phrases "ID" and "LOCUS" are line codes. There are other line
codes in each entry such as "REFERENCE" and "ORIGIN" in GenBank and "DE"
"SQ" in EMBL. Some lines do not have a code and some have one, but it is in-
dented. Other lines have codes, but there is no other information on the line.
these special cases will be discussed below in the definition of line code
request instructions.
Now that you are familiar with the data bases you can understand the DBpull
instruction set. Each instruction takes up only one line. Each line does one
of two things; either it indicates what entry type (GenBank or EMBL) is
requested on the following lines or it makes an actual request for part or
all of an entry identified by its entry id. Please note that the following
definitions will be made clearer by referring to the examples that follow.
(end of delman.use.dbpull.define)
Note: Instructions are entirely upper case because that is what the computer
system DBpull was designed on required.
Instructions that determine entry request type of succeeding lines:
EMBL: This indicates that requests for entries somewhere in the EMBL
libraries will be on the following lines.
GENBANK: Same for requests found in the GenBank libraries.
GENB: Same as "GENBANK".
Instructions that tell which entries are to be pulled:
Entry id: An instruction line beginning with an entry id will pull part
or all of that entry. The parts extracted will depend on which of the
"instructions that define extraction" (defined below) follows the id on
the same line.
Wildcard id: This request looks like an entry id request but somewhere
in the entry name are one or two "*" symbols. The "*" represents any
number of unspecified characters. It may be inserted at the beginning of
the id, at the end, or at both the beginning and the end but not the
middle. (Confused? see instructions example 3 below)
EVERY: The word "EVERY" at the start of a request line calls for every
entry of a particular entry type. (See instruction example 4)
Instructions that define extraction:
Line codes: Following the instruction that tells which entry or entries
are to be pulled, on the same line, come instructions that structure
the extraction. One or more line codes occurring in this space will result
in the lines of the entry which have matching codes being pulled. Genbank
line codes are actually words. The full word or an abbreviation will work,
but the abbreviation can not be shorter than 3 letters. "LOC", for instance,
will pull the "LOCUS" line while "LO" would not. When there are one
or more lines in the entry directly below a pulled line that either
do not possess a line code, posses indented codes, or posses the code "xx",
these additional lines will be extracted also.
RAW: Instead of line codes one can simply insert the word "RAW". This will
pull only the sequence of the entry without origin or coordinate labels.
The sequence will end with a "." to separate it from other sequences and to
make it suitable for input into Makebk. (see delman.describe.makebk) Also,
if the first request of fin is "RAW", fout will have no dateline and
therefore it will not make a suitable secondary data base for DBpull.
ALL: Instead of "RAW" or line codes the word "ALL" will result in an
entire entry being extracted.
(end of delman.use.dbpull.instructions)
Instruction examples (DBpull input file Fin)
Example 1:
EMBL
ADCXXX ID DE SQ
GENBANK
M13 LOC REFERENCE
ANABANIFH LOCUS
Comments: The first and third lines indicate what types of entries are
requested on the following lines. If, for instance, M13 were an EMBL entry
this set of instructions would not find it.
Example 2:
GENB
T7 RAW
MS2 ALL
Comments: The two requested ids are not in alphabetical order and the
DBpull output file fout will have the same order as the requests.
Example 3:
EMBL
*RNA SQ ID
*RNA* ID SQ
GENB
M* ORI SITES
GOOGOOGAGA ALL
T7 RAW
Comments: The character "*" is a wildcard; it represents any number of
unspecified characters.
The first request will grab any entry whose id ends in "RNA", the
second any one that has "RNA" anywhere in it, and the third any id which
starts in an "M". The fourth request is a joke and, like any other non-
existent id, will yield a "not found" message and then halt the program. If
there were no GenBank entry ids beginning in "m" a "not found" would appear
but DBpull would not halt because this id request is a wildcard. The logic
behind this distinction is that wildcards are used to search for the
possible existence of an entry, but regular ids are used only for entries
that are well known by the user. Note that "ORI" (origin) pulls sequence in
GenBank and "SITES" tells you where the genes and other features are. "SQ ID"
and "ID SQ" are equivalent; lines are pulled in the order that they occur.
Example 4:
EMBL
EVERY ID
GENB
EVERY LOC
Comments: This example would make a catalog for users of the entire EMBL
and GenBank data bases. The catalog would be alphabetical because the
catalog files used by DBpull (produced by DBcat) are presorted. If
"catalogs for humans" are provided with your libraries do not try this
example; it is very expensive. If you do try it, you might want to request
additional line codes to "LOC" and "ID" for a more informative catalog.
(end of delman.use.dbpull.examples)
Use of the Search Program
i. searching dna sequences for particular strings
The search program works on books of sequences. Any search pattern
will be looked for in each sequence of the book. Search patterns consist
of strings of nucleotides, such as 'aatggct'. You may also specify
ambiguous patterns, such as 'a or g', in either of two ways: '(a/g)' or
'r'. All possible ambiguities can be asked for, by either way. From
within the search program type 'l' to see the list of one-letter codes
for each ambiguous base combination. One can also include in the search
positions for which you don't care what the base is, indicated by 'n'.
For instance, 'anc' would search for a and c separated by any base. One
can also use 'e' (for extension) to vary the spacing between specified
regions. The 'e' is considered to be an 'n' and also as nothing. For
example, 'aec' would search for both 'anc' and 'ac'. We used this feature
to search for 'shine and dalgarno' sequences before 'atg's by specifying
'gga5n4eatg'. This means 'gga followed by 5 to 9 unspecified bases followed
by atg'.
One can search for strings which are close to the specified by allowing
mismatches to the specified sequence. This is done by typing 'm' as a
search command, and then specifying how many mismatches are allowed. If
there are regions within the specified sequence where you want no mismatches,
this is stated by enclosing that region between and '<' and '>'. For example,
if mismatches were set to 1 and the pattern searched were 'aatt', then
the 'ggc' must be found exactly, but the rest of the pattern need only be
within one of a perfect match.
The search program returns to you the positions of the matches found in
the book. Unless otherwise specified, the position corresponds to the first
base of the pattern. However, one can ask for the position to be another
base by preceding that base by '#'. For example, 'aa#atggct' would return
as the position of the match the 'a' of the 'atg'.
It is also possible to make searchs for relations between bases. Six
relations are allowed: identity (i); non-identity (ni); complementarity (c);
non-complementarity (nc); complementarity including g-t pairs (w); and
non-complementarity including g-t pairs (nw).
Relational searchs are specified by first
the symbol '^', followed by the pattern position this base is to be related
to, followed by the relation. For example, 'n^1i' would find all sites in
which there is a repeated base (aa, cc, gg or tt). Notice that the base
to which the relation refers must proceed the point of the relation in the
pattern. Searching for the pattern '5n^1c' would find sites of complementary
bases separated by 4 unspecified bases.
More information on search patterns and other commands in general
can be obtained by typing 'help' while in the program.
(end of delman.use.search.1)
ii. Creating Delila Instruction Files
The search program also allows one to create instruction files so
that the located sites may be put into a book for further analysis. This
is especially useful when you want to include in the analysis regions around
the sites. For instance, you could set the 'from' distance to -60 and the
'to' distance to +40. Then by searching for 'gga5n4e#atg' you would get
the instructions necessary to obtain the sequences from -60 to +40 around
the atg's which are preceded by Shine and Dalgarno sequences. Help on
using this feature of the program can be obtained by typing 'd help' while in
the program.
(end of delman.use.search.2)
cccccc oooooo n nn
cc cc oo oo nn nn
cc oo oo nnn nn
cc oo oo nnnn nn
cc oo oo nn nn nn
cc oo oo nn nnnn --------
cc oo oo nn nnn
cc cc oo oo nn nn
cccccc oooooo nn nn
ssssss tttttttt rrrrrrr uu uu cccccc tttttttt
ss ss tt rr rr uu uu cc cc tt
ss tt rr rr uu uu cc tt
ssssss tt rr rr uu uu cc tt
ss tt rr rr uu uu cc tt
ss tt rrrrrrr uu uu cc tt --------
ss tt rr rr uu uu cc tt
ss ss tt rr rr uu uu cc cc tt
ssssss tt rr rr uuuuuu cccccc tt
iiiiiiii oooooo n nn
ii oo oo nn nn
ii oo oo nnn nn
ii oo oo nnnn nn
ii oo oo nn nn nn
ii oo oo nn nnnn
ii oo oo nn nnn
ii oo oo nn nn
iiiiiiii oooooo nn nn
(end of delman.construction)
CONSTRUCTION OF DELILA LIBRARIES
Introduction
This section assumes that you are familiar with DELMAN.USE.
Construction of a Delila System Library involves several steps:
- Entry of the raw sequence data (twice)
- Correction of the sequences
- Gathering of the information about the sequences
- Creation of a "module" for insertion into the library
(not the same module type as the ones used by program Module.)
- Insertion of the module
- Construction of a catalogue
- Checking that the library is correct.
When you are gathering the data to create part of a library
(the library insertion module) you may find the forms in
DELMAN.CONSTRUCTION.FORM useful. Use the Module program to make
as many copies as required.
NOTES FOR TRANSPORTATION
Since the libraries that we send you have already been checked, you
need only run the CATAL program (as discussed below) to generate the
catalogues for these libraries. After that, Delila can be used.
(end of delman.construction.intro)
MORE ON LIBRARY STRUCTURE - LOGICAL VS PHYSICAL STRUCTURE
In DELMAN.USE.STRUCTURE we discussed the structure of a Delila
Library. The descriptions were about how the parts are connected,
and what is inside each part. This is the logical structure of the
data base. We did not discuss the details of how a library is actually
constructed, because it is not necessary to know these things when
working with the Delila System. The description of these details
is the description of the physical structure of the data base.
Since we do not yet have an extensive set of tools for constructing
Delila Libraries, it is necessary to describe the physical structure
enough so that you can build your own libraries. Because these details
are rigorously stated in LIBDEF, most things are automated by program
Makebk, and Catal does lots of checking, we will only discuss the general
concepts here.
The logical structure of a library follows the schema shown in LIBDEF
or DELMAN.USE.STRUCTURE. This structure is a two dimensional net.
Libraries are implemented physically in files, and so are linear
structures. If we exclude for the moment the references to a PIECE
by MARKERs, TRANSCRIPTs and GENEs, then the library structure is a
a tree. Any tree can be represented as a nested series of objects
in linear order:
ORGANISM (open parenthesis for an ORGANISM)
CHROMOSOME (open parenthesis for a CHROMOSOME)
GENE (open parenthesis for a GENE)
GENE (close parenthesis for a GENE)
PIECE (open parenthesis for a PIECE)
PIECE (close parenthesis for a PIECE)
CHROMOSOME (close parenthesis for a CHROMOSOME)
ORGANISM (close parenthesis for an ORGANISM)
If you look at any book (eg. EX0BK) or library (eg. LIB1) you will
see this structure. Lines in a library either define the structure
or are chunks of data (attributes). Attributes are signaled by an
asterisk (*) as the first character on the line.
We must now allow various objects to refer to PIECEs. This is done
by a reference to the name of the PIECE. For example, one of the
attributes in a GENE is the name of the PIECE that the GENE is on.
(In cases where the GENE spans two PIECEs, we use two GENEs.)
To simplify the operation of the CATAL program (to be described later)
we have added one more rule. All objects that refer to a particular
PIECE are called the "FAMILY" of the PIECE. The rule is that a
FAMILY precedes its PIECE in the physical (file) implementation.
(end of delman.construction.structure)
MAKING NEW LIBRARIES - THE CATALOGUE PROGRAM
The first technical difference between Libraries and Books in the Delila
System is that Libraries have catalogues while Books do not. Catalogues
serve several purposes. First, since they are a condensed list of
the objects in a Library, they allow objects to be found quickly.
There are catalogues for both Delila and for people (the latter is
called a HUMCAT - HUMan's CATalogue). These are constructed by the
program CATAL.
Since a library may be constructed by hand, it is also convenient to
check the Library's physical structure at the time the catalogue is made.
The Problem Of Duplicate Names
Using Delila, a Book may be easily constructed that contains two objects
within the same structure (if they are in different structures, it
won't matter). For example:
ORGANISM ECOLI;
CHROMOSOME ECOLI;
GENE LACI; (* THIS IS ON PIECE LAC *)
GET ALL GENE DIRECTION HOMOLOGUOUS;
GET ALL GENE DIRECTION COMPLEMENT;
If this Book were to become a Library, then a reference to PIECE LAC
would be ambiguous since there are two PIECEs with that name within the
CHROMOSOME. The CATAL program detects these cases and makes the names differ
by adding symbols to the names of second and subsequent duplicately named
objects. The second technical difference between Books and Libraries is that
Books may have duplicate names, while Libraries may not.
Notes For Transportation
Unknown ends of objects (such as a GENE) are represented in this
version by a number that is off the end of the coordinates of
the PIECE. For consistency, we have used +100000 or -100000 so
that these can be more easily recognized (to our knowledge no
continuous sequences are this long ... yet!). If your computer
cannot handle integers this large, then you can reduce these
numbers, as long as they are outside of the individual coordinates.
(end of delman.construction.catal)
AN EXAMPLE OF CONSTRUCTING DELILA LIBRARIES
In this example we show the series of steps used to set up the Delila
libraries provided on the tape. The special bracket notation ([...])
is used here to indicate the contents of a file. A slash (/) inside
the brackets indicates the beginning of a new line in the file.
Other notation is described in DELMAN.DESCRIBE.CONVENTIONS.
1. Generate Library Catalogues
catal(humcat,[ADVANCE DATES],lib1,cat1,newlib1,lib2,cat2,newlib2)
copy(newlib1,lib1)
copy(newlib2,lib2)
The humcat should be identical to or similar to the one we send.
(Note: l3 is empty, and c3 and newlib3 will not be written, but your
computer may require that these files exist as empty files in order to
run Catal. A similar situation holds for Delila and many other programs.)
2. Build Transcript Book
delila(train,trabk,tradl,lib1,cat1,lib2,cat2)
There will be warnings that can be ignored at this point.
3. Build Transcript Library
catal(trahu,[ADVANCE DATES],trabk,tract,trali)
You will see a number of cases where duplicate names are resolved.
4. Test Grin File
delila(grin,grbk,grdl,trali,tract)
comp(grbk,cmp,[3])
cmp should show 140 ATG, 7 GTG, 2 TTG.
5. Test Gain File
Within the Gain file, the "FIRST", "LAST" and "SPECIAL" cases must be
replaced by numbers. The WORCHA program comes in handy here, because it will
do this easily:
worcha(gain,ga3in,[FIRST/0/LAST/2/SPECIAL/0])
delila(ga3in,ga3bk,ga3dl,trali,tract)
comp(ga3bk,cmp,[3])
cmp should be the same as for Grin.
6. Expanding Grin
You can now expand the "FIRST" to "LAST" region of Gain, taking care not
to violate the "SPECIAL" cases.
(end of delman.construction.example)
RULES OF RAW SEQUENCE INSERTION
(1) A raw sequence is a file containing only the letters A, C, G or T
(no U is allowed, use T). You may type these letters or a set of
letters on the keyboard that is convenient (eg. 1234); then convert
the letters to ACGT using the program CHACHA.
(2) For reasons of transportability and readability, the length
of each sequence line should not exceed the width of characters on a
typical terminal: Do not type more than 60 bases per line. You can reformat
the data with REFORM or MAKEBK.
(3) Sequences can and should be entered in free format with spaces
to improve the readability of the sequence during entry. This
also helps in the corrections described below. Much later it helps one to
find parts of the sequence during fusion of PIECEs.
(4) Before entry, use a pencil to mark off intervals of sequence to
type. This makes entry easier since there are rest points. I often
check off each (or every other) interval as I go, so I rarely get
lost and duplicate or delete intervals. If you can keep the lines like those
in the paper, the sequence will be easier to check and correct later
(but remember rule 2).
(5) Two people should INDEPENDENTLY enter the sequence.
Independence is important: one person will FREQUENTLY make the
same mistake twice. Do not be fooled into entry of a sequence and
its complement by one person. We have had two cases where the same deletion
was entered in the same place by one person, even though he was typing
the sequence and its complement. Have two people independently
type the sequence and the complement. By doing it this way, you
will also catch some typographical errors if you are using a published
source. (Another method: if one person is to enter both strands, be
sure that they are typed from two copies on which different intervals
are used.)
The method of independent entry allows automatic correction. It seems
to be faster and more reliable than other methods.
(6) I caught the deletions mentioned above by knowing how long the
sequence should be. You should not rely on the computer for the
length. Predict it and then check it.
(7) The file names of the two copies should include the
initials of the person who typed the file. See the example below.
(8) A complemented or inverted strand may be re-complemented or
re-inverted using the program REFORM. Note that the free format
of (3) will be lost. You should use the reformatted sequence only
for checking, and not for the final Library insertion, since you
would lose the formatting if you did.
(9) At this point you have two files of "raw" sequence. The sequences
may be merged together and corrected using MERGE.
FOR EXAMPLE: If the sequence was OMPA, TS and MA typed the raw
copies, and the copy of MA contains the format desired for the
Library, you could use MERGE like this:
MERGE(OMPAMA,OMPATS,OMPA,GARBAGE)
(10) Be sure to save all raw files (eg. OMPAMA, OMPATS, OMPA) until
the library insertion is completed and taped or backed-up.
(end of delman.construction.data.entry)
SEQUENCE INSERTION PROCEDURE
The following procedure assures the accurate and complete insertion
of sequences into a Delila Library. Overview of the method:
REFERENCE OBTAINED
:
.....................*....................
: : :
V V V
: : :
RAW SEQUENCE RAW SEQUENCE DESIGN BOOK
COPY 1 COPY 2 :
: : :
V V :
: : :
CHACHA CHACHA :
: : :
V V :
: : :
:.......MERGE........: :
: :
V :
: :
RAW SEQUENCE :
CORRECTED COPY :
: :
V V
:............MAKEBK............:
:
V
:
LIBRARY INSERTION MODULE
:
V
:
LIBRARY INSERTION
I. Obtaining Sequences
A. Sequences may be obtained from
1) Publications and preprints
2) Computer transfer
3) Your lab
B. One copy of the source article and the sequence (or two copies of
the sequence when no paper is available) are to be made for entry to
our reference shelf. The photocopies must be of GOOD quality, with
NO loss of information.
II. Raw Sequence Insertion (See DELMAN.CONSTRUCTION.DATA.ENTRY for details)
A. Double entry is preferred over other methods.
B. Programs are available to make this easy: REFORM and MERGE.
RAWBK may be used on the checked raw sequence to get results quickly.
C. THE NAME OF THE GAME IS ACCURACY.
III. Book Design
A. First be sure that you understand library structure and coordinate
systems. See LIBDEF and DELMAN.USE.
B. Use forms to write out inserted sections. These can be found in the
sections that begin with "DELMAN.CONSTRUCTION.FORM".
C. Check the library to see if you can fuse the new sequence to
previous sequence.
D. Decide on a coordinate system or fuse to previously defined coordi-
nates. (NOTE: when there is no zero, add 1 to the negative numbers.)
Write this information on the source copy for our reference shelf.
E. Record the source of all fragments and special information (eg:
no zero, negative numbers incremented) in the PIECE notes.
Put a complete reference into the PIECE notes. Include
the positions on the coordinate system, such as: (-1288 to -208)
F. Record all MARKERs, TRANSCRIPTs and GENEs in your coordinates.
Unknown values are either +100000 or -100000, depending on which
end of the coordinates the value is beyond.
G. Create the Library insertion module using MAKEBK. All MARKERs,
TRANSCRIPTs and GENEs pointing to a PIECE must be placed immediately
prior to the PIECE that they refer to. They are called the "family"
of the PIECE. (Note: we call this piece of a Delila library a
module, but this is not the same as the ones the Module program works
with. The meaning should be clear from the context.)
IV. Insertion - With The Utmost Of Care
A. Always insert whole Library insertion modules. Replace old parts of
the library by modifying a module and reinserting it (with an editor).
B. Quickly check the book structure for blatant errors.
V. Checking the new Library
A. The catalogue program (CATAL) is used to check library structure
and to generate human and librarian catalogues.
B. Modules that contain only parts of books can be made into whole
books by placing a shell around the module. Example: a PIECE and its
family can be inserted into a shell of a fake ORGANISM and CHROMOSOME
to check the PIECE structure.
C. Correct modules are inserted into the library and CATAL is run on
the entire library. Be sure that file CATALP is empty, to ensure that
the dates are advanced.
D. End point checking: all coordinate numbers should be checked.
To do this, use DELILA to pull out: COORDINATE, PIECE, GENE,
TRANSCRIPT and MARKER endpoints. This is painful, but it has caught
many errors. Example:
GET FROM GENE BEGINNING TO GENE BEGINNING +2;
should give mostly ATG, and a few XTG. (SOMEDAY THIS MAY BE AUTOMATED)
VI. Listings Of The New Library
These are often useful (program to use in parenthesis)
A. LIB (SHIFT)
B. HUMCAT (CATAL)
C. REF (REFER)
D. LIS (LISTER) may be large.
(end of delman.construction.library.design)
NAME: LIBDEF, 1980 JUNE 9
ORGANISM
* SHORT NAME
* LONG NAME
NOTE
*
*
*
*
NOTE
* GENETIC MAP UNITS (REAL)
(INSERT A SERIES OF
ORGANISMS AT THIS
POINT)
ORGANISM
(end of delman.construction.form.organism)
NAME: LIBDEF, 1980 JUNE 9
CHROMOSOME
* SHORT NAME
* LONG NAME
NOTE
*
*
*
*
NOTE
* GENETIC MAP BEGINNING (REAL)
* GENETIC MAP ENDING (REAL)
(INSERT A SERIES OF
MARKERS, GENES, TRANSCRIPTS,
AND PIECES AT THIS POINT)
CHROMOSOME
(end of delman.construction.form.chromosome)
NAME: LIBDEF, 1980 JUNE 9
MARKER
* SHORT NAME
* LONG NAME
NOTE
*
*
*
*
NOTE
* PIECE REFERENCE
* GENETIC MAP BEGINNING (REAL)
* DIRECTION (+/-)
* BEGINNING NUCLEOTIDE (INTEGER)
* ENDING NUCLEOTIDE (INTEGER)
* STATE (ON/OFF)
* PHENOTYPE
DNA
*
*
DNA
MARKER
(end of delman.construction.form.marker)
NAME: LIBDEF, 1980 JUNE 9
TRANSCRIPT
* SHORT NAME
* LONG NAME
NOTE
*
*
*
*
NOTE
* PIECE REFERENCE
* GENETIC MAP BEGINNING (REAL)
* DIRECTION (+/-)
* BEGINNING NUCLEOTIDE (INTEGER)
* ENDING NUCLEOTIDE (INTEGER)
TRANSCRIPT
(end of delman.construction.form.transcript)
NAME: LIBDEF, 1980 JUNE 9
GENE
* SHORT NAME
* LONG NAME
NOTE
*
*
*
*
NOTE
* PIECE REFERENCE
* GENETIC MAP BEGINNING (REAL)
* DIRECTION (+/-)
* BEGINNING NUCLEOTIDE (INTEGER)
* ENDING NUCLEOTIDE (INTEGER)
GENE
(end of delman.construction.form.gene)
NAME: LIBDEF, 1980 JUNE 9
PIECE
* SHORT NAME
* LONG NAME
NOTE
* (NOTES INCLUDE PRECISE REFERENCE
* FOR EVERY BASE IN THE PIECE)
*
*
NOTE
* GENETIC MAP BEGINNING (REAL)
* COORDINATE CONFIGURATION
(CIRCULAR/LINEAR)
* COORDINATE DIRECTION (+/-)
* COORDINATE BEGINNING (INTEGER)
* COORDINATE ENDING (INTEGER)
* PIECE CONFIGURATION
(CIRCULAR/LINEAR)
* PIECE DIRECTION (+/-)
* PIECE BEGINNING (INTEGER)
* PIECE ENDING (INTEGER)
DNA
* (INSERT SEQUENCE HERE)
DNA
PIECE
(end of delman.construction.form.piece)
DDDDDDD EEEEEEEE SSSSSS CCCCCC RRRRRRR IIIIIIII BBBBBBB EEEEEEEE
DD DD EE SS SS CC CC RR RR II BB BB EE
DD DD EE SS CC RR RR II BB BB EE
DD DD EEEE SSSSSS CC RR RR II BBBBBBB EEEE
DD DD EE SS CC RR RR II BB BB EE
DD DD EE SS CC RRRRRRR II BB BB EE
DD DD EE SS CC RR RR II BB BB EE
DD DD EE SS SS CC CC RR RR II BB BB EE
DDDDDDD EEEEEEEE SSSSSS CCCCCC RR RR IIIIIIII BBBBBBB EEEEEEEE
(end of delman.describe)
PROGRAM NAMING CONVENTIONS
Every Delila System program exists in several forms:
1) Raw source code - without modules inserted. Example: "lister.r"
would be the raw code for the LISTER program. We are not sending code
this way.
2) Pascal source code - with all modules inserted. This code is ready
to compile. Example: "lister.p". (Our previous convention was to add
an s to the end of the file name to indicate this.)
3) Compiled code. Our convention is to remove the suffix: "lister".
To simplify the manual, programs are listed under the compiled code
name (lister).
PARAMETER FILE NAMES
A file that controls the operation of a program is called a parameter
file. For LISTER this file is LISTERP. For SPLIT it is ...
SPLITP (get it? HA! HA! sorry.)
RULES FOR PARAMETER FILES
1) If the file is not empty then the file must contain values for all
parameters. With few exceptions, this should reduce the number of complex
rules that one must deal with.
2) Each parameter is on its own line.
3) Parameters are left justified on the line.
4) A parameter may be followed by one or more spaces and then any
comment. This lets the user write reminders of what the allowed
values are.
WHY CAN'T DEFAULT PARAMETER VALUES BE STATED IN THIS MANUAL?
1) If default values are changed, then the manual must also be changed.
since there is no automatic mechanism to assure that these remain
the same, it is likely that it will be forgotten. The manual would
then be out of date.
2) The manual entry defines the program but does not enforce details
of operation. It is somewhat like the LIBDEF specification.
3) It is easy to find out what the defaults are since almost every
program states the values used in its listing. Running a small test
takes only two minutes.
(end of delman.describe.conventions.naming-parameters)
PROGRAM WRITING CONVENTIONS
Program source code will always follow certain rules:
1) The first line(s) will be the Pascal PROGRAM statement.
2) The module libraries that are sources of the modules will be stated.
3) One of the global constants will be called VERSION. This number
or string identifies the particular version of the source code. We
change VERSION every time that we modify the source file. The program
name and VERSION are written to the OUTPUT file when the program runs.
4) There will be a document module that describes the program.
The module is identical to the one in this manual such as
DESCRIBE.LISTER
It follows the format defined in
DELMAN.DESCRIBE.DOCUMENTATION.PROGRAMS
5) All constants, types, variables, procedures, functions
and sections of code will have comments that describe their function.
6) Interactive programs always have a HELP command.
FOR TRANSPORTATION:
1) Put non-standard features inside modules.
2) Program lines longer than 80 characters are avoided. (NB: This is ALWAYS
possible in PASCAL). The FLAG program will detect any lines that are too long.
3) Reading into packed arrays is forbidden. Read into unpacked arrays
and pack or transfer values.
4) The Pascal Users Manual suggests that PASCAL identifiers "must
differ over their first 8 characters." There are two problems related
to this. Assume that the transport is from a computer that requires
N characters to differ, where N > 8 (eg. 10).
a) Transport to a computer that requires M < N may cause names like A23456789
and A2345678X to be considered identical, and compilation will be prevented.
b) Transport to a computer that recognizes M > N will detect cases
where one name was written two ways, with the difference in the last
characters (between N and M). The "most famous" such case was
in CATAL: HUMCATLINE and HUMCATLINES were used on a computer where
N = 10 and failed on computers where M > 10.
The solution in both cases is to avoid names that differ beyond
8 characters. Is somebody willing to write a program to detect this?
(end of delman.describe.conventions.writing)
PROGRAM RUNNING CONVENTIONS
In this manual we will use a single notation to mean running a program:
lister(book,list)
means to run the program LISTER using a file named BOOK. The program
will produce output to file LIST.
The names BOOK and LIST are not necessarily the same as the file names
declared in the source of LISTER (LISTERS), we assume that the names
are mapped one on one. Also, file names to the right may not be
always mentioned, to simplify the notation. For example:
edit(inst1)
:
: (create Delila instructions in file INST1)
:
delila(inst1,book1,delist1)
(run DELILA to create a book named BOOK1 and
a Delila listing DELIST1 that shows where the errors are.
the library and catalogue are not mentioned.)
lister(book1,list1)
(Run the auxiliary program LISTER.
OUTPUT and LISTERP are not mentioned.)
The file OUTPUT will always contain messages and diagnostics intended
for the CRT screen or teletype.
The file INPUT is always used for interactive input by the programs.
To fully define the files that a program uses we will write:
LISTER(BOOK: IN; LIST: OUT; LISTERP: IN; OUTPUT: OUT)
IN and OUT define the direction of information flow into or out of
the program. INOUT would mean that the source file may be modified
(such as by an editor). This is a symbolic way to represent the data
flow diagrammed in our papers (see DELMAN.INTRO.DESCRIPTION).
NOTE: The mapping of logical file name (the one the program knows) to
physical file name (the actual one the computer system uses) is
frequently done with an ASSIGN or LINK command in the job control language of
the computer.
(end of delman.describe.conventions.running)
Short clustered descriptions of some Delila System files
DOCUMENTS
AAA Names Of Delila System Files
chars Character List
delman1 Delila System Manual
delman2 Delila System Manual, for program descriptions
libdef Delila Library System Definition
moddef Module Transfer System Definition
LIBRARIES
humcat Human's Catalogue For The Library
lib1 Library 1: Bacteriophage
lib2 Library 2: E. Coli And S. Typhimurium
DELILA INSTRUCTIONS
train Transcript Library Instructions
grin Gene Starts In Relative Form (Use Transcript Library)
gain Gene Starts In Absolute Form (Use Transcript Library)
SEARCH PROGRAM RULES
genrule Finds Genes And Non-Genes
enzrule Finds Restriction Enzyme Sites In Books
WEIGHT MATRICES FOR THE PERCEPTRON
w101 101 Wide, Finds All Genes In Transcript Library
w71 71 Wide, Finds All Genes In Transcript Library
w51 51 Wide, Finds All Genes And Some Nongenes
EXAMPLES
ex0bk Example Book
ex0hu Example Catalogue For Humans
ex0dl Example Delila Listing
ex0in Example Instructions - To Create EX0BK
ex0li Example Listing From LISTER
ex0lo Example Loocat On Catalogue from EX0BK
EXAMPLE DELILA INSTRUCTIONS FOR DELMAN
ex0in "ex0: example"
ex1in "ex1: the laci gene"
ex2in "ex2: an absolute get"
ex3in "ex3: a relative get"
ex4in "ex4: non-coding lac leader"
ex5in "ex5: the region between laci and lacz"
ex6in "ex6: multiple specification and requests"
ex7in "ex7: aligned book"
ex8in "ex8: non-coding lac leader- via respecification"
EXAMPLES FOR TESTING THE MODULE PROGRAM
exsin example source in
exmodli example modlue library
EXAMPLES FOR TESTING AUXILIARY PROGRAMS
expepin Delila Instructions For Testing Pemowe
EXAMPLES FOR TESTING THE PERCEPTRON
exspbk Example Sequences Positive Book
exsnbk Example Sequences Negative Book
expa0 Example Pattern 0, Learn EXSPBK Vs EXSNBK With Zero Start
expa1 Example Pattern 1, An Initial Matrix For Learning
expa2 Example Pattern 2, Learn EXSPBK Vs EXSNBK Using EXPA1 As Start
expan2 Result Of Patana On EXPA2
exsebk A Book For Searching With EXPA2
EXAMPLES FOR TESTING ENCODE PROGRAMS
exencin Example Encode Instructions
exencbk The Book For EXENCIN
exencen Example Encoding Of EXENCBK
FONTS FOR BIGLET
font font for the biglet program
phont demonstration font for the biglet program
EXAMPLE PARAMETER FILES
Often a program will have a file associated with it
that controls it and is called a parameter file. For example, the
pbreak program uses a parameter file called pbreakp. Many programs
have example files. They are not listed here, but you may want
to look for them before you run the program. An example is the xyplo
program, for which there are the files xyplop.demo, xyin.demo,
xyplop.test and xyin.test.
As programs are modified, this section will not always be up to date.
(end of delman.describe.short.cluster.files)
Short clustered descriptions of Delila System programs
Documentation exists as describe.[name]
MODULE LIBRARIES
auxmod: modules for auxiliary programs
delmod: delila module library
doodle: pascal graphics library and preprocessor for pic under unix
cybmod: specific module library for the cyber computer
genmod: genbank access modules
matmod: mathematics modules
prgmod: programming modules for the delila system
unixmod: specific module library for the unix operating system
vaxmod: specific module library for the vax computer
MODULE MANIPULATION
module: module replacement program
makemod: create a set of empty modules from a list of names
makman: make manual entries from a source code
maknam: make manual entry names
modin: generate modularized delila instructions for absolute sites
modlen: determine module lengths
makemod: create a set of empty modules from a list of names
nulldate: modules to neutralize the date-time functions
pbreak: breaks a file into pages at a certain trigger phrase
show: show modules in a module library
undel: remove references to delman in modules
TOOLS
biglet: text enlargement program
calc: a calculator that propagates errors
calico: character and line counts of a file
cap: put capital letters inside quotes of a program
censor: removes code from a program
chacha: changes characters in a file
code: find the comment density of a pascal program
column: pull defined column from input
concat: concatenate files together
copy: copy one file to another file
decat: break a file into 10 files
decom: remove comment starts from within a comment
difint: differences between integers
flag: points out excessively long lines
ll: line lengths
lig: ligation theory
lochas: look at characters in a file
merge: compare two files and merge them
nocom: remove comments
number: add line numbers to a file
rembla: remove blanks from ends of lines in a file
repro: make multiple copies of a file
same: counts the number of lines that are identical in two files
shell: basic outline for a program
shift: copy one file to another file, with a blank in front of each line
short: find locations of short lines in a file
shortline: make short lines out of long lines
split: split a wide file into printable pages
sqz: squeeze the input file to fit into fewer characters per line
sumfile: sum of file sizes
test: a simple test program for Pascal
unshi: remove first column of characters from a file
ver: look at the version of a program
verbop: increment the version number of a program
vernum: print the version number of a program
versave: save the file under the version number
unsqz: unsqueeze the input file
whatch: what characters are in a file?
worcha: word changing program
wl: wrap lines in a file
woco: word counting program
wordlist: lists words in a file
ww: word wrap
TOOLS FOR TEX
notex: remove tex and latex constructs
ref2bib: refer to bibtex converter
sortbibtex: sort a bibtex database
untex: remove tex and latex constructs
untitle: remove titles from bbl file
unverb: remove verbatim sections from a latex file
GRAPHICS
doodle: pascal graphics library and preprocessor for pic under unix
domod: doodle modules
dops: pascal graphics library and preprocessor for postscript
dosun: pascal graphics library and preprocessor for Sun graphics
shrink: reduce size of postscript graphics
genhis: general histogram plotter
genpic: convert genhis output to pic input
xyplo: plot x, y data
log: convert columns of data to log
dnag: graphics of dna
LIBRARIAN
delila: the librarian for sequence manipulation
catal: cataloguer of delila libraries, the catalogue program
loocat: look at a catalogue
GENBANK
dbbk: database to delila book conversion program
dbcat: database catalog production and sorting program.
dbfilter: filter GenBank databases to remove unwanted entries
dbinst: extract Delila instructions from a GenBank database
dblo: look at the catalogue of a genbank/embl database
dbpull: database extraction program.
AUXILIARY PROGRAMS FOR DATA BASE CONSTRUCTION
makebk: make a book from a file of sequences.
rawbk: make a raw sequence into a book
reform: raw sequences reformatted
AUXILIARY PROGRAMS FOR SEQUENCE LISTING
lister: list the sequences of pieces in a book with translation
parse: breaks a book into its components
AUXILIARY PROGRAMS FOR ALIGNED SEQUENCES
alist: aligned listing of a book
gap: gaps in aligned listing of a book
hist: make a histogram of aligned sequences.
histan: histogram analysis.
malign: optimal alignment of a book, based on minimum uncertainty
AUXILIARY PROGRAMS FOR ANALYSIS
cluster: cluster indana subindexes into groups of duplicate entries
coda: composition file to data for genhis
comp: determine the composition of a book.
compan: composition analysis.
count: counts the amount of sequence in a book
frame: evaluator of potential reading frames
indana: analysis of an index
index: make an alphabetic list of oligonucleotides in a book
pemowe: peptide molecular weights
search: search a book for strings
AUXILIARY PROGRAMS FOR HELIXES
dotmat: dot matrices of two books
helix: find helices between sequences in two books
keymat: keyed-matrices for helices between two books
matrix: dot matrices for helices between two books
rep: records repeats between sequences in two books
sorth: sort helix list
instal: delila instruction alignment
AUXILIARY PROGRAMS FOR PATTERN LEARNING
patana: pattern analysis
patlrn: pattern learning
patlst: lister of patlrn output.
patser: pattern searcher
patval: pattern evaluations of aligned sequences
AUXILIARY PROGRAMS FOR ENCODED SEQUENCES
encfrq: encoded sequence frequency analysis
encode: encodes a book of sequences into strings of integers
encsum: sum of the vectors of encoded sequences
AUXILIARY PROGRAMS FOR INFORMATION ANALYSIS
calhnb: calculate e(hnb), var(hnb), ae(hnb), avar(hnb), e(n)
frese: frequency table to sequ
palinf: find palindromes, based on information theory
rf: calculate Rfrequency
rseq: rsequence calculated from encoded sequences
rsim: Rsequence simulation
rsgra: rsequence graph
dalvec: converts Rseq rsdata file to symvec format
makelogo: make a graphical `sequence logo' for aligned sequences
ckhelix: check that the helix location is where one wants
alpro: frequency and information of aligned protein sequences
alword: frequency and information of aligned words
dirty: calculate probabilities for dirty DNA synthesis
sites: analyse sites from randomized sequence data base
bkdb: convert a book to database format for the sites program
siva: site information variance
diana: diaucleotide analysis of an aligned book
tri: test environment for triangle array
digrab: diagonal grabs of diana data
da3d: diana da file to 3d graphics
dotsba: dots to database
Ri: Rindividual is calculated for every site in the aligned book
scan: scan a book with a wmatrix and generate a vector
vfilt: vector filter
tod: to database format for sites program
winfo: window information curve
AUXILIARY PROGRAMS FOR OTHER USES
refer: print the references in the pieces of a book
sepa: separates delila instruction sets
lenin: convert a list of lengths into Delila instructions
RANDOM NUMBERS AND SEQUENCES
markov: markov chain generation of a dna sequence from composition.
tstrnd: test random generator
gentst: test random generator
normal: generate normally distributed random numbers
rndseq: generate random dna sequences
aran: aligned random sequences
MATHEMATICS
av: average integers
binomial: produce the binomial probabilities for a found black to white ratio
binplo: produce the binomial probabilities for a found black to white ratio
cerf: complement of the error function
cisq: circle to square
chi: estimates chi squared from degrees of freedom
linreg: linear regression
mnomial: produce the multinomial distribution for base probabilities
pcs: partial chi squared
riden: ring density graph
ring: z space ring
sphere: plot density of shannon spheres
stirling: test of stirling's formula
zipf: Monte Carlo simulation for Peter Shenkin's problem
MISCELLANEOUS
aa: not actually a program, this is the header page for Delila manual
asciicode: converts ascii table to Pascal code
binhex: convert binary to hex
hexbin: convert hex to binary
mstrip: remove control m's from a file
epsclean: clean an eps file
kenin: create Delila instructions from Kenn's all.gen instructions
kenbk: book from a file of sequences of sequences provided by Kenn Rudd
tipper: copy a file to the output file with special symbols at end
todawg: change a book into dawg format
ev: evolution of binding sites
evd: evolution display
makedate: make a date file
makessbdate: make a date file from a Sample_Sheet.bin file
PROGRAMS TO CONTROL MACHINERY
odti: munch od and time plates together for xyplo
titer: analyse titertek optical density data
spec: analyse two spectra from the camspec
ssbread: read a sample sheet from the ABI sequencer
tkod: read od values from tk data
(end of delman.describe.short.cluster.programs)