By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 3.61; (* of dbinst.p 2018 Mar 28}
(* begin module describe.dbinst *)
(*
name
dbinst: extract Delila instructions from a GenBank database
synopsis
dbinst(db: in,
binst: out, einst: out,
oinst: out, sinst: out,
olength: out, slength: out,
dbinstp: in, locuslist: out, missing: out,
featab: out,
output: out)
files
db: a set of GenBank entries
binst: instructions for finding the beginning of a feature
einst: instructions for finding the ending of a feature
oinst: instructions for finding the whole feature, called the "object".
They are given in the form "from begin + f to end + t" where f and t are
the "from" and "to" parameters given in dbinstp.
sinst: instructions for finding the regions between features, called
the "space". They have the same form as those of oinst.
olength: list of object lengths
slength: list of space lengths
dbinstp: parameters to control the program
First line: the name of the feature to use.
Second line: two integers, the base "from" and the base "to" relative to
the alignment point to write the instructions.
If "from" is larger than "to" then generic names "before" and "after"
are written. This allows one to make a generic file of instructions
to be copied and edited later.
Third line: The first 4 characters on the line control which instruction
files are to be written. To have all 4 on, use 'beos', for begin, end,
object and space. Any other character in a position means that the
corresponding file will not be written. The file will be rewritten
however. Thus beos means write all files, and bEos would not write
the einst file.
Fourth line: 2 characters without spaces that control which length
files are to be written. To both on, use 'os', for object and space.
Any other character means that the corresponding file will not be
written. The file will be rewritten however.
Fifth line: If the first character is 'r' then remove obviously
duplicated instructions and object or space lengths. When alternative
splicing occurs, GenBank records the endpoint several times, so that
the sequence instructions are identical. By using this toggle switch,
such cases are eliminated.
Sixth line: If the first character is 'f' then the coordinates of the
instruction are written whether or not the object is off the end
of the sequence. This allows one to pick up objects that are
partially on a piece.
If the first character is 's' then select against the feature if
either end is missing. This makes the length list correspond
to the instruction set.
Seventh line: Alignment shift. This integer is added to the
from and too coordinates of the instructions written out.
Normally this should be 0. An example helps. Normally, if the zero
of splice donor sites is defined the first base on the intron,
then if one is writing instructions based on exon coordinates
the zero base will be 1 too low. By making the alignment shift
1, the instructions written out will match the expectations of
other programs.
Note: object coordinates are shifted accordingly; this may
not be quite what you want if you are using them from the olength
file! However, the length is not affected.
locuslist: a list of all the loci in the db that have features of interest.
This list can be used with dbpull to create reduced databases containing
only those entries that contain the features we want.
missing: Features that are listed under the database COMMENT are listed
here. These are "EMBL features not translated to GenBank features". We
do not consider these to be reliable. They are NOT included in the binst,
einst or olength, slength instructions.
featab: feature table, tab delimited
first line gives information about this program
second line is column labels
following lines have columns:
accession name begin end orientation(+/-)
output: messages to the user
description
The GenBank entries in db are scanned, and Delila instructions are
generated, according to a desired feature table item. Four kinds of
instruction are made: beginning, ending, object and space. Beginning
appears only if the data for the beginning of the feature is in the db.
Ending appears only if the data for the ending of the feature is in the db.
Object appears only if both the beginning and ending are there. Space only
appears if there was an ending to the previous feature, and the current
feature has a beginning. Thus object and space instructions is guaranteed
to be a "natural" length.
The names of the pieces are now the ACCESSION number.
The names for the instructions are determined as follows. The GenBank
ORGANISM contains the two part genus/species name, such as:
ORGANISM Homo sapiens
The parts are joined into "Homo.sapiens", and this becomes the name of the
organism and chromosome in the instructions. The instructions for organism
and chromosome only change when the genus/species name changes in db. The
LOCUS name of the entry is picked up and used as the piece name. These
naming conventions are the ones generated automatically by the dbbk program,
so one need not think about it most of the time.
In each entry, lines of the form:
pept < 1 46 Ig V-R-H region protein, exon x
are located and used to generate Delila "get" statements.
If a "<" appears before the first number, then no instruction is
written to binst, since the beginning point is before the GenBank sequence.
If a "<" appears before the second number, then no instruction is
written to einst, since the ending point is after the GenBank sequence.
If a "<" or ">" appears in the db, then no object instructions or
lengths are written.
If a ">" appears in the previous feature or ">" appears in the current
feature, then no space instructions or lengths are written.
So for the above example, only one Delila instruction would be written:
get from 46 -10 to 46 +20;
if the dbinstp contained -10 20, and
get from 46 before to 46 after;
if the dbinstp contained 10 -20.
where "before" and "after" are replaced by the integers from dbinstp.
examples
If dbinstp contains:
CDS the name of the feature to use.
-40 20 "from" and "to" to write the instructions.
beOS "beos" means begin, end, object, space instructions written
os "os" means object and space length file written
r "r" means remove obviously duplicated instructions.
F "f" = find-anyway. 's'= select AGAINST feature if either end missing
0 alignment shift: amount to shift the zero base.
then instructions to get coding sequence (CDS) starts (binst) and ends
(einst) from -40 to +20 will be made.
Instructions for the entire coding region, from -40 before the start of the
peptide to 20 bases after will not be written because O is capitalized and
so not recognized.
Instructions for the regions between peptides, from -40 inside each previous
peptide to 20 bases into the inside of the next peptide will not be written
because S is capitalized and so not recognized.
documentation
none
see also
dbbk.p
author
Thomas Dana Schneider
bugs
The program does not produce the instructions for space between the first
object and the beginning of the sequence or the space after the last object
in the sequence. This is possible (and perhaps should be controlled by a
parameter) but it would not produce "natural" lengths because those space
lengths depend on the length of the reported sequence.
It is not clear that spaces are done properly anymore. Possible bug
at "SPACE PROBLEM".
Genus names are limited to genuslimit (a constant) to avoid names longer
than the standard Delila limit.
technical notes
The expected column locations of the complement flag in the database, (the
'before end of piece' and the 'after end of piece' flags) are given in the
program constants.
If a file is not written to, this version of the program will not
touch the file. Though this could lead to some confusion on
the part of an incautious user (who thinks the program wrote a file
when it did not), this does mean that the program will not create
any new files that are not necessary.
*)
(* end module describe.dbinst *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}