By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 2.38; (* of exon.p 2018 Mar 06}
(* begin module describe.exon *)
(*
name
exon: determine lengths of exons in GenBank entries
synopsis
exon(exonp: inout, db: in,
dinst: out, ainst: out, einst: out,
lengths: out, exonfeatures: out output: out)
files
exonp: parameters to control the program, one per line:
0: parameterversion: The version number of the program. This allows the
user to be warned if an old parameter file is used.
(Introduced 2007 Dec 10.)
1: If 'n' then the end exons are not included. These do not
have reliable lengths.
Even if end exons are included, the program will never add the Delila
instructions for the very ends of the CDS, because these are not
reliable. Often they are CAP or polyA sites. Specifically, the
first coordinate of the CDS is likely to be a CAP and so should not
be added to the acceptors in ainst, while the last coordinate of the
CDS is likely to be a polyA site and so should not be added to the
donors in dinst.
2: if 'd' then gobs of debugging output are printed
to the output file. If 'v' then verbose output is given
but not debugging information. ('v' is true when debugging.)
3: Two constants, theDfromrange and theDtorange, that determine
the from and to range to be written for Donor Delila instructions.
4: Two constants, theAfromrange and theAtorange, that determine
the from and to range to be written for Acceptor Delila instructions.
5: If the first character is 'e' then exon features are also
used. If the second character is 'i' then intron features are
also used.
6: If the first character is 'a' (for alternative) then exon
features that have one end point the same are included. If it is not
'a' then only exons that are completely different are included.
7: 4 characters that determine the harshness of which entries
to keep. The categories are:
single letter name string in GenBank:
p putative "putative"
n notexperimental "not_experimental"
g geneprediction "gene prediction"
u unpublished "Unpublished"
s pseudo "pseudo"
The letters 'pngus' are on the parameter line.
If a letter is capitilzed, then any entry with that string
in it ANYWHERE will be killed. This is harsh but effective
at removing GenBank crap.
8: If the first character is 'n' (for "notes") then if there
is no /gene or /number for a feature, the program will
use the /note feature. WARNING: Despite 15 years of complaining
to GenBank, names in notes are NOT PARSABLE and may cause ill health.
9: If the first character is 'r', 'R' or 'm' (for "mRNA")
then the mRNA feature is used instead of CDS.
Otherwise CDS is used.
(Introduced 2007 Dec 10.)
If exonp is old (before having a parameter version) exon will
attempt to upgrade it.
db: a set of GenBank entries
lengths: A list of the exon lengths found in db.
dinst: Delila instructions for donor sites.
ainst: Delila instructions for acceptor sites.
einst: Delila instructions for exons. The acceptor from (theAfromrange)
and donor to (theDtorange) are used to extend beyond the exon
edges.
exonfeatures: Locations of exons in the format for the Lister program.
output: messages to the user
description
The program searches for 'CDS'. If the next word is 'join' it parses out
the parts of the CDS, determining the lengths of the exons. If
'complement' is found, the complementary exons are identified.
To ensure a clean data set, the program eliminates:
* single exons in a locus (unreliable data for lengths)
* exons which have one end not defined (< or > mark)
* exons at the beginning or end of the CDS (unreliable data)
* exons that are references to other entries.
* duplicate exons within a single locus
* exons that have any coordinates the same as other exons in the same
entry. This (arbitrarily) eliminates alternative splice cases.
To remove further junk from the database, entries that contain any of
these phrases are skipped:
'not_experimental'
'gene prediction'
'Unpublished'
GenBank contains many mRNA sequences masquerading as DNA. They can be
identified by zero length introns. They are ruthlessly eliminated.
If a CDS has a no /gene name in the feature table, it will be named like
this:
U00096.CDS.190-255 no /gene, no /number
If a CDS has a /gene name in the feature table, it would be nice to name
it like this:
U00096.thrA (this name can fail)
Unfortunately that alone will fail because all exons end up being named
the same! So if there is a /number the name will include it:
M95740.IDUA.exon-3 /gene, /number
If there is a /gene but no /number the range will be given:
M95740.IDUA.427-512 /gene, no /number
So there are three options for names.
* The exons are placed into ascending order.
* The Delila name command is used to name the pieces.
examples
documentation
@article{Stephens.Schneider.Splice,
author = "R. M. Stephens
and T. D. Schneider",
title = "Features of spliceosome evolution and function
inferred from an analysis of the information at human splice sites",
journal = "J. Mol. Biol.",
volume = "228",
pages = "1124-1136",
year = "1992"}
see also
dbinst.p
author
Thomas Dana Schneider
bugs
technical notes
The program deals with alternative splicing by removing any exon that has
any coordinate the same as another exon.
The program only can accept a single type of organism to be put into the
instruction files. It's not clear that one would ever want to mix
organisms for this analysis!!
The zero coordinate for splice junctions follows the convention of
Stephens.Schneider.Splice: it is the base on the intron side of the splice
junction.
2007 Dec 05. The exon program would compile and run with exonmax
set to 15000. Unfortunately this is not enough for H.sapiens
chromosome 1 (NC_000001 247249719 bp). The program compiles (gpc)
but gives a 'Segmentation Fault'. The reason (thanks to David
Bryant) is that the stack size in Unix is restricted. The Unix
command 'limit' gives 'stacksize 8192 kbytes'.
There are at least 3 solutions.
1. The Unix stack size can be increased by the command:
limit stacksize 65538
Doing so solved the problem.
Although this works, it requires setting the operating system so it
is not too portable.
2. The exonmax determines the number of exonrecords:
fealist: array[1..exonmax] of exonrecord;
The exonrecords use the standard 'string' for the gene namewhich
has an array of characters whose size is determined by constant
maxstring for which the default is 150. Setting maxstring to 20
solves the problem. Although this would work, perhaps it is best
to allow long names.
2. Put the array into the program heap, which is unlimited, instead
of the stack, which has the current limit. This requires a program
change. I implemented it by making the fealist be a pointer to an
array: 'fealist^.a' replaced 'fealist' through the code. This
worked.
Thanks to David Bryant for pointing out the situation and
explaining the possible solution of putting the data on the heap.
*)
(* end module describe.exon *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}