By downloading this code you agree to the
Source Code Use License (PDF). |
{ version = 1.27; (* of mkdb.p 2011 Feb 04}
(* begin module describe.mkdb *)
(*
name
mkdb: read sequence; make GenBank entry with features for capitalized regions
synopsis
mkdb(sequ: in, mkdbp: in, entries: out, output: out)
files
sequ: raw DNA sequences in lower case except for objects
of interest marked in upper case. The program also accepts 'n'.
Sequences are separated by periods.
Each sequence may be preceeded by a name line and a species line.
These lines can begin with '>' or '*', but this is not necessary.
Spaces, blank lines and numbers are ignored.
Other lines that begin with '>' or '*' are comments
If there is no species name, the single name will be used
for the species also. This kludge allows the program to read
the fasta format. If there is no name at all (just sequence)
then a name will be assigned: nameste.
name: a string of characters to name the sequence.
organism: the species this sequence represents
entries: GenBank entries for the sequences, with features for
the capitalized regions marked as exons and features for the lower case
regions (not including primers) as introns.
mkdbp: parameters to control the program. The file must contain the
following parameters, one per line:
parameterversion: The version number of the program. This allows the
user to be warned if an old parameter file is used.
exoncutoff (integer): Capitalized regions longer than this
number of bases will be called exons, the others will be
called primers.
multipart (character): What to do if a name has spaces in it.
'i' ignore the rest of the name
'u' replace spaces with underscores
output: messages to the user
description
Sequences are often marked by people with capital letters to
indicate interesting regions (exons, primers, mutations, etc).
This program uses raw sequences to create simple flat-file GenBank
style entries with features marked by capital letters. Long
features are called 'exons' while short ones are called 'primers'.
The division between these two is given by the exoncutoff
parameter.
Example
If the sequ contains:
* T7stuff
* Bacteriophage T7
aacataaaggacacaatgcaatgaacattaccgacatcatgaacgctatc
gacgcaatcaaagcactgccaatctgtgaacttgacaagcgtcaaggtat
gcttatcgacttactggtcgagatggtcaacagcgagacgtgtgatggcg
agctaacCGAACTAAATCAGGCACttgagcatcaagattggtggactacc
ttgaagtgtctcacggctgacgcagggttcaagATGCTCGGTAATGGTCA
CTTCTCGGCTGCTTATAGTCACCCGCTGCTACCTAACAGAGTGATTAAGG
TGGGCTTTAAGAAAGAGGATTCAGGCGCAGCCTATACCGCATTCTgccgc
atgtatcagggtcgtcctggtatccctaacgtctacgatgtacagcgcca
cgctggatgctatacggtggtacttgacgcacttaaggattgcgagcgtt
tcaacaatgatgccCATTATAAATACGCTGAgattgcaagcgacatcatt
gattgcaattcggatgagcatgatgagttaactggatgggatggtgagtt
tgttgaaacttgtaaactaatccgcaagttctttgagggcatcgcctcat
.
The entries file will contain:
LOCUS T7stuff 600 bp DNA * mkdb 1.21
DEFINITION Bacteriophage T7
ACCESSION T7stuff
VERSION T7stuff.1
SOURCE Bacteriophage T7
ORGANISM Bacteriophage T7
FEATURES
primer 158..174
exon 234..345
primer 465..481
BASE COUNT 166 a 133 c 151 g 150 t 0 n
ORIGIN
1 aacataaagg acacaatgca atgaacatta ccgacatcat gaacgctatc gacgcaatca
61 aagcactgcc aatctgtgaa cttgacaagc gtcaaggtat gcttatcgac ttactggtcg
121 agatggtcaa cagcgagacg tgtgatggcg agctaaccga actaaatcag gcacttgagc
181 atcaagattg gtggactacc ttgaagtgtc tcacggctga cgcagggttc aagatgctcg
241 gtaatggtca cttctcggct gcttatagtc acccgctgct acctaacaga gtgattaagg
301 tgggctttaa gaaagaggat tcaggcgcag cctataccgc attctgccgc atgtatcagg
361 gtcgtcctgg tatccctaac gtctacgatg tacagcgcca cgctggatgc tatacggtgg
421 tacttgacgc acttaaggat tgcgagcgtt tcaacaatga tgcccattat aaatacgctg
481 agattgcaag cgacatcatt gattgcaatt cggatgagca tgatgagtta actggatggg
541 atggtgagtt tgttgaaact tgtaaactaa tccgcaagtt ctttgagggc atcgcctcat
//
documentation
see also
example parameter file: mkdbp
example sequ file: mkdb.sequ move to the name 'sequ' to use it
Program for listing the sequences: lister.p
Program for generating search for capitalized sequence: capsmark.p
author
Thomas Dana Schneider
bugs
technical notes
Capitalization that abuts either end of the sequence will be
indicated in the entry as beyond the end. This way the ends of the
sequence will not be marked as donors or acceptors.
The maximum name and sequence lengths are constants maxobjectlength
and maxsequencelength respectively.
*)
(* end module describe.mkdb *)
{no "version =" string found}
(* begin module describe.const *)
maxnamelength = 100; (* maximum length name *)
maxsequencelength = 6000000 ; (* maximum sequence length *)
(* 253256583 human chromosome 2 length *)
(* Set the length to the maximum your computer can handle *)
debugging = false; (* set true to get debugging output *)
(* end module describe.const *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}