Delila Program: sites
sites program

By downloading this code you agree to the
Source Code Use License (PDF).

Pascal source code: sites.p (wget instructions)
Instructions on compiling
MacOS binary: sites
Alphabetic List of Delila Programs
Delila Programs by Most Recent Update
Please report broken links
delilabundle.zip = All Programs and MacOS Binaries
Copyright Statement for Delila Programs

Documentation for the sites program is below, with links to related programs in the "see also" section.

{version = 8.09; (* of sites.p 2002 Mar 6}

(* begin module describe.sites *)
(*
name
   sites: analyse sites from randomized sequence data base

synopsis
   sites(database: in, standard: in,
         caps: out, latex: out, list: out, sorted: out,
         stats: out, tables: out, rsdata: out,
         sequ: out, makebkp: out, output: out)

files
   database: database consisting of DNA sequence data.
      The first line is the name of the database.
      The remaining lines consist of experimental packages.
      The start of a package is a line like:
          @ -27 11 -21 5 0.85
      The '@' must be left justified as the first character on the
      line.  The numbers are defined to be:

          @ FROM.range TO.range FROM.random TO.random fraction.canonical

      FROM.range: the coordinate of the first base reported in the database
      TO.range:   the coordinate of the last base reported in the database
      FROM.random: the coordinate of the first randomized base
      TO.random:   the coordinate of the last randomized base
      fraction.canonical:  the fraction of the canonical base during
                           chemical synthesis.

      The next line defines the canonical sequence which was 'randomized'.  It
      is in the format of the remaining sequences.  The first sequence in the
      package is always the standard, so do not forget to include it!

      The sequences follow the standard.  The format of the standard and the
      randomized sequences consists of:

      DNA sequence, plasmid name, primer, experiment, date (year, month, day)

      separated by one space each instead of commas.
      The sequence may contain any of the characters: "acgtxd.".
      "x" means that the base is not known.  "d" means that that base
      was deleted.  The program will reject these sequences (to make pure
      data), but this allows them to be stored in the database. "." means
      'the same as the standard sequence in this position'.  This allows
      one to enter sequences as a set of changes from the standard.

      The next experimental package begins with another '@'.  The data from
      each experimental package are gathered as frequencies and normalized by
      using the given canonical base frequency.  The normalized frequencies
      from all the packages are averaged to produce the final results.  This
      allows one to combine several experiments together, however all
      experiments are given the same weight.  This is reasonable if the
      experiments have similar canonical frequencies and numbers of sequences,
      but is probably not correct if one experiment carries more "importance"
      than another.  A method to accounting for these different weightings is
      not known.

   standard: Use the rsdata output of the rseq program from the natural
      sequences as your standard.  It is used for statistical comparison of the
      experiment to wild-type sequences.

   caps: listing of the database sorted and with capital letters showing
      changes from the standard and database errors.

   latex: just like list, but in a form that can be run through the typesetting
      program LaTeX.

   list: listing of the database in an easy-to-read format showing only the
      changes from the standard.  Also gives the tables of numbers of bases.

   sorted: the list sorted by sequence

   stats: frequency statistics of the database differences.
      summary of information results.

   tables: frequency tables for various stages of the normalization.

   rsdata:   This simulates the output of the rseq program by giving the
      numbers of bases (b) at each position (i).  When the frequency tables are
      normalized in this program, the effective number of sequences is lost.
      To make sure that the numbers reported in rsdata are accurate, they are
      multiplied by constant scaleup.  The table can be run through dalvec and
      makelogo to make a sequence logo.  The variance, varhnb, is set to be
      negative to indicate that no method is known for how to calculate it.  An
      earlier version of the program gave the minimum error based on the number
      of sequences in the database, but people tended to miss this fact when
      looking at the final sequence logo, so were unduely impressed by the
      data.

   sequ: raw sequences (after processing) ready for makebk

   makebkp: input for makebk to create the book

   output: messages to the user

description
   The function of the sites program is to gather, collate and analyze
   data from a randomization experiment.  See the reference given below.

   It was designed to help enter sequence data.  One may enter several copies
   of a particular sequence, and they will be joined together by merging their
   data.  Sequences of the same clone are identified by their common plasmid
   names. Inconsistent data are flagged.

   First the program sorts the data and checks that multiple entries are
   consistant with one another.  If they are not, the program halts and you
   should look into the caps file to figure out what is wrong.

   The program converts the database into a more readable form in list, and
   provides statistical analysis.  If the standard is:
gaattcaaattaatacgactcactatagggagaaagctt pTS37 kc7 ex100 87 nov 2
   and one of the data base lines is:
gaattcaaattaattcgactcactttagggaaaaagctt pTS331 1204 ex394 87 nov 2
   the program presents the data in file list as:
..............t.........t......a....... pTS331 1204 ex394 87 nov 2
   which is more readable.  This allows entry as a sequence, but display
   in a form that is easy to understand.

   If two primers are used, and data are found for both, then the
   name becomes 'both'.

   The stats file contains tables of the wild type frequencies and
   the experimental frequencies.

examples
   See database.t7 and standard.t7.

documentation

@article{Schneider1989,
author = "T. D. Schneider
 and G. D. Stormo",
title = "Excess Information at Bacteriophage {T7} Genomic Promoters
Detected by a Random Cloning Technique",
year = "1989",
journal = "Nucl. Acids Res.",
volume = "17",
pages = "659-674"}

see also
   Examples: database.t7 and standard.t7

   Related programs:
   siva.p, dalvec.p, makelogo.p, makebk.p

author
   Tom Schneider

bugs
   For sorting all plasmid initials are ignored, sorting is by the plasmid
   number only.

   A correction for small sample size is not known for the normalized
   experimental data.  Certainly the method given in program Calhnb is not
   right.  Therefore, the program does not report the expected variation.

*)
(* end module describe.sites *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}