By downloading this code you agree to the
Source Code Use License (PDF). |
{version = 5.06; (* of cluster.p 1992 September 18}
(* begin module describe.cluster *)
(*
name
cluster: cluster indana subindexes into groups of duplicate entries
synopsis
cluster(clusterp: in, subind: in, inst: in, book: in,
pairs: out, clumps: out, output: out)
files
clusterp: The cluster parameter file that consists of the following:
FIRST LINE 'y' turns the flag on, 'n' turns it off
(debugging) allows one to look at raw data in the bags.
The debugging flag controls the printing of the raw data above the
regular output of the cluster program, which is created solely by
procedure showRAWbag. This can then be compared with the data in
the chart for correctness. Raw data consists of the series of
coordinate pairs in the bag and the sides they are matched on.
printed above the standard output structure.
example: - ( 630, 69) R
L ( 649, 88) - {20} {20}
*************************************
| 630 663
HUMUK | ----------
| 34
HUMUPA | ----------
| 69 102
*************************************
It is important to note that the raw data will only appear in the
pairs output file, and will not be written in clumps at all. This
means that parameter 3, writepairs, must also be turned on for
this flag to be effective.
SECOND LINE 'y' turns the flag on, 'n' turns it off
(showfragments) allows one to see pairs that are fragmented.
The showfragments toggle controls printing the outputs of pairs
with "imperfect" matches. That is, in some cases a repeating
sequence will match in several frames, causing repeated sequence
matching and producing a large list of coordinate pairs. This
list can be shown if the parameter is turned on, but the statement
"WARNING: sequence pairs are overmatched" will appear if it is
turned off. The actual sequences will be shown in either case,
so the comparison can always be done by hand by the user. The
output is excessively long, but the sequences will be shown, so
the comparison can be done by the user.
example: 1 acggatcgtgtgtgtgtgtgtgtgtacgatcggatcgat
2 acggatcgtgtgtgtgtgtgtgtgtacgatcggatcgat
These sequences will have matches between all of the 'gt' base
pairs, resulting in an overwhelming number of matches. The
maximum number of possible matches is found by taking the length
of the sequences and dividing it by the value in the overmatched
parameter (FIFTH LINE) times the number of instructions that
match between any two pieces in the dbinst. This results in
a maximum number of matches between any two pieces. Any pieces
above this limit will can have their output completely shown or
can generate a warning message (see showfragments, SECOND LINE).
In addition to preventing the example case, showfragments will
also prevent the display of any other case that may cause an
excessive number of matches.
THIRD LINE 'y' turns the flag on, 'n' turns it off
(writepairs) controls the printing of the pairs output file.
If writepairs is on, the original clustering pairlist will be
printed into the output file pairs. If it is off, this file will
not be printed. This parameter must be turned on to effectively
use the debugging parameter (see FIRST LINE).
FOURTH LINE 'y' turns the flag on, 'n' turns it off
(writeclumps) controls printing of the clumps output file.
If writeclumps is on, the original clustering pairlist will be
sent through the clumping procedures. The output file clumps will
contain the sequences involved in the matches on the pair in
addition to the clumped version of the pairlist. The clumping
process takes an excessive amount of time for very large files,
since the program must traverse the entire pairlist to find all
related pairs, then put the pairs on to the clumplist, then go
through the book and find sequences to match every instruction
in every pair of every clump. Although it is much easier to
determine which pieces are true repeats through use of the clumps
file, it is certainly possible to do so by simply using the pairs
output file.
FIFTH LINE any integer
(matchparameter) is the number of matches to be allowed
between two instructions. This can be determined by dividing the
sequence length from the book by the minimum window size from the
subindex, or a maximum number of matches between instructions can
be set. An integer less than or equal to 0 will calculate maximum
matches using the above method. Any number greater than 0 will be
used as the new maximum matches.
example: if the instructions call for the sequences
piece1: get from 100 -50 to 100 +50;
piece2: get from 200 -50 to 200 +50;
The sequence length is 101. If the windowsize read from the
subindex = 15, then 6 possible matches can occur between these
two instructions (101 div 15 = 6).
The TOTAL number of matches between two pieces is found by
multiplying matchparameter by the number of instructions in a
given pair. If a piece has more matches than this, it is
considered to be overmatched, the bag will not be printed, and the
statment 'WARNING: sequence pairs have too many matches.' will
appear. Overmatched pairs can be printed using the showmatches
parameter (see SECOND LINE).
subind: a subindex from the indana program matching the inst and the book
inst: a set of delila instructions that correspond to the book
book: a delila book that contains the sequences being clumped
pairs: the output list of paired sequences
clumps: the output list of clumped sequences
output: When errors occur, the program halts and produces an error message
description
Duplicate entries in the subind subindex are clustered into a unified list
of pairs and copied to output files as sequence numbers, lengths, and
sequence base pairs.
Pairs are determined by the indana program, which delegates sequence
similarities with an '*'. Cluster takes the subindex and shows the
coordinate range and length of the similarity by pairs. The pairs file is
a list of relationships between two sequences, the clumps file takes this
list of pairs and groups related ones together. The seqalign modules of the
program then access the book and get the corresponding sequences to print
out with the instruction number and piece name.
documentation
none
see also
index.p, indana.p
author
R. Michael Stephens
bugs
None currently known.
technical notes
The read for the indana window size is based on the '[' character before
the number in the subind heading. Any changes to indana that alter this
format must be reflected in the getwindowsize procedure.
*)
(* end module describe.cluster *)
{This manual page was created by makman 1.45}
{created by htmlink 1.62}