Instructions on How to Refine a Flexible Information-Theory Based Ribosome Binding Model From an Initially Annotated Genome.

By Ryan Shultzaberger
version = 1.05 of refineinst.html 2005 Oct 13

Introduction:

These instructions were written to help scientists familiar with the programs used in Dr. Thomas Schnieder's lab, to refine a flexible information-theory based ribosome binding model from an initially annotated genome as done previously (Shultzaberger.Schneider2001). If you are unfamiliar with the programs used in Dr. Schneider's lab you can learn more about them at: https://alum.mit.edu/www/toms/ .

Theory:

Rigid information-theory based models are not appropriate for the modeling of flexible binding systems, like ribosomes. Flexible binding systems are those systems that contain more than one binding element with some variable distance between those elements. Prokaryotic ribosome initiation relies on the 30S recognition of the correct start codon and a rRNA-mRNA hybridization between the 16S rRNA and a region upstream of the start codon (the Shine-Dalgarno). The spacing between the Shine-Dalgarno and the start codon varies, making this system a flexible binding system. There is an in-depth description of the mathematics used to characterized flexible binding systems in Shultzaberger.Schneider2001 .

The number of sequenced genomes is increasing rapidly. These genomes are often annotated by computer algorithms that are fairly, but not completely, accurate. These programs work by looking at reading frame size, G/C content in the third position, and homology between genes in other organisms. These initial annotations often have incorrectly annotated genes. The methodology described in this document can be used to help correct these misplaced starts.

A flexible binding model can be made of all starts in an initially annotated genome using methods previously discussed ( Shultzaberger.Schneider2001) . The individual information scores of each site within the set can be determined. These individual information scores reflect the relative strength of each site within the total set. The relative frequency of each nucleotide base at each position is determined, for the whole set, creating a weight matrix. Then each site is compared to the weight matrix, and the individual score for that site can be determined.

If the site has an individual information score greater than zero bits, then this site is thought to be real. A positive information score corresponds to a negative delta G. Conversely, if a site has a negative score (or a positive delta G), then the site is not real and the ribosome will not bind to it.

The refinement method, which is the method for which these instructions are for, is a means to remove, identify, and eventually adjust these incorrect sites. After the model is made and those sites with negative information are identified, they are removed and the model is rebuilt. After this model is rebuilt there is a re-evaluation of the individual scores of each site. New sites that now have negative information, as compared to the new total set, are then identified and removed. This process of rebuilding and removing continues until no negative sites remain in the set. Once no more sites remain, the model is said to be refined.

This refined model is assumed to be made of correct sites and can be further used to predict where the correct initiation site is for those removed presumably incorrectly annotated sites. Examples of this are also within Shultzaberger.Schneider2001 .

Programs used:

wgetac
exon
instshift
catal
delila
alist
encode
rseq
dalvec
makelogo
ri
embed
malign
malin
diffinst
genhis
xyplo
diri
makeinst

Instructions:

A. Make a clean directory to work in (dir xxx), and make the following directories within this new directory:

1. malignsd
2. rounds
3. discan

(Here is a diagram of the final directory structure .)

B. In directory xxx you will make your first model, which is an aligned listing of gene starts from GenBank. To do this do the following:

1. Use wgetac to get your selected genome off of GenBank. This works by typing "wgetac ACCESSION_NUMBER".

2. Copy this file, which is named ACCESSION_NUMBER to db.

3. Run exon on db, but make sure that the "acceptor range is +/- 100" in the exonp file.

4. Use instshift so that the first base in the p-site is the zero position in your model.

5. Generate your lib and cat files that support your new genome.

6. Manipulate the ainst file so that it is in the following format:

example inst

7. Using this inst file, make a logo using run.logo (current version). (run.logo version 1.18 was originally used.)

8. Copy this inst file to rounds/inst.atg.0 and to malignsd/inst .

C. With this inst file you can multiple align out your Shine-Dalgarno. To malign the Shine-Dalgarno do the following in the malignsd directory:

1. Embed out the region associated with the P-site, so that it does not bias your re-alignment (like in the figure below). Make sure to copy the embedbk to book. (example embedp file)

Figure 1: The region associated with the p-site is embedded out.

2. Run malign over the region of the SD, and look at the malignxyin output to determine the best alignment. (example malignp file)

3. Run malin to make an inst of the best one.

4. Copy this realigned inst to .. /rounds/inst.sd.0

5. Now go into the rounds directory and use instshift to shift the maligned inst file so that the central base is at the zero position. You now have two inst files (inst.atg.0 and inst.sd.0). Your SD model should look like the following:

Figure 2: Your model should look like this, notice how the central position is at base zero.

D. Go into your discan directory and make two new directories:

1. simul.rounds
2. diri

E. In simul.rounds/ you will generate your rixyin files to be used by diri. Do the following:

1. Copy your inst.sd.0 file to inst in the simul.rounds directory, and to insta in the simul.rounds directory.

2. Run delila, alist, encode, rseq, dalvec, makelogo, and ri to generate an rixyin file for later use. Make sure that your rip file is for the range that you want, for the sd region. Copy this file into; .. /diri/bri .

3. Copy your inst.atg.0 file to inst in the simul.rounds directory, and to instb in the simul.rounds directory.

4. Run delila, alist, encode, rseq, dalvec, makelogo, and ri to generate an rixyin file for later use. Make sure that your rip file is for the range that you want, for the initiation region. Copy this file into; .. /diri/ari .

5. You need to generate a distribution of spaces between the SD and ATG models to do this run diffinst and genhis on the insta and instb files you made in steps 1 and 3. Copy the histog output .. /diri/histog .

F. Now that you have your rixyin files for both the SD and AUG part of the model, as well as the distribution of the spacings, you can determine the individual information contents of each site in the model. In order to determine the individual information content, do the following:

1. In the diri directory, run diri and look at the dcout file.

G. Now you only want those sites in the dcout file that have positive information.

1. The data input file for makeinst is composed of a coordinate and an information score. Since you are only interested in those sites that are positive, you can grep -v out the negative scores. Here is a possible script that can be used to make this data file: possible script

2. Run makeinst on the new data file to generate new (the data.atg output from the above script), refined inst files, to generate instructions for these sites. Notice that these instructions do not have header material, you need to insert this.

3. Copy this new inst to xxx/rounds/inst.atg.1 . This is the inst file for the initiation region after the first round of cleaning.

H. Use this inst file like you used inst.atg.0, and repeat steps C-G, skipping step D. Each time you repeat make sure to raise you increments by 1 so that after the second round you would have inst.atg.2 and inst.sd.2 Continue to repeat until you have no more negative sites in your model. This is your refined model.

color bar Small icon for Theory of Molecular Machines: physics,
chemistry, biology, molecular biology, evolutionary theory,
genetic engineering, sequence logos, information theory,
electrical engineering, thermodynamics, statistical
mechanics, hypersphere packing, gumball machines, Maxwell's
Daemon, limits of computers

Schneider Lab
origin: 2002 Jan 22
updated: see top.
color bar