a open book

flexrbs: Anatomy of Escherichia coli Ribosome Binding Sites

A quantitative model of ribosome binding uniformly
accounts for the statistics of the Shine-Dalgarno, the
initiation region and the variable spacing between them.
Two sequence logos are shown for the Shine-Dalgarno and
initiation region of E.  coli ribosome binding sites.  They
are separated by a histogram depicting the frequency of
different distances between the two parts.  In combination
these three components represent a flexible model for
ribosome binding sites.  Remarkably, the Shine-Dalgarno
matches the 3' end of the 16S rRNA depicted below it as 5'
A U U C C U C C 5'.  This figure was proposed as the JMB
cover but was not accepted.description
@article{Shultzaberger.Schneider2001,
author = "R. K. Shultzaberger
 and R. E. Bucheimer
 and K. E. Rudd
 and T. D. Schneider",
title = "{Anatomy of \emph{Escherichia coli}
Ribosome Binding Sites}",
journal = "J. Mol. Biol.",
volume = "313",
pages = "215--228",
pmid = "11601857",
note = "\htmladdnormallink
{http://dx.doi.org/10.1006/jmbi.2001.5040}
{http://dx.doi.org/10.1006/jmbi.2001.5040}",
year = "2001"}

PDF Preprint copy.

Published 2001 October 16 at JMB, Abstract at Pubmed

Summary of the flexible method: The basic observation is that the SD to Initiation Region (start codon and region around it, IR) distance is variable. One can, therefore, make a probability distribution, as shown above. One can compute the Shannon uncertainty of any distribution. This uncertainty remains after binding so it is to be subtracted from the sum of the other components. Furthermore, the ideas about individual information apply too and so one can build flexible sequence walker models. These work very well. The interesting thing is that one does not need to do any training to get these models. One starts from proven binding sites and gets the model directly. In contrast, training methods require that one provide examples of sequences that do not contain the site. However this is very difficult to obtain in general, so such training is probably contaminated with weak but functional sites. The information theory method avoids the problem.


Data Table new as of 2005 Sep 17.

We provide a table of data for the rbseg12 model for the U00096 E. coli K-12 MG1655 sequence that contains the following information:

30 nucleotides upstream of gene start
Location of the gene start (Start)
Location of the ShineDalgarno (SD)
Orientation of the gene (Orient)
Strength of the SD (Ri(SD))
Distance between the SD and the ATG (Gap)
Total strength of RBS including ATG (Ri(total))

Here are the first two rows of the table:

*Sequence -30 to +2           ATG       Start   SD   Orient  Ri(SD)   Gap    Ri(total)
cagataaaaattacagagtacacaacatccatg   190   175     1    5.47471 -15.0    5.86032

The first codon for every gene is the last three bases in the sequence.  The SD coordinate corresponds to the central "G" = in the SD (refer back to the ribosome paper), and the spacing is the difference between this base and the first base of the start codon (usually an "A" in "ATG").

sd_table.txt


Refined Computation of Rsequence and Rfrequency for E. coli ribosome binding sites : new as of 2005 Aug 23.
The original computation of Rsequence and Rfrequency for E. coli ribosome binding sites in Schneider1986 gave Rsequence = 11.0 bits and Rfrequency = 10.6 bits. The computation is changed in two ways now. First, the original data set contained all known ribosome binding sites, including those in bacteriophage. However, we now know that bacteriophage ribosome binding sites have higher information content than chromosomal ones, so Rsequence should be somewhat lower. Indeed, the estimates are now:
9.28(+/-0.06) bits (flexible EcoGen12 refined evaluating Ecogene12 set)
10.17(+/-0.14) bits (flexible EcoGene12 evaluating verified set)
10.35(+/-0.16) bits (flexible verified evaluating verified set)
as given in this paper. The other change is that the entire genome has been sequenced and Escherichia coli K12 (NC_000913) is 4639675 bp. It contains about 4242 genes, so Rfrequency = 10.10 bits. This is remarkably close to the values for Rsequence! (EcoGene 12 contained 4122 genes giving Rfrequency = 10.14 bits - which makes no difference to these results.)


CORRECTION: Under Materials and Methods page 225, third paragraph the reference to Blattner points to reference 37 instead of 33. For some reason we wrote that reference as 'Blattner et al. (1997)' and it got typeset incorrectly. Such errors never occur in LaTeX, which we use all the time; they occur frequently when people get involved, as apparently happened in this case. Unfortunately we missed the alteration at the proof stage. I, for one, am so used to the perfect referencing mechanism of LaTeX that I don't even think about checking such things anymore. But with humans involved, nothing is safe.


Other pointers:

Tom Schneider

color bar Small icon for Theory of Molecular Machines: physics,
chemistry, biology, molecular biology, evolutionary theory,
genetic engineering, sequence logos, information theory,
electrical engineering, thermodynamics, statistical
mechanics, hypersphere packing, gumball machines, Maxwell's
Daemon, limits of computers


Schneider Lab

origin: 2000 Jan 31
updated: 2019 Jun 21
color bar