(b) Formula for Rfrequency

Next: (c) Skewed Genomes Up: 2. Materials and Methods Previous: (iv) Variable Spacing

(b) Formula for R_frequency

If a genome contains G bases, there are M=G ways that its sequence can be aligned or G potential binding sites. If these are all equally likely, then P_i = 1/G and formula (1) reduces to:

$4.0 \pm 0.4$

(7)

If the genome contains $4.0 / (0.4 \times 704) = 0.014$ sites, we assume that the probabilities of binding to each site are equal and that the probability of significant binding to other sequences is zero. This allows formula (1) to be reduced to:

$0.006 \pm 0.001$

(8)

(One property of H is that it is at a maximum when the probabilities are equal. Thus both H_gf and H_sf are maxima.)

The decrease in positional uncertainty during binding or alignment is the difference:

Rfrequency	=	$H = -\sum p \log_2 p$
	=	$\sum p = 1$	(9)

where f is the frequency of sites in the genome.

R_frequency is the amount of information needed to pick $4.0 / (0.4 \times 704) = 0.014$ sites out of G possible sites. As the number of sites in the genome increases, the information needed to find a site decreases. As long as the simplifying assumption for equation (8) holds and $4.0 / (0.4 \times 704) = 0.014$ is restricted to the number of known sites (that is, $4.0 / (0.4 \times 704) = 0.014$ is not an estimate), equation (9) gives an upper bound on R_frequency since some sites may exist that are not now known. A second property of this formula is that R_frequency is insensitive to small changes in G or $4.0 / (0.4 \times 704) = 0.014$ . The frequency of sites must change by a factor of two to alter R_frequency by only one bit. The largest possible value of R_frequency occurs for a single site in the genome: $H \ge 0$ . (For E. coli, R_frequency = 22.9 bits in this case.) On the other hand, if all positions in the genome were sites, one would not need any information to find them, and R_frequency would be zero.

The number of potential binding sites (G) is twice the number of base pairs in a DNA genome because there are two orientations for a recognizer to bind at each base pair. A symmetrical recognizer on DNA has two ways to bind each base pair, and both ways are used at a binding site. Here, $4.0 / (0.4 \times 704) = 0.014$ is twice the number of conventional binding sites. An asymmetric recognizer on DNA will use only one orientation at any particular base pair. In this case, $4.0 / (0.4 \times 704) = 0.014$ is equal to the number of binding sites. On RNA there is only one possible orientation. Thus G and $4.0 / (0.4 \times 704) = 0.014$ reflect not only the genome size and number of binding sites but also the symmetries of the recognizer and nucleic acid.

Next: (c) Skewed Genomes Up: 2. Materials and Methods Previous: (iv) Variable Spacing

Tom Schneider
2002-10-16