Keys select one lock in a set of locks and so are capable (with a little motive force from us) of making a `choice'. The base 2 logarithm of the number of choices is the number of bits. (More details about information theory are described in a Primer .)
In a similar way, there are many proteins that locate and stick to specific spots on the genome. These proteins turn on and off genes and perform many other functions. When one collects the DNA sequences from these spots, which are typically 10 to 20 base pairs long, one finds that they are not all exactly the same. Using Shannon's methods, we can calculate the amount of information in the binding sites, and I call this Rsequencebecause it is a rate of information measured in units of bits per site as computed from the sequences . (See figure 1.1 for the details of this computation.)
For example, in our cells the DNA is copied to RNA and then big chunks of the RNA are cut out. This splicing operation depends on patterns at the two ends of the segment that gets removed. One of the end spots is called the donor and the other is called the acceptor. Let's focus on the acceptor because the story there is simple (what's happening at the donor is beyond the scope of this paper). Acceptor sites can be described by about 9.4 bits of information on the average . Why is it that number?
A way to answer this is to see how the information is used. In this case there are acceptor sites with a frequency of roughly one every 812 positions along the RNA, on average. So the splicing machinery has to pick one spot from 812 spots, or bits; this is called Rfrequency (bits per site). So the amount of pattern at a binding site ( Rsequence) is just enough for it to be found in the genome ( Rfrequency). Also, notice that we are using the fact that the capacity theorem says that it is possible for the sites to be distinguished from the rest of the genome.