Information theory was first described by Claude Shannon
in 1948 [5]. It sets out a mathematical
way to measure the choices made in a system. Although Shannon
concentrated on communications, the mathematics applies equally
well to other fields [6].
In particular, all of the theorems apply
in biology because the same constraints occur in biology
as in communication. For example, if I call you on the phone
and it is a bad connection, I may say `let me call you back'.
Then I hang up. I may even complain to the phone company
who then rips out the bad wires. So the process of
*killing the phone line*
is equivalent to
*selecting against a specific phenotype* in biology.

A second example is the copying of a key. In biology that's called `replication', and sometimes there are `mutations'. We go to a hardware store and have a key copied, but we get home only to find that it doesn't fit the door. When we return to the person who copied it, they throw the key away (kill it) and start fresh.

This kind of selection does not occur in straight physics.
It turns out that the
requirement of being able to make
distinct selections is critical to Shannon's channel capacity
theorem [7].
Shannon defined the channel capacity, *C* (bits per second)
as the maximum
rate that information can be sent through a communications
channel in the presence of thermal noise.
The theorem has two parts.
The first part says that
if the data rate one would like to send at,
*R*, is greater than *C*, one will fail.
At most *C* bits per second will get through.
The second part is surprising. It says that as long
as *R* is less than *or equal to* *C* the error rate
may be made as low as one desires.
The way that Shannon envisioned attaining this result was by encoding
the message before transmission
and decoding it afterwards.
Encoding methods have been explored in the ensuing 50 years
[8,9],
and their successful application is
responsible for the accuracy of our solar-system spanning
communications systems.

To construct the channel capacity theorem, Shannon assigned each message
to a point in a high dimensional space.
Suppose that we have a volt meter
that can be connected by a cable to a battery with a switch.
The switch has two
states, on and off, and so we can send 1 bit of information.
In geometrical terms, we can record the state
(voltage)
as one of two points on a line, such as *X*=0 and *X*=1.
Suppose now that we send two pulses, *X* and *Y*.
This allows for 4 possibilities,
00, 01, 10 and 11 and these form a square on a plane. If we send
100 pulses, then any particular sequence will be a point in
a 100 dimensional space (hyperspace).

If I send you a message, I first encode it as a string of 1s
and 0s and then send it down the wire. But the wire is hot
and this disturbs the signal [10,11].
So instead of *X* volts
you would receive
,
a variation around *X*.
There would be a different variation for *Y*:
.
and
are independent because thermal noise does not correlate over time.
Because they are the sum of many random molecular impacts,
for 100 pulses the s would have a Gaussian distribution
if they were plotted on one axis.
But because they are independent,
and the geometrical representation of independence is a right angle,
this represents 100 different directions in the high dimensional
space.
There is no particular direction in the high dimensional space
that is favored by the noise, so it turns out
that the original message will come to the receiver somewhere
on a sphere around the original point
[7,12,3].

What Shannon recognized is that these little noise
spheres have very sharply defined edges.
This is an effect of the high dimensionality:
in traversing from the center of the sphere to the surface
there are so many ways to go that essentially everything is on the surface
[13,14,12].
If one packs the message spheres
together so that they don't touch (with some error because they
are still somewhat fuzzy) then one can get the channel capacity.
The positions in hyperspace
that we choose for the messages is the *encoding*.
If we were to allow the spheres to intersect (by encoding in a poor way) then
the receiver wouldn't be able to distinguish overlapping messages.
The crucial
point is that we must choose non-overlapping spheres.
This only matters in human and animal communications systems
where failure can mean death.
It does not happen to rocks on the moon because there is
no consequence for `failure' in that case.
So Shannon's channel capacity theorem only applies when there is
a living creature associated with the system.
From this I conclude that Shannon is a biologist
and that his theorem is about biology.

The capacity theorem can be constructed for biological molecules
that interact or have different states [12].
This means that these molecular machines are capable of making precise
choices. Indeed, biologists know of many amazingly specific
interactions; the theorem shows that not only is this possible
but that
**biological systems can evolve to have as few
errors as necessary for survival**.