# Biological Information Theory (BIT) gives a natural binding site cutoff

A Cancer Data Science Laboratory (CDSL) Zoom talk: Monday August 16 3:00 PM EST
https://nih.zoomgov.com/j/1614867690

Slides: cutofftalk.pdf
Video (from this website): cutofftalk.mp4

Abstract Information theory is a mathematics initiated by Claude Shannon in a famous 1948 paper. Shannon developed several important theorems about information, measured in bits. A bit is the choice between two equally likely possibilities. In 1986 Tom Schneider showed how to measure the information in a set of aligned DNA or RNA binding sites and discovered that the number of bits could be predicted from the size of the genome and number of sites in the genome. The binding site information evolves to match the information needed to find the sites in the genome. While trying to understand why splicing donor and acceptor consensus sequences were identical, Mike Stephens and Tom invented the sequence logo graphic. Sequence logos represent a mathematical average of binding sites so Tom eventually realized how to construct a theory of individual binding sites. The corresponding graphic is a sequence walker. Tom compared binding site information to binding energy and found a constant ratio. This led to the discovery that many biological systems are 70% efficient. To understand why this is so, a deeper delve into information theory was needed. In 1949 Shannon published a short, beautiful paper showing how messaging systems can be modeled by packed spheres in a high dimensional space. This led to the channel capacity equation and theorem (errors can be as low as desired) and eventually to essentially all modern communications. Tom found that molecules also have a capacity with a similar capacity equation and corresponding theorem. Following Felker (1954), one can then derive a relationship between information and energy: the minimum energy dissipation required to gain a bit is Emin = Kb T ln 2 (joules/bit). Tom realized that the same equation can be derived from the Second Law of Thermodynamics, so the Emin equation is one of many versions of the Second Law (Jaynes1988). This can be used to understand the 70% efficiency of biological systems, but that's a story for another time. For this talk, the Second Law can be applied to the individual information measure. Sites above zero bits are predicted to be bound and sites below zero are not bound. Unlike the arbitrary "scoring" systems people use, this provides a theoretical basis for a natural binding site strength cutoff.

BIO Thomas D. Schneider was a winner in the 1974 Westinghouse Science Talent Search for work on an artificial life form. He then received the B.S. degree in biology from the Massachusetts Institute of Technology in 1978 and the Ph.D. degree in Molecular Biology from the University of Colorado Boulder, CO, USA in 1984, followed by postdoctoral research in Boulder. He is now a Senior Investigator at the National Institutes of Health, National Cancer Institute, Center for Cancer Research, RNA Biology Laboratory in Frederick, MD, USA. He is interested in discovering the underlying mathematics of biology and his motto is "Living things are too beautiful for there not to be a mathematics that describes them." His PhD thesis showed that the information in genetic binding sites on DNA or RNA, measured in bits, is just sufficient for them to be found in the genome (Schneider.Ehrenfeucht1986) which he demonstrated using a computer model that evolves binding sites in a few seconds (Schneider.ev2000). After coming to NIH he invented the widely-used sequence logo with then-high school student R. Michael Stephens (Schneider.Stephens1990). Logos show a graphical picture of the average information of binding sites; another invention is the sequence walker which shows the information in single binding sites (Schneider-walker1997, Schneider-ri1997). He and his lab invented several patented nanotechnologies: an ATP powered rotating molecular wheel, a molecular computer, the self-contained molecular DNA or RNA Medusa(TM) sequencer and a general molecular detector called a nanoprobe. His current work is to understand the relationship between information and energy, which is measured as the isothermal efficiency. His website is permanently point to by https://alum.mit.edu/www/toms

Related papers:

• primer: Explains uncertainty H formula
• logo: sequence logos
• ev: evolution of binding sites
• edmm: second law Emin = Kb T ln 2 (see also ccmm:)
• ri: individual information theory
• walker: sequence walkers graphic
• flexrbs: flexible sequence walkers, ribosome binding sites
• flexprom: flexible sequence walkers, sigma 70 promoters
• sigma38: flexible sequence walkers, sigma 38 promoters
• emmgeo: gives references 25 and 26 for rhodopsin cooling in picoseconds. 70% efficiency built from the ccmm and edmm paper concepts.
• zen: problems with consensus sequences
• shannonbiologist: How Claude Shannon inadvertantly used a biological criterion to get the channel capacity!

Schneider Lab

origin:    2021 Aug 15
updated: 2021 Aug 17