How (and why) to find a needle in a haystack

The proper functioning of organisms depends on a complex game of hide-and-seek conducted inside the cell. Biologists are now beginning to draw on information theory to develop a better understanding of the rules of this game

SUPPOSE you really had to find a needle in a haystack. How would you do it? And how big would the needle have to be for you to be in with a chance of finding it? These may seem like silly questions. But a version of this problem occurs at every moment within the cells of organisms. Here, "you" are a protein molecule with the vital job of switching genes on and off; the "haystack" is all of the DNA in the cell; and the "needle" is a particular fragment of DNA, often not longer than five or six genetic letters (out of, in the case of humans, roughly 3 billion) that the protein must find before it can do its job.
     Although scientists know a lot about individual genes and what they do, they know much less about the broader co-ordination of activities within the cell. As a first step, they are eager to find out how protein molecules conduct their vital search. The cells of any organism will not work properly unless a few hundred such molecules are able successfully to find their "binding sites" on the DNA. If even one were to fail, this might (depending on the gene it is in charge of) totally disrupt the cell’s activities. Other genes would not be switched on at the right times, and still others might never get switched off.
     So how does the protein organise its search? Again, take the haystack. Assume it is an ideal (ie, not a realistic) haystack, one in which it is as easy to check the bottom and the middle as it is to check the sides. In that case one way to search would be to select a spot randomly, inspect it, and if it is needle-free, to step back, select another spot and so on. Eventually, you would find the needle, but it would probably have taken you a lot of time.
     A better way would be to look around for a while in the vicinity of each selected spot. You would then waste less time and energy stepping to and from the haystack. And if (this would be a truly unrealistic haystack) the haystack were one-dimensional, so that you could only move forwards and backwards along it, your chance of success would actually be quite high, even if the direction of each step was random.
     Although there is some evidence that protein molecules do indeed search DNA as if it were a one-dimensional haystack, the absence of good techniques for observing individual biological molecules in action has made this hard to verify. Recently, however, Carlos Bustamante, a biophysicist at the Howard Hughes Medical Institute at the University of Oregon, in Eugene, and his colleagues there and at the University of California, Santa Barbara, have found a way to observe proteins as they search for their binding sites.
     To do this, the researchers adapted a device known as an atomic-force microscope. Like a record player, this has a stylus; if a molecule is passed underneath its tip, the stylus bobs up and down over the lumps and bumps of the atoms. A laser is bounced off the tip of the stylus to amplify the minute ups and downs so that the shape of the molecule can be made into an image, and even into a moving image.
     Using such a device to look at active biological molecules is difficult, however. The tip of the stylus sticks to the molecules, and it is hard to look at the molecules when they are in liquid-their natural state-because it is hard to get them to stay put. Dr Bustamante and his colleagues got around the first problem by coating the tip with carbon. They got around the second by putting the molecules of interest (a protein called RNA polymerase and a fragment of DNA) into a solution and placing them on top of a perfectly flat crystal. The molecules settle on the crystal and can then be spied on.
     RNA polymerase plays a big part in the process of transcription, the first step in the switching on of a gene, when the information carried in the DNA is copied into a molecule called RNA. In order to start this process, the polymerase must first find a binding site known as a "promoter", a fragment of DNA in front of the gene.
     To see how the polymerase seeks out the promoter, Dr Bustamante and his colleagues played a cruel trick. The sequence of DNA they selected did not actually contain a promoter. When they came to watch the film produced by their special microscope, they saw the polymerase land on the DNA, and then slide up and down along it, jostled randomly in either direction by the thermal energy of the solution. From time to time, it would detach itself, and then settle somewhere else and start hunting again, alas in vain.
     But what does the promoter have to be like for the polymerase to find it? That, to return briefly to the farmyard, depends on both the size of the haystack (that is, on the size of the genome-ie, how much DNA there is to sort through) and on the number of needles (that is, on how many binding sites for one particular type of protein there are). For a genome of a given size, a binding site will have to be much more conspicuous if it is alone than if it is one of many. It will, in other words, need to contain more information.
     DNA is a particularly apt material for information theorists to sink their teeth into. The molecule carries a signal encoded in an alphabet of four genetic "bases", A, T, C and G. At any given position along the molecule, one of these letters must be present. But which?
     An information theorist would answer thus: if each letter is equally likely to occur, the uncertainty is complete and a searcher does not know which of the letters to expect. But if the same letter always appears in a given position in the molecule, there is no uncertainty.
     A researcher trying to find out how much information a protein needs in order to recognise a binding site amidst all the noise and jostling of a busy cell can try to answer in two different ways. One way is to predict exactly how much information the site has to contain in order for a protein to detect it. This of course depends only on the frequency of the binding sites in the genome. The second is to examine the site and discover what it contains.
     This second kind of analysis is done by lining up lots of known binding sites for a particular protein, comparing them position by position, and so finding out which letter is most likely to occur at which position, and how probable it is that a different letter may sometimes crop up instead. This tells you how much information is present at each position: variable positions contain little information while constant positions contain lots. To find out how much information is contained in an average binding site, just add up the information for each position.
     But since some variation is permitted at most positions within a binding site, it means that the total number of different possible versions of a site is often large. And the trouble is, neither of the above approaches tells you much about any particular binding site that a protein may be looking for. For instance, is the binding site just a rare variant, or does it contain mutations that have reduced the information so much that the protein can no longer recognise it?
     Recently, Dr Schneider has addressed this problem, too. Using some mathematical wizardry, he has taken the average information of a binding site, and worked backwards to evaluate any individual example. This procedure turns out to have surprising power. It means that Dr Schneider can tell whether or not the signal from any particular binding site is too noisy to be useful-in other words, whether it is a damaging mutation. It allows him to design novel binding sites that will contain enough information. And it allows Dr Schneider to pretend he is a protein looking for a binding site. Using his technique, he can wander along known lengths of DNA (on his computer, not in solution). In doing so, he has already discovered several binding sites that no one knew were there.

The Economist Home Page

This article was published in The Economist April 5th-11th 1997, British version: p. 105-107, American version: p. 73-75, Asian version: p. 79-81.

Permission was granted to post this article at the National Cancer Institute (USA).

color bar

Small icon for Theory of Molecular Machines: physics,
chemistry, biology, molecular biology, evolutionary theory,
genetic engineering, sequence logos, information theory,
electrical engineering, thermodynamics, statistical
mechanics, hypersphere packing, gumball machines, Maxwell's
Daemon, limits of computers
Schneider Lab.
origin: 1997 October 28
updated: 2012 Feb 09
color bar