Sequences that are "similar" to true sites might compete with the true sites for binding to the recognizer. For example, the E. coli genome should contain about 1,000 EcoRI restriction enzyme sites (GAATTC), but that same genome should also contain about 18,000 sequences one nucleotide removed from an EcoRI site. Site recognition by and action of EcoRI within E. coli must include enough discrimination against the more abundant similar sites to avoid a fragmented genome (Pingoud, 1985). Restriction enzymes have enough specificity to do this. It seems that many recognizers do not because operator mutations may decrease binding by only 20 fold (Flashman, 1978). Most single base changes in promoters and ribosome binding sites decrease synthesis by 2 to 20 fold (Mulligan et al., 1984; Stormo, 1985). Binding to similar sites would degrade the function of the entire system. For repressors, binding to pseudo-operators would increase the chances of gratuitously inhibiting transcription and may also serve as a sink for the recognizer. For ribosomes, binding sites within mRNAs would lead to the expression of inactive protein fragments.
There are several solutions to the problem of avoiding many similar sites when the recognizer has limited specificity (Linn and Riggs, 1975). It is possible that similar sites are hidden so that they do not interfere. For example, mRNA secondary structure could prevent ribosomes from inspecting sites similar to ribosome binding sites (Gold et al., 1981). Chromatin structure may occlude the DNA, so that repressors do not actually have as many potential binding sites as the number of base pairs. A related possibility is that similar sites do not exist in the genome. For example, if a repressor's binding site is composed of oligos that are relatively rare in the genome, the number of similar sites could be many fewer than expected just from mono-nucleotide information. Any such special effects constrain the genome to particular oligonucleotide patterns. Discrimination against some oligonucleotides might account for the observed non-random distribution of oligonucleotides in the genome (Grantham et al., 1981; Stormo et al., 1982a; Fickett, 1982; Nussinov, 1984). Finally, von Hippel (1979) pointed out that recognizers could enhance site selectivity by binding to longer sites. If a repressor were to recognize a fifteen base pair long sequence in E. coli, not only could its site be unique, but there might not be any sites with one mismatch. When this strategy is used, one expects Rsequence to exceed Rfrequency. The sampling error correction we made may have lead to an underestimate of Rsequence(see Fig. 1). It is also possible that Rsequencewould be larger if it were calculated from longer oligos, rather than mononucleotides. We are usually prevented from doing that measurement because the sampling error variance increases rapidly. Still, our results suggest that Rsequenceis usually close to Rfrequency.