The range is the nucleic-acid region over which the sum of Rsequence(L) is taken. If the range is larger than the binding site, the Rsequence(L)fluctuations outside the site will cancel each other on the average. On the other hand, if the range is too small information content will be lost. That is, one must be sure not to delete part of the site.
Determining the range of a site is difficult because experimental methods such as deletion analysis, chemical protection or footprinting, do not define the exact region contacted. It is dangerous to judge the range by eye from the sequences themselves or the Rsequence(L) curves derived from a small sequence collection (note that some positions of Fig. 1c show the same information content as the 1-bit valley). To avoid these difficulties, we have added 5 bases to both sides of the largest range suggested by experimental data. Consequently, the results will be more variable than they may have been, but it is unlikely that part of a site will be lost. On the average the background will be cancelled, although in specific cases it may not be. In the cases where two sites are adjacent, we extend the range to just before the point of overlap. If adjacent sites do interpenetrate, then some of the information content is lost.
When it is likely that a site is symmetrical, both the sequence and its complement are used in the analysis. This doubles the number of sequences available, and refines the answer. If we had arbitrarily chosen an orientation for each sequence we might have biased the results.