(ii) Graphs of Rsequence and Correction for Sampling Error

In Fig. 1, we show the curve R_sequence(L) for either 61 (a), 17 (b) or 6 (c) HincII sites (GTPyPuAC; Roberts, 1983) chosen from the left end of bacteriophage T7 (Dunn and Studier, 1983). Here, the G's in the HincII sites have been placed at position L=0, and R_sequence(L) was calculated for 20 bases on either side. There are two major 2-bit peaks of information content surrounding a 1-bit valley in curve (a). None of the curves go to zero (the solid straight line) outside the sites, although they come close at several points. This effect is not small: for six sites (Fig. 1c) the background is at 0.44 bits per base so that with sequences 41 bases long, R_sequence will be overestimated by 18 bits. A sampling error correction for Hs(L)(e(n), Appendix I, page

). can be joined with H_g to give the final formula:

**Figure 1:** Information content, R_sequence(L) in bits/base, at various positions (L) in and around *Hin*cII sites [GT(T/C)(A/G)AC].
$\vspace{4in} \special{psfile=''globin.logo.ps'' % hoffset=504 voffset=-36 % hscale=70 vscale=70 hoffset=350 voffset=00 hscale=40 vscale=40 angle=90}$ The numbers of bases at each position, n(B,L), are given. The sites were obtained starting at the left end of the bacteriophage T7 DNA sequence (Dunn and Studier, 1983) and only one orientation of each site was used. The left-most base in each site (G) was placed at position 0 in each case, and the sequence examined for 20 nucleotides in each direction from this base. The solid lines are the zero without sampling error correction. The dashed lines are the zero when the correction is made. The bars show one standard deviation above or below R_sequence(L). They show the variation of the sampling error correction. (a) 61 sites, R_sequence = 10.7 $\sim$ 0.2 bits; (b) 17 sites, R_sequence = 9.9 $\sim$ 0.7 bits; (c) 6 sites, R_sequence = 8.3 $\sim$ 2.0 bits.

The standard deviation reported for each R_sequence is based on the variance of H_nb (Appendix I, page

) which is sensitive to the number of sequence examples, but not to the actual sequences. It is only a measure of variance in the correction for small sample sizes; the variation in the information content of individual sites will be described elsewhere. The variance of the sampling correction is shown in all figures as a bar extending one standard deviation above and below the R_sequence(L) curve.