The Small Sample Correction for Uncertainty and Information Measures

Question: What is the small sample correction?

Glossary definition of small sample correction
Examples of the problem to help your intuition:
- Suppose you have 6 random DNA sequences, consisting of the 4 nucleotides, 20 bases long. Then what is the probability that one will compute zero information at some position? ... Well, the probability is zero because one can't get exactly equiprobable bases from 6 sequences. So one will "see" some "information" in the sequences when there is none there. This is one of the reasons that consensus sequences are so dangerous to use! See: Consensus Sequence Zen, figure 3.
- In the case above, the best one could have is 4 bases all different. The next base would have to match one of these 4 and 1/4 of the time the final base would also match. So 50% of the sequences would be "A" for example, every 4 bases. Again, one would see "information" where there is none.
- In proteins the maximum information is log₂20 = 4.32 bits. If you have 58 protein sequences you don't know what the NEXT sequence will contain, it may have a different amino acid there. So the best estimate for the information is lower, about 4.08 bits.
So the small sample correction is a correction to the information or the uncertainty measure to account for this effect. In terms of statisics, the uncertainty measure is biased when there are small numbers of samples.

Philosophically, the uncertainty measure devised by Shannon is for population probabilities. But if we are to work with real data, the best we have are measured frequencies. The small sample correction is used to correct this switch.

Question: I have a question regarding the effect small sampling has on information content in motif alignments. If a distribution of observed is generated from a small sample then often the sequence logo is noisy (poor choice of word).
No, it's a reasonable choice of words. It fits the idea of a noise on a signal, and shows up the same way. Also, the logo is always noisy. You can see this by increasing the range of a sequence logo way beyond the ends of the site. We typically start in the range -200 to +200.

Question: I seem to remember reading that samples of less than 30 should be corrected.
I apply corrections to all samples. The correction is, of course, smaller for larger samples. In my programs (for 4 symbol DNA) I switch between an approximate correction (n > 50) and and a more precise but computationally expensive algorithm (n <= 50).

Question: How is the correction done?
See:
```
@article{Schneider1986,
author = "T. D. Schneider
 and G. D. Stormo
 and L. Gold
 and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide sequences",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431",
year = "1986"}
```
especially Figure 1 and the Appendix.

Programs. A program that does the correction (for 4 symbols) and produces a table is calhnb. The correction is used by rseq and alpro.

Correction for Information Measure. The small sample correction applies to the uncertainty function. It should be applied to both the before and after uncertainties for computing the information. In most cases the uncertainty before is computed from a high number of samples or is assumed to be from equiprobable samples, in which case the correction is unnecessary.

Thanks to Mark Schreiber (Bioinformatics, AgResearch Invermay, PO Box 50034, Mosgiel, New Zealand, PH: +64 3 489 9175, mark.schreiber@agresearch.co.nz) for the questions (above) that inspired this page.

Question: Is a 4 bit sequence conservation more significant than 3 bits?

The significance of N bits depends on the number of sequences input and is therefore independent of the number of bits. For example, if there are only 10 sequences, a 0.25 bit sequence conservation at one base position will be less significant (p = 0.1066) than if there are 20 sequences that have a conservation of 0.25 bits (p = 0.0041) (Schneider1986). The significance can be computed for the average (as shown by a sequence logo) but a method for individuals (as shown by sequence walkers) has not been published. The significance is related to the small sample problem because the sampling determines the error in the information measure.

Schneider Lab
origin: 2001 May 31
updated: 2011 Jul 14