Question: What is the small sample correction?
So the small sample correction is a correction to the information or the uncertainty measure to account for this effect. In terms of statisics, the uncertainty measure is biased when there are small numbers of samples.
Philosophically, the uncertainty measure devised by Shannon is for population probabilities. But if we are to work with real data, the best we have are measured frequencies. The small sample correction is used to correct this switch.
Question: I have a question regarding the effect small sampling has on information content in motif alignments. If a distribution of observed is generated from a small sample then often the sequence logo is noisy (poor choice of word).No, it's a reasonable choice of words. It fits the idea of a noise on a signal, and shows up the same way. Also, the logo is always noisy. You can see this by increasing the range of a sequence logo way beyond the ends of the site. We typically start in the range -200 to +200.
Question: I seem to remember reading that samples of less than 30 should be corrected.I apply corrections to all samples. The correction is, of course, smaller for larger samples. In my programs (for 4 symbol DNA) I switch between an approximate correction (n > 50) and and a more precise but computationally expensive algorithm (n <= 50).
Question: How is the correction done?See:
@article{Schneider1986, author = "T. D. Schneider and G. D. Stormo and L. Gold and A. Ehrenfeucht", title = "Information content of binding sites on nucleotide sequences", journal = "J. Mol. Biol.", volume = "188", pages = "415-431", year = "1986"}especially Figure 1 and the Appendix.
Programs. A program that does the correction (for 4 symbols) and produces a table is calhnb. The correction is used by rseq and alpro.
Correction for Information Measure. The small sample correction applies to the uncertainty function. It should be applied to both the before and after uncertainties for computing the information. In most cases the uncertainty before is computed from a high number of samples or is assumed to be from equiprobable samples, in which case the correction is unnecessary.
Thanks to Mark Schreiber (Bioinformatics, AgResearch Invermay, PO Box 50034, Mosgiel, New Zealand, PH: +64 3 489 9175, mark.schreiber@agresearch.co.nz) for the questions (above) that inspired this page.
Question: Is a 4 bit sequence conservation more significant than 3 bits?
The significance of N bits depends on the number of sequences input and is therefore independent of the number of bits. For example, if there are only 10 sequences, a 0.25 bit sequence conservation at one base position will be less significant (p = 0.1066) than if there are 20 sequences that have a conservation of 0.25 bits (p = 0.0041) (Schneider1986). The significance can be computed for the average (as shown by a sequence logo) but a method for individuals (as shown by sequence walkers) has not been published. The significance is related to the small sample problem because the sampling determines the error in the information measure.
Schneider Lab
origin: 2001 May 31
updated:
2011 Jul 14