Modeling splice site and transcription factor binding site variation by information theory.

Program Nr: 942 from 2002 ASHG Annual Meeting

We have validated information theory-based models for human acceptor and donor splice sites and NF-kB heterodimer binding sites. The average information describes the range of variation in sites having a common function, whereas the information content of a single site (R_i) measures its conservation within a family of binding sites. The strengths of different sites can be directly compared based on their respective R_i values, since R_i is related to the free energy of binding. The splice site models comprise a set of automatically curated donor (n=111,772) and acceptor (n=108,079) sites from all known genes in the human genome draft sequence. These comprehensive models accurately predict the effects of mutations, polymorphisms and cryptic splicing, including variants which partially abolish splicing and often produce milder clinical phenotypes. The NF-kB model was derived initially from previously known strong sites and then iteratively refined by incorporating binding sites predicted from the initial model and validated by EMSA studies. The NF-kB model accurately rank orders the strengths of known binding sites in competitor EMSA assays, and distinguishes promoters of genes regulated by NF-kB from those in which transcription is not known to be induced. The model was validated by detecting known (and previously unrecognized) sites in promoters of each of 13 genes regulated by NF-kB that were excluded from the initial model. The most sensitive and specific information theory-based models are based on sites spanning a wide range of binding affinities. A CCAAT-box protein binding site model (n=175) based on the gene that results in HPFH. Many other transcription factor binding sites collated in TRANSFAC are biased towards strong binding sites. More representative models will be required to detect weaker binding sites and to reliably assess the effects of mutations. Supported by PHS R01 ES10885-02 and the Merck Genome Research Foundation.