Modeling splice site and transcription factor binding site variation by information theory

Peter K. Rogan, Dept. Pediatrics, School of Medicine; Bioinformatics, School of Interdisciplinary Computer Science and Engineering; University of Missouri-Kansas City. Childrens Mercy Hospital, Kansas City, MO 64108

Information-theory based models of nucleic acid binding sites comprehensively describe the functional variation present in such sites. The information content of a single site (Ri) measures its conservation within a family of binding sites. The strengths of different sites can directly compared based on their respective Ri values, since Ri is related to thermodynamic entropy, and therefore to the free energy of binding. We have developed and validated models for human acceptor and donor splice sites and more recently, for NF-kB heterodimer binding sites. In contrast with consensus sequences, these models are defined by sites that represent a wide range of binding affinities. The splice site models comprise a set of curated sites in all known genes in the 10/7/00 human genome draft. Much of our previous work has focused on analysis of human mRNA splice site variants- distinguishing polymorphisms from mutations, and leaky mutations from null mutations and cryptic site activation. We have used these tools to predict the clinical severity of mutations responsible for a wide variety of inherited disorders.

We are also developing information models for transcription factor binding sites to assess the effects of promoter sequence variation on transcription of genes responsible for drug metabolism. The NF-kB model was derived first from previously known strong sites and then refined using binding sites predicted from the initial model and validated by EMSA studies. The splicing models accurately predict the effects of mutations, polymorphisms and cryptic splicing, including variants which partially abolish splicing and often produce milder clinical phenotypes. The NF-kB model accurately rank orders the strengths of known binding sites in competitor EMSA assays and distinguishes promoters of genes regulated by NF-kB from those in which transcription is not known to be induced. A promoter polymorphism at an NF-kB binding site which derepresses transcription of the CYP2D6 gene abolishes an NF-kB binding site on the antisense strand. The sensitivity and specificity of information theory-based models can be attributed to inclusion of both weak and strong binding sites. We find that information models of other transcription factor binding sites collated in TRANSFAC are biased towards strong binding sites. More representative models will be required to accurately predict the effects of variation in such sites. Supported by PHS R01 ES10885-02 and the Merck Genome Research Foundation.

Keywords: Information theory, mRNA splicing, NF-kB, mutations