Information analysis of human splice site mutations

Peter K. Rogan,¹ ² $R_{frequency} = \log_2{812} = 9.7$ Brian M. Faux,^* $R_{frequency} = \log_2{812} = 9.7$
and Thomas D. Schneider ³

Splice site nucleotide substitutions can be analyzed by comparing the individual information contents (R_i, bits) of the normal and variant splice junction sequences [Rogan and Schneider, 1995]. In the present study, we related splicing abnormalities to changes in R_i values of 111 previously reported splice site substitutions in 41 different genes. Mutant donor and acceptor sites have significantly less information than their normal counterparts. With one possible exception, primary mutant sites with < 2.4 bits were not spliced. Sites with R_i values $\begin{keyword}Fis DNA binding \sep \emph{oriC} \sep flip-flop \sep information theory \sep sequence walker \par\end{keyword}$ bits but less than the corresponding natural site usually decreased but did not abolish splicing. Substitutions that produced small changes in R_iprobably do not impair splicing and are often polymorphisms. The R_i values of activated cryptic sites were generally comparable to or greater than those of the corresponding natural splice sites. Information analysis revealed pre-existing cryptic splice junctions that are used instead of the mutated natural site. Other cryptic sites were created or strengthened by sequence changes that simultaneously altered the natural site. Comparison between normal and mutant splice site R_i values distinguishes substitutions that impair splicing from those which do not, distinguishes null alleles from those that are partially functional, and detects activated cryptic splice sites.

keywords: information theory, mRNA splicing, donor, acceptor, cryptic, mutation, polymorphism, walker

Mutations at splice sites make a significant contribution to human genetic disease, since approximately 15% of disease-causing point mutations affect pre-mRNA splicing [Krawczak et al., 1992]. Mutations in splice sites decrease recognition of the adjacent exon and consequently inhibit splicing of the adjacent intron [Talerico and Berget, 1990,Carothers et al., 1993]. Splice site mutations may result in exon skipping, activation of cryptic splice sites, creation of a pseudo-exon within an intron, or intron retention [Nakai and Sakamoto, 1994]: (1) Exon skipping, the most frequent outcome, is thought to result from failure of the normal and mutant splice sites to define an exon. (2) Most cryptic mutations activate splice sites of the same type and are typically located within a few hundred nucleotides of the natural site. This distance is probably limited by restrictions on the length of the resultant exon [Hawkins, 1988,Berget, 1995]. (3) Occasionally, mutations that are further away from the natural splice site create cryptic sites that are activated in the presence of a nearby cryptic splice site of opposite polarity, producing a novel non-coding exon within the intron. (4) Splice site mutations in very short or terminal introns can result in intron retention [Dominski and Kole, 1991]. In these instances, additional sequence elements may be required for normal splicing [Black, 1991,Black, 1992,Sterner and Berget, 1993].

Essential elements in donor and acceptor splice junctions have been defined by consensus sequences [Mount, 1982], by analysis of nucleotide frequencies at each position in a splice site [Senapathy et al., 1990] and by neural network prediction [Brunak et al., 1990]. Each of these methods have limitations. Although the GT and AG positions adjacent to donor and acceptor splice junctions are highly conserved, other positions are more variable [Mount, 1982,Stephens and Schneider, 1992]. The consensus sequence approximates the nucleotide frequencies at each position, and so it excludes the contributions of less frequent nucleotides present in a proportion of natural splice sites. Splice site sequences that deviate from the consensus do not necessarily produce significantly lower amounts of spliced mRNA [Rogan and Schneider, 1995]. Training a neural network requires sequences of both binding sites and sequences that are not bound [Stormo et al., 1982,Brunak et al., 1990]. Generally, non-bound sequences are taken to be those remaining after binding sites have been identified. However, these sequences do contain functional sites [Schneider, 1997b,Hengen et al., 1997], so neural networks may be inappropriately trained on overlapping data sets.

In contrast, information-theory based models of donor and acceptor splice sites require only functional sites and show which nucleotides are permissible at both highly-conserved and variable positions of these sites [Stephens and Schneider, 1992]. Information is the only measure of sequence conservation which is additive [Shannon, 1948]. The information content (R_i, in bits) of a member of a sequence family describes the degree to which that member contributes to the conservation of the entire family [Schneider, 1997a,Schneider, 1997b]. R_i is the dot product of a weight matrix derived from the nucleotide frequencies at each position of a splice site sequence database and the vector of a particular sequence. Individual information is related to thermodynamic entropy and therefore to the free energy of binding [Schneider, 1994,Schneider, 1997a]. Since splice sites are recognized prior to intron excision [Berget, 1995], the sequence of the splice site dictates the strength of the spliceosome-splice junction interaction and thus splice site use. It is our thesis that the strength of this interaction is related to the information content of the splice junction.

A group of sites with similar sequence and function can be described and quantified by their corresponding distribution of individual information contents. The mean of this distribution of R_i values is $\gamma$ bits for the 10 nucleotide long splice donor sites and $\gamma = 16$ bits for the 28 nucleotide long acceptor sequences [Stephens and Schneider, 1992,Schneider, 1997a], representing the average amount of information required for splicing, R_sequence[Schneider et al., 1986,Schneider, 1995,Schneider, 1994]. Strong splice sites have R_i values $R_{frequency} = - \log_2 (\gamma/G) = 4$ ; weak sites have R_i values $\gamma/G$ . Non-functional sites have R_ivalues less than or equal to zero [Schneider, 1994,Schneider, 1997a]. Since mutations at splice sites lessen or abolish splicing at those sites, we investigated whether the R_i values of mutant splice sites were related to defects in mRNA processing and whether mutant, cryptic and the corresponding natural splice sites could be ordered based on their respective R_i values.

Information content is defined as the number of choices needed to describe a sequence pattern, using a logarithmic scale in bits [Schneider et al., 1986,Schneider, 1995]. A set of either donor or acceptor splice junction recognition sites are aligned and the frequencies of bases at each position are determined. The weight matrix used to model the splice junctions is computed from

The individual information of a sequence j is the dot product between the sequence and the weight matrix:

The mean of the distribution of R_i values of natural sites is R_sequence [Schneider, 1997b,Schneider, 1997a]. The distribution of R_i values is approximately Gaussian, however the lower and upper bounds are zero bits and the R_i value of the consensus sequence.

The null R_i distribution was determined by creating a random 10,000 nucleotide sequence with a Markov chain process that maintained the same mono- and dinucleotide composition as the human splice junction database [Stephens and Schneider, 1992]. The means of the splice donor and acceptor null distributions were respectively $\begin{displaymath}R_{sequence} = H_{before} - H_{after} \;\;\;\;\;\mbox{(bits per site)}. \end{displaymath}$ and $\displaystyle H_g = e(G) -\sum_{b=A}^{T} p(b) \log_2 p(b)$ bits. The probability of observing either a donor or acceptor site with R_i> 0in this random sequence was 0.02 (Z = 2.0).

The effects of nucleotide substitutions can be evaluated by comparing the individual information of the common and variant alleles. The minimum fold change in binding affinity of two sites is $\begin{displaymath}H_{after} = \sum_{l=1}^L \biggl( e(n(l)) -\sum_{b=A}^{T} f(b,l) \log_2 f(b,l) \biggr), \end{displaymath}$ , where $\sim$ is the difference between their respective individual information contents [Schneider, 1997a].

Computational tools have been developed to investigate and display individual information. The R_iw(b,l)matrices were first computed from a set of 1799 splice donor and 1744 acceptor sequences [Stephens and Schneider, 1992]. To scan for potential sites or to determine the effects of a sequence change on the normal and neighboring sites, the individual information content of the donor or acceptor motif is computed for every site-length window in the sequence. To assess the effects of various substitutions on a specific donor or acceptor site, R_i was computed for the normal and variant sites with the program Scan and displayed with MakeWalker, DNAPlot and Lister (Schneider.walker; https://alum.mit.edu/www/toms/ $H_g = 1.9995 \pm 0.0058$ toms/walker).

The Scan program uses the R_iw(b,l) matrix to evaluate the individual information (R_i) at each position in a sequence. For each evaluation, it also computes the number of standard deviations away from R_sequence (Z score), and the one-tailed probability (p) of observing a normal splice site with that value of R_i. Sequences with R_i values that are either significantly greater or less than R_sequencehave low probabilities of belonging to the natural population of sites.

A walker graphically shows the contributions of each position to a binding site. In the display (generated by Makewalker or Lister), favorable contacts between the spliceosome and a test sequence are indicated by letters that extend upwards, while positions that are predicted to make unfavorable contacts are shown by inverted letters. Makewalker is interactive and shows one walker at a time, while Lister displays multiple walkers aligned with sequences and annotated by coding regions (e.g. Figs 1-4).

Human splice site mutations were chosen from published reports for which corresponding genomic sequence data were available. Only a subset of reported mutations could be analyzed, as sufficient intron sequences were often unavailable (<26 nucleotides for acceptor sites, <7 nucleotides for donor sites). To investigate the relationship between R_i value and splice site use, studies that evaluated expression of the mutant mRNA were selected whenever possible. A sequence interval (>100 nucleotides) surrounding the splice junction was scanned to detect potential cryptic splice sites in the vicinity of the natural site. Larger sequence windows were used for cryptic sites known to occur further away from the natural site (e.g. Table 2, #24).

Two mutations could not be analyzed because there were discrepancies at corresponding splice site sequences from different reports. A mutation in the IVS 10 acceptor of the hexosaminidase B gene could not be analyzed because the natural acceptor site had negative information content in one of the sequences [Neote et al., 1988,Proia, 1988]. A similar inconsistency was found in two different versions of the IVS 5 acceptor sequence of the protein kinase C gene [Foster et al., 1985,Soria et al., 1993].

Natural and variant sites with R_i> 0were compared with R_sequence [Stephens and Schneider, 1992] by using the Z statistic and associated probability of observing a site with a particular R_i value [Schneider, 1997a].

Primary mutations for either donor or acceptor sites were analyzed by determining the average differences in R_i values ( $4.0 \pm 0.4$ ) of natural versus mutant sequences. Significance was evaluated using a paired t-test. Mutations in which cryptic splicing was either predicted or demonstrated experimentally were excluded to avoid biasing estimation of $4.0 \pm 0.4$ , since cryptic splicing can alter natural splice site use in the absence of a change in the information content of that site.

The observed distributions of the locations of cryptic donor and acceptor sites were compared with a model that assumes that these sites are equally likely to occur upstream or downstream of the natural site. Significance was evaluated with the binomial distribution.

Different mutation reports measured splice site use directly by either cDNA sequencing, reverse transcription-PCR, primer-extension, S1 nuclease analyses, or allele-specific hybridization. Direct comparisons of natural and mutant splicing patterns were not always available. In some instances, the effect of the mutation was measured indirectly using either Northern hybridization (Table 1, #46, 47, 49; Table 3, #4), antigen immunoprecipation or protein levels (Table 1, #18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 49; Table 2, #40, 41, 42, 43; Table 3, #2, 3, 4) or measurements of enzymatic activity (Table 1, #18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 29, 30, 34, 35; Table 2, #40, 41, 42, 43 ; Table 3, #2, 3). Functional analyses of splicing were not reported for mutations #31, 32, 54 and 55 in Table 1, #14, 15, 23, 34-38, and 44, 45, 46 in Table 2, and #1, 7 and 8 (the natural site at 2621) in Table 3.

Several categories of mutations were distinguished by individual information analysis. A total of 111 nucleotide substitutions were evaluated. Fifty-seven mutations were nucleotide substitutions that solely altered use of the natural splice site and did not create cryptic splice sites (designated as primary splice site mutations, Table 1). Activated cryptic splice sites were predicted for 46 different mutations, 33 of which were corroborated experimentally (Table 2). Eight nucleotide substitutions were predicted not to alter splicing (Table 3).

Differences in information content of natural and mutant splice sites. Many of the primary splice junction mutants that showed complete exon skipping (residual splicing: -) had R_i values $4.0 / (0.4 \times 704) = 0.014$ bits (Table 1, #2, 3, 11, 12, 15, 16, 17, 19, 35). However, there are primary mutant donor and acceptor sites that were not used that have mostly small positive R_i values (Table 1, #4, 5, 14, 20, 21, 24, 36, 38, 40, 43, 47). This suggests that recognition of splice donor and acceptor sites requires more than zero bits.

Mutations that reduce or completely abolish splicing have significantly lower R_i values than the corresponding natural sites. The average difference in R_i between primary mutant and natural donor sites is $H = -\sum p \log_2 p$ bits (n = 45) and for acceptor sites it is $\sum p = 1$ bits (n = 12), and these differences are significant (p<0.0001 for both $4.0 \pm 0.4$ values). R_i values of primary acceptor mutations range from a minimum of -2.90 bits to a maximum of 11.75 bits, whereas donor mutations have a lower range from -14.25to 6.87bits.

We considered the possibility that the strength of a natural splice site, i.e. R_i value, might be related to its susceptibility to mutational inactivation. 15 of 24 (62%) natural sites in Table 1 with R_i values > R_sequencewere inactivated by mutation or had mutant R_i values $4.0 / (0.4 \times 704) = 0.014$ , compared to 22 of 29 (76%) natural sites with R_i values < R_sequence. Inactivation of splicing is primarily determined by the specific nucleotide substitutions that occur at those sites, however weak natural splice sites may be more susceptible than strong sites to succumb to mutations that abolish splicing.

Amount of information required for splicing. The minimum quantity of information required for splicing, R_i,min, was defined by comparing the R_i values of inactivating to leaky primary mutations (cryptic splicing mutations were excluded because activation of cryptic sites may affect natural site use). R_i,minis bounded by the maximum information content of a non-functional site and the minimum quantity of information required to produce normal transcripts.

The following minimally functional sites had small positive R_i values: A mutation at the exon 5 donor site ( $H \ge 0$ bits) in the HEXA gene results in a low level (3%) of normal mRNA (Table 1, #41). Similarly, a mutation at the exon 4 acceptor site ( $2^{-4 \times 16} \cong 5 \times 10^{-20}$ bits) in the APOE gene results in 5% of normal splicing (Table 2, #2) and a mutation at the IVS 14 donor site ( $\sim 4 \times 10^9$ bits) in COL1A1 decreases (by 50-60%) but does not abolish normal splicing (Table 1, #9). Furthermore, a mutant 2.4 bit acceptor site in the IDS gene (Table 2, #30) is associated with a moderately abnormal phenotype (the other allele is null), consistent with production of some normal mRNA. Finally, a mutation at the IVS 6 acceptor in COL1A2 reduces the R_i value of the splice site from 5.4 to 2.4 bits and results in a mild form of Ehlers-Danlos (type VII) syndrome due to 50% exon skipping (Table 1, #13; Fig. 1). Splicing at this site is completely impaired in vitro at 39 $\rotatebox{0}{\scalebox{1.00}{\includegraphics*{selflogo.ps}}}$ and restored at 30 $\rotatebox{0}{\scalebox{1.00}{\includegraphics*{selflogo.ps}}}$ . The temperature sensitivity of this mutation indicates that this 2.4 bit sequence is weakly bound by the spliceosome.

By contrast, mutations at the exon 1 donor splice site in the CAT gene (Table 1, #4; $\rotatebox{0}{\scalebox{0.90}{\includegraphics*{fismodels.ps}}}$ bits), in IVS 33 of COL1A2 (Table 1, #14; $\rotatebox{-90}{\resizebox{!}{\textwidth}{\includegraphics*{overlap.ps}}}$ bits) completely abolish mRNA splicing. The R_i value of this COL1A2 mutation is inconsistent with the result found for mutation #13, since the mutation with lower information content would be expected to be inactive. This difference may not be significant depending on the (unknown) precision of the R_iw(b,l)matrix, however it seems more likely that residual splicing at the mutated site in mutation #14 may not have been detected. Residual splicing was observed at several mutant splice sites with R_i values greater than 2.4 bits and less than 3.2 bits (Table 1, # 9, 41 and 52). These splice junction mutations define a range of values for R_i,min of either donor or acceptor sites. Although the confidence interval around R_i,minis unknown, donor and acceptor splice sites with R_i>2.4 bits are rarely found in a set of random sequences with human dinucleotide composition (p=0.008). To simplify comparisons between R_i,minand other R_i values, we use $\rotatebox{90}{\resizebox{!}{\textwidth}{\includegraphics*{gel-overlap.ps}}}$ bits.

Leaky splicing To determine whether the information present in a mutant site was related to splice site use, the R_i values of mutated splice sites that inactivated splicing were compared with R_i values of leaky splice sites. Completely inactivated sites generally had R_i values less than R_i,min (e.g. Table 1, #46), whereas mutations with R_i values greater than R_i,min reduced but generally did not abolish splicing. For example, a G $\rotatebox{-90}{\resizebox{!}{\textwidth}{\includegraphics*{oric.ps}}}$ C point mutation in the exon 2 donor site of the LFA1 gene (Table 1, #44) decreased R_i from 8.6 to 4.2 bits and this mutation is leaky, i.e. 3% of the normal spliced product is detected from this allele [Kishimoto et al., 1989]. Likewise, a patient with mild cholesterol storage disease was homozygous for a donor site mutation in the LIPA gene ( $\scalebox{0.69}{\includegraphics*{fisori.ps}}$ bits; Table 1, #45; Fig. 2). Mutations #1, 6, 7, 9, 10, 13, 18, 22, 26, 34, 41, 44, 45, 50, 52, and 56 (Table 1) and #2, 3, 4, 7, 9, 16, 21, 23, 27, 28, 30, 32 and 41 (Table 2), which have R_i values $\frac{N_i}{N}$ , are leaky at the respective natural splice sites. The average decrease in R_i values is smaller for primary mutations that result in reduced levels of normally spliced mRNA; $4.0 \pm 0.4$ is $\;$ bits for donor sites (n = 12; versus -7.67 for all donor sites) and $\scalebox{1.00}{\includegraphics*{cover.ps}}$ for acceptor sites (n = 4; versus -5.97 for all acceptor sites). When cryptic splice site mutations that result in residual splicing at the natural site are considered in addition, the change is negligible: $\begin{displaymath}\sum_{i=1}^{M}P_iu_i . \end{displaymath}$ bits (n = 14) for donor sites and $\begin{displaymath}\framebox{$\displaystyle H = -\sum_{i=1}^{M}P_i\log_2P_i \;\;\;\;\;\mbox{(bits per symbol).} $ }\end{displaymath}$ bits (n = 15) for acceptor sites.

Quantitative relationship The quantitative relationship between splice site use and information content is illustrated by the polymorphic alleles in IVS 8 of the CFTR gene (Table 1, #6; Fig. 3). The frequency of exon 9 skipping is inversely related to the length of the polypyrimidine tract of the upstream acceptor site [Chu et al., 1993,Chillon et al., 1995,Rave-Harel et al., 1997]. This is not surprising since the length of a homopolymeric polypyrimidine tract has also been related to splice site strength [Dominski and Kole, 1991]. The 4.1 bit difference between the R_i values of the shortest and longest alleles accounts for the lower amount of spliced mRNA from the shorter allele and is probably related to the phenotype of congenital bilateral absence of the vas deferens in male homozygotes. A 4.1 bit reduction in information would correspond to at least a 17 fold ( $F_a = \frac{1}{12}$ ) decrease in splicing, assuming minimal conversion of information to energy dissipated [Schneider, 1991b,Schneider, 1994]. This corresponds closely to the relative amounts of mRNA produced by the shortest (5T) and longest (9T) alleles [Chillon et al., 1995].

Only two exceptional mutations were found in which $F_c = \frac{2}{12}$ , although these sites were reportedly not used (Table 1, #5 [11.6 bits], #43 [5.7 bits]). The minimum predicted decreases of 3 and 11 fold, respectively, in binding affinity would not be expected to completely abolish splicing at these sites. Reduced amounts of splicing can occur at mutant splice sites with R_i> R_i,min, although a modest decrease in R_i at a splice site can apparently sometimes inactivate splicing.

Categories of cryptic splice sites R_i analysis detected secondary cryptic splice sites that are activated by mutation in or adjacent to the natural primary splice site. This indicates that the R_i values of activated cryptic sites may be determined with an information model derived from natural splice sites [Stephens and Schneider, 1992]. Table 2 shows 33 experimentally-identified cryptic sites confirmed by information analysis of the respective genomic sequences (section A), and 13 mutations that were predicted by R_i analysis to exhibit cryptic splicing (section B). For example, a mutation at position 35066 of the adenosine deaminase gene (Table 2, #1) does not alter the R_i value of the natural splice site (at 35099), but creates a secondary cryptic site of similar strength at position 35067. There were 7 additional mutations in which a new cryptic site was either created or predicted without altering the R_i value of natural splice site (Table 2, #12, 14, 15, 26, 31, 40, 43). Activation of cryptic sites can also prevent splicing at natural sites by promoting exon skipping (e.g. in 79% of transcripts resulting from a mutation in the iduronate-2-sulfatase gene; Table 2, #26; [Jonsson et al., 1995]). Exon skipping mutations occurred predominantly at donor splice sites (7 of 8) and in each instance, a cryptic site was created upstream whose R_i value exceeded or was similar to that of the natural site.

Susceptibility to activation Of 31 experimentally-verified cryptic splicing mutations (Table 2A, excluding #5 and 6), there are 19 splice sites whose R_i values exceeded the cryptic site prior to its activation ( $u_a = -\log_2(0.08) = 3.58$ bits). For the remaining 12 mutations (10 of which involve the same site in HBB), the inactive cryptic sites exceed the natural site by only an $4.0 \pm 0.4$ of $\displaystyle \frac{1}{12} \times \log_2(\frac{1}{12}) \; + \; \frac{2}{12} \times \log_2(\frac{2}{12})$ bits. Furthermore, the differences in R_ivalues between natural and cryptic sites prior to mutational activation are much smaller for donor sites ( $\displaystyle \frac{1}{12} \times \log_2(\frac{1}{12}) \; + \; \frac{8}{12} \times \log_2(\frac{8}{12}) ]$ , n = 17 for donors vs. $\displaystyle \;$ , n = 15 for acceptors). Likewise, cryptic donors were activated by an increase of $\log_2 8 = 3$ bits (n=5), whereas cryptic acceptor sites were activated by $\sqrt{P_y + N_y}$ bits (n=10). From these observations it would appear that donor sites may be more susceptible to the effects of neighboring cryptic sites.

Distance effects Cryptic sites activated by a mutation that weakens the natural site must reside within a few hundred nucleotides of the natural splice site, since the novel exon is restricted in length [Hawkins, 1988,Berget, 1995]. For example, a strong cryptic acceptor in intron 2 of the $\sqrt{N_y}$ -hemoglobin gene is activated by mutations at the exon 3 acceptor 271 nucleotides downstream (Table 2, #24). Mutation at a natural site can, however, activate sites that are further away when a cryptic exon is created. For example, mutation at the exon 3 acceptor of the CFTR gene activates a cryptic, non-coding exon in intron 3 (2,354 nucleotides downstream of exon 3 and 19,329 nucleotides upstream of exon 4; Table 2, #3).

Exceptions Although pre-existing or novel cryptic sites with R_i values less than that of the strongest local splice site were usually not recognized, there were exceptions. Infrequently, a weaker cryptic site can interfere with a natural site, even when the natural site is strengthened by the mutation (e.g. Table 2, #16). For example, activated cryptic sites with R_i values lower than those of the natural splice site after mutation may sometimes be used (Table 2, #1, 3, 4, 6, 9, 16, 23, 32). In at least one instance (Table 2, #1), a cryptic acceptor site upstream of the natural site is predominantly used despite the fact that both sites have similar R_ivalues, which suggests that the cryptic site is recognized first. Conversely, the R_i value of the exon 1 donor in the $\sqrt{N_y}$ -globin gene is less than that of an upstream cryptic site (Table 2, #12-15, 17-22), however this cryptic site is not activated unless it is strengthened or the donor is weakened. These exceptions suggest that besides direct competition between the cryptic and natural splice sites, other factors can influence splice site selection.

Nucleotide substitutions that do not significantly alter the R_i value of a natural site are expected to produce functional rather than mutant sites [Rogan and Schneider, 1995]. Given that such substitutions are not likely to be deleterious, they may be polymorphic in the germline, as has been shown for a sequence change in an hMSH2 splice acceptor site [Leach et al., 1993]. We identified other nucleotide substitutions that did not significantly alter the R_i value (Table 3):

Splicing patterns for several nucleotide substitutions #1, 2, 3, and 7 (Table 3) were not reported, however, based on information analysis, these changes would not be predicted to alter mRNA splicing. The substitutions either maintain or increase the information content of the natural splice site. The R_i values of the proposed cryptic sites for substitutions #1, 2, and 8 were either negative or unchanged, suggesting that they are not activated by these substitutions. A proposed cryptic site in exon 3 of the p53 gene (substitution #7) is significantly weaker than the natural acceptor site (by 6.14 bits) and has an R_i value only slightly larger than R_i,min. It would seem unlikely that this cryptic site is preferentially used.

The number of bits in a splice site is related to the amount of splicing at that site. Previously, we demonstrated that a polymorphic splice junction variant caused little change in information [Rogan and Schneider, 1995]. The present study extends this finding and shows that mutant splice sites often contain significantly less information than their corresponding natural sites. Further, cryptic splice sites are activated by increases in information or by decreases at the natural splice site, and the information at activated cryptic sites is often comparable to or exceeds the natural site.

Predicting the effects of mutations A required step of information analysis is to compute the total information over all positions in a site. This value must then be compared with that of other sites prior to concluding that a substitution that changes a positive to a negative weighting is deleterious (compare Tables 1 and 2 to Table 3). Functional splice sites can have nucleotides with negative weightings (e.g. Fig. 1, position 63) that are offset by strong contributions at other positions (e.g. Fig. 1, position 64), as we have shown for other binding sites (Figure 2 in Schneider.walker, Hengen.fisinfo). Statistical analyses of the distributions of point mutations in splice sites are useful [Krawczak et al., 1992] but can sometimes obscure these compensating effects. Within a binding site, the context of a mutation can be as important as the mutation itself.

The difference between the observed value of R_i,min ( $\sqrt{P_y}$ bits) and its expected value (zero bits) may have a biological basis. However, this difference could also be explained by errors in the database used to create the splice weight matrices [Schneider, 1997b], statistical limitations of the data and matrices, motifs that are different from the majority of sites [Hall and Padgett, 1994], or intrinsic limits to the precision of splice site recognition [Schneider, 1991a]. Although the standard deviation of R_sequence can be determined [Stephens and Schneider, 1992], the confidence intervals on individual R_ivalues are unknown. These intervals are expected to be larger at the lower and upper bounds of the R_i distribution, where fewer functional splice sites are observed. The existence of a natural site with R_i< R_i,min (2.2 bits; Table 2, #26) and an exon-skipping mutation with R_i> R_i,min (3.2 bits; Table 1, #14) suggests that R_i,min is not known precisely. The error ( |R_i- R_i,min|) may be as little as 0.2 bits ( R_i= 2.2 bits; Table 2, #26), but it might be as much as 2.4 bits (R_i= 0 bits; Schneider.Ri).

Susceptibility to mutation Donor sites may be more susceptible to inactivation than acceptor sites. The R_i values of mutant donor sites are more likely than mutant acceptors to be less than R_i,min. Natural donors possess less information than acceptors [Stephens and Schneider, 1992] and the average decrease in information due to mutation at donor sites exceeds the reduction in R_i at acceptors. Information is also less densely distributed across acceptor splice sites (0.3 bits per nucleotide) than in donor sites (0.8 bits per nucleotide), so changes at acceptors often have a smaller effect on R_i. Significantly more primary mutations in donor sites (n=45) than acceptor sites (n=12) were found, as has been noted [Krawczak et al., 1992,Nakai and Sakamoto, 1994].

Cryptic splicing The R_i values of most novel cryptic donor sites exceeded or were similar to those of the corresponding natural sites. Although similar results were also inferred from Shapiro-Senapathy consensus values [Krawczak et al., 1992], information analysis detects fewer incorrect cryptic splice sites [O'Neill et al., 1998], more accurately discriminates true sites from non-sites, and visually depicts both changes (Fig. 4).

An exon is initially defined by recognizing the acceptor [Berget, 1995]. Cryptic acceptor sites occur either upstream (n = 9) or downstream (n = 7) of the natural site (p = 0.4), suggesting that they are not located by scanning [Stephens and Schneider, 1992]. The exon definition model predicts that the spliceosome then scans downstream until a strong donor site is located [Robberson et al., 1990,Niwa et al., 1992], so a novel cryptic donor site created downstream of an intact natural site should not be recognized unless the natural site is mutated. In all cases, a decrease in the information content of the natural donor site activated pre-existing cryptic sites downstream (Table 2A). Furthermore, cryptic donor sites were activated more frequently upstream of the natural site (15 of 20; p=0.02). The idea that the splicing machinery selects for the strongest local acceptor splice site and scans for donors is supported by R_i analysis.

Nucleotide substitutions within 17 natural acceptor sites have been shown to create or strengthen adjacent cryptic sites that are thereby activated (see results: II. Detection of cryptic splice sites). Only acceptors were found, perhaps because the variable polypyrimidine tract potentiates spliceosome recognition at many positions, whereas donor sites have high information density and a non-repeating sequence pattern [Stephens and Schneider, 1992]. For this reason, weaker cryptic sites are often found near natural acceptor sites (e.g Fig. 4). Mutations involving the natural acceptor sometimes strengthen and activate these cryptic sites. The resulting aberrant exons may in some cases have been misidentified as natural splice products (e.g. Table 2B), since their length and sequence would differ by only a few nucleotides from the normal mRNA.

Conclusion We have shown that individual information theory can be used to rank normal and mutant splice junctions. As a consequence, silent polymorphisms can be distinguished from true mutations, changes in individual information are related to splice site use, and activated cryptic splice sites can be detected. These distinctions are possible because the information measure is related to the thermodynamic entropy, and therefore can be connected to the binding energy [Szilard, 1964,Schneider, 1991a,Schneider, 1991b,Schneider, 1994]. The information in the splice site should be related to the specific binding interaction between the spliceosome and the site [Berg and von Hippel, 1987,Berg and von Hippel, 1988a,Berg, 1988,Berg and von Hippel, 1988b]. However, the relationship is an inequality--the second law of thermodynamics [Schneider, 1991b,Schneider, 1994]--and can only be explored empirically at this stage. The correlation between information measures and measured thermodynamic parameters is expected to more precisely relate genotypes to phenotypes in genetic disorders.

We thank Greg Alvord for statistical consulting, and Kenn Kraemer and Howard Young for reading the manuscript. Grant support is acknowledged from the Public Health Service (CA74683) and the American Cancer Society (DHP-132) to P.K.R. We thank the Frederick Biomedical Supercomputing Center for access to computer resources and support services.

**Figure 1:** A primary splice junction mutation represented by sequence walkers.
$\vspace{4.0in} \special{psfile=''fig/gumball.ps'' hoffset=0 voffset=-144 hscale=70 vscale=70 angle=0}$ A G $\rotatebox{-90}{\resizebox{!}{\textwidth}{\includegraphics{oric.ps}}}$ A mutation 1 nucleotide upstream of the exon 6 donor of the COL1A2 [GenBank accession number M35391] gene results in 50% exon skipping and Ehlers-Danlos syndrome, Type VII (Table 1, #13). This substitution, which significantly reduced the R_i value, defines the lower threshold of information required for splice site recognition since it is temperature sensitive, being non-functional at 39 $\rotatebox{0}{\scalebox{1.00}{\includegraphics{selflogo.ps}}}$ but functional at 30 $\rotatebox{0}{\scalebox{1.00}{\includegraphics*{selflogo.ps}}}$ . The splice sites are shown by walkers [Schneider, 1997b] in which the height of a letter is the contribution of that base to the total conservation of the site. The upper bound of the vertical rectangles is at +2 bits, and their lower bound is at -3 bits. Letters that are upside down and point downwards represent negative contributions. The upper walker shows the normal site; the lower one displays the mutant sequence. The black arrow shows the position of the mutation (boxed). The dashed arrow represents the coding region.

**Figure 2:** A leaky splice junction mutation.
$\begin{figure}% \vspace{12cm} \special{psfile=''fig/expgraph.ps'' hoffset=0 voffset=0 hscale=60 vscale=60 angle=0} \end{figure}$ A G $\rotatebox{-90}{\resizebox{!}{\textwidth}{\includegraphics*{oric.ps}}}$ A mutation 1 nucleotide upstream of the exon 8 donor site of the lysosomal lipase gene [LIPA; U04292] results in mild cholesterol ester storage disease with 4-9% enzymatic activity (Table 1, #45). The reduction in information content is significant even though the R_i value is still much greater than R_i,min.

**Figure 3:** Polymorphic variation that affects splicing.
$\vspace{5.5in} \special{psfile=''cftr.ps'' hoffset=-50 voffset=400 hscale=80 vscale=80 angle=-90}$ Splicing varies among 3 common alleles that differ in length in the polymorphic polythymidine tract of the IVS 8 acceptor of the gene encoding the cystic fibrosis transmembrane regulator [CFTR; M55114] (Table 1, #6). The shortest allele (bottom walker) shows 90% outsplicing of exon 9 and is associated with congenital absence of the vas deferens. Individuals with the two longer alleles have a normal phenotype, although the 7T allele produces less mRNA than the 9T allele. Exon 9 begins at the base indicated by the left bracket and dashes.

**Figure 4:** Cryptic site creation concurrent with mutation of the natural site.
$\vspace{5.5in} \special{psfile=''ids.ps'' hoffset=-60 voffset=450 hscale=80 vscale=80 angle=-90}$ An A $\rotatebox{-90}{\resizebox{!}{\textwidth}{\includegraphics*{oric.ps}}}$ G mutation in intron 3 of the iduronidase synthetase gene [IDS; L35485] significantly decreases the information content of the IVS 3 acceptor while simultaneously creating a strong cryptic site at the position of the mutation, 1 nucleotide upstream from the natural splice junction (Table 2, #27). The upper two walkers show a pre-existing cryptic site at position 5153 and a natural site at 5154. The lower two walkers show the activated cryptic site at 5153 and the mutant site at 5154. For simplicity, only sites with greater than 4.3 bits are shown. In addition, a 4.2 bit site that is not used at position 5155, is reduced to 2.5 bits as a consequence of the mutation. The lower bound of the vertical rectangles is at -7 bits.

$\begin{displaymath}C_y = d_{space}\log_{2}{\left( \frac{P_y}{N_y} + 1 \right)} \;\;\;\;\;\mbox{(bits per operation).} \end{displaymath}$