In eukaryotes, messenger RNAs are generated by a process that includes coordinated splicing and 3′ end formation. Factors essential for the splicing of mRNA precursors (pre-mRNA) in eukaryotes have been identified primarily through the study of nuclear extracts derived from mammalian cells and Saccharomyces cerevisiae genetics. Here, we identify homologues of most known pre-mRNA processing factors in the recently completed sequence of the Drosophila genome. The set of proteins required for RNA processing shows remarkably little variation among eukaryotic species, and individual proteins are highly conserved. In general, proteins involved in the mechanics of RNA processing are even more conserved than proteins involved in the interpretation of RNA processing signals. The genome does not appear to contain a gene for the U11 RNA, or for a protein unique to the U11 snRNP, which raises the possibility that the U12-dependent spliceosome functions without U11 in Drosophila.
Most RNA processing factors have been identified in either nuclear splicing extracts derived from mammalian cells or in Saccharomyces cerevisiae (Burge et al. 1999; Kambach et al. 1999; Minvielle-Sebastia and Keller 1999). However, Drosophila is extensively used for genetic investigations of complex and regulated splicing. In this review, we survey the recently complete Drosophila sequence (Adams et al. 2000) for sequences related to factors identified in these other systems. In many cases, functional data for the Drosophila protein are not available, and our assignments are based on the best match among genomes. We have not included genes that have been identified in Drosophila for which there is evidence of a role in splicing (see, for example, the list presented in Burnette et al. 1999). This analysis yields a list of 27 genes that encode small nuclear RNAs (snRNAs; see Table) and a list of 99 genes that encode proteins involved in RNA processing (see Table). Our survey confirms that the components of the RNA processing machinery are highly conserved. Very few factors identified in other species are absent from the Drosophila genome. In general, the Drosophila proteins are more closely related to their vertebrate counterparts than to the Saccharomyces cerevisiae proteins.
Protein sequences of known yeast and human splicing factors were used to query the annotated set of predicted Drosophila proteins using BLASTP, and the nucleotide sequence of the genome using tblastn, on the NCBI server (Altschul et al. 1997; http://www.ncbi.nlm.nih.gov/). All identified Drosophila genes were used to query the nonredundant database to establish the optimal yeast and human matches. Alignments were generated using the blast two sequences option (Tatusova et al., 1999) or LALIGN (Huang and Miller 1991). Cytological positions were taken from GadFly (http://hedgehog.lbl.gov:8000) or flybase (http://flybase.bio.indiana.edu/), or deduced from the positions of flanking genes.
To identify snRNA genes, the Drosophila genome was queried using modified blastn parameters (parameter set A: -r 10 -q -11 -W 8 -G 100 -E 50; B: -r 10 -q -11 -W 7 -G 5 -E 20; C: -r 10 -q -11 -W 7 -G 15 -E 4; D: -r 7 -q -14 -W 7 -G 7 -E 3; and E: -r 4 -q -5 -W 8 -G 10 -E 2).
A curated database containing these results will be available at http://www.wam.umd.edu/~smount/DmRNA factors/table.html.
Results and Discussion
Major and minor snRNP components
Two types of spliceosomes have been previously described (Burge et al. 1999). The more common U2-type spliceosome is responsible for splicing the majority of introns, and the U12-type spliceosome is responsible for splicing a minor class of rare introns (perhaps 0.1% in both humans and flies). The Drosophila genome contains multiple copies of the 5 U snRNAs found in the major class spliceosomes. We found five genes for U1, six genes for U2, three genes for U4, seven genes for U5, and three genes for U6 (Table). With the exception of U4-25F, and the U5 genes (which were previously known only by in situ hybridization), these genes had been described previously (Alonso et al. 1984; Das et al. 1987; Saba et al. 1986; Saluz et al. 1988; Lo and Mount 1990). The variant U4-25F has only 69% identity with the major form of fly U4 (Saba et al. 1986), and 68% with human U4. Although the possibility that these new snRNA genes are pseudogenes cannot be ruled out, they appear likely to be functional because of their highly conserved promoters. In the case of U4-25F, some of the variation includes compensatory changes that allow formation of conserved stem loop structures. There are four clusters of snRNA genes, including one at 38AB with two U2, one U4, and two U5 genes within 6 kb.
The Drosophila genome also contains introns that resemble the minor class (or U12) introns first identified in mammals (Adams et al. 2000). These are recognized by the U12-type spliceosome including U11, U12, U4atac, and U6atac snRNAs in place of U1, U2, U4, and U6 (Hall and Padgett 1994; Tarn and Steitz 1996). Identification of snRNAs for the U12-type spliceosome in the genome involved modification of the standard parameters for BLASTN (see Methods). It was possible to find one gene for U12 snRNA, one gene for U6atac snRNA, and one gene for U4atac snRNA. These are almost certainly authentic genes, as critical sequences are conserved. In addition, the highly conserved snRNA promoter is present in each case, including a 9/10 or perfect match to the PSE consensus TAATTCCCAA, which is ∼52 nucleotides upstream of the start (Jensen et al. 1998; Lo and Mount 1990). In contrast, no gene for the U11 snRNA was found. Consistent with the absence of a U11 snRNA, we also failed to find the U11 35-kD–specific protein (accession No. NP_008951; Will et al. 1999). In fact, the U11 snRNP, which functions in 5′ splice site recognition, may not be required for splicing. The highly conserved minor class 5′ splice sites could be recognized by an unknown protein that acts during the early steps of splicing, by the U6atac snRNA alone, or by both. This mechanism would be analogous to a situation seen in vitro, where certain vertebrate introns can be processed in the absence of U1 snRNP if the 5′ splice sites can be recognized by U6 snRNA (Crispino et al. 1994; Tarn and Steitz 1994).
Each snRNP contains a set of Sm core proteins shared with the other snRNPs and a set of proteins that are specific to that snRNP. All 15 known proteins of the Sm family were identified and are highly conserved. These include seven Sm proteins that bind to the U1, U2, U4, and U5 snRNPs (B, D1, D2, D3, E, F, and G); the seven related LSm proteins found in the U6 snRNP (LSM2-LSM8); and the CaSm/LSM1p protein (Bouveret et al. 2000; Tharun et al. 2000). These and subsequent matches are shown in Table. The table reports, for each Drosophila gene, the GenBank accession number and expectation value (the expected number of matches this good or better; Altschul et al. 1997) for the best human and yeast (Saccharomyces cerevisiae) match.
In addition to the Sm proteins, each snRNP also contains a set of snRNP-specific proteins. As expected, orthologues of the proteins that are contained in both the vertebrate and Saccharomyces cerevisiae snRNPs are easily identified in Drosophila, based on their extensive sequence homology, except that a single Drosophila protein, encoded by the sans-fille (snf) gene, corresponds to both the U1 snRNP-U1A and U2 snRNP-U2B′′ proteins (Polycarpou-Schwarz et al. 1996; Stitzinger et al. 1999), and no additional homologues were found.
Interestingly, the Saccharomyces cerevisiae U1 snRNP is more complex than the vertebrate U1 snRNP, with seven additional protein components that are not found in the purified vertebrate U1 snRNP (Gottschalk et al. 1998; Rigaut et al. 1999). Only two of these proteins, Luc7 and Prp40, have easily identifiable Drosophila orthologues. The Drosophila ortholog of Luc7, CG7564, is 33% identical to the entire Luc7 protein (Fortes et al. 1999), and a second Drosophila Luc7-related protein is 21% identical to the yeast protein (Fortes et al. 1999). We identified a single Prp40-like gene in the Drosophila database. CG3542 shares 23% identity with the entire yeast Prp40 protein and 41% sequence identity over its entire length of 757 amino acids with the human protein, FBP11. FBP11 was initially identified because it also contains a tyrosine-rich WW domain and like Prp40, it interacts with the splicing factor SF1 (Bedford et al. 1997). These observations suggest that the function of these proteins in forming bridges between 5′ splice sites and the branchpoint may be conserved (Abovich and Rosbash 1997). It is likely that these Drosophila proteins, like their human homologues, would not be found in purified U1 snRNPs, but, nevertheless, do share a function with their yeast counterparts. A second human Prp40-like protein, FBP21, has been described in the literature (Bedford et al. 1998). FBP21 is more closely related to the Drosophila CG4291, with 28% identity over the entire length of the 338–amino acid protein. Because similarity between FBP11/CG3542 and FBP21/CG4291 is limited to the WW repeats, FBP21/CG4291 is unlikely to be related to Prp40. Consistent with this idea, human FBP21 has been found to stably associate with the U2 snRNPs and, therefore, may function at a later stage of spliceosome assembly than does Prp40 (Bedford et al. 1998).
Searches with the yeast U1 snRNP proteins Prp39 and Prp42 have identified only a single homologous sequence. Prp39 and Prp42 belong to a family of TPR repeat proteins (McLean and Rymond 1998) and share 25% sequence identity with each other over a ∼270–amino acid region that includes several copies of the TPR repeat motif. We identified a single Drosophila protein, encoded by the CG1646 gene, that shares 25% sequence identity with Prp39 and Prp42 over the same ∼270 amino acids, and is the best match between the Saccharomyces cerevisiae and Drosophila genomes. The Drosophila crooked neck protein (Crn) is another TPR repeat protein, it's yeast homologue has been shown to act later in spliceosome assembly (Chung et al. 1999).
Surprisingly, there are three Saccharomyces cerevisiae U1 snRNP proteins that have no clear counterparts in the Drosophila database. No Drosophila proteins, whose best match in the S. cerevisiae genome is Snu71, Snu56 or Nam8, were found. Recent work on Snu56 and Nam8 suggests that these proteins contact the pre-mRNA directly and may anchor the U1 snRNP onto the substrate (Puig et al. 1999; Zhang and Rosbash 1999), a function that could be dispensable in metazoans because it could be provided by the SR proteins. Alternatively, a similar function may be provided by proteins, such as Drosophila rox8 in the case Nam8, that do not appear to be orthologues (Drosophila rox8 matches three other yeast proteins better than Nam8). Proteins in the U2 snRNP, U5 snRNP and U4/U6.U5 tri-snRNP are generally very conserved, and no significant differences between Drosophila and other species were revealed by our analysis (Table).
Proteins Required for Splice Site Selection
SR proteins are splicing factors that contain either one or two characteristic RNA-binding domains and an RS domain. These proteins are among the earliest acting proteins in spliceosome assembly (Zahler et al. 1992; Graveley et al. 1999; Tacke and Manley 1999). There are 11 well characterized mammalian SR proteins: 9G8, SRp20, ASF/SF2, SC35, SRp30c, p54, SRp40, SRp55, SRp75 (for review see Mount 1997), NSSR1, and NSSR2 (Komatsu et al. 1999). Individual SR proteins differ with respect to the sequence specificity of their RNA-binding domains, and with respect to their ability to recognize and activate different exonic splicing enhancer sequences. We have identified seven SR protein genes in the Drosophila genome. These include the previously described B52, RBP1, SRp54 (Kennedy et al. 1998) and X16/9G8 (Vorbruggen et al. 2000) genes, as well as Drosophila orthologues of ASF/SF2 and SC35. In addition, we have identified a novel gene, CG1987, that is 95% identical to RBP1.
Phosphorylation of SR proteins is thought to play an important role in controlling spliceosome assembly (Stojdl and Bell 1999; Yeakley et al. 1999). Both SRPK and LAMMER (or CLK) kinases phosphorylate SR proteins. We have identified three kinases of the SRPK type (CG8174, CG9085, and CG8565) and only one LAMMER kinase, the previously described Doa kinase (Du et al. 1998).
A variety of proteins bind to pre-mRNA (also known as hnRNA), and many of these proteins, defined as hnRNP proteins, have been shown to influence splicing, typically by inhibition of splicing events near their binding sites (Chen et al. 1999). A number of Drosophila hnRNP proteins have been described (e.g., Matunis et al. 1992), and some nuclear RNA-binding proteins without clear homologues in mammalian species have unambiguous roles in the regulation of splicing (e.g., SXL). However, because it is impossible to determine from sequence alone whether a given RNA-binding protein is likely to function in splicing, or even to reside in the nucleus, we have not undertaken an analysis of these proteins. These proteins are discussed in the accompanying article by Lasko 2000.
Genome Contents: Parallels and Differences
The results of our search for RNA processing factors known from studies in mammalian extracts and Saccharomyces cerevisiae genetics indicate that very few RNA processing factors are absent from the Drosophila genome. Indeed, our survey reveals remarkably little variation in this list among yeast, flies, and mammals. As expected, Drosophila proteins are more closely related to their vertebrate counterparts than to the Saccharomyces cerevisiae proteins.
The extensive conservation of the components of the spliceosome between vertebrates and Drosophila supports the suggestion that the primary mode of regulating splicing takes place at the level of spliceosome assembly (Lopez 1998; Staley and Guthrie 1998). Some of these factors, such as SR proteins, which regulate assembly of the spliceosome on many different RNAs, are well conserved and are easily identifiable. Missing from these tables are the factors that regulate the splicing of specific RNAs. These factors are less likely to be well conserved and, indeed, some may prove to be organism specific (e.g., SXL and TRA). Even the variation we observe among proteins and RNAs with clearly established roles in splicing, per se, is weighted towards the early events in splice site selection. For example, the Drosophila genome is missing a set of U1 snRNP proteins that are found in the yeast U1 snRNP but not in the vertebrate U1 snRNP, and the genome does not appear to contain a gene for the U11 RNA, or for the single known protein unique to the U11 snRNP, suggesting that U12 functions without U11 in Drosophila. Here again, variation is observed in components that function in splice site selection.
We thank Jonathan Roberts (University of Maryland, College Park, MD) for help automating the web searches and for help with the website. We thank Jo Ann Wise (Case Western Reserve University, Cleveland, OH), Javier Lopez (Carnegie Mellon University, Pittsburgh, PA), and Jim Manley (Columbia University, New York, NY) for discussions and comments on the manuscript. We apologize to those whose relevant publications could not be cited due to space limitations.
This work was supported by GM37991-11 to SMM and NSF-MCB9904565 to H.K. Salz.
Abbreviations used in this paper: hnRNA, heterogeneous RNA; pre-mRNA, precursors to mRNA; snRNA, small nuclear RNA; snRNP, small nuclear ribonucleoprotein.