The recombination signals (RS) that guide V(D)J recombination are phylogenetically conserved but retain a surprising degree of sequence variability, especially in the nonamer and spacer. To characterize RS variability, we computed the position-wise information, a measure correlated with sequence conservation, for each nucleotide position in an RS alignment and demonstrate that most position-wise information is present in the RS heptamers and nonamers. We have previously demonstrated significant correlations between RS positions and here show that statistical models of the correlation structure that underlies RS variability efficiently identify physiologic and cryptic RS and accurately predict the recombination efficiencies of natural and synthetic RS. In scans of mouse and human genomes, these models identify a highly conserved family of repetitive DNA as an unexpected source of frequent, cryptic RS that rearrange both in extrachromosomal substrates and in their genomic context.
The rearrangements of V, D, and J gene segments are mediated by RAG1 and RAG2, products of the recombination activating genes, Rag-1 and Rag-2 (for a review, see reference 1). RAG1 and RAG2 function as a DNA recombinase (2, 3) that recognizes recombination signals (RS)* consisting of conserved nucleotide heptamers and nonamers separated by less conserved strings of 12 ± 1 or 23 ± 1 nucleotides (4, 5). Efficient physiologic V(D)J recombination occurs only between 12- and 23-bp spacer signals, defining the 12/23 rule (6).
DNA recombination is phylogenetically ancient and widespread. RAG1 has substantial homologies with Hin that mediates DNA inversions in Salmonella (7, 8), and the inversion signals recognized by Hin are similar to the consensus RS with a 23-bp spacer (23-RS; reference 9). The V(D)J recombinase has also been shown to have latent transposase activity (10–12). These findings support the idea that RAG1 and RAG2 originated from a transposable element that was captured and enslaved by the vertebrate immune system; physiologic RS associated with Ig and Tcr gene segments are thought to be relics of this process. Other functional signals have been located at genomic sites that are not adjacent to a V, D, or J gene segment. These cryptic RS (cRS) may be responsible for some chromosomal translocations (for a review, see reference 13) and receptor editing (for a review, see reference 14). Undoubtedly, some of these cRS arise by chance; others, however, may be traced to the evolutionary origins of V(D)J recombination.
Physiologic RS are highly variable and the import of this genetic diversity is not understood (for a review, see reference 15). Studies using extrachromosomal recombination assays have shown that, while mutating the CAC trinucleotide in the first three positions of the RS heptamer dramatically reduced recombination efficiency, at least some mutations were tolerated at every other RS position and, at many positions, mutations had almost no affect on recombination (16–18). Consistent with the observation that nonamer positions are more tolerant to changes than heptamer positions, known cRS often contain recognizable heptamers but lack identifiable nonamers (14). Because of this variability, cRS can only be identified empirically, by observing their participation in illegitimate rearrangements.
We demonstrated previously that strong pair-wise correlations exist between RS positions, especially among positions in 23-RS (19). To understand the significance of these correlations, we developed statistical models of the correlation structure underlying RS variability; these models indicate that higher order correlations, between three or more positions, are also present (19). While most positions in the RS are correlated with at least one other position, the correlations can be ranked by their relative strength. Strong correlations substantially overlap sites of DNA ethylation/methylation interference present in RS complexed with RAG1/RAG2 (19, 20), suggesting that the correlations may be relevant to recombinase/RS interaction.
Our models compute a recombination signal information content (RIC) score to rate the potential function of any RS-length sequence and also to serve as a search procedure for RS. Retrospective analyses indicated that RIC scores are strongly correlated with RS efficiency and could locate known physiologic- and cRS in their genomic contexts (19). Here, we show that our models of RS structure accurately predict the activity of physiologic RS and identify new, functional cRS in the mammalian genome. In addition, we demonstrate Ig- and Tcr-associated patterns of RS variability that could influence receptor rearrangement. For the first time, the identity and efficacy of RS can be predicted from DNA sequence by a precise and rigorous algorithm. The ability to predict RS function and efficiency from these models opens the possibility of directed mutational analyses of RS structure and suggests that recombinase/RS interaction depends upon the cooperative influence of widely dispersed nucleotides in the RS.
Materials And Methods
RS Sequence Set.
We analyzed 356 physiologic mouse RS from all Tcr and Ig loci (available at http://www.duke.edu/~lgcowell). A detailed description of the data can be found in reference 19. When 12- and 23-RS were analyzed as a pooled set, positions 1 through 13 of 12- and 23-RS were aligned, and positions 14 through 28 of 12-RS were aligned to positions 25 through 39 of 23-RS.
Genomic Sequence Set.
The following mouse DNA sequences were analyzed in this study: 212,133 bp of chromosome 8 sequence (NCBI accession no. AC084823), 199,101 bp of sequence from the Tcr β locus (accession no. AE000665), and 3,926 bp from the DH locus (accession no. AF018146).
Calculation of Position-wise Information.
Information (I) is calculated from the Shannon entropy (21). The Shannon entropy at the ith position in an alignment is given by
where Pi, j is the probability of nucleotide j at position i. The genomic entropy is
where qj is the probability of nucleotide j in the genome. The position-wise information content (22) is computed Ii = HGenome − Hi; the unit is 0.5 bits. For a DNA sequence alignment, I is correlated with sequence conservation: maximum I is 1 and indicates strict conservation; minimum I is 0 and indicates that no nucleotide is more frequent than any other.
Statistical Models of RS Structure.
We developed statistical models of RS correlation structure for 12-RS and 23-RS (19). Briefly, each model computes a score for any sequence of appropriate length (i.e. 28-bp sequences for the 12-RS model and 39-bp sequences for the 23-RS model) by taking the natural logarithm of the probability of observing the sequence as estimated by the nucleotide composition of the RS sequence set. The smallest model assumes that all nucleotide positions in RS are independent and is based on the set of probability distributions for the four nucleotides at each RS position, i.e. the probability of observing nucleotide X at RS position i for all positions i. The models were enlarged by the step-wise incorporation of correlation between one RS position and at least one other RS position. Correlations are included in the models by forming joint probability distributions for the correlated positions, e.g. the probability of observing nucleotide X at position i and nucleotide Y at position k, or the probability of observing nucleotides X, Y, and Z at positions i, k, and l, respectively. Joint probability distributions are formed when they increase the average probability of observing the set of physiologic RS. The final RS models define the set of probability distributions that assign the highest average probability to physiologic 12- and 23-RS.
The score (log P) for a sequence is a value between -∞ and 0. If RS were strictly conserved, sequences identical to the RS would have log P = 0 and all other sequences would have log P = -∞. RS are not strictly conserved, however, but the models were selected such that RS have higher log P on average than non-RS. We define the log P of a sequence as its RIC. RIC is computed as follows: RIC12 = ln[P1 P2 P3,15,25 P4,5 P6,28 P7,8,19 P9,26 P10,12 P11,27 P13,14,23 P16,17,18 P20,21,22 P24] for 12-RS and RIC23 = ln[P1 P2 P3 P4,14 P5,39 P6 P7,24,25 P8,9,21 P10,16 P11,12 P13,22 P15,23 P17,18 P19,27,30,31,32,33,37 P20,26 P28,29 P34,38 P35,36] for 23-RS. P1 is the marginal probability distribution for the four nucleotides at position 1, and P3,15,25 is the joint probability distribution for the 64 triplets at positions 3, 15, and 25. The presence of the joint probability function indicates that these three positions are correlated in the RS alignment. Correlation between positions may be observed because the positions act cooperatively to influence recombination or because RS share a common ancestry.
Only very low levels of extrachromosomal recombination are observed for RS not beginning with CAC (16, 17), so it is often assumed that CAC at positions 1–3 of the heptamer is required for recombination. The set of functional, physiologic RS reported by Ramsden et al. (18) includes an RS with heptamer CAGAGTG, however. Therefore, our models assign a probability of 0, and therefore RIC = -∞, to any sequence not beginning with CA.
Correlation Between RIC and Recombination Efficiency.
Spearman's rank correlation coefficient (rS) was used to detect correlation between RIC and measured recombination efficiencies.
Recombinationally Active Cell Lines.
The 103/BCL2 cell line was obtained by the transformation of mouse pre-B cells with a temperature-sensitive Abelson murine leukemia virus and the subsequent transfection with human Bcl-2 (23). 103/BCL2 proliferates at 34°C and expresses low levels of recombinase mRNA, protein, and activity (D. Ramsden, personal communication, and unpublished data). At 34°C, 103/BCL2 can support the rearrangement of efficient extrachromosomal recombination substrates (e.g., pJH290) but not rearrangement at the endogenous Igκ locus as detected by Southern blotting (23; and unpublished data). After as little as 12 h at 39°C, 103/BCL2 upregulates recombinase activity and rearranges the endogenous Igκ locus (23). 103/BCL2 cells were maintained at 34°C in RPMI 1640 supplemented with 10% FCS, 100 U/ml penicillin and streptomycin, 0.5 mg/ml Geneticin, and 0.55 μM 2-mercaptoethanol.
5B3 cells are M12 cells stably transfected with tetracycline-sensitive Rag1 and Rag2 (Tet-R1 and Tet-R2) vectors (24) modified to encode a RAG2-GFP fusion protein capable of supporting V(D)J recombination (25). Upon culture in the absence of tetracycline, 5B3 cells become GFP-positive and exhibit rearrangements of the endogenous Vλ and Jλ loci (25). 5B3 cells were cultured at 37°C in supplemented RPMI1640 (10% FCS, 100 U/ml penicillin and streptomycin, 3 mM histidinol, 10% glutamine, and 0.55 μM 2-mercaptoethanol) with or without tetracycline (0.5 μg/ml).
Extrachromosomal Recombination Assay.
We measured the efficiency of 18 physiologic and 9 synthetic RS by standard methods using extrachromosomal recombination templates. Briefly, recombination efficiencies of 12- and 23-RS were determined in pJH290 or a variant, p290T (see below), by the method of Hesse et al. (26). Both plasmids are coding joint substrates. In pJH290, a prokaryotic terminator of transcription is flanked by a 12- and a 23-RS; when pJH290 is transfected into recombination-competent 103/BCL2 (23), V(D)J recombination deletes a 300-bp fragment containing the RS and intervening sequence; free coding ends are recombined to form a coding joint in place of the deleted fragment. After alkaline lysis extraction of the plasmid, DH10B bacteria are transformed and the rearrangement status of plasmids in single bacterial colonies is assessed by PCR; 900-bp products represent intact pJH290, and 600-bp products indicate deletional rearrangements.
RS variants were introduced into plasmids by ligating representative 12- or 23-RS oligomers (Integrated DNA Technologies) into pJH290 digested with SalI or BamHI (NEB), respectively. SalI and BamHI restriction sites flank the 12-RS and 23-RS, respectively. Two additional modifications are present in p290T 23-RS variants, both created by the insertion of the 23-RS oligomer into the pJH290 backbone. All 23-RS oligomers carried a 4-bp deletion between the BamHI adhesive end and the nonamer, and both BamHI adhesive ends of 23-RS inserts were modified (to GGATCT). This modification produces a T substitution at the coding and signal flank of the 23-RS and renders the p290T plasmid resistant to BamHI digestion. These modifications, most likely the T substitution at the coding flank (27, 28), result in a 2-fold decrease in recombination efficiency compared to pJH290 (unpublished data). All pJH290 and p290T RS variants were confirmed by DNA sequencing.
10 μg of pJH290, p290T, or their RS variants were electroporated into 5 × 106 103/BCL2 cells. 103/BCL2 cells were washed with RPMI 1640 supplemented with 25 mM HEPES (Invitrogen) and resuspended to 1 × 107 cells/ml. 0.5 ml of this suspension was transferred into electroporation cuvettes (0.4 cm; Bio-Rad Laboratories) and incubated with 10 μg of recombination substrate for 5 min at room temperature and 10 min on ice. Samples were electroporated (250 V, 960 μF, 0 Ω), and transfectant cells were immediately chilled on ice (10 min), diluted to 5.5 ml with supplemented RPMI 1640, and incubated at 34°C overnight. Transfectant cultures were transferred to flasks containing 25 ml of supplemented RPMI for 24 h at 34°C and then incubated at 39°C for 48 h. Plasmid DNA was recovered by alkaline lysis extraction (29) and digested with DpnI (NEB) in a total volume of 100 μl. Digested plasmid DNA was purified by phenol:chloroform extraction (29) into a 10 μl volume of water. DH10B bacteria (Invitrogen) were transformed with 1 μl of digested, purified plasmid. Bacterial transformants were streaked on ampicillin (50 μg/ml) Luria broth (LB) agar plates.
Determination of Recombination Frequencies.
We measured recombination efficiency (R) by two similar methods. Analysis of low (R < 1%) efficiency cRS was based on bacterial colonies carrying plasmid substrates that impart constitutive ampicillin resistance (ampr) and conditional chloramphenicol (camr) resistance (26). R was estimated by the ratio of bacterial transformants exhibiting conditional (amprcamr) and constitutive (ampr) drug resistance (26); final values for R were averaged from ≥5 independent electroporations.
For high efficiency (R > 1%) physiologic and synthetic RS, we screened recombination templates in ampr bacterial colonies directly by PCR. This approach reduces the assay's sensitivity (frequencies < 0.3%) but precludes selection for spurious double-resistance. For the estimation of R by PCR, ampr colonies were randomly picked and expanded overnight at 37°C in 150 μl of LB containing 50 μg/ml ampicillin. The region of pJH290 and p290T flanked by the 12- and 23-RS was amplified by common PCR primers (290For, 5′-ATTAATGCAGCTGGCACG-3′, and 290Rev, 5′-CACTATCCCATATCACCA-3′) using Taq polymerase (Invitrogen). Amplifications were performed on 5 μl of template in a 50 μl reaction. Cycling parameters were: 94°C, 5 min; 28 cycles of 94° C, 1 min, 55° C, 1 min, and 72° C, 1 min. 10 min at 72°C ended the PCR program. PCR products were electrophoresed over 1% agarose gels to identify unmodified (900-bp product) and rearranged (600-bp product) plasmids. Final values for R (nos. rearranged plasmids ÷ [nos. rearranged plasmids + nos. unmodified plasmids]) were averaged from five independent electroporations. All 600-bp products from rearranged plasmids were sequenced and possessed typical coding joints (unpublished data).
To assay for RAG-induced, double-strand DNA breaks (dsb) in 5B3 cells, various ligation-mediated PCR (LM-PCR) reactions were performed (30-33). Briefly, genomic DNA was isolated from 5B3 cells cultured for 48 h in the presence (Teton) or absence (Tetoff) of tetracycline; the recovered DNA was subsequently ligated to the BW-LC linker (BW-LC1, 5′-AGCAACTGACGTGGAATCGCCAGAC-3′; BW-LC2, 5′-GTCTGGCGATTCC-3′; references 30 and 31). Ligated and unligated controls were amplified using locus-specific primer sets and Thermalase polymerase (Invitrogen). The locus-specific primers were γ-satellite (BW-LCHlong and γ1long: 5′-ACTGACGTGGAATCGCCAGACCAC-3′ and 5′-TTCCGTGATTTTCAGTTTTCTCGCC-3′, respectively), Vλ (25, 32), and Dβ (33). The PCR program for the amplification of γ-satellite DNA was: 98°C, 2 min, 28 cycles of 98°C, 30 s, 66°C, 30 s, and 72°C, 30 s and termination by 72°C for 10 min.
PCR products were electrophoresed over 0.8% agarose gels and transferred to nylon membranes (PerkinElmer; reference 29). LM-PCR products containing γ-satellite DNA were detected by hybridization (30, 31) with a 32P-labeled probe (BW-LCγ: 5′-GGAATCGCCAGACCACTGTAGGACCTGGAA-3′) that overlaps the BW-LC linker and a portion of the 234-bp γ-satellite repeat (34); hybridization was quantitated in a Storm phosphoimager (Amersham Biosciences). Hybridizations specific for Vλ and Dβ PCR products were carried out as described (25, 32, 33).
LM-PCR products were gel purified (QIAGEN) and ligated into the pCR2.1TOPO vector (Invitrogen) following the manufacturer's directions. TOP10 bacteria (Invitrogen) were transformed with the pCR2.1TOPO plasmid carrying LM-PCR inserts and streaked onto LB-agar plates supplemented with ampicillin (50 μg/ml) for blue/white colony selection as directed by the manufacturer. Single white colonies were picked and expanded overnight at 37°C in 3 ml LB supplemented with ampicillin (50 μg/ml). Cloned plasmid inserts were then purified by alkaline lysis extraction (QIAGEN) and sequenced with the M13 reverse primer by the Duke University DNA Sequencing Facility.
Patterned Genetic Variability in Mouse RS.
Alignment of 356 physiologic RS from all mouse Ig and Tcr loci reveals extensive sequence variability (19). To characterize this variability, we computed Ii for each nucleotide position in the RS alignment. Ii is proportional to sequence conservation; maximally informative positions are invariant whereas at minimally informative positions, nucleotides are present at frequencies equal to their genomic usage. The distribution of Ii along the RS alignment is shown in Fig. 1. Ii averaged over the length of the RS is 0.34. The heptamer has a higher mean position-wise information than the nonamer (I̅H = 0.78; I̅N = 0.53), and relatively little position-wise information (I̅S = 0.15) is contained in the spacer (Fig. 1). Different alignments of 12- and 23-bp spacers did not increase IS (unpublished data), and we did not find that 12-bp spacers are most similar to the first 12 nucleotides of 23-bp spacers (18). Separate alignments of 12- and 23-RS reveal greater conservation in 12-RS (I̅12 = 0.49; I̅23 = 0.33), and separate alignments of Ig and Tcr RS reveal greater conservation in Ig RS (I̅Ig = 0.46; I̅TCR = 0.30).
We computed Ii for each position in the alignment under two additional nucleotide classifications: the strength of hydrogen bonding, weak (T:A) vs. strong (G:C), and purine/pyrimidine. Higher position-wise information under one of these schemes would result if selection maintained nucleotide properties rather than particular nucleotides. I̅ was 0.28 under the weak/strong classification and 0.31 under purine/pyrimidine classification, indicating conservation for specific nucleotides. The distribution of Ii along the RS under both classifications was indistinguishable from that shown in Fig. 1 (unpublished data).
For any 28- or 39-bp sequence, the corresponding RS model computes a score, RIC12 or RIC23, respectively (19). For mouse 12-RS, the mean RIC12 score (R̅I̅C̅12) is –18.47; the highest RIC12 is associated with Vκ4-86 (−8.02) and the lowest with Jβ1-2 (−48.16). RIC12 scores are ranked in Fig. 2 A. The 100 highest RIC12 scores are similar, but the remaining 101 decrease more rapidly. When R̅I̅C̅12 and the mean rank for each locus containing 12-RS are plotted (Fig. 2 A), there is substantial overlap between RIC12 scores and between ranks across loci (Fig. 2, A and B). Nonetheless, RIC scores for Ig and Tcr RS are clearly separated. Ig 12-RS have higher RIC12 on average than Tcr 12-RS (−13.72 and −27.98, respectively; Fig. 2 B); higher scores for Ig RS is consistent with their lower variability.
RIC is based on the product of probabilities; the longer 23-RS therefore have lower RIC values than 12-RS. R̅I̅C̅23 for the mouse RS studied is –32.39. The RS associated with Dδ1 receives the lowest RIC23 (−69.68), and that of VH1S60 receives the highest (−15.83). The range of RIC23 scores is broader than for 12-RS (54 vs. 40 RIC units), consistent with higher sequence variability of 23-RS. RIC23 scores within loci are also more variable than observed for 12-RS, and Ig 23-RS tend to have higher RIC23 values than Tcr 23-RS (unpublished data).
Resolution of RS from Surrounding DNA.
To resolve RS, we characterized the RIC12 and RIC23 distributions of non-RS DNA (19). From these background RIC distributions, we set threshold RIC scores, −40 for RIC12 and −60 for RIC23, that balance the numbers of physiologic RS with subthreshold scores and non-RS with scores above threshold (19). Putative RS are resolved from the genomic background by RIC ≥ threshold. Only five of the 356 (1.4 × 10−2) physiologic RS score below threshold, and the frequency of non-RS DNA sequences having RIC above threshold is 5 × 10−4 (19). This is not a false positive rate, however, as these high scoring sequences may function as RS.
To identify known, functional RS, we searched >450 kb of genomic DNA containing 39 physiologic RS to demonstrate whether RIC scores could resolve 12- and 23-RS (19). Fig. 3 shows RIC12 values for a region of sequence AE000665 containing Jβ1-1 (−29.77) and Jβ1-2 (−38.81) and a region of sequence AF018146 containing DHFL16.1 (5′: −13.63, 3′: −15.53). Jβ1-2 is the lowest scoring physiologic 12-RS. Thus, RIC scores efficiently resolve physiologic 12-RS from the genomic background; results for 23-RS are similar (unpublished data). The only RS scoring below threshold are those associated with pseudogenes (19). RS flanking pseudogenes can not be selected, and we expect their RIC to be below threshold but above background.
Prediction of RS Efficiencies.
Previously, we computed Spearman's rank correlation between RIC and published recombination frequencies (17); RIC12 and RIC23 scores correlated well with extrachromosomal measurements of R (19). These published frequencies, however, were determined for a single 12- and 23-RS pair; most other RS tested differed from these RS by only 1-2 point mutations (17).
To determine whether RIC scores predict the functional efficiency of highly variable physiologic RS, we calculated RIC for 28 physiologic and synthetic RS and determined R for each in a standard extrachromosomal assay (Fig. 4, and Table I). RIC12 scores for 10 physiologic 12-RS correlated strongly with recombination (rS = 0.81) explaining 66% of the observed variation. RIC23 scores for 18 physiologic and synthetic 23-RS also correlated well (rS = 0.76), explaining 58% of recombination variability.
To determine if RIC predicts function in RS not used for model development, we computed rS for physiologic and synthetic 23-RS separately. Recombination efficiencies of synthetic RS were predicted with very high accuracy, rS = 0.93, explaining 86% of observed variability. Thus, RIC scores are effective predictors of RS function, even when diverse or synthetic signals are analyzed.
Nucleotide differences in heptamers, nonamers, or spacers can profoundly and synergistically affect recombination (35). Nevertheless, RIC accurately predicts dissimilar recombination efficiencies in similar RS and similar efficiencies in RS that differ substantially. For example, 290Tspaγ and 290Tspa3 (Table I) share consensus heptamers and nonamers but differ in their spacers. RIC23 scores correctly predict that 290Tspa3 (RIC23 = −27.7; R = 6.1%) will rearrange with higher efficiency than 290Tspaγ (RIC23 = −49.8; R = 0.9%). The physiologic 23-RS 2305 and 2310 (Table I) differ from each other by only two nucleotides in a non-consensus nonamer; the large difference in their RIC23 scores (10 RIC23 units) is consistent with their very different recombination efficiencies (<0.004 and 3.5%, respectively). Reciprocally, the 12-RS 1206 and 1207 (Table I) have consensus heptamers but differ at nine positions in their spacers and two in their nonamers; whereas the 1206 nonamer is consensus, the 1207 nonamer is not. Despite these differences, RIC12 for these RS differ by <4 RIC12 units, and their recombination efficiencies are similar (Table I).
Recognition of Known cRS.
The ability to predict R for RS that are similar or dissimilar implies that our statistical models capture some fundamental quality(ies) of RS structure. To test the limits of our models, we used RIC scores to identify cRS and to predict cRS activity. cRS were not used for model development and are likely under less stringent selection than physiologic RS.
Lewis et al. (36) reported 14 cRS (one 23-RS and 13 12-RS) that mediated illegitimate V(D)J recombination in plasmids transfected into RAG-expressing cells. To determine retrospectively whether these cRS could be resolved from the plasmid backbone, we computed RIC scores for each cRS. The 23-cRS scored −53.84, well above the physiologic threshold, indicating strong RS function. The average RIC12 score for the 12-cRS was −50.3, below the physiologic threshold but well above the mean of −60.07 for non-RS DNA (19). These RIC scores could not be compared to the activity of the plasmid cRS because those data have not been reported (36). Nonetheless, the results show that the (fortuitous) cRS present in the pJH288 plasmid could be identified by RIC and that our models might be used to search prospectively for cRS.
To extend our analyses of cRS, we searched 234 mouse and 229 human VH gene segments (37) for cRS in 3′→5′ orientation (Table II); 12-cRS near the 3′ end of VH gene segments can mediate receptor editing (38, 39). Our search located 51 (out of 111,990 possible signals) potential cRS with RIC12 > −40.0 (Table II). Virtually all (50/51) of these were from mouse VH gene segments, a bias that may reflect the mouse data set used for model-building. Half (26/51) of these 12-cRS lie within 12 bp of the VH segment's 3′ end where receptor editing is observed. The cRS with the highest RIC12 (−29.28) is located 6 bp from the 3′ end of mouse VH2S5 (Fig. 5).
The number of cRS in VH gene segments with RIC12 > −40 probably underestimates the prevalence of functional signals; we expect cRS to have lower scores than physiologic RS. For example, a 12-cRS in the 3H9 transgene has RIC12 = −45.32 but is known to mediate VH replacement (39). RIC12 scores > −45 identify 290 cRS in human (123) and mouse (167) VH segments and indicate that >50% of human and mouse VH gene families contain gene segments with potentially functional cRS (Table II). In VH families where cRS can be identified, fully 30–100% of gene segments (including allelic forms) carry potential cRS, a finding consistent with their conservation for H-chain editing (39, 40).
To determine if RIC12 scores could predict function in VH cRS, five potential 12-cRS from mouse (VH2S2, VH2S5, VH5S1) or human (VH3-64, VH7-81) VH gene segments were tested in an extrachromosomal recombination assay (Table III). The 12-cRS tested had RIC12 scores ranging from −29.3 to −40.2 and were located near the 3′ end of a VH gene segment. The Jβ2-2 12-RS (RIC12 = −37.8) and the cRS present in VH 3H9 (RIC12 = −45.3; reference 39) were also tested.
All the putative cRS, except that present in 3H9 (i.e., RIC12 ≥ −40.3), rearranged in pJH290 (Table III). Both human VH cRS and the mouse VH2S5 and VH5S1 cRS supported deletional rearrangements with efficiencies (0.4–0.6%) equivalent to the Jβ2-2 12-RS (0.7%). Rearrangements of the third mouse cRS (VH2S2) were observed at 10-fold lower frequencies (0.03%); all rearranged plasmids were sequenced and confirmed to contain bona fide coding joints (unpublished data). Thus, RIC scores identified functional VH cRS even though these cryptic sequences were not used to generate our RS models. The absence (<0.01%) of detectable rearrangements to the known cRS of 3H9 (RIC12 = −45.3) suggests that the number of cRS capable of supporting VH replacement in vivo may well exceed our estimate of 290 functional signals (Table II).
Prospective Identification of Novel cRS.
The ability of RIC scores to identify functional cRS in VH gene segments indicated that our models might locate unknown cRS. We therefore searched >10.5 Mb of mouse and human cDNA and genomic DNA for potential 12- and 23-cRS. Using RIC scores that indicate physiologic thresholds of activity (RIC12 ≥ −40 and RIC23 ≥ −60; reference 19), we identified 4,746 12-cRS and 16,439 23-cRS, yielding a frequency of 5 × 10−4 cRS/bp. This value is lower than that estimated by Lewis et al., 1.7 × 10−3 (36), from illegitimate rearrangements in plasmids but indicates that some 0.5–1 × 106 cRS capable of efficient rearrangement are present in the mammalian genome.
Some of the potential cRS identified by this search are embedded in the 234-bp repeat of mouse γ-satellite DNA (R̅I̅C̅23 = –64.2 ± 5.1; reference 34) and in a highly similar repeat present in the human genome (41). Some of these potential cRS, e.g. γ01, γ12, and γMD (Table III), have RIC23 scores (−59.9 to −53.2) indicative of efficient recombination; we cloned three of these cRS into pJH290 (p290γ01, p290γ12, p290γMD) to determine their recombination efficiencies (Table III).
Two of the three γ-satellite cRS mediated detectable levels of V(D)J recombination (Table III). p290γ01 had a recombination efficiency (0.6%) equivalent to the mouse Jβ2-2 RS (Table III), and p290-γMD also rearranged, albeit 30-fold less efficiently (0.02%). We were unable to detect rearrangement (<0.01%) in p290-γ12. Sequencing confirmed that all rearrangements were to the heptamer-like motif of the γ-satellite cRS rather than spurious rearrangements to cryptic signals in the plasmid backbone (unpublished data).
To determine if endogenous γ-satellite DNA could rearrange in vivo, we performed a LM-PCR (31) specific for signal end cleavage in the γ-satellite repeat using the recombinase-inducible cell line, 5B3 (25). After 48 h of culture in the absence of tetracycline (Tetoff), ∼70% of 5B3 cells become GFP+, indicating the production of transgenic RAG1 and RAG2:GFP (Fig. 6 A). Ligase-dependent PCR products consistent with blunt-ended dsb at γ-satellite cRS heptamers are present at low levels even in 5B3 cells cultured in medium containing tetracycline (Teton), but analogous dsb in the endogenous Igλ and Tcrβ loci are undetectable (Fig. 6 B). Under Tetoff conditions, γ-satellite dsb increase 8- to 16-fold and dsb in the Igλ locus become abundant. In our hands, Tetoff 5B3 cells also exhibit low levels of dsb in the endogenous Dβ loci (Fig. 6 B). Thus, dsb consistent with cleavage at the γ-satellite heptamer-like element are induced in 5B3 cells under conditions that promote rearrangement intermediates in the endogenous Igλ and Tcrβ loci. Sequence analysis confirmed that LM-PCR products produced from Vλ signal end (SE) and Dβ5'SE-specific primers represented authentic recombination intermediates (unpublished data). We interpret the presence of γ-satellite dsb under Teton conditions as the result of imperfect silencing of the RAG transgenes and the extraordinary abundance of γ-satellite DNA in the mouse genome (34) but can not exclude the possibility that γ-satellite DNA is exceptionally fragile.
If γ-satellite DNA were exceptionally prone to mechanical shearing or to cleavage by mechanisms unrelated to V(D)J recombination, we should observe frequent dsb at sites other than the γ-satellite heptamer. We therefore cloned and sequenced equal numbers (n = 17) of LM-PCR products recovered from 5B3 cells grown under Teton or Tetoff conditions and compared them to the γ-satellite consensus generated in Vector NTI (Informax) from the 31 published γ-satellite repeat elements (34; Fig. 7 A). Comparison to the consensus sequence permitted our detection of artifactual CAC-bearing heptamers introduced during PCR amplification (42). Our sequence analysis demonstrated that virtually all (32/34) of the sequenced LM-PCR products represented the BW-LC linker fused to a γ-satellite cRS heptamer-like element (Fig. 7 A). Half (9/17) of the γ-satellite sequences recovered from 5B3 cells under Teton conditions are repeats; five repeats were recovered under Tetoff conditions (Fig. 7 A). Identical γ-satellite motifs were amplified under Teton and Tetoff conditions, even though LM-PCR product was increased ∼10-fold in Tetoff cells (Fig. 6). Recovery of identical LM-PCR products under Teton and Tetoff conditions is consistent with low levels of constitutive recombinase activity in 5B3. Alignment of Teton and Tetoff γ-satellite LM-PCR products illustrates the lower diversity of the Teton group (Fig. 7 A) and suggests that these sequences might represent the more efficient group, even though average RIC23 scores for the Teton and Tetoff γ-satellite signals do not significantly (P > 0.5) differ (−74.4 versus −72.9, respectively).
Comparisons of each γ-satellite LM-PCR product (Fig. 7 B) to the consensus γ-satellite repeat demonstrate favored sites for linker ligation; with a single exception all of these are compatible with recombinase-mediated DNA cleavage (Fig. 7 B). The 234-bp γ-satellite repeat contains nine CA dinucleotides that represent potential cRS and are evaluated by the 23-RS model (Fig. 7 B); the canonical CAC trinucleotide of the RS heptamer is found at six of these (Fig. 7 B). Two Tetoff LM-PCR products indicated ligations to sites other than a CAC trinucleotide. The first, at position 96, indicated linker ligation to a consensus AAC and the second (position 98) to CAT (Fig. 7 B). These LM-PCR products are atypical for recombination intermediates and may represent dsb unassociated with recombinase activity. Of the six potential γ-satellite cRS beginning with CAC, three – at positions 44, 102, and 158 (Fig. 7 B) – account for most (81%, 26/32) dsb events. The 23-cRS at positions 44 and 158 of the consensus repeat exhibit near-physiologic RIC23 scores and were predicted by our model. The model predicts only a low recombination frequency for the major cRS at position 102 of the γ-satellite consensus (Fig. 7 B). We also scored the 31 published γ-satellite repeat elements (34) used to generate the consensus. 26 of the 31 have a CA, and thus a potential cRS, at position 102. The average score for the 26 cRS is –75.86, and the model predicts a higher recombination efficiency for 15 of the 26 than for the consensus position 102 cRS. Thus, to our knowledge, this is the first prospective identification of a cRS based solely on primary sequence analysis.
The information I present in individual positions of the RS alignment is predominately in the RS heptamer and nonamer (43; Fig. 1). Genetic variability elsewhere reduces the average Ii to 34% of the I̅ if all RS were identical. We find this level of I̅ to be surprisingly low for a signal mediating the introduction of dsb in DNA; promiscuous recombination driven by poorly regulated DNA cleavage would cause significant damage to the cell. We have previously shown that RS positions are correlated (19); these correlations could increase the specificity of the signal contained in the RS and reduce promiscuous binding. We introduced a model of RS correlation structure that computes a score, RIC, for any RS-length sequence (19). RIC efficiently identifies physiologic RS and known cRS (Figs. 2, 3, and 5; Tables I and III) and is strongly predictive of recombination efficiency (Fig. 4). Together these results suggest that RIC captures biologically important RS characteristics. In fact, the strongest correlations in the models overlap regions of RS/recombinase contact (19, 20).
On average, RIC scores for Ig RS are higher than for Tcr RS, possibly due to the slight overrepresentation (56%) of Ig RS in our data set. We doubt, however, that this small surplus alone could be the cause. Genetic variability in Tcr RS is > in Ig RS. Why? It is unlikely that recombinase-RS interaction differs in B and T cells, but if it did, discrete patterns of mutual information (MI) in Tcr and Ig RS should exist. Instead, I and MI are patterned similarly for all RS groups in our data set, and we observe lower levels of sequence conservation in Tcr RS. It may be that genetic variability in Tcr RS is expanded to influence the TCR repertoire (44). While Ig RS could also bias the Ig repertoire (15, 45, 46), the MHC-restriction of TCR may have favored biased associations of Tcr V, D, and J gene segments (47-49). Increased variability in Tcr RS could serve to increase favored Tcr rearrangements by preferentially guiding rearrangement partners.
We also find greater variability among 23-RS than among 12-RS. The recombinase may interact differently with 12-RS than with 23-RS, due to their different lengths and/or to enforce the 12/23 rule, resulting in more stringent sequence constraints for 12-RS. For example, Swanson and Desiderio (20) observed ethylation/methylation interference at 11/12 spacer positions in 12-RS but at only 3 positions in 23-bp spacers. RS spacers may be bent when bound to the recombinase (50); this bending, or other structural constraints such as rotational phasing, may constrain the shorter spacers more severely. HMG1 and HMG2 have a more pronounced effect on the binding and bending of 23-RS than of 12-RS (51), and the RS positions contacted by HMG1 differ between the two types of RS (52).
It is also possible that the increased variability among 23-RS results from a unique role in the regulation of ordered assembly and/or allelic exclusion at the H, β, and δ loci. There is strong evidence that 12-RS regulate the precise targeting of Dβ gene segments to Jβ gene segments and Vβ gene segments to Dβ gene segments (47, 48). These results do not rule out a role for 23-RS, however, and they demonstrate that RS can play a significant role in regulating ordered assembly at the β locus. The high level of variability among 23-RS can be explained by hypothesizing that there are two groups of 23-RS, those that participate in the first stage of rearrangement (JH, Dβ, and Vδ) and those that participate in the second stage of rearrangement (VH, Vβ, and Dδ). The specificity of the signal in the two sets of RS may differ, or the V-to-DJ type rearrangers may simply be less efficient. Indeed, Liang et al. (53) have recently demonstrated that recombinase activity mediated by core or full length RAG2 distinguishes between D→J and V→DJ 23-RS groups. This finding supports the notion that patterned variability among RS could provide a mechanism for regulating receptor assembly and allelic exclusion at the β, H, and δ loci (53).
RIC scores for physiologic RS lie well outside background distributions (Fig. 2 B), allowing us to define thresholds that discriminate between RS and non-RS. When correlations between positions in RS are ignored, as in consensus models, scores for non-RS increase and resolution of RS becomes problematic (19). Not only do higher RIC values identify physiologic RS located in the Tcr and Ig loci (19; Fig. 3), but RIC scores are also highly correlated with recombination efficiency (19; Fig. 4). Determinations of 12- and 23-RS efficiencies in a standard extrachromosomal recombination assay (26) revealed very high correlations between measured and predicted recombination efficiencies (rS = 0.81 and 0.76 for 12- and 23-RS, respectively) even for synthetic signals not present in nature (Fig. 4). Analogous models that ignore associations between nucleotide positions in RS never predict recombination better and sometimes much less well than RIC (19).
Of particular interest is the ability of RIC to predict dissimilar recombination efficiencies in similar RS. The physiologic RS p290T-2305 and p290T-2310 differ at only two nonamer positions (Table I). These signals are respectively associated with the VH7S4 and VH7S3 gene segments (54, 55); in the unselected B cell repertoire, VH7S3 is ∼8-fold more frequent than VH7S4 (54, 55). Even though p290T-2305 and p290T-2310 RS share 95% sequence identity, their RIC23 scores are very different: −32.7 and −22.5, respectively (Table I). This difference correlates with their relative activities in extrachromosomal substrates (Table I and Fig. 4) and with their usage in vivo (54, 55).
Given the ability of our models to identify physiologic RS and accurately predict their efficiencies, we searched mouse and human VH gene segments to determine if the models could also identify embedded 12-cRS (14). RIC12 scores located known and novel VH cRS and predicted that efficient 3′ cRS, located where receptor editing could result in a functional H chain, are common (Fig. 5 and Table II). Our genomic scans indicate that >50% of gene segments comprising six mouse and three human VH gene families contain putative cRS in this location (Table II), a result consistent with other analyses based on heptamer-like motifs conserved in Ig loci (for a review, see reference 14).
In contrast to searches for cRS “heptamers”, our models predict recombination efficiencies based on the entire RS sequence, allowing for the identification of cRS in VH gene segments that are likely to function. All five of the potential VH-associated cRS that we selected for testing in an extrachromosomal recombination assay exhibited detectable activity; four at levels similar to physiologic RS (Table II). To our knowledge, this is the first evidence that VH cRS can support V(D)J recombination efficiently, at levels near that of some physiologic RS. Our findings are consistent with models of B cell development where VH replacement contributes significantly to the BCR repertoire (14).
Previously, a single VH-associated cRS was predicted by Feeney and colleagues (56) based on the presence of a heptamer-like motif (CACAGTA) and its 3′ location in a VH gene segment. No recombination events mediated by this cRS were detected (56), but the identical cRS was functional in our hands (p290-m5S1, Table II). We speculate that the detection method used by Nadel et al. (56) was less sensitive than our own to infrequent recombination events.
The prevalence of cRS at the 3′ end of VH gene segments has led to speculation that these signals are conserved for VH gene replacement (14). Studies of IgH knock-in mice (39, 57, 58) have clearly demonstrated the possibility of VH gene replacement in vivo; however, the strong selective forces acting on B cells in these animals may emphasize rare or antigen-independent replacements (39, 56-58). For example, it is not clear if the VH replacements observed in IgH knock-in mice occur at a stage of B cell development consistent with (self) antigen-driven selection (39, 56-58).
We also identified cRS within the 234 bp-repeat of γ-satellite DNA. γ-satellite DNA is a highly repetitive, tandemly arrayed element that comprises ∼6% of the mouse genome (34). A highly similar (∼95%) repeat is also present in human DNA (reference 41 and unpublished data) suggesting this repeat is phylogenetically conserved. The abundance and conservation of this complex DNA motif suggest that the γ-satellite repeat may represent a link between physiologic RS and the transposon ancestor of RAG1/2 (11, 12). cRS in γ-satellite DNA rearrange with variable efficiencies in extrachromosomal substrates (Table III), but at least one γ-satellite cRS, p290γ01, rearranges as efficiently as the Jβ2-2 RS (Table III). γ-satellite cRS can function in vivo; LM-PCR products consistent with γ-satellite rearrangement intermediates are substantially increased in 5B3 cells under the Tetoff culture conditions that up-regulate expression of RAG1 and RAG2:GFP and initiate V(D)J rearrangement in the endogenous Igλ and Tcrβ loci (Figs. 6 and 7). At least some γ-satellite DNA is accessible to enzymatic machinery, as demonstrated by abundant γ-satellite RNA transcripts and recurrent integration of active transgenes into γ-satellite DNA (59–61). Functional cRS are also present in CA dinucleotide repeats (36). In contrast to CA repeats, however, γ-satellite cRS are complex, closely resemble physiologic RS, are more abundant than CA repeats (34, 62), and rearrange more efficiently (Table III and reference 36). The abundance of γ-satellite cRS may make them a common substrate for illegitimate V(D)J rearrangement and a potential site for RAG-mediated genomic remodeling (63), or a frequent and safe site where RAG-induced dsb can harmlessly rearrange.
RIC23 scores identified two of the three efficient cRS in the γ-satellite consensus. The model's prediction of only low recombination efficiency for the major cRS at position 102 of the consensus indicates that further additional work is necessary to model and understand RAG/RS interaction. Nonetheless, even in their current iteration, our statistical models are capable of identifying functional RS in the genome and offer the basis for rational analyses of RS structure by mutagenesis.
RS variability is sufficient to preclude exhaustive measurements of recombination efficiencies and effective searches for cRS. Except for the 12/23 rule (6) and the requirement for a CAC heptamer (16, 17), RS function can not be predicted (64, 65). Surprisingly, methods for the characterization of variable DNA sequence motifs have been available for 20 years (66–71), but until now, RS have only been described using consensus methods (18). RIC and these older probabalistic methods (66–71) will always outperform consensus methods for representing variable DNA motifs because they do not censor the information present in genetic diversity. Additionally, RIC incorporates correlation structures ignored by previous methods, increasing its ability to resolve and evaluate DNA motifs (19). RIC accurately identifies RS and predicts recombination efficiencies for physiologic, synthetic, and cRS (Figs. 4 and 6; Tables I and III). In addition, the statistical models that generate RIC can scan genomes for cRS. Our frequency estimates for fortuitous 12- and 23-RS (1–4 × 10−4) are 10-fold below earlier, empirical estimates (36). This higher level of discrimination is important when searching for cRS that may participate in illegitimate rearrangements, e.g. potential cRS in VH gene segments and at breakpoints of chromosomal translocations (13). Statistical models of RS structure are designed to aid empirical studies by focusing experiments on the most promising candidate structures; RIC's place in the study of V(D)J recombination is to identify and prospectively evaluate RS, ending roundups of the usual suspects.
We are grateful to Dr. D. Ramsden (University of North Carolina, Chapel Hill) for expert advice and the pJH290 substrate, and to Dr. E. Oltz (Vanderbilt University) who provided the 5B3 cell line. We are also grateful to Dr. N. Rosenberg (Tufts University) for the 103/BCL2 cell line. We thank Drs. D. Ramsden and M. Schlissel (University of California, Berkeley) for their comments on the manuscript.
L.G. Cowell received a Bioinformatics and Genome Technology postdoctoral fellowship from Duke University and support from National Institutes of Health training grant T32 AI52077. This work was supported in part by U.S. Public Health Service grants AI24335 and AI49326 (to G. Kelsoe).
L.G. Cowell and M. Davila contributed equally to this work.
Abbreviations used in this paper: ampr, ampicillin resistant; camr, chloramphenicol resistant; cRS, cryptic RS; dsb, double-strand breaks; I, position-wise information; MI, mutual information; LM-PCR, ligation-mediated PCR; RIC, RS information content; RS, recombination signal; rS, Spearman's rank correlation coefficient; 12-RS, 12-bp spacer RS; 23-RS, 23-bp spacer RS.