The complete nucleotide sequence of the 957-kb DNA of the human immunoglobulin heavy chain variable (VH) region locus was determined and 43 novel VH segments were identified. The region contains 123 VH segments classifiable into seven different families, of which 79 are pseudogenes. Of the 44 VH segments with an open reading frame, 39 are expressed as heavy chain proteins and 1 as mRNA, while the remaining 4 are not found in immunoglobulin cDNAs. Combinatorial diversity of VH region was calculated to be ∼6,000. Conservation of the promoter and recombination signal sequences was observed to be higher in functional VH segments than in pseudogenes. Phylogenetic analysis of 114 VH segments clearly showed clustering of the VH segments of each family. However, an independent branch in the tree contained a single VH, V4-44.1P, sharing similar levels of homology to human VH families and to those of other vertebrates. Comparison between different copies of homologous units that appear repeatedly across the locus clearly demonstrates that dynamic DNA reorganization of the locus took place at least eight times between 133 and 10 million years ago. One nonimmunoglobulin gene of unknown function was identified in the intergenic region.
During the vertebrate immune response, Ig and TCR play central roles in antigen recognition. The NH2-terminal portion of their subunits is called the V region because of its diverse amino acid sequence required for interaction with a diverse spectrum of antigens. Generation of the primary V-region repertoire depends on the common genetic basis and molecular mechanisms characteristic of these antigen-receptor molecules (1–3). First, the V regions are encoded by two or three genetic segments, namely V, D, and J segments, each of which comprises multiple copies and provides the repertoire before somatic mutation. Second, during the ontogeny of lymphocytes, each one of these segments is chosen to undergo a somatic recombination event called V-(D)-J recombination, giving rise to the combinatorial and junctional amino acid diversity. Upon encounter with antigens, further diversification and refinement of the Ig repertoire is accomplished by a process known as affinity maturation, which includes somatic hypermutation, receptor editing, somatic gene conversion, and clonal selection (1, 2). In contrast, the V-region diversity of TCR is fixed through the selection process in the thymus and maintained without any modification (3). Although these molecules are likely to be derived from a common ancestral receptor molecule, much more complex molecular mechanisms are used for the refinement of the V-region repertoire of Ig than TCR after maturation of lymphocytes.
The Ig molecule is encoded by three independent gene loci, namely Igκ and Igλ genes for the L chain and IgH genes for the H chain, which are located on chromosome 2 (4, 5), chromosome 22 (5, 6), and chromosome 14 (7), respectively. Each of these loci spans a large DNA region of from one to a few megabases (Mb)1 (8–12). Although antibody function is determined by the complementation of L and H chains, accumulating evidence suggests that the major contribution to the generation of the diversity and specificity of Ig is from the H-chain molecule. Existence of an additional set of gene segments, namely D segments, and their involvement in V-D-J recombination increases enormously the sequence variability of the VH region. Receptor editing by rearrangement of the silent allele has not been reported for the H-chain locus (13, 14), possibly indicating a critical role of the H chain for the antigen specificity of the Ig molecule.
It is, therefore, important to have the complete structure of the human VH locus in order to understand the origin and behavior of the human immune repertoire. In addition, such studies will be useful in designing humanized antibodies. One of the best examples is the establishment and analysis of the xenomouse, which has deletions of the endogenous JH and Jκ loci but carries human VH and Vκ segments as transgenes to produce known human antibodies (15). Knowledge of the number and organization of germline VH and Vκ segments is essential to test the correlation between the germline repertoire and B lymphocyte repertoire formation in vivo.
Comparison of nucleotide sequences of the 5′-regulatory region of VH segments may tell us how human VH segments are transcriptionally regulated. Because the recombination signal sequences (RSS) flanking the germline gene segments play a key role in V-D-J rearrangement, it is interesting to test the correlation between the usage of individual VH segments and the sequence variation within the RSS. Existence of a novel VH family may provide additional V-region diversity. Isolation of polymorphic markers along the locus will greatly facilitate IgH haplotyping and subsequent systematic genetic analyses to examine the possible association between polymorphisms of the VH locus and susceptibility to immune disorders. It is also feasible to search for somatic gene conversions that have not yet been demonstrated in humans, the most critical test of which would be the extensive sequence comparison between germline and rearranged VH segments.
From an evolutionary point of view, nucleotide sequence comparison between different parts of the locus will enable us to trace evolution of this multigene family by DNA reorganization. It would also be very interesting to clarify the origin and nature of the translocated VH loci on chromosomes 15 and 16 (16, 17). The existence of many VH pseudogenes suggests frequent gene conversions during the evolution of the VH locus (18). Moreover, comparative analysis of the structure and organization of the human VH locus with those of other species or with other multigene loci (Vκ, Vλ, and TCR) will provide clues for further understanding the molecular mechanisms that govern the evolution of multigene families. Finally, the VH locus that lies adjacent to the 14q telomere may provide a suitable candidate for the study of the structure of a human telomere.
To address the above questions, earlier studies on the organization of the human VH locus have resulted in the completion of the physical map of the entire locus by isolation of yeast artificial chromosome (YAC) clones (8, 9). Here, we report the determination of the complete nucleotide sequence of the 957-kb DNA encompassing the human VH locus consisting of 123 VH segments. This permitted the classification of the VH segments according to their structure and utilization into 39 functional, 1 transcribed, 4 open reading frame (ORF), and 79 pseudogenes. Both frequent DNA reorganization after mammalian divergence and high levels of repetitive elements were revealed. We also identified a putative ancestral VH segment that is distantly related to VH segments of other vertebrates as well as to those of humans. A single exon-encoded nonimmunoglobulin gene of unknown function was mapped in the JH proximal part.
Materials And Methods
Isolation of the Distal Part of the VH Locus.
The JH-distal region contains one member each of the VH2 and VH5 families, which can serve as markers of missing DNA because they contain relatively few numbers (8). Probes and primers specific to these two families were used for the initial screening of cosmid (19) and ICRF YAC libraries (20) by the method as described (21). A contig of 125-kb DNA consisting of four cosmids (M146, U22-1, U22, and M83) and one YAC clone (13.3) which does not overlap with the JH-proximal 0.8-Mb region was obtained. The remaining gaps between Y24/Y6 and YAC 13.3 was filled with the P1 clones A1 and H10, obtained by screening a human bacteriophage P1 library (22) with the primers corresponding to the 5′ terminus of Y24 and the coding sequence of the V2-70 segment. The probe representative for the human telomere repeat was synthesized as described previously (23) and used for hybridization.
Nucleotide Sequencing of the VH Locus.
Two different methods were used to determine the nucleotide sequence. 637-kb regions whose plasmid subclones were available were sequenced by a primer walking method. The remaining DNAs were sequenced by a combination of shotgun and primer walking methods as follows; (a) insert DNA (average size ∼3–4 kb) for the shotgun libraries was obtained from cosmid and P1 clones either by partial digestion with Sau3AI or by mechanical shirring and subsequent fractionation by agarose gel electrophoresis; (b) plasmid DNA of 96 shotgun clones from each cosmid or P1 clone was used for the first round sequencing analysis by using vector primers of both ends. The 192 sequences obtained were assembled to generate contigs by Sequencher software (Gene Codes Corporation, Inc., Ann Arbor, MI). (c) The remaining gaps between contigs were then filled by primer walking using the plasmid DNA of the shotgun clones that bridge different contigs as a template. Accuracy of the nucleotide sequence was estimated to be 99.98% by comparison of the two sequences of 23-kb DNA between the V6-1 segment and D gene cluster from two independent cosmid clones.
Identification of Nonimmunoglobulin and Repetitive Sequences.
Eight nonimmunoglobulin genes were identified by BLASTN and analyzed in detail using GENETYX-MAC version 9.0 (Software Development Co. Ltd., Tokyo, Japan). Content and distribution of genome-wide repetitive sequences were extensively searched by CENSOR (24) at the Genetic Information Research Institute (Palo Alto, CA) as well as by dot matrix analysis.
Molecular Evolutionary Analysis.
Optimal alignment of nucleotide sequences was obtained by visual inspections maximizing the sequence homology between any pair of the VH segments. Intron sequences were not included for the analysis. The evolutionary distance ka was calculated by the simple Poisson model correction as ka = −ln[1 − (4/3)Ka] (25), where Ka represents the nonsynonymous substitution per site between sequences compared. The usage of Ks (synonymous substitution per site) for the alignment is not appropriate because Ks is saturated in many pairs of the VH segments compared. The evolutionary tree was inferred by the neighbor-joining method (26).
To estimate the divergence time between VH3/VH4 units, optimal alignment of spacer sequences was obtained by the methods as described (27, 28) together with visual inspections, and the tree was constructed by the neighbor-joining method (26). The divergence time of duplicated copies (T) was estimated by the equation T = k/2v, where k = −ln[1 − (4/3)K]. K represents the simple nucleotide difference between sequences compared. The evolutionary rate v was calculated as 1.4 × 10−9 per site per year by comparison of spacer DNA sequences among primate β-globin gene clusters (human, orangutan, Old World monkey, and New World monkey) (data not shown).
Results And Discussion
Complete Mapping of the Human VH Locus.
Previously, we isolated and analyzed the JH-distal 0.8-Mb region of the human VH locus. 64 VH segments (V6-1 to V3-64) have been completely sequenced and were categorized into 33 structurally functional and 31 pseudo VH segments (8). In this study, we further extended the region by screening and characterization of human YAC, P1, and cosmid clones (Fig. 1). The newly isolated region encompasses the 170-kb DNA upstream of the V3-64 segment and its JH-distal end hybridized with a human telomere repeat probe. Physical mapping and Southern blot analysis identified 16 VH segments and additional 9 DNA fragments that weakly hybridize with human VH probes within the 170-kb region (Fig. 1). Subsequent comparison of the physical map with that of yIgH6 (9) revealed that the content and organization of VH segments are almost identical except that yIgH6 carries additional 7-kb DNA at the telomeric end. Since the yIgH6 clone was isolated by the activity of the human telomere in yeast (9), it may extend to the 14q terminus.
Complete Nucleotide Sequence of the Human VH Locus.
The complete sequence of 957,090 bp between the JH1 segment and the telomeric part of chromosome 14q was determined. The region contains a total of 123 VH segments of 7 different families (Fig. 1 and Table 1). The VH segments are localized in the 883-kb DNA between 73 and 956 kb upstream of the JH cluster. The 5′-most VH segment, V3-82P, is located only 1,480 bp downstream of the 5′ terminus of YAC 13.3. Highly interspersed organization of the VH segments belonging to seven different families was confirmed (19). The lengths of intergenic regions are quite variable; the average distance between neighboring VH segments is ∼6.8 kb, with the longest being 41.4 kb (V1-2/V4-1.1P) and the shortest 418 bp (V3-67.2P/V4-67.1P). However, a clustered distribution of VH segments is not evident, unlike the human Igλ locus where five Vλ clusters are physically separated by long spacer DNA (11, 12). The transcriptional polarities of the 123 VH segments are the same as that of the JH segments, unlike the human Igκ locus in which distal 36 Vκ segments are in an inverted transcriptional orientation because of the gross inversion of 440-kb DNA (10).
In the JH-distal 170-kb region, the existence of 17 VH segments (V3-65P to V7-81) was suggested by earlier physical mapping studies (9). We identified 16 of them at the positions proposed (9) and classified them into 7 structurally functional VH segments and 9 pseudogenes. However, we failed to identify the VH sequence corresponding to the V7-77 segment even though the physical maps of the corresponding portions are exactly identical.
Many DNA fragments of YACs and cosmid clones in the VH locus weakly hybridized with VH probes (although such hybridization was not detectable by Southern blotting of human genomic DNA), suggesting the presence of additional VH-related sequences including possible novel human VH families. Indeed, the mouse Q52 family does not have human counterparts that show >66% nucleotide sequence homology (18). We identified 43 such VH-related sequences in the total locus and classified them into 19 VH3, 22 VH4, and 1 VH7 segments according to the homology to known 7 VH families, making the total number of the human VH segments to be 123 (Tables 1 and 2). Unfortunately, all of these 43 newly identified VH segments have defects (Table 2) and thus are categorized as pseudogenes, excluding the possibility of novel VH families in humans. Of note, only three VH segments (V3-30.2P, V3-33.2P, and V7-34.1P) have the basic VH structure while the other 40 VH segments contain the truncation.
The D region gene cluster consists of 26 D segments within the 39-kb DNA between 53 and 14 kb upstream of the JH segments (Fig. 1). Because all of the 26 D segments belong to some of the known six families, we named each D segment based on the family and localized duplication unit, thus following the nomenclature of earlier studies (29, 30). The organization and numbering of D segments are in accordance with that proposed in earlier studies (29, 30) and confirmed by nucleotide sequencing analysis (31) in that the D cluster consists of four copies of a 9-kb element containing a set of six D family segments in the order 5′-DM-D(LR)-DXP-DA-DK-DN-3′. An extensive analysis of D segment usage in VH cDNAs successfully classified the 27 D segments, including the unique DQ52 segment that is located in the JH gene cluster (32), into 25 functional and 2 pseudogenes (31).
The Total Number of the Functional Human VH Segments.
The definition of a functional VH segment is important and essential for determining their number in the human VH locus because there is some discrepancy regarding the classification of VH segments into functional or pseudogene segments, a discrepancy which is in part attributable to the incomplete nucleotide sequences of some VH segments (8, 9, 33). However, given the complete sequences of all the VH segments, we can propose the following criteria for the functional VH segment. The functional VH segment should have an intact exon-intron structure, a complete ORF, and no fatal defects in RSS. In addition, expression of the VH segment should be confirmed by identification of the given VH sequence in data bases of full-length VH cDNA. Identification of a partial cDNA sequence is not sufficient because a part of the V3-47P sequence is found in the VH cDNA database even though V3-47P must be a pseudogene because of a point mutation at the initiation codon (ATG to AGG) (18). Transcription of a rearranged TCR Vβ segment which carries a defect in splicing signal sequence has also been demonstrated (34). Needless to say, the best proof for the functional VH segment is to identify its sequence in the IgH protein database, although some VH segments might be difficult to identify because of hypermutation.
Therefore, we looked for the full-length VH cDNAs and proteins that correspond to the 40 structurally functional VH segments mentioned above. 37 of them fulfill the requirement for the functional VH segment since they are utilized for H-chain polypeptides (Tables 1 and 2). The V4-28 segment shares >97% homology with partial VH cDNA sequences in the database. Although it appears to be transcribed, its translation product remains to be identified. Hence, it is classified as the second group, transcribed. The V3-35 and V7-81 segments did not correspond to any VH cDNAs. To be conservative, however, we allowed their possible usage in the V-D-J rearrangement and classified them as the third ORF group (Tables 1 and 2) (discussed below).
Previous classifications of V1-24, V1-58, V3-16, V3-38, and V5-78P were corrected in this study. The V1-24 and V1-58 segments containing a complete ORF except for the abnormal splicing signal GC/AG (18) were found in a recently published IgH protein and therefore categorized as functional. We include the V3-16 and V3-38 segments in the ORF group even though they contain highly diverged RSS heptamers (discussed below). The V5-78P segment, which completely loses RSS by replacement of an unknown sequence, was not identified in cDNAs and therefore is classified as pseudogene. Conversely, none of the 79 pseudogenes that have defects in the VH gene structure itself were found in full-length VH cDNAs. Taken together, the corrected numbers of the functional, transcribed, ORF, and pseudogenes are 39, 1, 4, and 79, respectively (Table 1).
This raises the issue of how many functional VH segments are required for the full antibody repertoire in humans. The immune response of transgenic mice carrying human Ig YAC clone (xenomice) gives some hints on this (15). The xenomouse II strains with 35 functional VH and 18 functional Vκ segments develop human adultlike antibody repertoires with high levels of mature B lymphocytes and high-affinity human antibodies against diverse antigens while the xenomouse I strains bearing 5 functional VH and 3 functional Vκ segments are capable of only modest immune responses. This strongly suggests the importance of the complete germline V gene repertoire in the highly diverse human antibody response.
Analysis of the 5′-regulatory Sequences.
The 5′ flanking region of VH segments plays an important role in the regulation of H-chain gene expression. Already, we have shown the striking conservation of the upstream sequences of VH segments in a family-specific manner (18). This region contains two cis-acting elements, namely the octamer motif, which is essential for correct transcription of Ig genes, and the TATA box required by the general transcription machinery (35). In this study, 500 bp of 5′-flanking sequences from the 79 VH segments without 5′ truncation were aligned to identify and compare these two motifs, as well as unknown conserved sequences across VH families that might also act as cis-acting elements. We found that 40 out of the 44 functional, transcribed, or ORF VH segments contain an octamer sequence 100% identical to the consensus (ATGCAAAT) (Table 2). Slightly less conserved are ATCCAAAT in V3-53, AGGCAAAT in V6-1, and ATGCAGGT in V3-20 although these three VH segments have been shown to be translated. The V3-38 segment, an ORF group member, completely loses the octamer and TATA motifs due to 5′-truncation. Because this VH segment also contains a point mutation in a critical site of the RSS (discussed below), it might not be capable of rearrangement or transcription. Interestingly, the octamer sequence of pseudogenes appears much less conserved. In the 33 VH pseudogenes with the octamer motif in their 5′-flanking regions, as many as 15 have diverged octamer sequences (Table 2).
In contrast, the distance between the octamer sequence and the TATA consensus, as well as the sequence of the TATA box itself, are well conserved within the same VH family but vary between different VH families. Another motif, the heptamer, which is reported to be essential for full VH promoter activity in mouse lymphoid cells (36), is found 2 bp upstream of the octamer in the VH1 and VH7 family members only, confirming our previous observation (18) that the heptamer is not essential for the expression of H-chain genes in humans. We could not find any other conserved nucleotide motifs across the seven VH families by nucleotide sequence alignment. However, such novel cis-acting elements may be identified by investigation of the correlation between promoter activity and nucleotide sequence variation of the 5′-flanking region.
RSS and V-D-J Rearrangement.
The RSS of VH segments is located immediately downstream of the coding region sequence and is composed of conserved heptamer (CACAGTG) and nonamer (ACAAAAACC) sequences, which are separated by 23-bp spacer nucleotides. Recent in vitro analysis of RSS (37, 38) clearly demonstrated that the first three positions of the heptamer and the fifth and sixth positions of the nonamer are critical for efficient V-(D)-J recombination. Among the 123 VH segments identified in this study, 108 have RSS heptamer and nonamer signals (Table 2). All 40 of the functional or transcribed VH segments maintain the first three nucleotides of the heptamer signal (CAC) intact, although five of them are slightly different from the consensus in the four 3′ nucleotides (AGTG). Slightly more variation can be seen in the RSS nonamers of these 40 VH segments as follows; (a) the fifth and sixth positions are highly conserved except for two VH2 segments; (b) C is more frequently used than A at the fourth position; (c) the VH1 segments have a family-specific nonamer, TCAGAAACC capable of V-D-J rearrangement. The G nucleotide at the fifth position of V2-26 and V2-70 appears to maintain recombination efficiency as shown in the human Vλ genes (12). In the ORF group, V3-35 and V7-81 contain mutated RSS heptamers (CACTGAG and CACCATG, respectively) although their first three heptamer nucleotides retain the CAC consensus. It is not clear whether this might affect the efficiency of the V-D-J recombination. In contrast, the first three positions of the heptamer signals of V3-16 (TCC) and V3-38 (TAC) have diverged even though their nonamer signals remain well conserved. This might be the reason why these two VH segments cannot be found in functional V-D-J rearrangements. In all 44 VH segments, the spacer length is strictly maintained at 23 nucleotides (Table 2). Given the number of the human VH, D, and JH segments now, the combinatorial diversity of the human VH genes can be calculated as 40(functional/transcribed VH segments) × 25(functional D segments) × 6( JH segments) = 6,000. Of course, this value is only approximate, as the number of VH and D segments shows allelic variation.
The RSS of the 64 pseudogenes appear much more diverged (Table 2). First, 26 VH pseudogenes carry mutation(s) at one or more of the five critical positions (A to G mutation at the fifth nucleotide of the nonamer is excluded). In addition, V3-36P and V3-76.1P have lost the nonamer signal due to the truncation in the 23-bp spacer. Second, the spacer length is not well conserved; the 7, 6, 2, and 1 VH segments have spacer lengths of 24, 22, 20, and 17 bp, respectively, although the usage of V segments with 22- or 24-bp spacers has been demonstrated in human Vλ and TCRβ loci (12, 34). Surprisingly, however, as many as 28 VH pseudogenes carry the complete RSS with 23-bp spacer nucleotides, which corresponds to ∼35% of the pseudogenes. Assuming V-D-J recombination to be equally possible for any of the 70 VH segments and pseudogenes with authentic RSS, the probability of productive V to DJ rearrangement per allele can be calculated as 1/3( frame) × [40( functional/transcribed VH segments)/70] = 0.19 or 19%.
Evolution of the Human VH Locus.
To clarify the evolutionary trail of the human VH locus, we constructed a phylogenetic tree based on the nucleotide sequence alignment of 114 VH segments that do not have large truncations within the coding region. The phylogenetic tree showed the presence of three VH clusters corresponding to VHI, VHII, and VHIII subgroups (39), which are further subdivided into seven VH families; VH1/VH5/VH7, VH2/VH4/ VH6, and VH3 families, respectively, in agreement with the previous proposal (18, 39) (Fig. 2,A). The VH6 segment is branched off from the cluster of VH4 segments in this tree. However, we confirmed that the VH6 family forms an independent branch when the tree is constructed based on the simple nucleotide and amino acid differences (data not shown). Therefore, we consider this fluctuation as being due to a high level of homology between the two families. The 12 VH3 pseudogenes that have the 5′ truncation at the same position in their introns (Table 2) constitute an independent cluster of the VH3 family. Such clustering of truncated pseudogenes can also be observed in the VH4 family; a group of 13 VH4 segments containing the common 5′ truncation at amino acid number 10 (Table 2) again branched off from the common ancestor. These VH segments are scattered across the locus, suggesting the initial truncation in an ancestral VH segment and subsequent interspersion of duplicated copies throughout the locus.
Interestingly, the V4-44.1P segment appears to be so independent from the other three subgroups that it forms a fourth subgroup (Fig. 2 A). The V4-44.1P shared weak nucleotide sequence homology to the VH4 (<62.9%), the VH1 (<59.4%), and the VH3 (<58.2%) segments, and amino acid homology to the human VH segments did not exceed 40.6%. When the amino acid homology search for this VH segment was performed against protein databases, a similar level of homology was obtained with those of a variety of vertebrates, including: mouse (38.8%), rat (30.0%), rabbit (38.6%), dog (34.4%), Caiman (36.4%), Xenopus (33.7%), teleost fish (36.7%), and horned shark (28.6%). The presence of this VH pseudogene can be explained either by the possibility that the V4-44.1P segment is a putative ancestral VH segment or that VH segment is a very old pseudogene and the accumulation of the mutations has decreased its overall homology to the other human VH segments. Consideration of the V4-44.1P segment as the eighth family may be less likely because interspecies homology between corresponding VH families is usually much higher than that between different families within a single species (40, 41). However, it is premature to draw a conclusion based on amino acid comparisons of only a limited number of VH segments from other species.
Dot matrix analysis of the 957-kb sequence against itself failed to find the large scale genome duplication. However, we found 13 DNA sequences of variable length (4–24 kb) that appear repeatedly across the VH locus (Fig. 1). These homologous units constitute 67% of the entire locus and contain the DNA fragments previously shown to cross- hybridize with 14 intergenic probes by Southern blot analysis (42). Of note is the DNA sequence that appears 11 times in the region between 380 and 955 kb upstream of the JH segments (indicated by red boxes in Fig. 1) and contains a VH4 segment with the 5′ truncation flanked by a VH3 segment at its upstream end. Among them, the nucleotide sequence of the spacer DNA between the two VH segments is highly conserved in the 10 different VH3/VH4 units. The spacer sequences were aligned to estimate the divergence time of these VH3/VH4 units.
As shown in Fig. 2,B, nine DNA duplication events took place between 132 and 10 million years ago. Of note, seven events occurred after the mammalian divergence 75 million years ago (43), demonstrating the recent high frequency reorganization of the human VH locus. The 48-kb DNA ranging between the V3-33.2P and V4-28 segments consists of four copies of the VH3/VH4 units (Fig. 1). According to the identity of the physical map between the upstream and downstream 24-kb DNA, each of which contains 2 VH3/VH4 units, the 48-kb region was considered to be generated by tandem duplication of the 24-kb DNA (8). A similar score in the divergence time was obtained between the corresponding pairs: 13 million years ago between V3-32P/V4-31.1P and V3-29P/V4-28.1P, and 10 million years ago between V3-33.2P/V4-33.1P and V3-30.2P/V4-30.1P (Fig. 2 B). This strongly suggests the initial internal DNA duplication within the copy (73 million years ago) and subsequent recent gross DNA duplication (∼10–13 million years ago). Clustering of two VH3/ VH4 units is also seen in the 19-kb DNA between V3-54P and V4-51.2P. However, in this cluster the upstream V3-54P/V4-53.1P pair is nearest to the V3-33.2P/V4-33.1P and the V3-29P/V4-28.1P pairs (39 million years ago) while the V3-52P/V4-51.2P pair is most distantly related to the other nine copies (132 million years ago). This excludes the possibility of another gross duplication. The most recent duplication, which took place 10 million years ago between V3-33.2P/V4-33.1P and V3-30.2P/V4-30.1P, suggests the existence of the both pairs in gorilla and chimpanzee but not in gibbon (44, 45). A similar calculation was performed between DNA regions containing the truncated VH segments, V3-67.3P/V3-67.2P and V3-5.2P/ V3-5.1P and the divergence time was found to be 61 million years ago, again after the divergence between mouse and human (data not shown).
Identification and Characterization of Nonimmunoglobulin Genes.
We identified eight DNA sequences in the 957 kb that are highly homologous to known DNA sequences in the databases. Three of them were mapped within the VH-rare downstream part (Fig. 1). The 7,883-bp cDNA of KIAA0125 (46) displayed 99.8% identity to the DNA sequence between the V6-1 segment and the D gene cluster. KIAA0125 is encoded by a single exon and its transcriptional orientation is in the opposite direction to that of the VH segments. This cDNA has several interesting features, including an extremely short putative protein coding region (77 amino acid residues) and, in contrast, very long 5′- and 3′-untranslated regions (1,289 and 6,087 nucleotides, respectively) (46). In addition, its 3′-untranslated region contains two tandem repeats of 68- and 48-bp units. Moreover, the expression of KIAA0125 is limited to lymphoid organs (46). It is interesting to investigate its physiological roles because these characteristics are often found in imprinted genes including the H19 gene whose transcripts work as an RNA component of ribonucleoprotein particle (47–49).
We also found two processed pseudogenes within the largest spacer DNA between the V1-2 and V4-1.1P segments (Fig. 1). The 681-bp DNA segment located ∼105 kb upstream of the JH segments is 94.9% homologous to the human ribosomal protein S8 cDNA (50). Another DNA segment of 2,348 bp located ∼133 kb upstream of the JH cluster shows 89.9–91.4% homology to a series of cDNAs of the metalloprotease-like, disintegrin-like, cysteine-rich protein family of Macaca (51) (Fig. 1). Two copies of the 1.7-kb sequence showing the 77% homology to the 3′-half of human leukemia virus receptor 1 cDNA were identified in the spacer DNA between V1-18/V1-17P and V4-67.1P/V1-67 (52). These two distantly located DNA segments are 90.4% homologous and contain the common 5′-truncation, suggesting the integration of reverse transcribed human leukemia virus receptor 1 mRNA followed by the truncation and DNA duplication. Similarly, three copies of the DNA segment that show >86% nucleotide sequence homology to the 3′-most 500 bp of the human golgin-245 cDNA are also scattered within the locus (53) (Fig. 1).
Structure of the Human 14q Subtelomeric Region.
Physical mapping studies (9) suggest that the 14q terminus is located several kilobases upstream of the 5′ end of the YAC clone 13.3. Indeed, we could not find the complete repeat of human telomere-specific hexanucleotides CCCTAA at the distal end. However, the 5′-most 873 nucleotides contained a divergent telomeric repeat array of 181 bp which shares 69.1% homology to poly-(CCCTAA) sequence. Interestingly, this 873-bp DNA, which is unique within the VH locus, showed striking similarity to the telomeric regions of human chromosome 4p (92.4%), 4q (93.0%), and 22q (94.3%) that have been deposited in GenBank/EMBL/ DDBJ databases. Since the telomeric region is hyper- recombinogenic and the telomeric region of one chromosome in an individual often corresponds to that of another chromosome in others (54), these might represent alleles. It is reported that the divergent telomeric repeat array appears several kilobases downstream of the complete telomere repeat (54). In chromosome 4p and 4q, the homologous DNAs to the above 873 bp are located 5 and 13 kb, respectively, downstream of the authentic telomere repeat. Taken together, the distance between chromosome 14q terminus and the 5′ end of our contig would be ∼10 kb.
GC Content and Genome-wide Repetitive Elements.
Cytogenetically, band 14q32.33 is an early replicating and G + C–rich R band. Other characteristics of R band are being rich in housekeeping genes, having G + C–rich third coding bases, and SINE-rich/LINE-poor genome composition. However, studies on the DNA replication classified the human VH locus as a G-bandlike gene locus that replicates at the late stage of S phase, whereas the CH locus was classified as R-bandlike (55). The third position in codons of VH segments is A + T rich, again inconsistent with the cytogenetical observation (Hayashida, H., unpublished observations). We found that in this locus 893 kb upstream is A + T predominant (average 58.4%) while the 65 kb downstream is rich in G + C nucleotides (average 58.6%). The high G + C percentage at the JH-proximal part appears to continue toward the CH gene region. Existence of polypurine/polypyrimidine tracts has been reported at the boundary of A + T–rich class II and G + C–rich class III gene clusters of human HLA locus (56). In the VH locus, however, such tracts are not evident at the boundary. Nonetheless, the boundary may contain a switch point for DNA replication timing and scaffold-associated regions.
We looked for the content and distribution of various kinds of genome-wide repeats and identified 722 genome-wide repetitive elements, which correspond to as much as 41.8% of the entire locus. They are categorized into 136 SINE elements (133 Alu and 3 MIR), 340 LINE elements (338 LINE1 and 2 LINE2), 213 LTR elements (82 LTR-retrotransposon or MaLR, 69 retroviral LTR, and 62 retrovirus-like other LTR), 5 DNA transposons, and 25 medium reiteration frequency repetitive sequences (Fig. 1). LINE1 element is the largest contributor, constituting 23.2% of the locus while the number of Alu elements is much less than that expected by random distribution (239 copies) and constitutes a relatively small fraction (3.4%). Of note, only 2 copies out of 338 LINE1 elements contain the complete LINE1 structure of ∼6 kb whereas 278 copies are <1 kb in size.
Identification of a much larger number of LINE1 element in this study than in previous analysis by Southern blot (42) (44 Alu and 11 LINE1 hybridizing DNA fragments in the JH-proximal 730-kb DNA) is due to the usage of the probe in the previous study, which corresponds to for the conserved portion between LINE1 subfamilies, resulting in the failure in detection of smaller copies lacking the conserved portion. In the case of Alu elements, the difference in the number mainly reflects the multiple Alu elements in a single restricted DNA fragment. In the human TCR Vβ locus, LINE1-rich/Alu-poor structure is consistent with its chromosomal location at G band (34). Discordance of the results between nucleotide and cytogenetical analyses in the human VH locus may be attributed to the extraordinary chromosome structure of subtelomeric region. This locus is also rich in LTR elements (13.3% in total). Possible involvement of retroelement in gross changes of genome structure has been suggested recently (57). Abundant LTR elements may explain in part the dramatic difference in the organization of the VH loci between humans and mice.
We thank Dr. Chris T. Amemiya for screening P1 library; Dr. Ted Choi for kind donation of the YAC 13.3 clone; Mr. Hiroshi Suga, Drs. Akira Shimizu, Nobuo Nomura, Yoshimichi Ikemura and Fuyuki Ishikawa for valuable comments; Dr. Melvin Cohn for critical reading of the manuscript; Dr. Jean Thierry-Mieg for extensive BLAST and repetitive analysis; Dr. Masazumi Takahashi for computer scripts; and Ms. Hiroe Ohori-Kurooka for technical assistance.
This work was supported in part by grants from the Ministry of Education, Science, Sports, and Culture of Japan and from the Science and Technology Agency of Japan.
F. Matsuda's current address is Centre National de Genotypage, BP191-2, rue Geston Cremieux, 91000 Evry Cedex, France. K. Ishii's current address is JST Laboratory, Kitasato University Faculty of Science, Kitasato 1-15-1, Sagamihara 228-8555, Japan.
Abbreviations used in this paper: Mb, megabase; ORF, open reading frame; RSS, recombination signal sequences; YAC, yeast artificial chromosome.
Address correspondence to Tasuku Honjo, Department of Medical Chemistry, Kyoto University Graduate School of Medicine, Yoshida, Sakyo-ku, Kyoto 60601, Japan. Phone: 81-75-753-4371; Fax: 81-75-753-4388; E-mail: firstname.lastname@example.org