Structure and Genetics of the Partially Duplicated Gene RP Located Immediately Upstream of the Complement C4A and the C4B Genes in the HLA Class I11 Region MOLECULAR CLONING, EXON-INTRON STRUCTURE, COMPOSITE RETROPOSON, AND BREAKPOINT OF GENE DUPLICATION*

The correlation of many HLA-associated autoimmune and genetic diseases with the polymorphic complement C4 genes may be attributed to the presence of disease susceptibility genes in the close proximity of C4. We have cloned and characterized a pair of partially duplicated genes, RP1 and RP2, located 611 base pairs up- stream of the human C4A and C4B genes, respectively. The putative RP protein, consisting of 364 amino acid residues, is basic and highly hydrophilic. There is a bipartite nuclear localization signal at residues 114-131 and therefore RP may be a nuclear protein. Northern blot analysis suggested that RP is ubiquitously ex- pressed. The 5’ region of the RP1 gene is CpG rich, which is a characteristic of housekeeping genes. The RP1 gene contains nine exons. Located in the fourth intron is a cluster of Alu elements, and a newly defined composite retroposon SVA with a SINE, multiple copies of GC-rich VNTRs and an Alu element altogether enclosed by direct terminal repeats. Members of SVA are also present in the complement C2 gene located about 20 kilobases up- stream of RP1 in the HLA and in the cytochrome CYPlAl gene. Determination of the DNA sequences for RP2 from two different HLA haplotypes revealed identical hybrid sequences which resulted from fusion of RP with the tenascin-like Gene X and truncation of the 5’ regions of both genes. Cumulative data suggest that the four tan- demly arranged genes RP, complement C4, steroid 21-hydroxylase (cYP211, and Gene X altogether form a modular structure, RCCX. The number of RCCX modules varies from one to three or more in the population. Absence of the truncated genes RP2 and Gene XA have been detected in genomes with single RCCX modules.

The correlation of many HLA-associated autoimmune and genetic diseases with the polymorphic complement C4 genes may be attributed to the presence of disease susceptibility genes in the close proximity of C4. W e have cloned and characterized a pair of partially duplicated genes, RP1 and RP2, located 611 base pairs upstream of the human C4A and C4B genes, respectively. The putative RP protein, consisting of 364 amino acid residues, is basic and highly hydrophilic. There is a bipartite nuclear localization signal at residues 114-131 and therefore RP may be a nuclear protein. Northern blot analysis suggested that RP is ubiquitously expressed. The 5' region of the RP1 gene is CpG rich, which is a characteristic of housekeeping genes. The RP1 gene contains nine exons. Located in the fourth intron is a cluster of Alu elements, and a newly defined composite retroposon SVA with a SINE, multiple copies of GC-rich VNTRs and an Alu element altogether enclosed by direct terminal repeats. Members of SVA are also present in the complement C2 gene located about 20 kilobases upstream of RP1 in the HLA and in the cytochrome CYPlAl gene. Determination of the DNA sequences for RP2 from two different HLA haplotypes revealed identical hybrid sequences which resulted from fusion of RP with the tenascin-like Gene X and truncation of the 5' regions of both genes. Cumulative data suggest that the four tandemly arranged genes RP, complement C4, steroid 21-hydroxylase (cYP211, and Gene X altogether form a modular structure, RCCX. The number of RCCX modules varies from one to three or more in the population.
Absence of the truncated genes RP2 and Gene XA have been detected in genomes with single RCCX modules. Duplication of the RCCX modules probably occurred before the speciation of great apes and humans as they contain the same breakpoint region of RP and Gene X gene duplication.

_ _ _ _ _ _ _ _ _~ ~
Autoimmune, genetic, and malignant diseases are associated with the major histocompatibility complex (MHC) ' in humans (also known as the HLA) (1-5). This may be attributed to: (i) the presence of disease susceptibility genes, oncogenes, or tumor suppressor genes in the MHC; or (ii) the variable efficiencies of the MHC class I, class 11, and possibly class I11 molecules for antigen presentation. Indeed, more than 36 new genes have been identified in the MHC and many of those gene products have been found to be involved in important cellular processes (6). Characterization of these novel genes and identification of their biological functions are essential to understanding the molecular basis of MHC-linked diseases.
The fourth component of complement (C4) is a structural subunit for the multi-molecular C3 and C5 convertases of complement activation in the immune response (7). The C4 genes are located at the class I11 region of the MHC (8). In humans there are two tandem C4 loci, Locus I and Locus 11, which are about 12 kb apart. Locus I codes for C4A and Locus I1 codes for C4B, although there are exceptions (9-11). Three kb downstream of the C4Aand C4B genes are the cytochrome P450 21-hydroxylase (CYF '21) A and B genes. CYP2lA is a pseudogene due to deleterious mutations in exons. Homozygous mutations and/or deletions of the CYP2lB genes may result in congenital adrenal hyperplasia (CAH), characterized by salt-losing and/or virilizing phenomena (12)(13)(14). Concurrent deletions of C4 and CYP2l genes are well documented phenomena that may lead to various Structure and Genetics followed by a stretch of GT-rich sequence that is a characteristic feature for the 3' end of a mammalian gene (19). Here we report the cloning and characterization of the novel gene(s) located immediately upstream of the C4 genes. This novel gene RP is partially duplicated in the MHC and the truncated version is often involved in a gene duplication or gene deletion process concurrent with neighboring genes. The similarity in physical locations suggest that RP is probably identical to G11 assigned by cosmid mapping (20). MATERIALS (23).
Determination of RP cDNA sequence from clone R1.l was achieved by shotgun cloning of randomly sonicated DNA fragments into M13 SrnuI cut and phosphatased vector, and single-stranded DNA dideoxy sequencing (24,25). Gel readings were assembled with Staden's DNA Analysis software (26).
Determination of the 5' Sequence of RP cDNA Total RNAs were isolated from cultured cells MOLT4 (T leukemia cell line) and from HT29 (colon carcinoma cell line) by guanidine isothiocyanate lysis and CsCl ultracentrifugation (27). Reverse transcription was carried out using oligo(dT) as the first primer and 1-5 pg of RNA according to the Perkin Elmer Cetus (Norwalk, CT) RNA PCR protocol (28). Amplification of RP cDNA was achieved by two rounds of PCR. The first PCR was performed using primers corresponding to the 3' end of the RP cDNA (HRP3) and 5' end of RP1 gene (1.6K-F1 or 1.6K-F2).
Approximately 10% of product from the first PCR was used for the second PCR using "nested" RP primers, 1.6K-F2 and RPR5, and 0.8K-F1 and RPR5, corresponding to the 5' end of the RP1 gene (1.6K-F2 and 0.8K-Fl) and to RP1 Exon 3 (RPRB), respectively. The 1.6K-F1 and 1.6K-F2 sequences were assumed to be present in the RP1 transcript by the presence the Kozak consensus for an initiation codon (29), although the RP cDNA sequence obtained revealed that these primers are not close to the predicted RP initiation codon (Figs. 2 and 4). PCR conditions were: 1 cycle at 94 "C for 5 min; 30 cycles a t 94 "C for 1 min, 54 "C for 1 min, and 72 "C for 1 min; and 1 cycle a t 72 "C for 10 min. The PCR products were cloned into TA cloning vector (Invitrogen) and sequenced.
Northern Blot Analysis Total RNAs were isolated from cultured cell lines MOLT4, HepG2 (hepatoma cell line), U937 (monocytic cell line), and RPMI 8402 (pre-T lymphocytic cell line) as described (27). Poly(A+) RNA from MOLT4 was a gift from Dr. Caroline Bilsland (Harvard University). Human liver RNA was prepared as described previously (30). For each sample, -25 pg of total RNA or 2 pg of poly(A+) RNA was resolved by formaldehydeagarose gel (0.8%) electrophoresis and blotted to Hybond N membrane (Amersham). The blot was hybridized with the R1.l cDNA probe and washed twice a t room temperature with 2 x SSC, 0.1% SDS and twice at 65 "C with 1 x SSC, 0.5% SDS (27).

of HLA Novel Gene RP 8467
Sequence Determination of RPl a n d RP2 Genes R P l Gene-Restriction fragments containing the entire RP1 gene were subcloned from cos 3A3 and designated pSH13 and pBH4 (Fig. 3). The majority of DNA sequences for the RP1 gene were obtained by shotgun cloning of sonicated DNA fragments into M13 mp18 or into Bluescript KS vectors (Stratagene, LaJolla, CA) and single-stranded DNA dideoxy DNA sequencing (24,251. Sequenase kit (U. S. Biochemical Corp., Cleveland OH) and [35SldATP were employed for sequencing reactions. Gel readings were compiled with Staden's DNA sequence analysis softwares (26). Gaps i n sequence contigs were filled by further subcloning of the appropriate restriction fragments as shown in Fig. 3 and primer walkings using new primers based on known DNA sequences for sequencing reactions. Sequence contigs were joined together through sequence determination of PCR-amplified DNA fragments overlapping the junctions (shown by thick bars in Fig. 3). Overall, each nucleotide was determined more than three times and confirmed by sequences from both strands.
RP2 Sequence from cos 4A3-A 12-kb BglII fragment spanning the intergenic region between two C4B genes from cos 4A3 (9, 31) was subcloned into pUC18. From this subclone, pBRP7, a 5-kb BarnHI restriction fragment was isolated and its sonicated fragments shotgun cloned into pBluescript KS vector by blunt-end ligation. Singlestranded DNAs were prepared after rescue with M13 K07. Gel readings were obtained by SpeedReader and DNA sequences analyzed with PC/ GENE softwares (Intelligenetics, La Jolla, CA). RP2 Sequence from A-JM2a-A 5.3-kb TuqI fragment corresponding to the upstream region of a C4B5 gene from A-JM2a (32) was subcloned into pUC18 vector. From this plasmid, a 2.1-kb BurnHI-TuqI fragment was sequence-determined after shotgun cloning into M13 mp18 SrnaIcut vector. Gel readings were assembled by Intelligenetics PCIGENE softwares.
Isolation of Genomic DNA Human genomic DNAs were isolated following standard protocols (27) from cultured cell lines HepG2 and MOLT4, and from peripheral blood of normal individuals (Bl, L1, and SCOl), congenital adrenal hyperplasia patients (CAH-El, CAH-2, and CAH-31, Prader Willi patients (PW-1 and PW-21, and a nasopharyngeal carcinoma patient (NPC-A). Appropriate consents from blood donors were obtained according to approved protocols by the Institutional Board of Columbus Children's Hospital.
Mouse genomic DNA was isolated from myeloma cell line NSO; African green monkey genomic DNA was isolated from from kidney cell line COS 7; and cotton top tamarin DNA was isolated from an Epstein-Barr virus transformed cell line NPC-LC (33). Orangutan and chimpanzee DNAs were prepared from cultured cell lines PUT2 and WES, respectively, obtained from the American Tissue Culture Collection.

PCR of Genomic DNAs
Locations of possible breakpoints for Gene XA and RP2 hybrid regions in primate and mouse genomes were determined by PCR (35) with primers YMRl and HRP3. For each reaction, 500-1,000 ng of genomic DNA and -250 ng of each primer were used. Conditions for PCR were: 1 cycle a t 94 "C for 5 min; 30 cycles at 94 "C for 1 min, 54 "C for 1 min, and 72 "C for 1 min and 15 s; and 1 cycle at 72 "C for 10 min.
Protein and DNA Sequence Analyses Comparison of the RP amino acid and DNA sequences with national data bases were performed by GCG FASTA program from Pittsburgh Supercomputer Center. A dendrogram of the RP variable number tandem repeats (VNTR) were performed by PC/GENE Program Clustal. Other sequence analysis programs used included DOTPLOT, BESTFIT, PILEUP, PRE'ITY, and PUBLISH of the GCG package (36).

RESULTS
Isolation of cDNA Clones for the Novel Gene RP Upstream of C4-A 653-bp BstEII-Acc I fragment, which is 274 bp upstream of the major transcriptional start site of the C4Agene, was used to screen cDNA libraries. R1.l was isolated from the RPMI polyadenylation signal. The 3' sequence of the R1.l cDNA clone is identical to the 5' regulatory sequences for the C4A and the C4B genes. This implies that there may be a pair of duplicated genes, designated RP1 and RP2, present immediately upstream of the C4A and the C4B genes, respectively.
The R1.l cDNA was used as a probe for Northern blot analysis to investigate the expression of RP transcripts isolated from cell lines of different origins (Fig. 1). This probe hybridized to a message of 1.6-1.8 kb in size from RNA samples isolated from the liver (lane 1 ), hepatoma cell line HepG2, lympocytic cell lines MOLT4 and RPMI 8402, and monocytic cell line U937 (lanes 3-6). RP transcripts of similar size were also detected from Northern blot analysis of RNA samples from colon carcinoma cell line HT29 and neuroblastoma cell line IMR32 (data not shown). Thus, RP appears to be ubiquitously expressed with a major transcript size about 1.6-1.8 kb.
Since R1.l contains the poly(A) tail and is only 1.1 kb in size, the 5' region of the RP full-length cDNA is missing in this clone. Determination of the RP 5' cDNA sequence was achieved by reverse transcriptase-PCR (28) with RNAs isolated from HT29 and MOLT4, using PCR primers derived from the upstream sequence of cDNA clone R1.l, and from the 5' sequence of the RP1 gene. The coding sequence for RP1 is shown in Fig. 2.
There is an open reading frame coding for 364 amino acid residues. An in-frame stop codon is located 33 nucleotides 5' to the putative initiation codon. Thus, it is likely that the predicted amino acid sequence for RP is complete.
The RP protein is extremely hydrophilic in its NH2-terminal half. There are several alternate hydrophilic and hydrophobic regions at the carboxyl-half of the protein. Overall, there are 53 positively charged residues ( A r g or Lys) and 37 negatively charged residues (Asp or Glu) in the RP protein. Thus, there is a net positive charge of 16 in the RP protein. In addition, there are 8 histidine residues which may also be positively charged.
Located between residues 114 and 131 is a bipartite, positively charged nuclear localization signal (371, KRHHLIPPET-FGVKRRRKR. Hence, the RP protein may be targeted to the nucleus. The RP protein is rich in Gly (9.6%), Pro (6.9%), and

T C C T C C G C C T G C C A G A G A C A T C~G A~C T C C T C A T C A T T A C T G G G
L R L P E T * L e d l e (12.6%) residues. While most of the Pro and Gly residues cluster at the NH2-terminal portion, the majority of Leu/ Ile residues are present at the COOH-terminal region.

TCTTATTACTCTCTCTTCATAGGAAGGTGCGAmCTCCGGAGCCTCCTCAAGCAGGC
There are no N-linked glycosylation sites in the predicted RP sequence. Search of RP for protein modification sites with the PC/GENE PROSITE program revealed that there is a single potential tyrosine kinase phosphorylation site at residue 231; five potential kinase C phosphorylation sites at residues 80, 105, 112, 145, and 277; five potential casein kinase I1 phosphorylation sites at residues 6,76,85,183, and 277; two amidation sites at residues 61 and 323; and nine potential N-myristoylation sites a t residues 25, 27, 65, 69, 70, 74, 104, 144, and 211. Whether the endogenous RP protein is post-translationally modified as revealed by these potential sites remained to be determined.
Gene Structure of RPl-The human RP1 gene located upstream of a C4A3 gene in a cosmid clone was subcloned into plasmids (Fig. 3). The DNA sequence for the entire RP1 gene and the intergenic region between C4A and RP have been completely determined. A sequence of 12,118 bp is shown in Fig. 4. The RP1 gene consists of 9 exons (Fig. 5). The 5' boundary for Exon 1 has not been defined precisely partly because of the multiple initiation sites of transcription (data not shown). The 3' end of Exon 1 was located because a continuous cDNA sequence spanning 191 bp for the 3' region of Exon 1 and about 70 bp for the 5' region of Exon 2 has been obtained (data not shown). The putative initiation codon is 128 bp downstream of the splice junction of Exon 2. The coding capacity of the exons ranges from 14 amino acid residues (for Exon 9) to 74 amino acid residues (for Exon 2). Exon 9 also contains a 3"untranslated region of 397 bp. The size of the introns ranges from 85 bp for Intron 2 to 6,136 bp for Intron 4 (Table I).
There are several notable features of the RP1 gene. First, the 5' region of the RP1 gene is rich in CpG sequences. For example, there are 130 CpG dinucleotides from nucleotide 500-3,000, compared with 30 from nucleotide 8,500-11,000 (Fig. 4). The CpG sequences are generally under-represented in mammalian DNA and their presence (at the 5' region of a gene) is strongly correlated with housekeeping genes (38). This would infer that RP is involved in an essential function.
Second, there are eight complete copies of A h elements present in Intron 4 and one of these elements (copy number 6) is actually located within another Alu element (copy number 5). Among the eight A h elements, copy numbers 1,2, and 8 belong to the A h -J subfamily that is similar to the 7SL DNA (39).
Copy number 4 appears to be an integral component of a composite retroelement and will be discussed below. The other four Alu elements belong to the Ah-S subfamily with copy number 6 categorized to the b branch and copy number 3 categorized to the c branch (39). Ah-Sib is considered to be a young branch of the Alu family, which is consistent with our data as the incorporation of copy number 6 into the RP gene has to be after the presence of copy number 5. Copy number 3 has an atypical trimeric structure in contrast to the dimeric structure for most Alu elements. The additional structure was labeled Ah-3.1 in Fig. 4. Third, there is a stretch of highly repetitive DNA sequences located between nucleotides 5,717 and 6,579 (Fig. 4). These repeats can be illustrated by multiple diagonal lines in a dotplot analysis (36) (Fig. 6C). There are 21 copies of very similar but nonidentical, tandem repeats, each of which has a GC content of 72-84% and a size of 35-45 bp. A dendrogram showing the relatedness of these 21 tandem repeats is presented in Fig.  6A. These 21 copies of tandem repeats together are flanked by hexameric sequences, TGGGCA (bored in Fig. 4). The presence of these hexamers precisely at both ends of the 35-45-bp tandem repeats suggests that these tandem repeats as a whole might exist as a structural unit in a transpositiodretroposition event, which resulted in the generation of a hallmark, the (hexameric) direct repeats (40).
A Composite Retroposon Is Present in the R P l Gene-Comparison of the tandem repeat sequences in the RP1 gene with the GenBank data base through the GCG FASTA program (36) revealed striking similarities with a group of nonviral retroelements, SINE-R11, -R14, and -R19 (411, but arranged in the reverse orientation. Members of the SINE-R elements contain two basic components: (a) 3-6 copies ofVNTR which are highly similar to those 35-45-bp tandem repeats present in the RP1 gene; and (b) a short interspersed repeat element (SINE) of -490 bp that is homologous to the genomic region between the env gene and the 3'-long terminal repeat of an endogenous retrovirus HERV-K10 (42). The SINE sequence in RP1 is only 135 bp in size. Most of the sequence corresponding to the HERV-K10 long terminal repeat region in the SINE-R retroposons is absent in the RP1 gene. Immediately preceding the SINE element in RP is a stretch of T residues of 16 nucleotides.
In addition to the RP1 gene, the SINE-R related sequences have also been detected in introns of two other human genes, the complement C2 (43) and the cytochrome P450 CYPlAl (44) genes. The copy number for the VNTRs varies from 16 in CYPlAl, 17 in complement C2 (B allele), 21 in RP1, to 23 in another complement C2 gene (A allele). Analysis of the RP, C2, and CYPlAl genomic sequences reveals a more complex organization of the reiterative sequences than those of SINE-Rs. Located immediately downstream of the VNTRs in each gene is a highly conserved region of 370-372 bp (Fig. 7) with sequence identities of about 95%. Present in each of these sequences are three stretches of Alu-related sequences of 25, 54, and 246 bp (Fig. 7) and a less well-defined sequence of 32 bp. One of the

C C G C C C C C C T C C A C G C C C C C M C C A C T C C C C C G G C G T G C G C G M I
G A P P R R Q R V P C R A C P U R E P I CCCCCGCCGCCGTGGCGCCCCCCCTCCCCCAGTC~TCCTC~ATCTCCCTCCCCACCCC R G R R C A R P C C C D A G ( 7 4 )

CCCTCCCAGC*CTCACGGCCTCACCCACCACMCTTCACCCICC~CCT~TCACCTCC
Exon 3 TCI~CACGGACCCCCGGCCACACCCTACCTCACTCCTCTCCCCCCCMCACCCTAI ( 7 5 ) G T P G E T V R H C S A P E D P l F

C C C T C A C C C C A~A T C T G C C C~G T C C C C C C C C C C C C C T C T C T U C M C T U~~C C
( 1 4 5 ) s A R A A V S E L I4 9 L

TCTTCCCCCCACCCCTC~TC~CC~CCCCCTCCCCCCC*1CCACCTCT
F P R C L F E D A L P P I V L R S Q V Y . . .

M T A T A T G A G C T G A~A T T A T C C C T C I C T C I~A C G ATACACGTTCCTGGGACTCATCCMCCTTMTCM~ACCCTCTC~CCCCCCCCAITTC
G G A C I C T A C A T m T C M C T C C C T C M G l G C m r r M G r C T C C C C

C C~M M G T T G U C M T C T A T~C A T A T A T M~~.~~~L T G A T C~A T * M C C
ACATACCTATCTACCCACTCCTACRC~CTAIACMTA~CACURATCCCA~TATA .

T C C C T C T C T C A T C " T C T C I C I C A T C A C M~C C T C C ( A l u (3)
.

GCMTTCTCCTCCCTCACICTCCTCA~ACCTCACACTAUCC~CCCCC~CCAICCCC A G C T M~C C T A~A C T A C A C A T C~C~U T C A T G T T C G C C A T M T C C T C~ C A T C I C C T C * C C T C C T C A T C T C C C C T C C T C C C C C C
Alu ( 3 ) -1 . Alu-related sequences (ie. 246 bp; Alu copy number 4 in Fig. 4) and 281-297; two mini-insertions at nucleotides 121-122 and appears to be an entire Alu element and is flanked by a pair of 258-262; and 34 scattered point mutations, when compared target site repeats. This particular Alu element in RP1, C2, or with the consensus Alu sequence (39) (Fig. 8). CYF'lAl genes is unusual in that there are many conserved  enclosed by a direct repeat sequence of 13 bp, GATAATTC-CACTA (boxed in Fig. 4). Homologous organization enclosed by a pair of direct repeats of 18 bp is present in the complement C2 gene. Hence, these retroelements appear to form a family of retroposons with discrete, composite units (i.e. SINE, VNTRs, and Alu) proliferated in the human genome. This composite retroposon is named SVA in light of its composition (Fig. 7). Part of the DNA sequence for the SVA retroposon in the CYPlAl gene is not available but the existing data reveal a structure very similar to the SVA-RP and SVA-C2 (also known as SINE.R-C2).

C A T C C A C C C C A C C C C C C C T M~R A~~~~A T A C~~A T C C G~C
RP Gene Is Duplicated in Most Human and Primate Genomes-Since RP sequences are present upstream of the complement C4A and the C4B genes, it infers that there may be two copies of RP genes in a haploid genome. A Southern blot analysis of BamHI-digested genomic DNAs, which were isolated from human peripheral blood lymphocytes from CAH patients (lanes 1 3 ) , Prader Willi patients (lanes 4 and 5 ) , a nasopharyngeal patient (lane 6), normal individuals (lanes 7 ,  8, and 101, human tumor cell lines (lanes 9 and 11 ), an African green monkey cell line (COS7), and a cotton top tamarin cell line (NPC-LC), using a RP cDNA probe (R1.l) is shown in Fig.   9. Two distinct RP specific, BamHI fragments of 9.6 and 5.0 kb in size were detected in most human samples, but only a single 9.6-kb fragment was detected in the genomes of CAH-E1 (lane 1 ) and of SCOl (lane 10). CAH-E1 was a congenital adrenal hyperplasia patient with homozygous deletion of the CYP2lB genes3 SCOl was a normal individual who was typed as HLA B8 DR3 C4AQO C4B1 (there is a homozygous deletion of C4A genes in this individual). Subsequent restriction mapping and DNA sequencing data revealed that the 9.6-kb BamHI fragment corresponds to the RP1 gene, while the 5.0-kb BamHI fragment corresponds to the RP2 gene. Two RP-specific BamHI fragments of 9.3 and 4.7 kb were detected in African green monkey (lane 12), but a single fragment of 10 kb was detected in cotton top tamarin (lane 13). These results suggest that the RP genes are duplicated in the majority of the human population, but in some individuals only the RP1 gene is present. They also infer that there may be two RP genes in an Old World monkey African green monkey but a single RP gene in a New World monkey cotton top tamarin. This same conclusion was obtained from a genomic Southern blot analysis of TaqI-digested DNAs with the R1.1 probe (data not shown).
Partial Gene Duplications of RP and Gene X-Located in the approximately 12-kb intergenic region between C4A and C4B are the CYP2lA pseudogene, RP sequence, and Gene XA which overlaps CYP2lA at the 3' ends. CYP21Ais about 3.2 kb in size and located 3.0-kb downstream of the C4Agene; thus, Gene XA and the RP2 gene are localized in a region of 6 kb. This observation appeared paradoxical as the size of the RP1 gene is about 11.5 kb (Fig. 41, while that of the Gene XB may be as large as 70 kb (45,46).
In order to solve this puzzle, a 12-kb BglII fragment was subcloned from cos 4A3, which corresponds to the intergenic region between two C4B genes in an unusual haplotype C4A2 C4B1 C4B2 (9) (Fig. 1OA). The RP2-specific 5.0-kb BamHI restriction fragment (Fig. 9) is located in this subclone, which was completely sequenced by shotgun cloning and dideoxy sequencing. A comparison of this sequence with those for RP and Gene X cDNAs and the RP1 gene reveals a hybrid sequence derived from RP and Gene X (Figs. 1OB and 11). Specifically, this 4,971-bp sequence contains a 1,566-bp fragment corresponding to part of the 5'-untranslated region of a C4 gene (42 bp), the RP-C4 intergenic region (611 bp), and Exon 7-Exon 9 of the RP1 gene (913 bp) which is fused to a 3,405-bp fragment corresponding to the 3' region of Gene X (Fig. 1OB). The 5' ends for both RP and Gene X in this hybrid region are truncated and therefore duplications for these two genes are incomplete. With respect to the RP1 gene, the breakpoint of gene duplication for the RP2 sequence is located in Exon 7 (Fig. 11) and is 2,093 bp downstream of the Alu clusters and SVA element (Fig. 4). A DNA sequence corresponding to Gene X cDNA is found 797 bp upstream of RP2. Nine hypothetical Gene X exons, Exons a to i can be deduced from this 4,971-bp BamHI fragment when compared with the published cDNA sequence (Fig. 1OB) (18). Gene XA is arranged in the opposite orientation with respect to RP, C4, and CYP2l genes. The first 332 bp of the published Gene X (partial) cDNA sequence is absent in the Gene XA sequence and therefore the Gene XB 5' exons are absent in Gene XA. In addition, there is an internal deletion of 91 bp in the hypothetical Exon e. The truncation of the 5' region and the internal deletion may change the reading frame and result in premature termination with respect to Gene XB cDNA (18). It remains t o be determined whether the changes in Gene XA would lead to the generation of a new protein product. An independent study by Gitelman and collegues (17) has shown that the DNA sequence 5' to the duplication breakpoint corresponds to intronic sequence of Gene XB. Thus, the Gene XA-RP2 hybrid was formed by a recombination at a Gene X intron and Exon 7 of RP.
To determine if there is a common breakpoint for gene du-  plication of Gene X-RP-C4-(CYP21), a 2.1-kb TaqI-BamHI restriction fragment from A-JM2a spanning the RP2 sequence and the 5' region of a C4B5 gene of the C4A4 C4B5 haplotype (32) was determined (Fig. 1OC). This fragment covers the entire 913-bp RP2 sequence and also 573 bp of the Gene X sequence and its sequence is identical to the corresponding region obtained from cos 4A3, except for the presence of two point mutations. In other words, the breakpoint of the hybrid Gene XA-RP2 in a C4A4 C4B5 haplotype is identical to that of the C4A2 C4B1 C4B2 haplotype. (Further analysis of the polymorphism of Gene XA-RP2 sequences will be published elsewhere.) Thus, the genomic region between C4A and C4B genes contains pseudogenes or truncated sequences for three different genes, i.e. CYPBlA, Gene XA, and RP2. The tandem genes for RP, C4, CYP21, and Gene X appear to form a four-gene module RCCX that may be duplicated together in the MHC class I11 region (Fig. 12). However, duplications for the flanking RP and Gene X are incomplete. Homozygous deletions of RP2 in individuals CAH-E1 and SCOl were concurrent with Gene XA. This is because the 5.0-kb BamHI restriction fragment containing Gene XA-RP2 sequences were absent in these individuals (Fig.  9). We have also found homozygous deletions of a C4 gene and a CYP2l gene in the genomes of CAH-E1 and SCO1. In other words, these individuals have single RCCX modular struct u r e~.~ A Common Breakpoint Region for Duplication of the RCCX Modules in Great Apes-To determine if the RP-C4-CYP21-Gene X modules are duplicated with a Gene XA-RP2 hybrid region in humans and apes, PCR was performed with a set of primers corresponding to the RP (Exon 9) at one end and to Gene X at the other end (Fig. 1OD). As shown in Fig. 13 (Panel  A), a 1.36-kb fragment was amplified from cosmid DNA (cos 5) that spans a long C4A and a short C4B gene (lane 11, from human genomic DNAs with RCCX bimodular structures, e.g.

Raji (lane 3 ) and HepG2 (lane 4 1, and from chimpanzee (lane 5)
and orangutan genomic DNAs (lane 6). Southern blot analysis of the samples shown in Panel A using the R1.l probe confirmed that the 1.36-kb fragment contained RP sequence (Panel B ). Thus, there is a common breakpoint region for gene duplication of the RCCX modules in the great apes. On the other hand, no amplified products were detected from human genomic DNA, CAH-El, with a single RCCX modular structure (lane 2 ) , or from mouse genomic DNA(1ane 7). The former was expected because the corresponding PCR primers in a single modular haplotype are oriented in a head-to-head configuration, located 20-30 kb apart and therefore could not be amplified by PCR. The breakpoint of gene duplication for mouse RP is undetermined but available sequence data corresponding to the 5' regions of the C4 and the Slp genes exclude the possibility for an identical breakpoint as in humans and apes.

DISCUSSION
Here we report the identification, cloning, and characterization of the novel gene RP located 611 bp upstream of the human complement component C4A and the C4B genes in the class I11 region of the HLA. The unusual modular duplication (and deletion) of RP together with its neighboring genes Gene X, complement C4, and steroid 21-hydroxylase CYP21, and the association of the HLA with autoimmune and genetic diseases motivates an intensive investigation on the structure, genetics, and function of RP.
Although the deduced amino acid sequence of RP does not reveal striking similarities to any known proteins, i t sheds light on the properties and possible function of this ubiquitously expressed molecule. The presence of a bipartite nuclear localization signal suggests RP may be a nuclear protein. The highly hydrophilic and basic nature of the protein infers that

D.
W I FIG. 10. Partial gene duplications of Gene X and RP. A, a subclone of a 12-kb BglII restriction fragment from cos 4A3 spanning the intergenic region between two C4B genes; B , hypothetical exon-intron structures for Gene XAand RP2 in a 5.0-kb BamHI restriction fragment that has been completely sequenced (data base accession number L26263). The breakpoint for gene duplication for RP and Gene X at the chimeric region is marked by an arrow. C , the relative position of a 2.14-kb TaqI-BamHI restriction fragment from clone A-JM2a corresponding to the intergenic region between the C4A4 and C4B5 genes (32). This fragment has been completely sequenced (data base accession number 26262). D, the relative location of the PCR primers to determine the breakpoint of gene duplication for the Gene XA-RP2 (please refer to Fig. 13).
the protein might interact with negatively charged molecules such as DNA or acidic domains of transcriptional factors. A comparison of the RP protein sequence with other protein sequences in national data bases revealed that the NH2-terminal portion of 157 residues in RP is 22.9% identical to the hypothetical 119.5-kDa uvr-A protein in bacteria Micrococcus (471, while the carboxyl portion of RP is about 20% identical to the yeast RAD7 protein (48). Both uvr-A and RAD7 are involved in the DNA repair mechanism. Similar to RP, a bipartite nuclear localization signal and a leucine-rich region are present in the RAD7 sequence (48,49). Mutation of the RAD7 gene in yeast resulted in decreased proficiency of excision repair of DNA damaged by U V light (48). While analogs for many of the components involved in the DNA repair mechanism of yeast have been isolated, the human analog for the yeast RAD7 has not been cloned.
Immunological disorders such as systemic lupus erythematosus (50, 511, immunoglobulin IgA deficiency and common variable immunodeficiencies (52, 531, and malfunctions in reproduction, such as recurrent spontaneous abortions (54), have been related to the null alleles of C4A. In this case null alleles of C4A imply a gross deletion of a C4A gene together with other genes in the RCCX module, or mutations of the C4A gene that may also be concurrent to RP, CYP21, or Gene X. A typical example for the latter can be found in a HLA B44 haplotype where the conversion of a C4B gene to C4A in the second C4 locus was concurrent with mutations of the CYP21B gene (10, 14). The diversities of many disorders correlated with the C4A null alleles infer deficiencies of different genes in the close proximity of C4A, andlor the presence of a malfunctioning gene with widespread functional properties.
To this end malfunctions of a transcriptional factor, a DNA repair protein, or a molecule involved in the signal transduction pathway would result in the described disorders. A deficiency of the putative nuclear protein RP could be related to some of these problems.
The mouse also contains RP genes upstream of the C4 and the Slp genes, although the breakpoint of gene duplication or deletion for RP appears to be different from that in human^.^ Whether both RP genes in the mouse are functional is yet to be determined. I t was shown that a crossover at the mouse C4-CYP2l region led to the lethality of homozygous embryos (55), which suggests the presence of essential gene(s) at the region of crossover. On the other hand, breeding experiments for congenic rats revealed the existence of a growth and reproduction complex (grc) with several genes closely linked to, if not present in, the MHC. One of the genes in thegrc has been inferred to be a tumor suppressor gene (56,571. Our zoo-blot experiment suggested that there are two copies RP genes in the haploid genome of rat.5 The physical location and the structural information together suggest that genes of the RCCX modules could be related to the grc. The concept for the modular organization of the C4 and CYP2l genes was first suggested by Klein and collegues (15). This study extends the concept of the modular gene duplication to include the genes flanking C4 and CYP21, RP, and Gene X. In a RCCX bimodular (or trimodular) structure, the breakpoint of the four-gene duplication is present at Exon 7 of the RP1 gene and an intron of the Gene XB. This resulted in the com-L. Shen, R. Chen, and C. Y. Yu, manuscript in preparation. C. Y. Yu, unpublished data. plete duplication of a C4 gene and a CYP21 gene, but only partial duplications of RP and Gene X. The truncated sequences, RP2 and Gene XA, form a chimeric hybrid at the intergenic region of the two C4 genes.
This modular duplicatioddeletion pattern involving four structurally and functionally unrelated genes is intriguing. Partial gene duplication has been suggested to be one of the major mechanisms leading to genetic diseases (58). This is because a partially duplicated DNA or a pseudogene DNA sequence may mutate without selection pressure and those (deleterious) mutations can be incorporated into the functional gene through recombinations or gene conversions, as observed in the cYP2l genes (reviewed in Ref. 59). Thus, the Gene XA-RP2 hybrid sequences could play a role in disrupting the gene function and in the genetic instabilities of the RCCX modules in the population.
Although the bimodular structures of RCCX are prevalent in the population, the single modular structures account for 10-30% of the human genomes (60). Single RCCX modular StNCtures consist of the intact RP1 and Gene XB loci, and varied types of the complement C4 and CYP2l genes3 In those single RCCX modular genomes, the absence of CYP2lB leads to CAH, while the absence of C4A is a predisposing factor for systemic lupus erythematosus. Thus, it is important to understand the mechanism leading to the deletion of genes of the RCCX modules.
In many situations a repetitive DNA element such as an Alu element or an endogenous retrovirus was found at or proximal to the breakpoint of DNA rearrangements (58). In the RCCX modules, a cluster ofAlu elements and a composite retroposon SVA with 21 copies of VNTRs are present at Intron 4 of the RP1 gene, which is 2,093 bp upstream of the corresponding breakpoint of RP gene duplication. Notably dimeric sequences for Alu elements have not been reported in the C4A, CYP21A, Gene XA, RP2, C4B, and CYPBlB, a genomic region more than 50 kb in size.
Elucidation of the composite retroposon SVA was the result of a deliberate sequence comparison with DNA sequences in the GenBank data base. In contrast to a simpler retroposon SINE-R, the composite retroposon SVA in the RP1, C2, or CYPlAl genes contains 16-23 copies of unusually GC-rich VN-TRs and also additional sequences with an Alu element characterized by distinct deletions and mutations among SVA retroposons. The SINE and Alu elements are arranged in the opposite, head-to-head configurations. Since possible gene products of the SVA have not been defined at this stage, the sense DNA strand of SVA cannot be specified with confidence.

Structure and Genetics
However, the presence of multiple T residues at one end of the SVA could reflect the presence of a poly(A) structure similar to a messenger RNA that was reversely transcribed and subsequently incorporated into the human genome. If this were the case, the SVA would be orientated in the opposite direction with respect to the resident genes RP, C2, and CYPlAl. The striking similarities in the organization and high sequence identities of SVA among the three genes suggest that SVA may be a recently evolved retroposon. All three SVA elements described above are located in an Alu-rich region. For example, the SVA-RP is present within an Alu cluster and the entire repetitive DNA region spans 4.4 kb in size. The SVA-C2 and SVA-CYPIA1 have acquired an additional structure with 7 copies of hexameric sequences located immediately after the SVA-specific Alu element (Fig. 7). The SVA-RP contains only 135 bp of the 490-bp SINE element present in SINE-Rs. The missing region consists of a responsive element for glucocorticoids (41,42) and therefore the gene activity of RP may not be induced by these steroids. Whether the SVA element and its GC-nch VNTRs play a role in the function of RP, or in the unusually frequent RCCX modular variations such as gene duplications and/or deletions and polymorphisms, remains to be determined. Only three SVA retroposons have been elucidated so far but two of them (i.e. SVA-RP and SVA-C2) are localized about 20 kb apart in the polymorphic HLA class I11 region. It is also of considerable interest to note that about seven copies of the SVA-related VNTR sequences are found close to the meiotic recombinational breakpoint of the HLA DQBl gene in the DR7 DQw2 haplotype (61).