Identification of an Evolutionarily Conserved Domain in Human Lens Epithelium-derived Growth Factor/ Transcriptional Co-activator p75 (LEDGF/p75) That Binds HIV-1 Integrase

, Human lens epithelium-derived growth factor/tran-scriptional co-activator p75 (LEDGF/p75) protein was recently identified as a binding partner for HIV-1 integrase (IN) in human cells. In this work, we used bio-chemical and bioinformatic approaches to define the domain organization of LEDGF/p75. Using limited proteolysis and deletion mutagenesis we show that the protein contains a pair of evolutionarily conserved domains, assuming about 35% of its sequence. Whereas the N-terminal PWWP domain had been recognized previously, the second domain is novel. It is comprised of (cid:1) 80 amino acid residues and is both necessary and sufficient for binding to HIV-1 IN. Strikingly, the integrase binding domain (IBD) is not unique to LEDGF/p75, as a second human protein, hepatoma-derived growth factor-related protein 2 (HRP2), contains a homologous sequence. LEDGF/p75 and HRP2 IBDs avidly bound HIV-1 IN in an in vitro GST pull-down assay and each full-length protein potently stimulated HIV-1 IN activity in vitro . LEDGF/p75 and HRP2 are predicted to share a similar domain organization and have an evident evolutionary and LEDGF-(1–325) glutathione-Sepharose Biosciences). of m M isopropyl-thio- (cid:2) - D -galactopyrano- side. by sonication in 1 M 50 m M NaH 2 PO , m M DTT, m M PMSF, lysate precleared by centrifugation was diluted 50 m M NaH PO , to conductivity to 24 and a HiTrap a linear salt in 50 m M NaH 2 PO , eluting m M NaCl was 1:3 m M NaH 2 PO , and injected into a HiTrap SP- Sepharose column. eluted linear of NaCl to M m M NaH 2 PO , ing further on a Superdex 200HR column (Amersham Bio- at ml/min in 250 m M NaCl, 50 m M NaH 2 PO , purified to 3.5 supplemented with and

genome to a host cell chromosome (for reviews see . Its activity is essential for viral replication and spread in primary cells and contributes to the persistence of the viral infection in vivo (4,5). Blocking HIV-1 IN activity by specific inhibitors was shown to arrest viral spread in cell culture (6). HIV-1 IN, born to the transposase family of DNA transferases, is structurally and mechanistically similar to Mu phage and Tn5 transposases. Its active site formed by the three acidic residues Asp 64 , Asp 116 , and Glu 152 (known as the "DDE motif ") is located within the structurally conserved catalytic core domain. The core domain is contained within residues 50 -212 of the protein and flanked by the N-terminal HHCC-type zinc finger and the C-terminal DNA binding domains. Similar to the related bacterial transposases, retroviral INs form multimers, although the true stoichiometry of IN within the retroviral preintegration complex (PIC) is not known (7).
Mutations in HIV-1 IN display a wide range of phenotypes, affecting viral replication at the integration step (class I mutants), or causing various pleiotropic effects on virion morphogenesis and reverse transcription (class II) (8). Pleiotropic phenotypes of many IN mutants advocate that IN might have additional functions in viral replication. Thus, a role for IN in reverse transcription has been proposed (9). The complex phenotypes of class II mutants could potentially be explained by failure of the mutant INs to interact with viral reverse transcriptase and/or a host cell factor(s). A number of cellular and viral proteins were suggested to participate in retroviral integration (for a review see Ref. 10). Furthermore, several proteins were reported to directly interact with HIV-1 IN, including viral reverse transcriptase (9), a component of the SWI-SNF chromatin-remodeling complex INI1 (11), uracil DNA glycosylase UNG2 (12), heat shock protein HSP60 (13), a DNA repair protein Rad18 (14), a Polycomb group protein EED (15) and lens epithelium-derived growth factor/transcriptional coactivator p75 (LEDGF/p75) (Ref. 16, for a review see Ref. 17). The exact roles of these proteins and their importance to viral replication have yet to be determined. However, when HIV-1 or feline immunodeficiency virus (FIV) INs are expressed separately from other viral proteins, endogenous host-cell LEDGF/ p75 appears to be the dominant interactor, accounting for their nuclear/chromosomal accumulation (16,18,19). LEDGF/p75 protein markedly stimulated HIV-1 IN activity in vitro and was recently reported to be associated with functional HIV-1 PICs (16,19). These data cumulatively suggest that LEDGF/p75 and possibly its homologs pose as cellular host factors in retroviral replication likely acting at the levels of chromosomal targeting, and/or integration of viral cDNA (17).
LEDGF/p75 belongs to a family of hepatoma-derived growth factor (HDGF)-related proteins (HRPs). Five mammalian HRPs are known: HDGF, HRP1, HRP2, HRP3, HRP4, and LEDGF/p75 (see Supplementary Table I) (20,21). The characteristic feature of these proteins is a high degree of sequence homology within their N-terminal 90 -95 residues, spanning the PWWP domain (InterPro accession number IPR000313) (20,22). This domain, named for a conserved although not invariant Pro-Trp-Trp-Pro motif, extends for about 70 residues (22). Several dozen other PWWP domain-containing proteins have been described, including Wolf-Hirschhorn syndrome candidate 1 gene product WHSC1, mismatch repair protein MSH6, mammalian DNA methyltransferases Dnmt3a and Dnmt3b, and a plant homolog of ataxia telangiectasia-mutated protein kinase (22). PWWP domains seem to be distantly related to the Tudor and Chromo domains and are thought to mediate protein-protein interactions involved in regulation of chromatin structure (22,23). Noteworthy, the PWWP domain of Dnmt3b methyltransferase was recently shown to be essential for chromatin association of the protein (24). Recognizable orthologs of mammalian HRPs seem to be present in all vertebrates, although proteins containing PWWP domains are more widespread and occur throughout eukaryota, including yeast (22). Apart from their homologous N-terminal PWWP domains, HRPs show little sequence similarity.
Cellular functions of LEDGF/p75 and other HRPs have not been studied in detail. Like other PWWP domain-containing proteins, HRPs are imported into the nucleus (21,22,25,26). Chromatin association has so far been demonstrated only for LEDGF/p75 (16,18,27). LEDGF/p75 was implicated in the regulation of expression of stress response related genes, such as Hsp27, ␣B-crystallin, and antioxidant protein 2, presumably through binding to heat shock and stress-related regulatory elements in the promoters of the target genes (28,29). Overexpression of LEDGF/p75 was reported to enhance cell viability under conditions of serum starvation, thermal, and oxidative stress (28,30). During apoptosis, LEDGF/p75 is subject to cleavage by caspases that abolishes its activity as a cell-survival factor (30).
In this work, we studied the evolutionary conservation and domain organization of LEDGF/p75. We identified the HIV-1 IN binding domain (IBD) in this protein and found that another human protein, HRP2, can bind HIV-1 IN via a homologous domain and stimulate its activity in vitro.

EXPERIMENTAL PROCEDURES
Isolation and Sequence Analysis of LEDGF/p75 and HRP2 cDNAs-A wealth of expressed sequence tags (ESTs) representing fragments of cDNAs encoding homologs of human LEDGF/p75 from various vertebrate sources were readily identified by searching the NCBI sequence data base with translating basic local alignment search tool (BLAST) (www.ncbi.nlm.nih.gov/BLAST). ESTs with the following GenBank TM accession numbers gave sufficient sequence information to design primers for PCR amplification of the complete coding regions of Gallus gallus LEDGF/p75 cDNA: BU112216, AJ394255, BU332575, BU129859, and CN231179. ESTs for the Xenopus laevis ortholog were: BJ042207, BJ055881, BX852842, BU912001, and BQ729240. Total RNA isolated from G. gallus pro-B cell line DT40 or kidney tissue from an adult male X. laevis specimen was reverse-transcribed using random-primed SuperScript III reverse transcriptase (Invitrogen). The complete coding region of chicken LEDGF/p75 cDNA was amplified using Expand DNA polymerase (Roche Applied Science) and primers: 5Ј-CACGGCGGCGAGACAAC/5Ј-AGATTTCAAATGCAATCCTCTTC. The primers used to amplify the frog cDNA were: 5Ј-TGCCTGAATT-TCGTCGAG and 5Ј-AATGACCACACGAGTGTGA. The resulting PCR fragments were subcloned into the pCR4-TOPO vector (Invitrogen), and three independent clones for each cDNA were sequenced.
The 3Ј-terminal part of the Danio rerio (zebrafish) HRP2 cDNA was obtained from the following ESTs: BI886683, AW154327, CD586524, BQ480774, and AL916221. The resulting contig was used to search through the zebrafish genome assembly (www.ensembl.org/Danio_ rerio/) using nucleotide BLAST. The gene was identified on chromosome 22 and four exons containing the available portion of HRP2 cDNA sequence could be readily matched to the genomic sequence. Only one fragment predicted to encode a PWWP domain could be identified within the upstream 30 kb genomic sequence using PROSITE (us. expasy.org/prosite/). Assuming this sequence to represent the beginning of the HRP2 open reading frame (ORF), two sets of PCR primers were designed to amplify the cDNA: 5Ј-GTGGACGGATAGAAACG/5Ј-GAAG-GAAGCCAAGGTGTG and 5Ј-AACGAGCAGAACGAGGAG/5Ј-GTTT-GTGAGCATAAAAGGAG. Random-primed cDNA prepared from a sample of D. rerio kidney total RNA was used as template. PCRs with both primer pairs readily amplified fragments with the expected size of about 2.1 kb. Sequence analysis agreed with the chromosomal sequence and confirmed homology to human and mouse HRP2. The D. rerio HRP2 gene spans about 24.6 kb on the chromosome 22; the coding region of the cDNA is derived from 18 exons. The complete ORF from the X. laevis HRP2 cDNA was reconstructed from the following ESTs: CA789647, AW643477, BJ039312, BJ626283, BJ619541, BF612425, BU916767, CA981423, BJ054046, BJ642669, BE678817, BE678996, BX853753, BF426654, BJ622360, CD363094, BJ639036, BG234506, BJ086552, BJ050372, and BG812065. Partial G. gallus HRP2 cDNA sequence was obtained from a contig of the following ESTs: BU261466, BU324804, BU392278, CD727797, BU351707, BU347241, AI981158, BU141299, BU428133, BU236024.
DNA Constructs for Bacterial Protein Expression-All glutathione S-transferase (GST)-LEDGF/p75 fusion constructs used for protein expression in this work were based on the pGEX-4T1 vector (Amersham Biosciences). The full-length LEDGF/p75 ORF and its fragments were PCR-amplified using Pfu-Ultra DNA polymerase (Stratagene). Sense primers were designed to incorporate a BamHI restriction site followed directly by the first codon of the relevant LEDGF/p75 fragment; antisense primers contained a stop codon (TGA) directly following the last codon. PCR fragments were digested with BamHI and subcloned between BamHI and SmaI sites of pGEX-4T1.
To clone the putative IBD of HRP2, a fragment coding for residues 470 -593 of the human protein was PCR-amplified from random-primed HeLa cDNA using Expand DNA polymerase and the following primers: 5Ј-GCGTGGATCCTCCGTGGAGGAGAAGCTGCAG/5Ј-CCCTCACTT-GTCCTCCGCCTTCTCC. The resulting PCR fragment was digested with BamHI and ligated into BamHI/SmaI-digested pGEX-4T1. The full-length HRP2 ORF was PCR-amplified from cDNA clone MGC2641 (American Type Culture Collection) using primers 5Ј-GCGTGGATC-CATGCCACACGCCTTCAAGCC and 5Ј-GCTCAGCTCTCCTCGTC-CAGGGCCTC; the PCR fragment digested with BamHI was subcloned between BamHI and SmaI sites of pGEX-6P3, resulting in pCP-GSTHRP2. For expression of non-tagged full-length HRP2, the entire HRP2 ORF was amplified using primers 5Ј-TGCCACACGCCT-TCAAGCC and 5Ј-GTTTTCACCGTCATCAC, the resulting PCR fragment was digested with XhoI and cloned between NdeI and XhoI sites of pRSETB (Invitrogen) (the vector NdeI terminus was filled-in using T4 DNA polymerase) giving pCP-NatHRP2. Non-modified pGEX-4T1 was used to produce GST as a control for pull-down experiments. Plasmids pCPNat75 and pKB-IN6H were described previously (18).
DNA Constructs for Expression in Human Cells-Plasmids pBHA-P75 and pCPHA-HRP2 expressed human LEDGF/p75 and HRP2 with N-terminal influenza hemagglutinin (HA) tags, respectively, under the control of the human cytomegalovirus immediate-early promoter. To make pBHA-P75, the LEDGF/p75 ORF was PCR-amplified using 5Ј-C-CGCGGATCCGACACCATGGCATACCCATACGACGTCCCAGAC-TACGCTACTCGCGATTTCAAACCTGGAGACC/5Ј-ATAAGAATGCG-GCCGCCTAGTTATCTAGTGTAGAATCC and Pfu-Ultra DNA polymerase. The resulting amplicon was digested with BamHI and NotI and ligated into BamHI/NotI-digested pcDNA6-V5-HisB (Invitrogen). The BamHI/XhoI fragment of pCP-GSTHRP2 carrying the entire HRP2 ORF was re-cloned between BglII/XhoI sites of the pCPHA-NLS vector, fusing the 5Ј-end of the HRP2 ORF directly to the HA tag coding sequence, resulting in pCPHA-HRP2. The HA tag fusion vector pCPHA-NLS was made by first disrupting the BglII site in pcDNA6-V5-HisB by digesting it with BglII, filling-in using Pfu polymerase, and religation resulting in pcDNA6⌬Bgl. A DNA fragment obtained by annealing synthetic oligonucleotides 5Ј-CGGGAAGCTTAGACACCATGGCCTAC-CCTTACGACGTGCCCGACTACGCCAGATCTG and 5Ј-GGTGG-GATCCCTCCACCTTCCGCTTCTTCTTGGGAGGGCCAGATCTG-GCGTAGTCG followed by extension using Sequenase Version 2.0 T7 DNA polymerase (Amersham Biosciences) was restricted with HindIII and BamHI and then ligated with HindIII/BamHI-digested pcDNA6⌬Bgl. The resulting pCPHA-NLS vector encodes for the HA tag fused to the simian virus 40 large T antigen nuclear localization signal (NLS), with an intervening BglII restriction site. The construct from Bram et al. (38) was used to express human cyclophilin A (CypA) with a C-terminal HA tag. This plasmid will be referred to here as pCypA-HA. The construct pED-FLAG-IN was used to express FLAG-tagged HIV-1 IN (39). All expression constructs were verified to be free of inadvertent mutations by sequencing.
Non-tagged HRP2 was induced in Rosetta2 (DE3) cells (Novagen) for 3 h at 28°C by addition of 0.25 mM isopropyl-thio-␤-D-galactopyranoside. The bacteria were disrupted by sonication in 1 M NaCl, 50 mM NaH 2 PO 4 , 5 mM DTT, 0.3 mM PMSF, pH 7.7. The lysate precleared by centrifugation at 15,000 rpm for 30 min was diluted with 50 mM NaH 2 PO 4 , pH 7.2 to reduce conductivity to 24 mS/cm and injected into a 5-ml HiTrap heparin column. Bound proteins were eluted with a linear salt gradient in 50 mM NaH 2 PO 4 , pH 7.2. HRP2 protein eluting at ϳ500 mM NaCl was collected, the peak fractions were pooled, diluted 1:3 in 50 mM NaH 2 PO 4 , pH 7.2 and injected into a 5-ml HiTrap SP-Sepharose column. The protein was eluted with a linear gradient of NaCl from 0.15 to 1.0 M in 50 mM NaH 2 PO 4 , pH 7.2. Fractions containing HRP2 were pooled, concentrated using a Centricon device, and further separated on a Superdex 200HR column (Amersham Biosciences) at 0.25 ml/min in 250 mM NaCl, 50 mM NaH 2 PO 4 , pH 7.2. The purified protein was concentrated to 3.5 mg/ml, supplemented with 10% glycerol and stored at Ϫ70°C after flash-freezing in liquid nitrogen. Non-tagged LEDGF/p75 and His 6 -tagged HIV-1 IN were produced in E. coli strain BL21(DE3), pLysS using pCPNat75 and pKB-IN6H, respectively, and purified according to published procedures (18,40). LEDGF-(326 -530), LEDGF-(347-471), and HRP2-(470 -593) fragments released from GST by digestion with thrombin (Sigma-Aldrich) were further purified by cation exchange chromatography on SP-Sepharose using a linear 0.1-0.5 M NaCl gradient in 50 mM sodium phosphate buffer, pH 7.2. Protein concentrations were determined using the Bradford colorimetric assay (Bio-Rad) employing bovine serum albumin (BSA) as a standard.
Bands excised from Coomassie Blue R250-stained membranes were subjected to Edman degradation in a Procise protein sequencer (Applied Biosystems). In-gel trypsin digestion and peptide extraction were done as described (41). To determine molecular masses of the intact TR1 and TR2 peptides, a mixture of digestion products was separated on a 2.1 ϫ 250 mm Vidyac C8 column. Fractions containing the TR1 and TR2 fragments were analyzed by matrix-assisted laser desorption/ionization mass spectrometry (MALDI MS).
GST Pull-down Assay-Purified GST fusion proteins were adsorbed onto glutathione-Sepharose beads (Amersham Biosciences) in 200 mM NaCl, 5 mM DTT, 25 mM Tris-HCl, pH 7.3, using 125 l (settled volume) beads per 40 g of protein. After 4 h at 4°C, the beads were washed in excess buffer and stored on ice. To test for IN binding, 10 l of glutathione-Sepharose beads carrying GST fusion proteins were resuspended in 200 l of cold PD buffer (150 mM NaCl, 5 mM MgCl 2 , 5 mM DTT, 0.1% Nonidet P40, 25 mM Tris-HCl, pH 7.4) containing 10 g of BSA. After addition of 3.8 g of His 6 -tagged HIV-1 IN the samples were gently rocked for 1.5-2 h at 4°C and left for an additional 15-30 min without mixing. After careful aspiration of the supernatant, the settled beads were resuspended in 700 l of fresh PD buffer, and allowed to sediment without centrifugation. The wash was repeated twice and bound proteins were eluted in SDS-containing sample buffer and analyzed by SDS-PAGE. In certain cases IN pull-down was confirmed by Western blotting using polyclonal anti-IN serum (42).
Cell Transfection and Immunoprecipitation-293T cells were maintained in Dulbecco's modified Eagle's medium containing 10% fetal calf serum (Invitrogen), 5 units/ml penicillin and 5 g/ml streptomycin. 293T cells grown in 6-well dishes to 30 -50% confluency were transfected with 0.5 g of pCypA-HA, pBHA-P75, or pCPHA-HRP2 along with 0.5 g of pED-FLAG-IN per well using FuGENE 6 transfection reagent (Roche Applied Science). Twenty-four hours post-transfection, cells were washed in cold phosphate-buffered saline, and lysed in 400 l of cell lysis buffer (500 mM NaCl, 0.5% Triton X-100, 50 mM HEPES pH 7.9, 5% glycerol, 2 mM MgCl 2 , 25 mM ␤-glycerophosphate, 1 mM sodium orthovanadate, supplemented with complete protease inhibitor mixture (Roche Applied Science)). The extracts were centrifuged at 19,000 ϫ g to remove cell debris and precleared by incubation with 4 l (settled volume) of protein G-Sepharose beads (Amersham Biosciences). Precleared supernatants were incubated with 4 g of mouse anti-HA 12CA5 antibody (Roche Applied Science) at 4°C, 4 l of protein G-Sepharose beads were added, and the samples were left rocking for an additional hour. The beads were washed three times in cell lysis buffer, four times in reduced salt buffer (cell lysis buffer modified to contain 150 mM NaCl, 0.1% Triton X-100, and 0.1% Nonidet P-40). Whole cell extracts and immunoprecipitated proteins were resolved in 4 -20% SDS-polyacrylamide gels. Following semi-dry transfer to polyvinylidene difluoride membranes, HA-tagged CypA, LEDGF/p75, and HRP2 proteins were detected by Western blotting using anti-HA 3F10 antibody conjugated to horseradish peroxidase (Roche Applied Science) and Western Lightning chemiluminescent reagent plus (PerkinElmer Life Sciences). FLAG-tagged IN was detected with anti-FLAG M2 antibody (Sigma-Aldrich) and goat anti-mouse IgG horseradish peroxidase conjugate (Jackson ImmunoResearch Laboratories).

Conservation of LEDGF/p75
Protein-Sequences of several mammalian LEDGF/p75 orthologs were available in public sequence databases. We identified ESTs representing partial cDNA sequences of G. gallus and X. laevis LEDGF/p75 cDNAs, which allowed us to clone and sequence complete LEDGF/p75 cDNAs from these species. On the basis of the obtained cDNA sequences, chicken and frog LEDGF/p75 were predicted to be composed of 579 and 564 amino acids, respectively, both somewhat larger than the 530-residue human ortholog. Alignment of the predicted amino acid sequences revealed ϳ48% identity between mammalian, avian, and amphibian LEDGF/p75 proteins ( Supplementary Fig. S1). The plot in Fig. 1A summarizes this alignment by showing the degree of conservation along the protein sequence. Three regions of homology were evident (highlighted as shaded boxes in Fig. 1A). The most conserved fragment spanning residues 1-94 (conserved region I), which showed about 89% identity between human, chicken, and frog, corresponded to the PWWP domain (22). A 105-residue region spanning residues 351-455 displayed about 87% identity (re-gion III). In addition, a short fragment involving residues 178 -197 (region II) showed significant homology. Intuitively, these most conserved regions likely represent functional and/or structural determinants within the protein. The most variable regions encompassed an internal fragment flanking the PWWP domain (residues 94 -177 in human LEDGF/p75, showing only about 13% identity) and the 60 C-terminal residues of the protein (20% identity). The single conserved feature of the first hypervariable region was a 7-residue sequence, 146 RRGRKRK 152 , which partially overlaps the NLS in human LEDGF/p75 (residues 148 -156) (43). Both chicken and frog LEDGF/p75 contain an insertion of 39 amino acids within the first hypervariable region (Supplementary Fig. S1).
Secondary Structure Prediction-We used the protein structure prediction program package available through the Predict-Protein server (31) to analyze possible structural elements in LEDGF/p75. The protein is uncommonly rich in charged amino acids, accounting for about 42% of its sequence. Thus, it was not surprising that two extensive loop regions were predicted by the NORSp program (residues 177-250 and 440 -530, Fig. 1B). Furthermore, analysis by PROFsec identified only relatively short regions, which are likely to be involved in stable secondary structure (Fig. 1B). By homology, the N-terminal 90 residues of the protein are known to constitute a PWWP domain (22). Prediction of ␤-strand elements followed by ␣-helices in that region is accurate, since a five-stranded ␤-barrel core and a C-terminal bundle of ␣-helices are conserved structural features of PWWP domains (44). The region encompassing residues 347-423 and matching homology region III (Fig. 1A) was predicted with high confidence to pack into four or five ␣-helices. Of note, this fragment, along with the N-terminal PWWP domain, span two mostly hydrophobic regions of LEDGF/p75 with average hydrophobicity indices above zero (data not shown).
Limited Proteolysis of LEDGF/p75-We used limited proteolysis (45) to probe the domain organization of LEDGF/p75. As the protein is rich in charged amino acids, a cleavage site for trypsin is predicted on average every 4 -5 residues. Considering all Lys and Arg residues, the largest hypothetical LEDGF/ p75 tryptic peptide was just 25 residues (Thr 477 -Lys 501 ) with a molecular mass of about 2.6 kDa. We found that recombinant human LEDGF/p75 was indeed very sensitive to trypsin. A mass ratio of 250:1 of LEDGF/p75:protease yielded final proteolyzed products as well as semi-stable intermediates ( Fig.  2A). The protease was quenched at different time points by addition of PMSF and reaction products were analyzed using Tris-glycine or Tricine SDS-PAGE. As quantified by densitometry of Coomassie-stained gels, ϳ60 -70% of the protein became extinct after a relatively short exposure to trypsin (compare lanes 1 and 6 in Fig. 2A). As proteolysis proceeded, two distinct polypeptides TR1 and TR2 with apparent molecular masses close to 10 kDa gradually accumulated at the expense of the intermediate cleavage products (Fig. 2A). Both TR1 and TR2 fragments persisted even after overnight digestion under these conditions (data not shown). N-terminal sequencing of TR1 and TR2 fragments revealed that TR1 was derived from the N terminus part of LEDGF/p75, having the same N-terminal sequence as the full protein, i.e. NH 2 -Met-Thr-Arg-Asp-Phe. TR2 originated from the C-terminal portion of the protein and contained two overlapping N termini: NH 2 -Lys-Arg-Glu-Thr-Ser-Met-and NH 2 -Glu-Thr-Ser-Met-Asp-Ser-corresponding to trypsin cleavage at peptide bonds Lys 342 -Lys 343 and Arg 344 -Glu 345 , respectively. To identify the C termini of the fragments, TR1 and TR2 were purified by reverse phase high performance liquid chromatography and their masses were determined by MALDI-MS. The molecular mass of the TR1 fragment was 11,424 Ϯ 10 Da. The TR2 product represented a mixture of fragments of 11,348 Ϯ 10 and 11,632 Ϯ 10 Da. These data allowed us to unambiguously map the C termini of the TR fragments to LEDGF/p75 residues Lys 100 for TR1 and Lys 442 for TR2. Indeed, the calculated molecular mass of Met 1 -Lys 100 was 11,429.3 Da, whereas the masses of Lys 343 -Lys 442 and Glu 345 -Lys 442 fragments were 11,641.2 and 11,356.8 Da, respectively, which matched the experimentally determined masses well within confidence intervals. When a deletion mutant retaining the 206 C-terminal residues of LEDGF/p75 (residues 326 -530) was exposed to trypsin, only the TR2 fragment was obtained (Fig.  2B, lanes 3-7). This result confirmed that TR1 and TR2 resided in the N-and C-terminal regions of LEDGF/p75, respectively. Although more than a dozen potential trypsin cleavage sites exist within fragments 1-100 and 345-442 of LEDGF/p75, both appeared to resist proteolysis, indicating that both are involved in stable structures.
In addition to trypsin, we tested proteinase K, thrombin, chymotrypsin, and Arg C proteases (data not shown). Unlike trypsin, digestion with proteinase K did not result in stable proteolytic products, however transient fragments of about 10 kDa in size were observed. Incubation of GST-LEDGF/p75 with thrombin resulted in multiple cuts within the putative loop region adjoining the PWWP domain. Chymotrypsin and Arg C proteases appeared less active than trypsin and although the fragments obtained confirmed the tryptic map, the cleavage patterns were more complex and longer incubation times were necessary to allow for accumulation of final products.

TR2 Is the Functional LEDGF/p75 IBD-To identify region(s) of LEDGF/p75 involved in the interaction with HIV-1 IN, we prepared a series of LEDGF/p75 deletion mutants.
Mutants were expressed and purified as GST fusions, preadsorbed onto glutathione-Sepharose beads, and tested for their ability to pull-down recombinant HIV-1 IN. As can be seen from Fig. 3A, both the full-length protein (residues 1-530) and the mutant lacking the variable 59 C-terminal residues (1-471) readily bound HIV-1 IN (Fig. 3A, lanes 9 and 12). However, a more extended deletion from the C terminus disrupted interaction with IN, as LEDGF-(1-325) lacking 205 residues failed to pull down IN (lane 10). This result corroborates the previous finding that LEDGF/p52, an alternative splice form containing a unique 8-residue tail in place of LEDGF/p75 residues 326 -530, did not bind HIV-1 IN (18). Furthermore, the C-terminal fragment of LEDGF/p75 containing residues 326 -530 was sufficient to pull down HIV-1 IN (lane 11). By making another set of deletions, the IN binding function of LEDGF/p75 was mapped to just 83 amino acids, spanning residues 347-429 (Fig. 3B, lane 15; see also Supplementary Fig. S1). Importantly, this fragment lies within conserved region III of LEDGF/p75 (Fig. 1A) and the TR2 fragment ( Fig. 2; see also Fig. 1B for summary). We found that further truncations from the N terminus of 347-429 abolished the interaction with IN (lane 16) and reduced the solubility of the recombinant protein (data not shown). Deletions from the C terminus of this fragment, on the other hand, profoundly affected stability of GST fusion proteins in E. coli (data not shown). These observations indicated that residues 347-429 of LEDGF/p75 span the IBD and comprise the minimal sequence required for its proper folding.
Of note, full-length LEDGF/p75, as well as deletion mutants containing the interdomain region (residues 150 -325) were only marginally stable when expressed in bacteria, even when the temperature of induction was reduced. The bulk of the GST fusions recovered by adsorption to glutathione-Sepharose represented various proteolytic fragments. Due to dimerization of GST, it was not feasible to completely remove proteolytic fragments from preparations of GST-LEDGF/p75 or fusions with LEDGF-(1-325) or LEDGF-(1-471) even after additional heparin affinity and cation exchange chromatography (Fig. 3A).
Identification of HRP2 as a Second IBD-containing Protein-Using translated BLAST to search for human cDNAs encoding polypeptides with homology to the LEDGF/p75 IBD we found that a second HDGF-related protein, HRP2, contains a very similar sequence within its C-terminal region. Because this region of homology is relatively short and occurs within largely divergent sequences, the similarity within C-terminal regions of LEDGF/p75 and HRP2 remained unnoticed until now. Fig.  4A presents an alignment of the human LEDGF/p75 IBD with the related sequence from HRP2 and includes their respective orthologs from different species. Human LEDGF/p75 and HRP2 proteins are about 48% identical within this region, and, considering conservative amino acid substitutions, the similarity exceeds 70%. Furthermore, predicted secondary structural elements within the two putative IBDs matched very well, with both domains demonstrating high ␣-helical content (Fig. 4A). We identified several ESTs encoding an HRP2 ortholog from D. rerio, which allowed us to clone and sequence its complete coding region. In addition, HRP2 cDNA from X. laevis could be completely reconstructed from available ESTs (see "Experimental Procedures"). Sequence alignment of human, frog, and fish HRP2 revealed high degrees of sequence conservation within the PWWP and IBD-like regions (regions I and III, Fig.  4B) (for a complete alignment see supplementary Fig. S2). An approximate 20-amino acid region of homology (region II) was similar to homology region II in LEDGF/p75, with each region containing several conserved Pro, Arg, and Lys residues. HRP2 region IV, however, appears unique to this protein. In addition, we also identified a hypothetical 475-residue protein CG7946 from Drosophila melanogaster (GenBank TM accession NP_651768, UniGene cluster Dm.4512) that contains an IBDrelated sequence. This fragment, spanning CG7946 residues 318 -400, shared about 21% identical and 46% similar residues with the HRP2 IBD (not shown). Intriguingly, since this protein is also predicted to possess an N-terminal PWWP domain, it likely represents an insect ortholog of HRP2. Additional searches using InterProScan and SMART (Simple Modular Architecture Research Tool) revealed homology between the IBDs and the N-terminal domain of TFIIS (SMART accession SM00509). Although the E-values reported by SMART for these hits were relatively high, equating to 3.1 and 1.2 for human LEDGF/p75 and HRP2 IBDs, respectively, the N-terminal domain of TFIIS seems to represent their closest relative among known protein domains. The TFIIS domain family includes fourhelix bundle domains of TFIIS, elongin A, and CRSP70 (46).
To find out whether the putative IBD of HRP2 has affinity for HIV-1 IN, we fused a fragment spanning HRP2 residues 470 -593 to GST and tested it in our pull-down assay. As seen in Fig. 5A activity in the absence of organic solvents and polyethylene glycol (lanes 1-7, Fig. 6A). The DNA substrate was a linearized plasmid containing HIV-1 U3 and U5 sequences at its termini (mini-HIV, see Ref. 47). Strand-transfer products represented a range of branched molecules resulting from inter-and intramolecular integration of the substrate DNA and were readily   FIG. 4. HRP2 contains a conserved IBD. A, multiple sequence alignment of the conserved IBD regions of LEDGF/p75 and HRP2 orthologs. Amino acid coordinates were given when complete sequence information was available. Amino acid residues identical between all aligned sequences are highlighted in black. Positions conserved through substitution are shown in gray. The code for the consensus is: Ϫ, negatively charged; ϩ, positively charged; Ϯ, charged; !, hydrophobic; ¥, polar; @, aromatic. The taxonomic alignment is: chicken, G. gallus; lizard, Anolis sagrei; frog, X. laevis; fish, D. rerio. Sequences with the following GenBank TM accession numbers were used: AAC25167 (human LEDGF/p75), NP_116020 (human HRP2), and CF775831 (EST from A. sagrei). The coding regions of G. gallus and X. laevis LEDGF/p75 and D. rerio HRP2 were determined in this work. The sequence of the complete X. laevis HRP2 cDNA and a 3Ј-portion of the G. gallus HRP2 cDNA coding region were reconstructed from available ESTs. ␣-Helical regions for human LEDGF/p75 and HRP2 as predicted by PROFsec are shown as boxes above the alignments. B, sequence conservation between mammalian, amphibian, and fish orthologs of HRP2. Percentages of identical residues along the alignment of human, X. laevis, and D. rerio HRP2 proteins, calculated in windows of 20 residues. The complete alignment is shown in Supplementary Fig. S2. visualized in agarose gels by staining with ethidium bromide (Fig. 6A). In the absence of LEDGF/p75, only a marginal level of product formation was detected (compare lanes 2 and 4 -7 in Fig. 6A). Maximum stimulation was seen at 0.2 M LEDGF/p75 (lane 5, Fig. 6A). LEDGF/p75-dependent stimulation of IN strand transfer activity was efficiently blocked by a specific IN inhibitor, diketo acid L-731,988 (6), at submicromolar concentrations (Fig. 6B).
To test whether HRP2 can stimulate HIV-1 IN in vitro, we purified full-length HRP2 protein (Fig. 6E). As can be seen from Fig. 6F, HRP2  Reaction products were treated and visualized as in Fig. 6F.

DISCUSSION
Sequence alignments, in silico secondary structure prediction, and limited proteolysis collectively suggest that LEDGF/ p75 contains a pair of small structural domains: an N-terminal PWWP domain (residues 1-90), the existence of which had been recognized on the basis of sequence homology, and a novel domain that mediates interaction with HIV-1 IN. Remarkably, these two domains encompass only about 35% of the protein sequence. Recombinant LEDGF/p75 displays high sensitivity to proteolysis suggesting that a large portion of the protein exists as flexible regions or loops. Of note, we did not detect a stable interaction between the PWWP and IBD domains in a GST pull-down assay (data not shown), suggesting that the domains are relatively independent in the full-length protein.
We think that such flexibility might be related to the function of the protein in vivo, allowing the domains to associate with and link together components of various complexes. Two putative loop regions, with no regular secondary structure were suggested by in silico analysis of LEDGF/p75 (Fig. 1B). Interestingly, proteins containing extended loops are statistically associated with transcription regulatory functions (48). In addition to the PWWP and the IBD domains, an internal 20residue fragment of LEDGF/p75 (residues 178 -197) displayed significant sequence conservation (region II, Fig. 1A). This 20amino acid fragment contains five Pro residues and is thus unlikely to adopt an independent secondary structure. High Pro and Arg content makes it similar to the AT hook motif of the HMGA proteins. Due to the recognized sequence conservation of region II, we speculate that it is important for LEDGF/ p75 function. One likely possibility is that it represents a part of the DNA binding determinant of LEDGF/p75.
The LEDGF/p75 IBD is comprised of about 80 residues and is predicted to fold into four or five ␣-helices (Figs. 1B and 4). The minimal fragment that bound HIV-1 IN via GST pull-down spanned residues Ser 347 -Val 429 . This is in agreement with a previous report that LEDGF/p52 protein lacking residues 326 -530 neither bound HIV-1 IN in vitro nor co-localized with it in live cells (18). Intriguingly, we identified a homologous sequence within another HDGF-related protein, HRP2, which likewise displayed affinity for HIV-1 IN. Thus, in addition to the N-terminal PWWP domains, LEDGF/p75 and HRP2 share conserved C-terminal domains, suggesting a close evolutionary and probable functional relationship between these proteins. Although we did not analyze susceptibility of HRP2 to proteases, analysis of its predicted amino acid sequence suggests that domain organization is similar to that of LEDGF/p75. Alignment of HRP2 orthologs from mammalian, amphibian, and fish sources showed a high degree of sequence conservation within the PWWP and IBD regions (Fig. 4B, see also Supplementary Fig. S2). Two additional fragments with significant interspecies homology (regions II and IV, Fig. 4B) were present in this protein. While HRP2 homology region II was clearly related to LEDGF/p75 region II, containing similarly spaced Pro and charged residues (Supplementary Figs. S1 and S2), region IV appears unique to HRP2. An extended ␣-helix involving residues Glu 321 -Arg 356 is predicted in this fragment. Thus, it is likely that HRP2 possesses an additional small structural domain. The sequences connecting the conserved regions in HRP2 contain multiple low complexity elements comprised of Pro, Ser, or Ser-Asp repeats, suggesting high flexibility (Supplementary Fig. S2). Low complexity sequences are common to eukaryotic proteins and are thought to be natively disordered (49). Such sequences are usually not conserved, in accordance with their putative roles as flexible hinges. A high prevalence of simple sequences in HRP2 explains the overall low degree of sequence conservation between orthologs compared with that of LEDGF/p75 (see Supplementary Table I and Figs. S1 and S2). In silico analysis of amino acid sequences of other HRPs suggest that although they do not possess IBD-like domains, ␣-helical elements are located within C-terminal regions of HDGF and HRP1 (data not shown), suggesting the presence of a second functional domain within these proteins as well.
Like HDGF, all HRPs seem to have mitogenic activity in cell culture (21,25,30). It is presently unclear whether the growth factor activity of such proteins that lack classical secretory signals is related to their functions in vivo (20). The original observation that LEDGF/p75 co-purified from HeLa nuclear extracts together with the transcription co-activator PC4 provided a clue that the protein might be involved in transcriptional regulation (50). More recently, LEDGF/p75 was reported to bind to heat shock and stress-related elements within promoters regions of the AOP2, Hsp27, and ␣B-crystallin genes and trans-activate their expression (28,29). Although an earlier study isolated LEDGF/p75 from a lens epithelial cDNA library, expression of the protein is clearly not limited to lens. In contrast to the protein's name, cDNA clones encoding LEDGF/p75 have been isolated from a wide range of primary and transformed mouse and human tissues at all stages of development (refer to EST collections associated with the Uni-Gene entries from Supplementary Table I). Sequences derived from 215 cDNA clones suggesting several alternative LEDGF splice variants exist in the AceView data base (for up to date information consult www.ncbi.nlm.nih.gov/IEB/Research/ Acembly/). While the most abundant splice form, supported by 170 cDNA clones, encodes for LEDGF/p75, only 12 cDNAs are derived from p52 mRNA. Although a detailed expression analysis of individual splice forms will require a specialized study, it would appear that LEDGF/p75 is the dominant protein product of the PSIP1 gene in most tissues.
According to the large numbers of human and mouse ESTs corresponding to LEDGF/p75 and HRP2, these proteins are ubiquitously expressed at relatively high levels (see Supplementary Table I). Although the HRP2 IBD displayed an apparent high affinity for HIV-1 IN by GST pull-down (Fig. 5A), results of co-immunoprecipitation experiments suggested that LEDGF/p75 was a more potent IN interactor than was fulllength HRP2 in human cells (Fig. 5B). This was not entirely unexpected, as depletion of endogenous LEDGF/p75 alone by siRNA efficiently disrupted the nuclear and chromosomal accumulation of HIV-1 and FIV IN in cells (18,19). However, LEDGF/p75 and HRP2 proteins stimulated HIV-1 IN to a comparable degree in vitro (Fig. 6F). Based on this result we speculate that binding of IN to HIV-1 cDNA termini might stabilize the HRP2-IN interaction. HRP2 could potentially explain the failure of persistent siRNA-mediated knockdowns of LEDGF/p75 to reduce viral replication (19). It would also be interesting to determine if LEDGF/p75 and/or HRP2 modulate the enzymatic activity of FIV and other retro/lenti-viral INs (19).
It was demonstrated that HIV-1 displays a significant bias toward integration into active genes (51,52). Somewhat similar, but not identical integration specificity was observed for murine leukemia virus, which prefers to integrate within transcription start regions in the human genome (52). On a practical level, specificity for integration within or near active genes poses a problem in developing retroviral vector-based gene therapies (53). Distant relatives of retroviruses, yeast retrotransposons present the best studied paradigm of targeted integration in eukaryotes (reviewed in Ref. 54). At least in the case of the Ty5 retrotransposon, a specific interaction between Ty5 IN and the chromosomal protein Sir4p determines the specificity of retrotransposition into silent chromatin (55,56).
Integration of another yeast retrotransposon, Ty3, which has a preference for RNA polymerase III transcription start sites, is controlled by a TFIIIB transcription factor complex, although the interacting determinant on the retrotransposon side is not known (57). Putative chromodomains were identified in the C-terminal regions of INs from many LTR retrotransposons, such as fungal Cft1 and Skippy, and were hypothesized to mediate the targeting of their integration (58). In this context, a model involving a chromatin binding protein as a targeting factor for retroviral integration seems quite plausible. LEDGF/ p75, a chromosomal protein and a putative regulator of transcription that binds lentiviral INs in live cells, represents such a candidate factor (16 -19). Identification of LEDGF/p75 as a component of HIV-1 PICs encourages further research, as it remains to be seen whether LEDGF/p75 and/or its close relative HRP2 play role(s) in PIC formation or targeting during retroviral infection (19).