Protein and Carbohydrate Structural Analysis of a Recombinant Soluble CD4 Receptor by Mass Spectrometry*

The primary structure of a soluble form of the CD4 receptor (sCD4) expressed in Chinese hamster ovary cells has been confirmed by mass spectrometric peptide mapping and tandem mass spectrometry. These studies corroborated 95% of the 369-amino acid-long sequence and established the fidelity of translation of the NH2 and COOH termini including the absence of ”ragged ends.” The arrangement of the three disulfide bonds in recombinant sCD4 was also established by mass spectrometry and comparative high performance liquid chromatography mapping and shown to be identical to that expected from previous studies of intrachain disulfide bonding in T4 antigens derived from sheep and mouse. No other arrangements of disulfides were detected. Carbohydrate mapping by mass spectrometry was used to establish that both potential Asn-linked glycosylation sites in sCD4 (Asna’’ and Asn300) have oligosaccharides attached. Structural characterization by mass spectrometry and methylation analysis of the heterogeneous family of oligosaccharides at each of the specific attachment sites indicates that the major components of both families of oligosaccharides have the following biantennary structures:

The primary structure of a soluble form of the CD4 receptor (sCD4) expressed in Chinese hamster ovary cells has been confirmed by mass spectrometric peptide mapping and tandem mass spectrometry. These studies corroborated 95% of the 369-amino acid-long sequence and established the fidelity of translation of the NH2 and COOH termini including the absence of "ragged ends." The arrangement of the three disulfide bonds in recombinant sCD4 was also established by mass spectrometry and comparative high performance liquid chromatography mapping and shown to be identical to that expected from previous studies of intrachain disulfide bonding in T4 antigens derived from sheep and mouse. No other arrangements of disulfides were detected. Carbohydrate mapping by mass spectrometry was used to establish that both potential Asn-linked glycosylation sites in sCD4 (Asna'' and Asn300) have oligosaccharides attached. Structural characterization by mass spectrometry and methylation analysis of the heterogeneous family of oligosaccharides at each of the specific attachment sites indicates that the major components of both families of oligosaccharides have the following biantennary structures:
The human CD4 receptor is a 55-kilodalton glycoprotein found predominantly on a subset of mature, thymus-derived (T) lymphocytes and to a lesser extent on monocyte and macrophage related cells. T lymphocytes are involved in the recognition of antigens presented by class I1 major histocompatibility complex molecules, and substantial evidence indicates that the CD4 receptor directly interacts with class I1 major histocompatibility complex antigens thereby mediating an efficient immune response (1)(2)(3). In man, CD4 also serves as the receptor for the human immunodeficiency virus (HIV)' (4)(5)(6)(7). Numerous studies have provided evidence for such direct interaction. Certain monoclonal antibodies directed against CD4, such as Leu3a and OKT4A, block HIV infection * This work was supported in part by Grant GM-39526-02 from the National Institutes of Health (to S. A. C.). The costs ofpublication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
$ T o whom correspondence and reprint requests should be addressed.
Interaction of CD4+ cells with HIV-1 is mediated by gp120, the surface glycoprotein component of the viral envelope glycoprotein gp120/160, which CD4 binds to with high affinity (Kdlo-' M) (8). The CD4 receptor glycoprotein can be coprecipitated with anti-gpl20 antibodies from lysed cells infected with HIV-1, and, conversely, gp120 can be precipitated with anti-CD4 monoclonals (6). These and other data strongly indicate that interaction of CD4 with gp120 is a critical step in viral infection and destruction of CD4+ lymphocyte populations.
We and others (8)(9)(10)(11)(12)(13) have hypothesized that inhibition of HIV binding by the interaction of viral gp120 with a recombinant soluble form of the CD4 (sCD4) receptor will block virus binding, infection, and virus-mediated cell fusion (8)(9)(10)(11)(12)(13). A variety of recombinant sCD4 molecules lacking the membrane-spanning and intracellular domains have been cloned and expressed and shown to inhibit the binding of HIV-1 to CD4+ cells, prevent formation of syncytia, and block HIV-1 infectivity in vitro (8)(9)(10)(11)(12)(13). More recently, sCD4 has been shown to block diverse strains of HIV-1, HIV-2, and the simian virus SIV; however, infection of certain brain and muscle cell lines in vitro could not be blocked by either sCD4 or anti-CD4 antibodies, suggesting that the virus may infect these cells by a mechanism not involving direct interaction of gp120 with CD4 (14).
The amino acid sequence of human CD4 has been deduced by sequencing of the cDNA coding for the protein. The CD4 precursor consists of an NHs-terminal hydrophobic signal domain, an extracellular domain 370 amino acids in length that has limited sequence homology to the immunoglobulin variable and joining regions, a hydrophobic transmembrane domain, and a charged intracellular domain consisting of 38 amino acids (15,16). There are 6 cysteine residues in the extracellular domain of CD4 that, by analogy to the reported arrangement of intrachain disulfide bonds in mouse and sheep CD4 (17), are expected to form three disulfide bonds between successive pairs of cysteines. One or more of these disulfide bonds is critical for the binding of gp120 to CD4 presumably due to stabilization of the tertiary structure of the binding region located in the NH2-terminal region of the extracellular portion of the protein (18)(19)(20)(21)(22). The protein also contains 2 asparagine residues in the consensus sequence (Asn-X-Ser/ Thr, where X = any amino acid except Pro) required for attachment of carbohydrate (23).
To date, only limited structural analyses of human CD4related proteins have been reported. Here we provide the first detailed structural characterization of a recombinant sCD4 in which mass spectrometry and tandem high performance mass spectrometry have been used to corroborate the primary structure, including disulfide bond arrangement, determine the location and extent of Asn-linked glycosylation, and characterize the major structural classes of carbohydrate at the two specific att,achment sites. Detailed sequence and stereochemical analysis by exoglycosidase digestion of the major and minor glycoforms present at the two attachment sites will be presented elsewhere.'

RESULTS AND DISCUSSION
Soluble CD4 was expressed by dihydrofolate reductase coamplification in Chinese hamster ovary (CHO) cells as described previously (9). The protein was purified to apparent homogeneity ( Fig. 1) (see "Experimental Methods") in the Miniprint with an overall recovery of 5-10 mg of sCD4/liter of serum-free conditioned medium. Measured amino acid composition and NHs-terminal sequence were consistent with the predicted sequence of the molecule (29). Carbohydrate composition analysis employing the method of Chaplin (44) indicated the presence of mannose, galactose, N-acetylglucosamine, fucose, and N-acetylneuraminic acid. No N-acetylgalactosamine was detected, suggesting our preparation of recombinant sCD4 contains only N-linked (uersus 0-linked) oligosaccharides.
Peptide Mapping by FABMS-Peptide molecular weight determination by fast atom bombardment mass spectrometry (FABMS, also referred to as liquid secondary ion mass spectrometry), coupled with sequence analysis of specific peptides by tandem MS, is an ideal complement to Edman degradation for structural characterization of proteins (24)(25)(26)(27)(28). The strat-C-T. Yuen, S. A. Carr, and T. Feizi, manuscript in preparation.
Portions of this paper (including "Experimental Procedures," Figs. 1-3 and 6-10, and Footnote 7) are presented in miniprint at the end of this paper. Miniprint is easily read with the aid of a standard magnifying glass. Full size photocopies are included in the microfilm edition of the Journal that is available from Waverly Press. egy as it was applied to recombinant sCD4 is illustrated in Fig. 2, A and B. In the FABMS peptide mapping procedure, the molecular weights of peptides in digests are determined by FABMS and fitted (based on established rules for cleavage by the specific enzyme or chemistry employed) to the known or deduced sequence of the protein. This fitting is accomplished with the aid of computer programs which, given the reaction conditions employed and the predicted sequence of the protein as input, produce lists of molecular weights and sequence locations of the expected peptides. The protein is usually reduced and alkylated prior to proteolysis (unless the intent is to assign disulfide bonds, see below) in order to increase its susceptibility to cleavage. The proteases employed in the present study were trypsin and Staphylococcus aureus V8, individually or in combination. These enzymes were also used on samples that had not been reduced previously and alkylated in order to define the disulfide bond arrangements (see Fig. 2B and below) and on samples of sCD4 deglycosylated with peptide N-glycosidase F to define carbohydrate attachment sites (see below). Following analysis of the complex proteolytic digests by FABMS (for example, see Fig. 9) the mixtures were fractionated by reversed phase HPLC (Figs. 3 and 8) and the resulting fractions reanalyzed by FABMS. Signals for peptides that are not detected by direct analysis of the entire digest are often observed following HPLC fractionation resulting in greater coverage of the amino acid sequence. In addition, signals for peptides observed previously in the complex mixture are often much stronger in the simplified mixtures. The FABMS methods presented here are, in general, sufficiently sensitive and reliable to detect alterations present at >5% in the regions of the protein sequence mapped (i.e. the 95% mapped in the present work) provided that such changes result in a mass shift from that predicted based on the DNA or cDNA sequence (24,25,28).
FABMS peptide and carbohydrate mapping data for recombinant sCD4 are summarized in Fig. 4. The amino acid sequence defined by these studies corresponds to residues +3 to +371 of the sequence by Maddon et al. (15) but with a Lys at +3 consistent with the NH2-terminal sequence of human CD4 (11,16,29). In addition, we have renumbered the sequence of our recombinant sCD4 beginning with Lysl-Lys2-Val3.. . to be consistent with the NH'-terminal sequence of the mature, expressed protein identified in these studies by MS and by Edman sequence analysis in other studies (11,29). Altogether, approximately 95% of the primary structure was confirmed in the present studies. Approximately 79% of the sequence was corroborated by FABMS of the 6-h tryptic digest of reduced and carboxylmethylated (RCM) sCD4 and HPLC fractions derived therefrom (Fig. 3). Tryptic peptides derived from regions of sCD4 containing either AsnZ7' or Asn300, the two potential Asn-linked glycosylation sites, were not detected until peptide N-glycosidase F was used to release carbohydrate from the glycoprotein (see below). All significant signals observed in these FABMS data could be assigned to the deduced sequence of the glycoprotein (solid underlines, Fig. 4). Additional coverage of the glycoprotein was obtained by FABMS of S. aureus V8 digested RCM-sCD4 and cyanogen bromidecleaved RCM-sCD4 (Fig. 4, 11-11, and 0-0, respectively; only peptides yielding additional coverage or confirming of the COOH terminus are shown). The former digest yielded the NHz-terminal tridecapeptide of MHA = 1401.7 (subscript A = chemical average mass, see "Experimental Methods"). Extended or NHp-terminally modified forms of this peptide were not detected by MS or NH2-terminal sequence analysis: indicating that the expected NHz-terminal sequence is the I. Y. Huang, unpublished observation. n only one present in this preparation of the glycoprotein. Four peptide signals a t m/z 999.6, 1442.7*, 3073.5~, and 4241.gA confirm that Val369 is the COOH terminus of our recombinant sCD4; COOH-terminally extended or processed forms of the glycoprotein are not observed. The 5% of the sCD4 sequence not mapped consists of small, hydrophilic tryptic peptides (two dipeptides, one tripeptide, two tetrapeptides, and a single pentapeptide) and an amino acid. In general, small hydrophilic peptides and amino acids are difficult to detect by FABMS (25). With the exception of the amino acid, each of these peptides have been identified by composition and Edman sequence analysis of early eluting HPLC fractions of the tryptic digest (data not ~h o w n ) .~ The sequence of the COOH-terminal nonapeptide was established by tandem high performance mass spectrometry with a VG ZAB-SE 4F four-sector double focusing mass spectrometer. The signal at m/z 999.6 was mass selected from either the complex digest or an HPLC fraction containing this peptide using the first double focusing mass spectrometer. The mass selected parent was fragmented by high energy (10 keV, laboratory frame of reference) collisions with helium in the collision region between the two mass spectrometers. Daughter ion spectra are obtained at the final collector at the end of the second mass spectrometer by a computer-generated B/E linked scan of MS-2 such that magnetic (B) field/electric (E) field is maintained constant. The sequence Val-xLeu-Pro-Thr-Trp-Ser-Thr-Pro-Val (where xLeu = Ile or Leu) could be defined by interpretation of the resulting daughter ion mass spectrum (Fig. 5). The interested reader is referred to Ref. 24 for a detailed discussion of the fragmentation processes observed for peptides by tandem high performance mass spectrometry.
Several minor signals caused by our handling of the protein were also detected in these studies. Specifically, 515% of Met2", Met314, and Met342 (but not Metz4') in sCD4 were oxidized to the corresponding sulfoxides (based on relative

2 3 4 5 6 7 ' 0
peak heights of the signals corresponding to oxidized versus nonoxidized peptide) during the carboxylmethylation procedure. These side products gave rise to low abundance satellite signals 16 Da above the Met-containing tryptic peptides L e~~~~-L y s~~~, L e~~'~-L y s~'~, G1u330-Lys360, and Ala332-Ly~3w (Fig, 4). Absence of these satellite signals in the FABMS data obtained on digests of native (not reduced and carboxylethylated) sCD4 (see below) demonstrated that these oxidation products are not present in the protein as purified.
Identification of Glycosylation Sites and Structural Classification of Carbohydrates at Specific Attachment Sites-The sites of attachment of Asn-linked oligosaccharides in sCD4 glycoprotein were determined by FABMS carbohydrate mapping (30). In this technique, peptides containing Asn-linked carbohydrate are detected by comparing the FAB mass spectra obtained before and after treatment of the glycoprotein with peptide:N-glycosidase F(PNGase F (31, 32)) which cleaves the P-aspartylglycosylamine linkage of all known types of Asn-linked sugars and converts the attachment site Asn to Asp which weighs 1 dalton more. New peaks appear in the mass spectra after treatment with the glycosidase that correspond to formerly glycosylated peptides. The technique will thus detect and locate Asn residues to which carbohydrate is attached independent of whether the Asn is present in the consensus sequence (Asn-X-Ser/Thr) or not. Sequence coverage of the protein is also increased (30).
Four new signals were detected in the mixtures of tryptic peptides after PNGase F digestion. Two of these new peptides signals at n / z 2832.3* and 2960.5a correspond to tryptic peptides and L y~~'~-L y s~~~, respectively, in which AS^^^' had been converted to Asp by the action of the glycosidase upon release of the oligosaccharide (Fig. 4). Similarly, the other two new signals at m/z 1490.7 (1491.7A) and 2161.4,+ are tryptic peptides in which has been converted to Asp during release of the attached oligosaccharide. No significant signals derived from these regions of the glycoprotein were detected prior to PNGase F digestion, suggesting that recombinant sCD4 is more than 90% glycosylated at each of these two sites (30). Carbohydrate attached to the tryptic glycopeptide Asn300-Lys312 was inefficiently released by PNGase F as expected based on the location of the oligosaccharide on the NH2-terminal residue of the peptide (33). Complete removal of carbohydrate from this site was effected by PNGase F digestion of RCM-sCD4 prior to trypic digestion (see Fig. 2 4 ) . Deglycosylated native sCD4 had an apparent mass 4000 Da smaller than native sCD4 on SDS-PAGE? The protein becomes somewhat less soluble at pH 8 after removal of the carbohydrate. The precipitate formed is solubilized after 2 h of tryptic digestion.
Composition in terms of hexose, deoxyhexose (dHex), Nacetylhexosamine, and N-acetylneuraminic acid (NeuAc) and molecular heterogeneity for oligosaccharide chains from the two specific attachment sites in sCD4 glycoproteins was obtained by FABMS carbohydrate "fingerprinting" (34). Potential glycopeptides were identified by comparing the reversed phase HPLC profile of the tryptic digest of RCM-sCD4 with that of a sample of RCM-sCD4 that had been sequentially digested with PNGase F, then trypsin (Fig. 3). Peaks in the tryptic digest that disappeared or were greatly attenuated in the chromatogram of the sample digested with the glycosidase are likely to be glycopeptides with carbohydrate linked to Asn.
Four putative glycopeptide-containing fractions were preparatively isolated from the tryptic digest and analyzed by FABMS. Signals for intact glycopeptides were observed in each of these fractions by FABMS (for example, see Fig. 6 Table I. The data indicate that both attachment sites in recombinant sCD4 (AsnZ7l and Am3'') have nearly identical families of oligosaccharides attached with the following general composition: NeuAc,HexNAc4Hex5dHex, where m = 0-2, and n = 0,l. The relative intensities of the parent signals in the FABMS data suggest that the most abundant oligosaccharides have m = 1 and n = 1 at each site, but that the Asn300 site has proportionately more deoxyhexose ( Table I).
The compositions of the oligosaccharides a t each of the specific attachment sites were further verified by permethylation of the PNGase F-treated HPLC-derived glycopeptide fractions and FABMS of the extracted derivatized carbohydrates (for example, see Fig. 7). The molecular species ob-Interference prevented quantitation; presence corroborated by gas served (Table I)  Methylation analysis of the permethylated oligosaccharides from each attachment site gave linkage types indicative of biantennary complex carbohydrates (Table 11). Together with compositions provided by the FABMS analyses of the underivatized glycopeptides and permethylated oligosaccharides from each attachment site, we conclude that the major carbohydrate structures at both and are biantennary complex sugars heterogeneous in NeuAc and Fuc content:   Peptide not observed assignment made by mass difference between observed reduced peptide and disulfidee No signals corresponding to disulfide-bonded peptides were detected in the mass range 800-4500 Da.
cosidase specificity, that the major oligosaccharides in recombinant sCD4 expressed in CHO cells have triantennary structures (42). A disialylated, monofucosylated biantennary carbohydrate has recently been shown to be the major (95%) carbohydrate constituent of human interferon+ secreted in a genetically engineered CHO cell line (35).
Fractionation of PNGase F-released oligosaccharides using high performance anion-exchange chromatography (see "Experimental Methods") also indicated that each site has minor amounts of carbohydrates bearing three sialic acid groups, consistent with the presence of trace amounts of triantennary structures (region S3, Fig. 10). The molecular weights and fragmentation observed in the FAB mass spectra of the major components in the neutral, mono-, and disialylated fractions (Fig. 10) are consistent with the structures outlined above. 6 Furthermore, the FABMS analyses demonstrated that biantennary oligosaccharides containing fucose attached to the reducing-end GlcNAc elute earlier than their nonfucosylated analogs. This observation is consistent with the reported order of elution using high performance anion-exchange chromatography of fucosylated versus nonfucosylated oligosaccharides in which the fucose is linked a-1,3 to GlcNAc in the antennae (41).
Location of Disulfide Bonds-The FABMS peptide mapping approach has also been used to identify the locations of disulfide bonds in sCD4. The method (Fig. ZB) involves cleavage of the protein under conditions known to minimize disulfide reduction and reshuffling (37,38). In the absence of free thiols, procedures involving acidic or mildly basic conditions such as trypsin digestion can be used with minimal concern for disulfide rearrangement. The goal is to obtain disulfide-linked peptides each containing only a single -S-Sbridge. The FAB mass spectra of these mixtures exhibit (M + H)+ for intact, inter-, and intramolecularly disulfide-bonded peptides. Reduced forms of the peptides are also often observed in these spectra, even in the absence of added base, because -S-Sbridges of disulfide-linked peptides are prone to reduce in the liquid matrix under FABMS conditions. If signals for the constituent thiol peptides are not detected, the sample is reduced in the dithiothreitol/dithioerythritol or thioglycerol matrix by addition of triethylamine or ammonium hydroxide. Comparison of the spectra prior to and after reduction often enables assignment of the constituent cysteinyl peptides and how they are linked together.
In the present study 2 mg of recombinant native sCD4 was deglycosylated with PNGase F, dialyzed, and then sequentially digested with trypsin and s. aureus V8. The sample was analyzed by FABMS after each proteolytic step prior to and after incubation with dithiothreitol at basic pH (Fig. 9). To S. A. Carr and J. R. Barr, unpublished observations. check for the presence of peptides in other disulfide-bonded arrangements, the trypsinlV8 digest was analyzed by reversed phase HPLC and compared with the HPLC of the same digest after incubating for two hours with dithiothreitol at basic pH (Fig. 8). Regions of the HPLC map of the nonreduced protein digest that disappeared after reduction were preparatively fractionated by HPLC and analyzed by FABMS. The individual fractions were also reduced on the probe to determine the masses of the constituent peptides. The data (Table  111) indicate the following arrangements of disulfides: Cys"-Cysa4, C y~~~" -C y s '~~, and Cy~~'-'~-Cys~~~ (Fig. 4). This arrangement is identical to that expected from previous studies of intrachain disulfide bonding in T4 antigens derived from sheep and mouse (17). No signals corresponding to disulfide-bonded peptides were detected in the mass range 800-4500 Da for fractions 52-54 (Fig. 8). These fractions most likely consist of large, incompletely digested disulfide-bonded peptides with molecular weights beyond the mass range analyzed.
The work described here illustrates the unique strengths of mass spectrometry and tandem mass spectrometry for sequence analysis of recombinant proteins and for characterization of posttranslational modifications such as disulfide bonds and carbohydrates. The methodology employed is particularly useful for rapid characterization of the class (hybrid, oligomannose, complex) and branching type (biantennary, triantennary, etc.) of the carbohydrates at specific sites in glycoproteins and can be used to rapidly (on the order of a few weeks) compare and define the structural class of carbohydrates on the same recombinant glycoprotein expressed in different cell lines or under different cell growth and harvesting conditions. Although cloning and sequencing of the gene coding for virtually any desired protein can now be accomplished with tremendous speed and efficiency, structural characterization of the recombinant protein product is often a relatively slow step. Amino-terminal and carboxyl-terminal sequencing of the protein product are required to establish identity, but these are only starting points, and internal sequences must be verified to be sure of the fidelity of the sequence. Automated Edman sequence analysis is not sufficient, since many native or induced protein modifications cannot be identified by this technique. Clearly, fast, sensitive, and reliable procedures are necessary to bridge the analytical gap between protein chemistry and molecular genetics. Mass spectrometry and tandem mass spectrometry when used in conjunction with conventional chemical and biochemical approaches can help to fill this gap, particularly when each approach is used so as to take advantage of its unique strengths.
Acknowledgements-We wish to thank V. Dodia, R. hacker, A.