Human complement factor I: analysis of cDNA-derived primary structure and assignment of its gene to chromosome 4.

Factor I is a serine proteinase of complement which together with one of several specific cofactors cleaves activation products of the third and fourth components of complement (C3b and C4b) and modulates the activity of C3 convertase. A heterodimer glycoprotein (Mr = 88,000), factor I is synthesized as a single-chain precursor, prepro-I, which undergoes intracellular proteolytic processing. The human hepatoma line HepG2, however, secretes predominantly the single-chain precursor pro-I. In order to determine the molecular basis for this apparent processing defect, factor I cDNA clones were isolated from a HepG2 mRNA-derived library. Sequencing of the largest insert, HI1971, revealed that it contains 14 base pairs of 5' untranslated region, the complete coding sequence for the 583-residue prepro-I (NH2-signal peptide-heavy chain-linking peptide-light chain-COOH), two polyadenylation signals within the 200-base pair 3' untranslated region, and a portion of poly(A) tail. Analysis of the derived protein structure 1) reveals a mosaic multidomain structure of the heavy chain; 2) demonstrates structural similarity between intracellular conversion of pro-I and activation of other serine proteinase zymogens; and 3) indicates that the light chain of factor I resembles most closely the active subunit of tissue plasminogen activator among all serine proteinases and factor D among complement proteinases. Furthermore, this protein sequence was compared to the sequences of factor I cDNA clones isolated from normal human liver libraries and found to be identical. By exclusion, this defines as cellular the basis for the inefficient processing of pro-I by the HepG2 line. Chromosomal localization by the somatic cell hybrid method maps the factor I gene to chromosome 4.

= 88,000), factor I is synthesized as a single-chain precursor, prepro-I, which undergoes intracellular proteolytic processing. The human hepatoma line HepG2, however, secretes predominantly the singlechain precursor pro-I. In order to determine the molecular basis for this apparent processing defect, factor I cDNA clones were isolated from a HepG2 mRNAderived library.
Sequencing of the largest insert, HI1971, revealed that it contains 14 base pairs of 5' untranslated region, the complete coding sequence for the 583-residue prepro-I (NH,-signal peptide-heavy chain-linking peptide-light chain-COOH), two polyadenylation signals within the ZOO-base pair 3' untranslated region, and a portion of poly(A) tail. Analysis of the derived protein structure 1) reveals a mosaic multidomain structure of the heavy chain; 2) demonstrates structural similarity between intracellular conversion of pro-I and activation of other serine proteinase zymogens; and 3) indicates that the light chain of factor I resembles most closely the active subunit of tissue plasminogen activator among all serine proteinases and factor D among complement proteinases. Furthermore, this protein sequence was compared to the sequences of factor I cDNA clones isolated from normal human liver libraries and found to be identical. By exclusion, this defines as cellular the basis for the inefficient processing of pro-I by the HepG2 line. Chromosomal localization by the somatic cell hybrid method maps the factor I gene to chromosome 4.  with a carbohydrate content of 11-27% (5,8). It is present in plasma as an active enzyme at a concentration of -3.5 mg/dl (5). A partial NHz-terminal sequence of the heavy chain and the near complete sequence of the light chain have been determined (9, 10). The light chain sequence shows blocks of homology with the active subunits of other serine proteinases. However, unlike other serine proteinases of coagulation, fibrinolysis, and complement (with the exception of complement factor D) it is present in the circulation as an active enzyme (11). Recently, a polymorphism of this protein in the Japanese population has been described (12, 13). Biosynthesis of human factor I has been reported in liver and primary monocyte cultures (8, 14, 15). A detailed analysis of its biosynthesis under cell-free conditions and in three human hepatoma lines indicates that it is synthesized as a single-chain precursor and undergoes proteolytic processing and N-linked glycosylation (8). Each of three hepatoma lines, HepG2, Hep3B, and PLC/PRF/5, secretes both precursor and mature forms, but unlike many other proproteins, pro-I has not been identified in plasma (16, 17). Moreover, variation in the proportion of factor I secreted as pro-I was observed among various cell lines, and in one of the lines, HepG2, pro-I is the predominant secreted form. Analysis of the processing of other single-chain precursors by these hepatoma lines showed that the ratios of secreted, completely processed, partially processed, and/or proprotein forms did not vary among them, indicating that this phenomenon is proteinspecific. Xenopus oocytes injected with HepG2 mRNA secrete, however, primarily the processed two-chain protein. Thus, the basis for the inefficient processing of pro-I by the HepG2 line has not been determined and may be intrinsic to the produced factor I or the proteolytic processing mechanism itself.
In order to test whether a structural defect is the cause of this phenomenon and to permit a further analysis of factor I structure, we isolated factor I cDNA clones from HepG2 and normal human liver libraries. These clones were used to determine the nucleotide sequence of the factor I mRNA, derive the complete primary structure of its translation product, and determine the chromosomal localization of the factor I gene.

MATERIALS AND METHODS
Oligonucleotide Preparation-The oligonucleotide mixtures for the library screening and sequence-specific primers were synthesized by automated phosphoramidite chemistry on a DNA synthesizer model 380B (Applied Biosystems), deblocked by incubation with 1 volume of 10 N ammonium hydroxide for 6 h at 55 "C, and then lyophilized. Some of the oligonucleotides were further purified by acrylamide gel electrophoresis, electroelution, desalting, and lyophilization. For the Northern blot hybridization and cDNA library screening the probe was labeled with [32P]dATP (Du Pont-New England Nuclear) by the T4 polynucleotide kinase method (18).
Identification of the Factor I mRNA-Cytoplasmic RNA from the human hepatoma cell line HepG2 was isolated by phenol/chloroform extraction and analyzed by formaldehyde agarose gel electrophoresis and Northern blot hybridization using the 32P-labeled oligonucleotide mixture as a probe (19).
Isolation of Factor I cDNA Clones-A X g t l O cDNA library prepared from the human hepatoma line HepG2 mRNA (20) was screened to isolate factor I clones (21). The screening was done with a 32P-labeled oligonucleotide mixture under stringency calculated for the sequence with a minimal G/C content (19). Positive clones were plaquepurified, and the phage DNA were isolated by the plate-lysis method (19). The HI1971 insert was isolated by low-melt agarose gel purification and subcloned into the pUC-18 vector (22).
Nucleotide Sequence Analysis-The HI1971 insert was subcloned into the M13mp18 phage vector (23), and each strand was sequenced by the dideoxy chain termination method using universal and sequence-specific 17 to 20-mer primers and (%]dATP (24). The analysis of the nucleotide and derived amino acid sequences was done using BIONET.
Chromosomal Localization of Factor I Gem-DNA from humanmouse and human-hamster hybrid cells containing known complements of human chromosomes (26, 27) was analyzed by Southern blotting hybridization (19) for human factor I sequences using the pUC-18-derived insert labeled by the random hexanucleotide method (28).

RESULTS
Isolation of Factor Z cDNA Clones and Determination of the Derived mRNA and Protein Structures-Based upon the amino acid sequence of the NH,-terminal region of the light chain (lo),' a degenerate 23-mer complementary oligonucleotide probe was synthesized (Fig. L4). When the 32P-labeled mixture was hybridized to a Northern blot containing HepG2 RNA, a single band corresponding to an mRNA of -2.3 kilobases was identified (Fig. 1B). (The identical result was obtained using the cDNA probe as well.) The same 32P-labeled oligonucleotide probe was used to screen 6 x lo5 phage of the human hepatoma HepG2 cDNA library. Thirteen primary isolates were purified by serial plating and screening, and the phage DNA were isolated. Digestion of the phage DNA with EcoRI permitted assessment of the insert size in the positive clones. The largest insert, derived from the Xc2H11971 clone, was subcloned into M13mp18 and analyzed by the dideoxy method using universal and sequence-specific primers ( Within the 68-base pair 5' region there is an in-frame methionine codon beginning a t base 13 which delineates a peptide with the following features characteristic of a signal peptide (29): (a) length of 18 residues; (b) a high proportion (67%) of hydrophobic residues with an uninterrupted stretch in the middle portion; and (c) a cysteine residue at its end. In addition, the pentanucleotide CCAAC preceding the methionine codon is similar to the translation initiation consensus sequence CC-CC (30). These features strongly suggest that this peptide represents the entire signal peptide and that the HI1971 insert, therefore, contains sequence coding for the entire primary translation product of factor I mRNA, prepro-I, consisting of 583 residues. The 200-base pair 3' untranslated region contains two polyadenylation signals (AATAAA) starting at positions -140 and -22 from the polyadenylation tail. This implies a possible heterogeneity in the length of factor I mRNA as has been observed for complement Clr and other plasma protein mRNAs (31). Comparison of the size of this insert and the estimated size of factor I mRNA (-2.3 kilobases) suggests that the latter contains an -300-base 5' untranslated region.
The organization of the factor I subunits within the single chain precursor, NH2-heavy chain-light chain-COOH ( Fig.  lC), is similar to that of other proteinase zymogens of the coagulation, fibrinolytic, and complement systems (1,32, 33). However, in contrast to the latter which contain a single basic residue at the site of cleavage/activation, pro-I contains a sequence of four basic amino acids (Arg-Arg-Lys-Arg) between its subunits. This is characteristic of linking peptides in several single-chain complement (pro-C3, pro-C4, and pro-C5) and retroviral (env) protein precursors but not mature proteins (34,35). While the COOH-terminal amino acid analysis of some of these proteins shows a complete removal of the linking peptide (36), in at least one, human coagulation profactor X, the first arginine residue of an identical tetrapeptide is retained as the carboxyl terminus of the NHzterminal subunit in the mature protein (37). An unambiguous number of residues (and amino acid composition) could, therefore, be estimated only for the light (244) but not the heavy chain ((321).
The -23-kDa difference between the sizes of unglycosylated and the fully glycosylated pro-I has been determined to be due to N-linked glycosylation (8). Three potential N-linked glycosylation sites (Asn-X-Thr/Ser (38)) are present in each of the chains: heavy, residues 52, 85, and 159, and light, residues 125, 155, and 197. Though generally only a third of the potential sites is glycosylated and the aspartic acid residue following in the light chain indicates a less likely site of glycosylation, utilization of most of the acceptor sites would be required to account for the estimated carbohydrate content. Both chains are likely to be glycosylated since the sizes of each observed by sodium dodecyl sulfate/polyacrylamide gel electrophoresis (50 and 38 kDa) are larger than those predicted by the sequence data (36 and 27 kDa).

1971
Comparison of the derived amino acid sequence with that primary structure of the light chain is identical but for varireported for 30 NHz-terminal residues of the heavy chain (9) ance a t three sites to the reported near complete sequence shows identity with the exception of Gln" which is glutamic (10). Our sequence shows glutamic acid absent a t position 86 acid in our sequence. This discrepancy might represent a post-but present at position 95. Also our sequence shows a valine translational modification of the heavy chain. The derived residue at the carboxyl terminus while the protein data show Mosaic Structure of Complement Factor 1 the penultimate asparagine residue in that position.
Comparison of the Factor I Structure Derived from the Nucleotide Sequences of HepG2 and Adult Human Liver cDNA Clones-To determine whether the structure of pro-I produced by the HepG2 line is different from the normal, we isolated factor I cDNA clones from a normal human adult liver cDNA library (39). Comparison of the derived amino acid sequence of prepro-I synthesized by HepG2 with partial sequences derived from the clones isolated from this library and a complete sequence derived from clones isolated from another human liver library (56) demonstrates complete identity (data not shown). The identity of the factor I sequences derived from the HepG2 and human liver cDNA clones suggests by exclusion that the inefficient processing of pro-I by the HepG2 line is the result of a cellular processing defect and not a change in the protein. This is in contrast to the known cases of hyperproinsulinemia which are caused by mutations at the cleavage sites (40). The fact than no variation has been observed in the processing of other single-chain precursors by the three hepatoma cell lines (8) further suggests that the proteolytic processing of single-chain precursors might be protein-specific.
Chromosomal  1 and 4 ) human-hamster, and human-mice hybrids containing (lanes 2 and 5) and lacking human chromosome 4 (lanes 3 and 6) were digested with HindIII (lanes 1 3 ) and Sac1 restriction endonucleases (lanes 4-6). Combined patterns of human and rodent factor I sequences are observed for the DNA isolated from hybrids containing human chromosome 4 (lanes 2 and  5). Patterns identical to the DNA hybrids lacking human chromosome 4 were observed also for the DNA isolated from rodent nonhybrid cells. On the left positions of the HindIII digestion fragments of X phage are shown. kbp, kilobase pairs.  techniques (26,27). Hybrid DNAs were also analyzed with DNA probes for each of the human autosomes and the X chromosome.
*The column designations refer to presence (+/) and absence (-/) of human factor I sequence or presence (/+) and absence (/-) of a human chromosome.
For calculation of discordant fractions, hybrids with a rearranged chromosome or in which the chromosome was present in less than 15% of the cells or the characteristic isozyme or DNA probe was weakly positive were excluded.
For the single clone derived from fusion with fibroblasts from the X/13 translocation carrier, this category represents the der13 chromosome.
'For the 17 hybrid clones derived from fusions with leukocytes from the two different X/19 translocation carriers, this category represents the der19 translocation chromosomes.
'The X category includes hybrids with an intact X and those with derX translocation chromosomes. blot hybridization analysis of hybrid cell DNA (19) using the 32P-labeled HI1971 insert as a probe (Fig. 3) followed by concordance analysis (Table I). Comparison of the segregation pattern of the human factor I sequence with human chromosomes shows the highest concordance fraction for chromosome 4 (0.95), followed by the Y chromosome (0.79) and  heavy chain ( H ) , 11, V, and VI, with homology to a region 77-113 in C9 and the consensus sequence of seven tandem repeats in the LDL receptor binding region. A truncation of the domain I1 and transposition of some cysteine residues in the domains I1 and V can be seen. B, alignment of domain I11 in the heavy chain with consensus sequences found in all (*) or at least three (unmarked) of 11 other proteins (see the text). C, decapeptide with a 90% homology found in the heavy chain and in the G2 protein of Punta Ton, Phlebovirus. D, E, F, pairs of identical tetrapeptides found, respectively, within the heavy chain, heavy and light chains (L), and the light chain. The second tetrapeptide is flanked by a conservative substitution. chromosome 1 (0.75). Only a single hybrid was discordant for factor I and chromosome 4, likely due to rearrangement of chromosome 4. No other complement protein nor a serine proteinase has so far been mapped to this chromosome. This indicates that the factor I gene is not a member of a cluster of genes for other regulatory complement proteins (C4BP, CR1, and H) located on chromosome 1 (41).

DISCUSSION
Analysis of a human complement factor I cDNA clone isolated from a HepG2 library has resulted in a determination of the structure of factor I mRNA and derivation of a complete amino acid sequence of its translation product prepro-I. It has permitted characterization of the signal peptide, determination of the organization of subunits in the single-chain precursor, and assignment of the factor I gene to chromosome 4. Consistent with biosynthetic studies, structural analysis of prepro-I indicates that factor I is synthesized as a singlechain precursor, which is processed by two proteolytic steps (prepro-I to pro-I to native I) and an N-linked glycosylation. Comparison of pro-I and chymotrypsinogen sequences suggests the following similarity (Fig. 4). During the activation of chymotrypsinogen peptide 9-15 is excised (42). This results in the generation of an NH2-terminal octapeptide linked to the rest of the molecule via its Cys'-Cy~'~~ disulfide bond, with Ile16 becoming the NH, terminus of a peptide analogous to the active subunit of other serine proteinases. A similar structural change appears to result from the proteolytic processing of pro-I when, following cleavage, the linking peptide is removed and on the heavy chain (which we presume to form the disulfide line with the light chain) marks a 9-to 10-residue COOH-terminal peptide. Further, as similarities in these regions among various serine proteinases are observed, we note here the triplet Cys-Gly-Val found in both factor I and at the NH, terminus of chymotrypsinogen. By analogy then this triplet probably marks the beginning of the serine proteinase domain in the pro-I. These observations suggest that the intracellular processing of pro-I to the native form is structurally similar to zymogen activation. Whether this is also true functionally remains to be determined. Factor I activity has been detected in HepG2 supernatants, but pro-I and native protein were not separated in those studies (15). It will be interesting to compare the processing of factor I and single chain 24-kDa factor D, the only other complement proteinase also secreted as an active enzyme (43).
The noncatalytic subunits of serine proteinases are known to contain a variety of domains (32,33). The heavy chain of factor I contains 29 cysteine residues (11%) and hence is capable of forming 14 intrachain disulfide links which are likely to contribute to domain structure.
Accordingly, we analyzed (by BIONET and manually) the primary structure for sequence homology with other proteins and propose six domains in factor I heavy chain (Fig. 5, A and B).
Three of these domains, I1 (26-60), V (196-238), and VI (239-276) (Fig. 5A), are homologous to the -40-residue domain found once in another complement protein C9 (44) and repeated seven times in tandem in the binding region of the LDL3 receptor (45,46). As defined in these proteins, each domain contains overall 6 conserved cysteines among other consensus residues. Though amino acid sequence homology between complement proteins interacting with factor I and the ligands of the LDL receptor (apoE and apoB) has not been found (47,48), the presence of these domains in factor I might imply a higher order structural similarity among them.
Domain 111 (71-136) contains 7 out of 10 conserved residues of an -60-residue sequence found (primarily) in tandem repeats (from 1 up to 20 times or more) in seven complement (factors B and H, C2, Clr, C4BP, CR1, and CR2) and four noncomplement proteins (/32-glycoprotein I, interleukin-2 receptor, coagulation factor XIII, and haptoglobin), as shown in Fig. 5B (41,49,50). In addition, this domain contains 6 residues of a consensus sequence found in the repeats of at least three of these proteins (C4BP, factors B and H, p21 and/ or coagulation factor XIII) (57). The presence of this domain among all of these proteins does not necessarily imply an unknown common function. It might be that the conserved residues constitute a structural framework for the unique residues defining the functional specificity.
The heavy chain also contains six basic di-and tripeptides (Fig. 6). The functional significance of these peptides is not certain, but they may contribute to domain delineation or represent sites of cleavage/inactivation or binding.
The disulfide linkages of the heavy chain may be predicted only for domain I11 (Fig. 6). We presume that disulfide linkages within 11, V, and VI are confined within each domain. The remaining 6 cysteine residues we propose form linkages to generate the simplest secondary structure (Fig. 6). Four of these (by analogy with C l r domains I and 111) would define distinct domains I and IV (31). The remaining pair then The abbreviation used is: LDL, low density lipoprotein. would link the outer regions of domain 111 and together with the other two disulfide bridges form a kringle-like module found in other serine proteinases (32). This proposition is supported by the fact that a haptoglobin region containing the consensus residues of domain I11 (49) is also similar to the kringles of other serine proteinases (52).
The protein sequence was further analyzed for similarities with other proteins and internal homology using BIONET. A remarkable 9-out-of-10-residue match was found between a decapeptide starting at position 78 (domain 111) and a region in the G2 protein of Punta Tor0 Phlebovirus (Fig. 5C) with a respective nucleotide match of 73% (53). The protein similarity suggests a possible function for that region in factor I and/ or a pathophysiological role of that region in the virus.
Analysis for internal homology demonstrates three pairs of identical tetrapeptides (Fig. 5, D-F). The significance of these observations is not clear, but the respective locations of the first pair of tetrapeptides in the heavy chain at the beginning and the end of domain I11 (Figs. 5 0 and 6) and, additionally, the degree of nucleotide homology (11 out of 12) suggests a remnant of a duplicated domain 111.
The complete sequence of the light chain of factor I is, as expected (IO), homologous to the catalytic subunits of other serine proteinases with conservation of 26 of 29 invariant residues (42). It is closest in sequence to tissue plasminogen activator (41% identical residues), plasma kallikrein (37%), and coagulation factor XI1 (34%) among all serine proteinases and to factor D (28%) among the complement proteins (11, 25, 5455). The substitutions of Gly6' and Leu'" (chymotrypsinogen numbering) by Thr" and Metg6, respectively, might be significant for the higher order structure and hence enzymatic specificity of factor I, while the third substitution, V a P for is a conservative one. Though the light chain structurally and functionally resembles trypsin (9), the placement of its cysteine residues (and hence their disulfide linkage) and other structural features (Fig. 4) demonstrate unique chymotrypsin features as well. The disulfide linkages for the 11 cysteine residues of the light chains are, therefore, proposed by analogy with chymotrypsin and are identical to t-PA (Fig.  6).
In summary, we propose that factor I contains up to six domains within its heavy chain. Three of its domains (11, V, and VI) are homologous to the C9-like domains of the LDL receptor but have not been found in other serine proteinases. This together with other features (synthesis of an active enzyme, solitary presence of its gene on chromosome 4) indicate that factor I is a highly diverged and specialized member of the serine proteinase family. By analogy with the other multidomain proteins that evolved via an exon-shuffling mechanism (46), the domains of factor I gene are expected to be encoded by distinct exons.