Diversity of CysD domains in gel‐forming mucins

CysD domains are disulfide‐rich modules embedded within long O‐glycosylated regions of mucin glycoproteins. CysD domains are thought to mediate intermolecular adhesion during the intracellular bioassembly of mucin polymers and perhaps also after secretion in extracellular mucus hydrogels. The human genome encodes 18 CysD domains distributed across three different mucins. To date, experimental structural information is available only for the first CysD domain (CysD1) of the intestinal mucin MUC2, which is one of the most divergent of the CysDs. To provide experimental data on a CysD that is representative of a larger branch of the fold family, we determined the crystal structure of the seventh CysD domain (CysD7) from MUC5AC, a mucin found in the respiratory tract and stomach. The MUC5AC CysD7 structure revealed a single calcium‐binding site, contrasting with the two sites in MUC2 CysD1. The MUC5AC CysD7 structure also contained an additional α‐helix absent from MUC2 CysD1, with potential functional implications for intermolecular interactions. Lastly, the experimental structure emphasized the flexibility of the loop analogous to the main adhesion loop of MUC2 CysD1, suggesting that both sequence divergence and physical plasticity in this region may contribute to the adaptation of mucin CysD domains.


Introduction
Gel-forming mucin glycoproteins are the primary macromolecular constituents of the mucus that coats exposed epithelial surfaces in the body, including the respiratory, digestive, and reproductive tracts [1,2].Humans express five gel-forming mucins: MUC2, MUC5B, MUC5AC, MUC6, and MUC19.These mucins share the following properties: they are extremely large (> 5000 amino acids), contain conserved domains at each end responsible for disulfide-mediated polymerization, and have an extensive central region rich in proline and O-glycosylated threonine and serine amino acids (PTS region).Another feature common to the mucins MUC2, MUC5B, and MUC5AC, but not to MUC6 or MUC19, is multiple copies of an additional, cysteine-rich domain, called CysD, interspersed within the PTS region (Fig. 1A).Mucins differ, however, in the number of CysD domains they contain: the human intestinal mucin, MUC2, has two CysD domains, whereas the respiratory mucins, MUC5B and MUC5AC, have seven and nine, respectively.
CysD domains are about 100 amino acids in length and contain five disulfide bonds (Fig. 1B).The structure of the first CysD domain of MUC2 (CysD1), determined using X-ray crystallography (PDB 6TM6), revealed an elongated fold with 2 three-stranded bsheets and two calcium binding sites [3].The termini of the domain are close to one another, linked by a disulfide bond.A conserved WXXW motif participates in cation-p interactions.In the context of a larger MUC2 structure, determined using cryo-electron microscopy (cryo-EM), it was shown that the CysD1 domain is critical for the assembly of MUC2 polymers; CysD1 interacts with other parts of MUC2 to stabilize a beaded filament proposed to represent an intracellular assembly intermediate in polymerization [3].Deletion or mutation of CysD1 disrupts filamentation, which may lead to the formation of abnormal mucus and contribute to the development of disease [3].Intermolecular interactions between CysD1 and the rest of the filament are mediated by exposed aromatic amino acid side chains on loops distal to the domain termini.CysD domains have also been reported to selfassociate [4], and engineered molecules containing multiple CysD domains were shown to stiffen mucus hydrogels [5,6].Together these observations support the idea that CysD domains are involved in the adhesion of mucin molecules to one another.Importantly, the loops participating in CysD1 adhesion in the MUC2 cryo-EM structure show amino acid sequence variability compared to other CysD domains.It is, therefore, likely that diversification of intermolecular adhesion, and adaptation of mucins to various organs and biological functions, is accomplished, at least in part, by the divergence of CysD domains.
To gain further insight into the CysD family, we crystallized CysD7 from MUC5AC, which has high amino acid sequence identity to multiple CysD domains in this mucin (Fig. 1C).MUC5AC CysD7 is at least 97% identical to CysD3, CysD5, CysD8, and CysD9 in the same protein (compared over the regions spanning the second to the ninth cysteines in the domains).MUC5AC CysD7 is also at least 53% identical to CysD3, CysD4, CysD5, CysD6, and CysD7 of MUC5B.When compared with MUC2 CysD1, MUC5AC CysD7 shows < 40% amino acid identity, despite the conservation of the disulfidebonded cysteines, WXXW motif, and other key residues in the fold.Therefore, MUC5AC CysD7 is an appropriate representative for the exploration of CysD structural diversity in human gel-forming mucins.

Crystallization and structure of MUC5AC CysD7
The CysD7 domain of MUC5AC (amino acids 3518-3626 of UniProt P98088 [7]) was expressed in HEK293F cells and purified from the culture supernatant (see Materials and methods).After the removal of the His 6 tag used for purification, the protein was concentrated and crystallized.Crystals grew as thin rods, from which diffraction data were collected to 1.7 A resolution (Table 1).The structure was solved by molecular replacement using the AlphaFold2 prediction for the domain [8] (Fig. 2A).After refinement and rebuilding into the electron density map, the only substantial difference in backbone conformation between the experimental CysD7 structure and the AlphaFold2 prediction was a shift of the small helix a2 (residues 3579-3582; amino acid sequence IEHL) about 2.6 A along the long axis of the domain.This shift was due either to interaction of helix a2 with helix a1 of a symmetry-related molecule in the crystal or to an ) is 5654 amino acids in length.Dark gray boxes are D assemblies (sets of characteristic domains found in mucins, von Willebrand factor, and other proteins) [14].Pink boxes are a region known as VWCN [15], tan boxes are sets of VWC domains, and brown boxes are CTCK domains [16].(B) The amino acid sequence of MUC5AC CysD7 is shown with gray and black lines representing disulfide bonds between the indicated cysteines.Amino acid numbering is according to full-length MUC5AC (UniProt P98088).(C) Table of percent amino acid sequence identities between CysD domains in mucins.Identities were calculated from the second cysteine to the penultimate cysteine in each domain due to large gap penalties in some cases for aligning the first and last cysteines (see Fig. 4 for full alignments).CysD domains with high percent sequence identities (> 97%) to MUC5AC CysD7 are highlighted in dark green.Moderate percent sequence identities (53% and 58%) are highlighted in lighter green.The relatively low sequence identity between MUC5AC CysD7 and MUC2 CysD1 is highlighted in dark orange.
intrinsic difference in the rotamers of I3579 and L3582 compared to the predicted structure.Differences of this magnitude are common between experimental and AlphaFold2 structures [9].Another possible contributing factor is the redox state of the nearby disulfide between C3569 and C3588.Despite protein production in the oxidizing environment of the mammalian secretory pathway and data collection without exposure to reducing agents or synchrotron X-rays, this solventaccessible disulfide was reduced in a fraction (~0.25) of the molecules in the crystal.Partial occupancy was detected for a C3569 rotamer that is incompatible with both the disulfide and the position of I3579 in the AlphaFold2 structure (Fig. 2B).

MUC5AC CysD7 binds a single calcium ion
A motivation for determining the structure of the MUC5AC CysD7 domain experimentally was to establish the number of calcium ions bound.Two calcium ions, 7.4 A from one another, were previously observed in MUC2 CysD1 [3] (Fig. 3A).One of the MUC2 CysD1 calcium-binding sites is a complete octahedral coordination shell consisting of four amino acid side chains and two backbone carbonyls, whereas the other utilizes only two side chains, with three other ligands provided by backbone carbonyls.The crystal structure of MUC5AC CysD7 revealed only a single coordinated calcium ion, in the octahedral cage (Fig. 3A).One of the two aspartates that bind the second calcium ion in MUC2 CysD1 is conserved in MUC5AC CysD7, but the other aspartate is replaced by a histidine (Fig. 3B).Though not involved in calcium coordination, the conserved MUC5AC CysD7 aspartate shares a second function with its counterpart in MUC2 CysD1: the interaction with an N-H group at the edge of strand b2 (Fig. 3A).
In MUC2 CysD1, the loop preceding strand b2 is folded over to surround the second calcium ion (Fig. 3C).In MUC5AC CysD7, this loop remains extended such that its tip is about 10 A from its counterpart in MUC2 CysD1.The extension of the loop places it closer to a large loop with exposed hydrophobic residues (Fig. 3B).

Additional structural differences between MUC5AC CysD7 and MUC2 CysD1
Another significant difference between MUC5AC CysD7 and MUC2 CysD1, which was predicted accurately by AlphaFold2, is the presence of an extra ahelix (helix a1) following strand b2 in CysD7 (Fig. 2A).This helix is predicted for all CysD domains in human gel-forming mucins except those in MUC2.In the MUC2 amino-terminal region beaded filament structure [3], two CysD1 domains sit within a few A of one another at the center of each bead (Fig. 4A).Inspection of these two domains reveals that if helix a1 were present in MUC2 CysD1, it would clash with the second calcium-binding site of the neighboring CysD1.Notably, when MUC5AC CysD7 is superposed on MUC2 CysD1 in the MUC2 filament (PDB 6TM2), helix a1 is accommodated by the lack of the second calcium-binding site and extension of the loop Table 1.Crystallographic and refinement statistics.Values in parentheses refer to the highest resolution shell.The authors acknowledge that higher resolution data could have been obtained from this crystal, but the data collected and analyzed were sufficient for the conclusions made herein.The Ramachandran outlier is I3557, found in a region of high-quality backbone electron density.The proximity of the isoleucine to the buried disulfide C3558-C3617 may constrain the I3557 backbone dihedral angles.(Fig. 3C), which makes room for the helical insertion (Fig. 4A).

Wavelength (
Another difference between MUC5AC CysD7 and MUC2 CysD1 is the set of residues participating in cation-p interactions with the WXXW motif.In both proteins, an arginine side chain arising from strand b6 inserts between the two tryptophan side chains.However, in MUC2 CysD1, a lysine two amino acids upstream of the arginine in strand b6 interacts with the second tryptophan, whereas in MUC5AC CysD7, an arginine arising from strand b3 interacts with this tryptophan (Fig. 4B).Interestingly, the arginine in strand b3 is conserved in MUC2 CysD1 (Fig. 4C), raising the possibility that this b3 arginine could substitute for the b6 lysine, though experimental evidence for this idea is currently lacking.Regarding other CysD domains, the b3 Fig. 3. MUC5AC CysD7 binds one calcium ion.Structure images were generated using CHIMERAX [17].(A) Calcium-coordinating environment of MUC5AC CysD7 compared to MUC2 CysD1.(B) Amino acid sequence alignment of MUC5AC CysD7 and MUC2 CysD1 highlighting calcium-coordinating amino acid side chains.Amino acids in the conserved site appear on a red background, and amino acids in the second site of MUC2 CysD1 appear on an orange background.Amino acids contributing backbone carbonyl groups to calcium coordination are indicated with red spheres for the conserved site and orange spheres for the second site in MUC2 CysD1.Conserved calcium-binding residues are numbered according to their position in the MUC5AC CysD7 sequence.Remaining amino acid identities are indicated by a gray background.Amino acids participating in protein-protein interactions in MUC2 amino-terminal region filaments are indicated by blue squares.Sequences were obtained from UniProt [7] (MUC2: Q02817; MUC5AC: P98088) and aligned using T-COFFEE [18].(C) Loop (red arrow) of MUC5AC CysD7 (orange) is extended rather than folded over a calcium ion as in MUC2 CysD1 (beige).Exposed hydrophobic side chains in loops are shown as sticks and labeled for MUC5AC CysD7.
arginine is found in all CysDs except for MUC5B CysD2, which has a glutamine at this position (Fig. 4C).MUC5B CysD2 instead has an arginine at the same position as the lysine in strand b6 of MUC2 CysD1.Therefore, it appears that only MUC2 CysD1 and MUC5B CysD2 have diverged in the basic amino acids that interact with the conserved WXXW motif.

Flexibility in the MUC5AC CysD7 adhesion loop
CysD domains display a large loop distal from the domain termini.This loop contains exposed hydrophobic side chains (aromatics or prolines) (Fig. 3C) and was found buried at the molecular interface in the MUC2 [3] and MUC5B (unpublished data) filaments.The loop, therefore, appears to be a primary contributor to CysD domain adhesion, though additional regions also participate in binding [3].In maps calculated from MUC5AC CysD7 crystal diffraction data, the loop exhibited poor electron density, despite its location near crystal contacts with well-ordered regions of other molecules.F3605 appeared to be disordered, and its side chain was not modeled.These observations suggest that the configuration of the loop may adapt to its binding site during mucin assembly or to aid in post-secretion interactions.

Discussion
The molecular mechanism by which CysD domains contribute to mucin assembly has begun to be revealed by high-resolution structures spanning the first of these domains in MUC2 [3] and MUC5B (unpublished data).However, the functions and binding capabilities of the many additional CysD domains present in MUC5B and MUC5AC (Fig. 1A) are poorly understood.It is not yet clear whether CysD domains can functionally substitute for one another, overlap in their activities, or have distinct roles in mucin assembly.Even CysD domains that share 99% amino acid sequence identity may be constrained to different binding sites or influenced in other manners by the lengths and natures of the PTS regions flanking them.Nevertheless, we hypothesized that insights can be gained by studying CysD domains in isolation and analyzing them in the conceptual framework of known CysD binding sites.
An example of such insight involves the presence or absence of helix a1, the a-helix following strand b2 in CysD domains.The lack of helix a1 in MUC2 CysD1 appears to accommodate the juxtaposition of two CysD1 domains docked at the middle of each bead in the filament formed by the amino-terminal region of MUC2 [3], preventing a steric clash with the second calcium-binding site of the neighboring CysD1 (Fig. 4A).The a1 helix is present in MUC5AC CysD7, and, as evident by homology and AlphaFold2 prediction, in most other CysD domains.However, based on greater similarity to MUC5AC CysD7 than to MUC2 CysD1 at the relevant amino acid positions, most other CysD domains also appear to lack the second calcium-binding site.Superposing the structure of MUC5AC CysD7 onto MUC2 CysD1 in the context of the beaded filament showed that the lack of the second calcium-binding site compensates for the presence of helix a1.There are no major steric clashes, aside from those readily resolvable by changes in sidechain rotamers, between the two copies of the CysD7 domain (Fig. 4A).This analysis does not prove that MUC5AC CysD7 naturally binds mucin-beaded filaments in the same position and orientation as MUC2 CysD1, nor that these two segments of the domain have indeed co-evolved for this purpose, but it does demonstrate that such a docking pose is possible.It remains to be determined at what stage in mucin assembly or mucus production, and in what manner, CysD regions downstream of CysD1 participate.
Another general insight into CysD domains obtained by comparing two divergent representatives of the family is the variability in calcium-binding stoichiometry.MUC2 CysD1 was previously observed to bind two calcium ions [3], whereas MUC5AC CysD7, as described here, bound only a single calcium.The coordination of the second calcium ion in MUC2 CysD1 is dominated by backbone carbonyls.Consequently, the presence or absence of a suitable coordination shell is more difficult to discern from amino acid sequence alignments compared to the first calcium-binding site, which is a wellorganized octahedral shell containing four amino acid side chains.Moreover, the conservation in MUC5AC CysD7 of one of the two calcium-coordinating side chains of the second calcium-binding site in MUC2 CysD1 was found to be due to a different function of this side chain, distinct from calcium binding.Specifically, the acidic side chain interacts with an exposed backbone N-H group in a nearby edge strand of a bsheet, helping to cap the sheet (Fig. 3A).In MUC2 CysD1, this acidic side chain simultaneously engages in calcium binding and sheet capping.
Information on large mucin assemblies obtained by cryo-EM is useful for illuminating the organizing principles of these glycoproteins [3].However, the cryo-EM structures have not provided very high-resolution information on peripheral (but essential) parts of the mucin supramolecular assemblies, such as the CysD1 domains.Therefore, complementing cryo-EM structures with detailed crystallographic analysis is useful.To date, no experimental structure of any part of MUC5AC is available, aside from a short PTS peptide crystallized in complex with a glycosyltransferase (PDB 5AJO).The analysis presented herein will aid in modeling and interpretation of future structural studies on MUC5AC.Already, this study expands our understanding of the evolution of mucin CysD domains, with a focus on their structural diversity, flexibility, and calcium ion binding capabilities.

MUC5AC CysD7 protein production and purification
The MUC5AC CysD7 domain was produced using a protocol similar to that used to produce MUC2 CysD1 [3].The coding sequence for residues 3518-3626 of human MUC5AC was inserted into the pcDNA3.1 plasmid downstream of a segment encoding the sequence MRRCNSGSGPPPS LLLLLLWLLAVPGANAAPQGHHHHHHENLYFQGG, which includes the signal sequence from the protein QSOX1, a His 6 tag for purification, a tobacco etch virus (TEV) protease cleavage site for tag removal, and a glycine linker.TEV protease cleavage left two non-native glycines fused to the amino terminus of the CysD7 domain.Plasmids were propagated in and purified from cells of the Escherichia coli XL1-Blue strain (Promega, Madison, WI, USA).Transient transfection was done using the PEI Max reagent (Polysciences, Inc., Warrington, PA, USA) with a 1 : 3 ratio (w/w) of DNA to PEI at a concentration of 1 9 10 6 cells per ml into FreeStyle TM 293-F (Thermo Fisher Scientific, Waltham, MA, USA) cells for protein production.Six days after transfection, the culture medium was collected and centrifuged for 10 min at 500 g to pellet cells.The supernatant was then centrifuged for 10 min at 3000 g to pellet any remaining particulate matter.The supernatant from this second centrifugation was filtered through a 0.22 lm filter, and the His 6 -tagged proteins were purified by nickel-nitrilotriacetic acid (Ni-NTA) chromatography.The eluted protein was dialyzed against 0.59 phosphate-buffered saline at room temperature overnight at a 25 : 1 molar ratio with His 6 -tagged TEV protease, produced in-house.After dialysis, the protein was separated by Ni-NTA chromatography from the TEV protease, cleaved His 6 tag, and any remaining uncleaved material.The eluate was collected, the cleaved protein was concentrated to 10 mgÁmL À1 , and the buffer was exchanged to 10 mM Tris, pH 7.5, and 20 mM NaCl for crystallization.

MUC5AC CysD7 crystallization and structure determination
The hanging drop vapor diffusion method was used for crystallization.Initial CysD7 microcrystals were obtained from 200 mM ammonium chloride and 20% PEG 3350 in the JCSG-plus TM Screen (Molecular Dimensions, Holland, OH, USA).Conditions were optimized to produce rod-shaped crystals over a well solution containing 24% PEG 3350, 150 mM ammonium chloride, and 100 mM sodium acetate buffer, pH 5.1.These crystals were then seeded into a drop over a well solution containing 32% PEG 3350, 150 mM ammonium chloride, 100 mM sodium acetate buffer, pH 5.1, and 5% ethylene glycol.For cryo-protection, a crystal grown under these conditions was transferred to the same solution except containing 20% ethylene glycol, mounted in a loop, and flash-frozen in a 100 K nitrogen stream.Diffraction data were collected at 100 K on an in-house Rigaku liquid-metaljet X-ray Synergy System with HyPix Arc 150°detector.ALPHAFOLD v2.0 [9] accessed via COLABFOLD [10] was used to produce the CysD7 model used for molecular replacement.Temperature factors were initially set to 12 to approximate the estimated Wilson B factor.The initial model was then iteratively rebuilt and refined using COOT [11] and PHENIX [12], respectively.PHENIX was run with individual B factor refinement and automated water picking.One water was manually changed to chloride upon inspection of Fo-Fc map density and appropriate coordination distances.Occupancy refinement was carried out to assess the fractions of alternate conformations.Model geometry was assessed using MOLPROBITY [13].

Fig. 1 .
Fig. 1.CysD domains.(A) The number and positions of the CysD domains (orange) in human gel-forming mucins are illustrated.MUC5AC (UniProt P98088) is 5654 amino acids in length.Dark gray boxes are D assemblies (sets of characteristic domains found in mucins, von Willebrand factor, and other proteins)[14].Pink boxes are a region known as VWCN[15], tan boxes are sets of VWC domains, and brown boxes are CTCK domains[16].(B) The amino acid sequence of MUC5AC CysD7 is shown with gray and black lines representing disulfide bonds between the indicated cysteines.Amino acid numbering is according to full-length MUC5AC (UniProt P98088).(C) Table of percent amino acid sequence identities between CysD domains in mucins.Identities were calculated from the second cysteine to the penultimate cysteine in each domain due to large gap penalties in some cases for aligning the first and last cysteines (see Fig.4for full alignments).CysD domains with high percent sequence identities (> 97%) to MUC5AC CysD7 are highlighted in dark green.Moderate percent sequence identities (53% and 58%) are highlighted in lighter green.The relatively low sequence identity between MUC5AC CysD7 and MUC2 CysD1 is highlighted in dark orange.

Fig. 2 .
Fig. 2. MUC5AC CysD7 domain structure and reduction of disulfides.Images were generated using CHIMERAX [17].(A) Cartoon of MUC5AC CysD7 with disulfides shown as spheres.Secondary structures are numbered according to their order in the primary structure, and the amino ("N") and carboxy ("C") termini are labeled.The green sphere is a bound calcium ion.(B) The 2Fo-Fc crystallographic map (1.4 r) is presented within 1.6 A of a segment of the MUC5AC CysD1 structure.The AlphaFold2 model of this segment (lime green) would clash with the observed alternate conformation of Cys3569 in a minor reduced fraction of the protein in the crystal.

Fig. 4 .
Fig. 4. Diversity of secondary structural elements and cation-p interactions in CysD domains.Green spheres are calcium ions.Structure images were generated using CHIMERAX [17].(A) Three beads of the MUC2 amino-terminal region beaded filament [3] are shown with the CysD1 domains colored teal.Below, the MUC2 CysD1 domains (cyan and teal cartoons) are presented in the orientation they assume in the filament.Superposition of the MUC5AC CysD7 structure (orange) onto one or both MUC2 CysD1 domains shows that the clash between the additional a-helix present in CysD7 and the neighboring domain (red arrow) is relieved by the loss of the second calcium-binding site (green arrow).(B) Cation-p interactions in MUC5AC CysD7 vs. MUC2 CysD1, viewed from two angles.(C) Sequence alignment of human CysD domains with amino acids observed or predicted to participate in cation-p interactions highlighted in cyan.Cysteines are highlighted yellow.The positions of the a1 helix (when present), the second calcium-binding loop (when present), and the adhesion loop are indicated.Sequences were obtained from UniProt [7] (MUC2: Q02817; MUC5B: Q9HC84; MUC5AC: P98088) and aligned using T-COFFEE.Alignments were then manually adjusted, making the fewest possible changes to align cysteines.