Integrative computational framework to decipher the functions of shell proteins in biomineralization

Mollusk shells contain biominerals with remarkable mechanical properties enabled by a small fraction of embedded organic matrix proteins. However, the speci ﬁ c molecular functions of most shell proteins have remained elusive. Traditional genomics and functional studies are extremely laborious to identify key components. To address this, we developed an in-silico pipeline integrating protein structure modeling, molecular dynamics simulations, and machine learning to elucidate the critical ion protein interactions governing shell formation. Using the pearl oyster Pinctada fucata as a test case, our framework successfully recapitulated known protein functions and predicted roles of uncharacterized proteins to guide future experiments. Moreover, the pipeline ’ s modular design enables versatile applications for rapidly elucidating structure-function relationships in diverse biomineralization systems, complementing conventional wet-lab methods. Overall, this computational approach leverages automatic simulations and analytics to unlock molecular insights into shell protein ion dynamics, accelerating the discovery of key crystallization regulators for bioinspired materials design.


Introduction
Biominerals are biogenic composites of inorganic constituents and embedded organic matters.The latter endows biominerals extraordinary mechanical properties despite their low content and therefore is of great interests for scientists to design new materials following similar strategy (Huang et al., 2019;Zhao et al., 2020;Zhou et al., 2022).The organic components in biominerals, also refer to organic matrix, play a pivotal role in mineral deposition such as crystal nucleation, orientation, polymorph selection and morphology modification (Addadi and Weiner, 2014) Understanding how organic matrix precisely affect the crystallization is crucial to bioinspired material synthesis.
Mollusks are masters in producing plentiful biominerals with various microstructures, textures, and shapes.Their functions include protection, feeding, buoyancy, mating and vision, making them a great reservoir for bioinspired materials (Lowenstam and Weiner, 1989).For example, the shiny nacreous layer in some bivalves, gastropods and cephalopods, namely the mother of pearl, have been extensively studied both on the fundamental basis of its fracture-resistance and architectural assembly, and on the biomimetic applications of nacre-like materials (Finnemore et al., 2012;Cartwright, 2016).Despite the large progress in making nacre-like structure of improved mechanical properties, the natural nacre is far beyond completely replicated, especially regarding the stiffness and toughness, which may due to the unique crystallization process in nacre.
The formation of shell structure in mollusks is regulated by the shell matrix, which contains polysaccharides, proteins, and lipids (Marin et al., 2012).Shell proteins are the key components controlling shell formation, as revealed by intensive studies in the Pinctada genus (Liu et al., 2015;Kintsu et al., 2017;Mariom et al., 2019).In the past three decades, hundreds of shell proteins have been identified by biochemical methods or by omics tools (Liu and Zhang, 2021).Yet their exact roles in controlling CaCO 3 crystallization are not clear, except for a few members.This may be ascribed to the time and effort consuming procedures of functional characterization of shell proteins, as well as the relatively limited methods to explore the protein's function.Basically, the function of a shell protein is revealed by its amino acid sequence and their performance in in vitro CaCO 3 crystallization, which may further be supported by the expression of the related gene either under normal situation or under treatments of shell damage or RNAi knock-down.Such methods have been successful in characterizing the potential roles of shell proteins in crystallization or in some particular cases in immunity (Bahn et al., 2017;Yang et al., 2020).
However, the rapid expansion of numbers of shell proteins identified by omics tools, especially proteomics, in the recent years has challenged the traditional characterizing methods, making it a really tough and nearly inaccessible task to reveal the roles of all the identified shell proteins (Liu and Zhang, 2021).Moreover, the traditional characterization methods cannot fully elucidate the molecular mechanism of the crystallization controlling by the shell proteins at molecular or even atomic level.Molecular dynamics is a time-saving method for studying the movement patterns and interactions of proteins at the atomic level, and can be used to explore the intricate structure of protein molecules, the relationships between atoms, and how these relationships affect the physical properties and biological functions of proteins (Eastman et al., 2017).For example, by understanding the dynamic behavior of proteins, scientists can predict and design new drug molecules that can effectively interact with specific proteins, thereby achieving the goal of treating diseases (Sinha et al., 2022).Moreover, molecular dynamics simulations revealed that cement protein MrCP20 from barnacle Megabalanus rosa can sequester free Ca2+ and CO 3 2− ions on its highly charged surface through disorder −order interplay of the protein and ions, and thus regulating calcite deposition in the barnacle base plate (Harini et al., 2019).To fill the gap between the biomineralization genome and the protein functions, here we presented a systematic framework integrating the state-of-the-art protein folding algorithm, molecular dynamics (MD) simulations and machine learning analysis, in attempt to describe the interaction among proteins and the inorganic ions (calcium, bicarbonate, and magnesium).

Sequence analysis
We firstly retrieved all the protein sequences of Pinctada fucata from Uniprot, with a total of 1116 entries (up to August of 2023).Out of all proteins, we did a literature search and gathered all the cases with experimental results for influence on calcite or aragonite crystallization (see Table 1 and Supplement Material for details).Also, we included some potentially functional proteins from previous in-depth proteomic analysis of shell matrix proteins (Liu et al., 2015) to examine their functions with our protocol.Then we processed the sequences with two approaches: MAFFT (Katoh et al., 2002) method to perform the multiple sequence alignment and obtain the distance matrix; and large protein sequence language model ESM2b (Lin et al., 2023) to process the sequences and obtained the embedding vectors.Then, the distance matrix and the embedding vectors were subject to t-SNE (Van der Maaten and Hinton, 2008) for visualization using Scikit-learn package (Pedregosa et al., 2011).

Structure modeling and molecular dynamics simulations
All protein structures used for MD simulations were predicted using Uni-Fold (Li et al., 2022).To perform large scale MD simulations with various settings, we developed an automated pipeline named ProtIon based on OpenMM (Eastman et al., 2017), and the source code can be found here: (https:// github.com/Dongwentao96/ProtIon).The standard protocol is shown in Figure 1: the input structure was firstly examined to repair the missing residues or atoms with PDBfixer (https:// github.com/openmm/pdbfixer),while determining the protonation states depending on the pre-defined pH value.This procedure is particularly useful if experimental structures are used, which often contain missing residues or atoms.Then, the protein was put into a box with 1.0 nm buffer padding and the ions of choices were added at the pre-defined concentration [in our case 8mM CaCl 2 and 16mM NaHCO 3 for calcite and an additional 50mM MgCl 2 for aragonite based on our previous study (Yi et al., 2022)], along with counter-ions to maintain the electro-neutrality of the systems.For simulations, the energy minimization was firstly performed, followed by equilibration under NVT ensemble for 100ps and NPT (300K, 1bar) ensemble for 500ps, with heavy atoms fixed using Langevin dynamics (Zhang et al., 2019) and Monte-Carlo barostat (Chow and Ferguson, 1995), Amber FF14SB (Maier et al., 2015) forcefield and TIP3P (Jorgensen et al., 1983) water model.Finally, production simulations were performed under NPT conditions (300K, 1 bar).In our case, 100ns simulations were performed for each case to capture the local distribution of ions.Upon simulation completion, the integrated script can automatically generate the density map of ion distribution throughout the simulation, as well as perform routine trajectory analysis.The package support various solvent or small organic molecules, with GAFF2 (Wang et al., 2004) forcefield by default and parameters from Li et.al. (Li et al., 2015) for ions to automatically do the parameterization.The trajectory was processed with MDTraj (McGibbon et al., 2015) and MDanalysis (Michaud-Agrawal et al., 2011), and all visualization was performed using VMD (Humphrey et al., 1996).

Ion distribution calculation
To analyze the ion distribution over the simulations, we applied a 4 Å cut-off around each amino acid and counted the ion numbers for each snapshot as upper bound considering the van der waals radius of the studying ions.Then the count numbers were divided by the total number of ions then the average counts over all snapshots were calculated for further analysis and visualization.The Volmap module in VMD was used to perform the density calculation.

Machine learning analysis
We used the statistical domain features from the Python package TSFEL (Barandas et al., 2020) to systematically embed the residue-level ion enrichment capability for each system from the molecular dynamics simulation trajectories into feature vectors of the same length, with the domain set to statistical, fs set to 1, window_spliter set to False, overlap set to 0, and the rest as default values.The features include histogram, interquartile range, mean absolute deviation, median absolute deviation, root mean square, standard deviation, variance, ECDF percentile count and ECDF slope.Similar to sequence analysis, we used the t-SNE method with the Python package Scikit-learn for visualization using Matplotlib (Hunter, 2007) and Seaborn (Waskom, 2021) packages.

Results
Proteins usually exert physiological functions through some specific side chains of the amino acid residues.Therefore, elucidating the ion interaction maps of shell proteins is essential to deciphering the molecular mechanisms of shell proteins on biomineralization.Our computational simulations aimed to capture this key biophysical process and predict protein functions from the perspectives of structure, dynamics and ion distributions.However, most shell proteins identified so far in mollusk lack known homologous proteins, because only a few mollusk species have relatively complete draft genomes.At the same time, most of them do not have the three-dimensional structure analyzed by experiment.Recent advancement in protein folding algorithm paved the way for structural and dynamic studies towards our goal, since most matrix proteins are poorly characterized, and with no resolved experimental structures.Among several state-of-the-art folding algorithms (Baek et al., 2021;Li et al., 2022;Lin et al., 2023), we used Uni-Fold (Li et al., 2022) to process our sequences on the online platform Hermite (https://hermite.dp.tech).With the predicted structures we employed our ProtIon pipeline to perform MD simulations in calcite and aragonite crystallization solutions with corresponding ion concentrations (see Methods).Usually, the folded structures contain uncertainties, especially in the loop regions, but such deviation can be restored within the simulation process governed by the physical forcefield (Wang et al., 2022).Indeed, we observed the RMSD may change significantly but reached plateau within 100ns simulation time in almost all systems (Supplementary Figure 1 and 2).Subsequent analysis of the trajectories revealed distinct ion binding capabilities correlated with promotion or inhibition of crystal growth.
The results show that proteins known to promote calcite or aragonite growth exhibited localized enrichments of both Ca 2+ and HCO 3 -ions in certain regions (Figure 2).In contrast, proteins inhibiting mineralization showed accumulation of single ion types or spatially distant enrichment regions.Although other factors may also affect the specific functions such as the interactions between the proteins and the substrate or the geometric shape or size of the proteins.While most proteins of interests contain rich charged, especially acidic amino acids in the sequences, it is only through the 3D structures, combined with the MD simulations that can we examine whether the protein can gather both ions (Ca 2+ and HCO 3 -) to the same regions.

MD simulations explains past functional studies
In the past few decades, the impact of some matrix proteins on CaCO 3 precipitation have been characterized (see Table 1).Then, we examined these matrix proteins in the table by molecular dynamics model.It should be noted that, Aspein, PNU5, and PfN44 all displayed concurrent accumulations of large amounts of Ca 2+ and HCO 3 -locally (Figure 2A), agreeing with their reported promoting role in the calcite growth (Isowa et al., 2012;Pan et al., 2014;Shuai et al., 2023).Moreover, PNU5 showed co-enrichment of Ca 2+ , Mg 2+ , and HCO 3 -(Figure 2D), consistent with its reported aragonite crystallization promotion (Shuai et al., 2023).It is particularly interesting to see that when Mg 2+ ion was present, the binding affinity of PfN44 protein in the system altered significantly, due to the Mg 2+ ions binding to the central beta sheet and thus disrupting the overall scaffold (Figure 2E).This phenomenon is in alignment with previous study reporting that PfN44 can stabilize magnesium-calcite to inhibit the crystallization of aragonite, explaining its opposite functions in calcite and aragonite formation (Pan et al., 2014).
PfN23 is a basic SMP identified from the P. fucata shell, which can specifically induce the crystallization of aragonite, and its positively charged C-terminal is supposed to be the key functional region (Fang et al., 2012).Coincide with this, we found that the Cterminus of basic protein N23 selectively accumulated Ca 2+ and HCO 3 -only in the aragonite solution (Figure 2C).Compared to its aragonite conformation, N23 in the calcite solution without Mg 2+ exhibited larger structural changes and lost the C-terminal Ca 2+ enrichment, and instead accumulate HCO 3 -in more regions.This change in anion binding might be involved in the previous reported promotion of calcite dissolution (Fang et al., 2012).
Similarly, mantle protein N25 also displayed localized Ca 2+ and HCO 3 -enrichments, but with lower overall levels and more directional focus compared to the abovementioned proteins when subjected to calcite crystallization solution (Figure 2C).Previous evidence indicates that N25 can block growing sites for forming crystal layer, thus increasing the energy cost for deposition and decreasing growth rates of some crystal faces of CaCO 3 (Yang et al., 2019).Therefore, the observed ion binding regions of N25 may be involved in interaction with the crystal layer.Additionally, N25 and lysine-rich matrix protein 7 showed localized single ion enrichments in the aragonite solution, consistent with their reported inhibition of aragonite crystal growth (Liang et al., 2016).PNU9 showed no significant ion accumulation in calcite crystallization solution (Figure 2B), agreeing with the functional study reported previously (Kong et al., 2019).In the Mg 2 + -containing aragonite solution, PNU9 exhibited only minor, dispersed enrichments of Mg 2+ and HCO 3 -.Similarly, matrix protein PfY2 displayed local accumulation of single ion types in calcite.Such discrete single ion binding likely underlies their inhibiting effect of crystal growth (Yan et al., 2017).Likewise, nacrein in the aragonite forming condition showed the ability to locally enrich single ions in multiple separated regions (Figure 2E), consistent with its reported inhibition of aragonite ProtIon overall workflow.The input sequences can be processed with protein folding algorithm like Uni-Fold.Then, the structures were processed by PDBFixer, and the MD systems were built with ions and other molecules added and the whole system solvated.After equilibration and production runs, the trajectories were processed to display the ion density maps and subject to further analysis.
crystal growth (Blank et al., 2003).Prismalin-14 was expressed only at the mantle edge primarily present in the prismatic layer composed of columnar calcite surrounded by organic matrices (Suzuki et al., 2004).Our results show that Prismalin-14 has two minor HCO 3 -enriching regions sharing a planar Ca 2+ site (Figure 2C).Another distinct region with high single Ca 2+ accumulation may modulate the inhibition of calcium carbonate crystallization.
In the calcite solution, PNU7 only showed Ca 2+ enrichment (Figure 2C), and is predicted to play an inhibition role.However, in the aragonite simulation, PNU7's C-terminus exhibited two sites with high Ca 2+ /HCO 3 -accumulation and minor surrounding Mg 2+ enrichment (Figure 2E).It also displayed a region near the Nterminus with high HCO 3 -accumulation.In vitro crystallization experiment showed that PNU7 can modify calcite morphology and stabilize large vaterite particles when Mg 2+ is absent, while at lower Mg 2+ concentration, large amounts of tiny crystals were formed (Yi et al., 2022).When the C-terminus of PNU7 was deleted, the growth of calcite and vaterite were inhibited.Our observed binding patterns of this protein agree with these reported functions.
In the aragonite solution, matrix protein Y2 exhibited localized co-enrichment of Ca 2+ and HCO 3 -, along with minor co-enrichment of Ca 2+ and Mg 2+ (Figure 2E).Purified recombinant rPfY2 protein has been found to significantly suppressed CaCO 3 precipitation rate and participated in the crystal nucleation process (Yan et al., 2017).Additionally, the morphology of crystals was modified, and the transformation of amorphous calcium carbonate (ACC) to calcite or aragonite was inhibited.We hypothesize the functional domains enriching Ca 2+ and HCO 3 -can regulate CaCO 3 precipitation rate, representing a more specialized form of growth inhibition.

PCA analysis on MD simulation data well separated proteins of different functions
Comparing with previous studies on the shell matrix proteins, we that proteins with different functions exhibit highly correlated results in molecular dynamics simulations, suggesting that molecular dynamics models can be used to deduce the potential functions of some uncharacterized shell proteins and to predict their possible effects on the growth of calcite or aragonite.Thus, we recorded the frequency of ion occurrences within proximity to each residue during the calcite and aragonite crystallization simulations as a characterization of the ion enrichment capability for each residue (Supplementary Figures 3 and 4).We then used TSFEL to extract features as protein representations (See Methods).Additionally, we selected traditional distance matrices based on sequence alignment and embeddings from pre-trained protein language models as alternative protein representations.
Subsequently, we utilized t-SNE for unsupervised dimensionality reduction of the protein representations from the three methods.As seen in Figures 3A, D, the matrix proteins were grouped into two groups of promoting crystallization/inhibiting crystallization by using the multi-sequence comparison distance matrix as a parameter, and there was a certain degree of differentiation between the two types of proteins under aragonite crystallization condition, but it was not ideal in the calcite environment.When embedding vectors are used as parameters, the resulting groupings have a large overlap, both in the calcite and aragonite growth conditions (Figures 3B, E).When molecular dynamics simulations of locus ion enrichment were used to describe the function of proteins, the two protein groups were well separated, with very little overlap between proteins with promoting or inhibiting crystallization properties (Figures 3C, F).Therefore, the ion enrichment capability representations have a natural advantage for depicting proteins affecting the growth of calcite versus aragonite.

ProtIon combined with machine learning to explore biomineralization proteins
Based on the molecular dynamics simulation features, we employed SVM (Figure 4) to build a model to classify the known proteins and applied to other proteins potentially related to biomineralization that were previously reported (Liu et al., 2015).Interestingly, PNU7, N23 and PfN44 are very close to the hyperplane, and these three have subtle and contradictory roles in calcite and aragonite, with both Ca 2+ and HCO 3 -bind to a specific domain and/or greatly affected by the presence of Mg 2+ .According to the SVM predictions (Figure 4) and ionic density maps, we speculated that MSI31, MSI80 and Nacrein (Figure 5A

A B
SVM classification of shell proteins under calcite (A) and aragonite (B) growth conditions.For calcite the promote, inhibit and no effect proteins were separated by two hyperplanes shown as two lines.For aragonite, only the promote and inhibit proteins were separated by the hyperplanes shown as a line, since no 'no effect' proteins were reported.In both cases, other unknown proteins were projected to predict their potential role.
and Supplementary Table 1) may promote calcite growth.They exhibited multiple regions of co-enrichment of Ca 2+ and HCO 3 ions and relatively large size, are consistent with the aforementioned cases of Aspein, PNU5 and PfN44 under calcite growth condition (Figure 2).As to Lysine-rich matrix protein 7 (KRMP7), it only showed enrichment of HCO 3 -ions and is likely to inhibit the growth of calcite.PfT has three small regions of coenrichment of Ca 2+ and HCO 3 -(Figure 5B), as well as a significant Ca 2+ ion enrichment region, which is relatively similar to the simulation result of Prismalin-14, indicating potential functions of changing crystal morphology and inhibiting calcite growth.
In terms of the impact on aragonite crystals, Aspein and MSI31 may promote aragonite growth, as predicted by the SVM model.As shown in Figure 5D, they both have regions of co-enrichment of Ca 2+ , HCO 3 -and Mg 2+ , similar to the simulation results of PNU5 and N23.Meanwhile, MSI80 and PfT do not have significant large co-enrichment regions, also consistent with the prediction.However, it should be noted that the SVM model may have limited predictive potential for some proteins.For example, there are weak cases like N-U6 and Prismin-1 (see Supplementary Figure 5 and Supplementary Table 1) under calcite growth condition, both having some ion enrichment regions but were predicted to have no effect on calcite crystal growth, probably due to the relatively small protein size that prevent the stable binding to the substrate.Also, when under aragonite growth condition, N-U6, Prismalin-14 and Prismin-1 all have some regions of co-enrichment of Ca 2+ and HCO 3 -, but no Mg 2+ was accumulated around these regions.Their predicted inhibition on aragonite growth may be ascribed to the relatively small size and ion enrichment region, or due to the attachment to the newly formed crystal layer, or reduce the precipitation rate of CaCO 3 as reported in previous studies (Yang et al., 2019;Yi et al., 2022).

Discussion
Mollusk shell formation is a complex biomineralization process precisely controlled by the organism to generate intricate composite materials with remarkable mechanical properties.The shell matrix comprises proteins, polysaccharides, and other macromolecules that regulate mineral deposition and crystal growth at multiple levels (Marin et al., 2007a(Marin et al., , 2007(Marin et al., , 2012)).At the heart of this process is the interaction between the organic matrix and the inorganic ions that make up the mineral phase.The availability and transport of ions, particularly calcium, carbonate, and magnesium, are critical determinants of shell construction (Cartwright and Checa, 2007).Calcium and bicarbonate ions are taken up from seawater by the mantle epithelium and transported to the calcification site via the extrapallial fluid (Marin et al., 2012).The organic matrix creates localized ion-rich environments to stimulate crystal nucleation.Matrix proteins selectively bind ions via acidic residues, acting as scaffolds for oriented crystal growth (Falini et al., 1996).Specific macromolecular conformations and ion binding motifs direct the polymorph selection between calcite and aragonite.
The precise spatial and temporal regulation of ion concentrations by the organic matrix is key to controlling crystallization kinetics and directing the assembly of composite shell microstructures (Nudelman and Sommerdijk, 2012).Tracking the dynamics of ion accumulation reveals mechanistic insights into how proteins influence crystal nucleation, orientation, phase, morphology and material properties.As Marin et al. discussed, the protein-ion interactions that direct crystallization occur at multiple length scales, from the nanoscale ion binding sites within individual proteins to the larger-scale accumulation patterns shaped by the 3D matrix scaffold (Marin et al., 2007b).
Although the full mechanism of biomineral formation is complicated (Yang et al., 2020), involving enzymatic processes, spatial patterning, and controlled ion accumulation, the general picture is that matrix proteins attach to a chitin-silk fibroin substrate and then concentrate Ca 2+ and HCO 3 -from the environment to form shell layers.This requires the proteins to stably bind both ions while presenting a large enough surface for crystal nucleation.If the protein can only gather anion or cation, it will drive the crystallization kinetics towards dissociation.The results in the present study indicate that the interactions between the side chain of the shell proteins and the inorganic ions are related to the regulatory role of the proteins.Magnesium can directly affect the precipitation of calcite and aragonite (Morse et al., 1997).Our study showed that some shell proteins, especially those from aragonitic nacre layers, interacted with Mg 2+ differently, which may explain their control of crystal polymorph selection during shell formation.Previous studies have shown that proteins rich in aspartic acid are thought to bind calcium ions, and proteins containing the EF-hand domain can also bind calcium ions (Hattan et al., 2001;Gotliv et al., 2005;Takeuchi et al., 2008).The binding of calcium or magnesium ions in turn changes the structure of the matrix protein.In addition, Lia Addadi and Steve Weiner's group have shown that the regular arrangement of amino acid residues in the tandem repeats containing asparagine is well matched to the crystal lattice of calcium carbonate with a specific crystal polymorph, thus determining the CaCO 3 polymorph (Addadi and Weiner, 1985).Consistently, tandem repeats containing DDRK can significantly affect the deposition of calcium carbonate (Tah et al., 2024).
The automated simulation and analysis tool ProtIon based on Python package OpenMM to study protein-ion interactions and performed consistency validation on some important proteins in the field of biomineralization, proving a high consistency between simulation results and experimental results.Based on the simulation analysis results, we can identify residues in proteins that have a greater impact on mineralization.The rich structural information provided by our method can be further used for understanding the detailed mechanism for shell formation protein design and domain fusion to incubate better varieties, as well as protein design and domain fusion to incubate better varieties.
The machine learning model we here developed to predict the function of matrix proteins is only theoretically tested, which warrant more experimental evidence in the future to support it.For example, the ion enrichment of specific amino acid sequence can be analyzed by truncating the protein to obtain mutant.Alternatively, we can also directly express the domain with ion affinity to verify its role.If the validity of our model can be verified experimentally, its application can be expanded.In the present study, we only use one shell protein in a single stimulation, but under natural biomineralization conditions, many proteins interact and regulate CaCO3 precipitation synergistically.Nevertheless, many shell proteins have disordered regions which usually exhibit as loops and coiled coils.Previous studied have shown that these disordered regions are vital in shell mineralization (Ndao et al., 2010;Brown et al., 2014), which is further supported by the results present here revealing the interaction of the disordered regions and the inorganic ions.However, we excluded some shell proteins with large proportion of disordered regions and have no 3D structures in this study (e.g.MSI60 and shematrin family) due to technical difficulties in the subsequent MD simulation.Because our machine learning model depends on the 3D structure of the target proteins, this shortcoming of our method limits its application to all shell proteins.Moreover, the training dataset of our model was relatively limited due to the fact that only a small part of the identified molluscan shell proteins has been characterized via in vitro crystallization experiment.Excluding shell protein other than P. fucata shell proteins also led to the limited training data.Therefore, more efforts are warranted to establish a routine method that is applicable to most shell protein from various molluscan genera.

Conclusion
On the top of the proteomic studies (Connors et al., 2012;Liu et al., 2015;Du et al., 2017), our framework can perform simulations under various conditions and reveal the ensemble average ion distributions over the protein surfaces, which in turn can be used to determine the particular role of the protein in shell formation.In particular, we tested some well-studied shell proteins and a few potentially functional proteins, for the cases of calcite and aragonite formation.Our simulations successfully captured the ion accumulation and correlate well with the experiments.In addition, we build up a machine learning model to further identify potential proteins with certain promoting or inhibiting mineral formation functions.Moreover, our molecular dynamics simulation module was based on OpenMM (Eastman et al., 2017), an open source and versatile simulation framework, allowing for convenient adjustment according to specific case.

FIGURE 2
FIGURE 2 Ionic density volmaps captured by molecular dynamics simulation.(A) Proteins promote growth of calcite.(B) Protein PNU 9, which has no effects on growth of calcite.(C) Proteins inhibit growth of calcite.(D) Proteins promote growth of aragonite.(E) Proteins inhibit growth of aragonite.For N23 and PNU7, the C-terminal domains were highlighted due to their distinct and important role in aragonite formation as mentioned in the main text.
FIGURE 3 t-SNE results of mineralization related proteins.The t-SNE dimensional reduction applied to (A, D) multisequence comparison distance matrix (B, E) esm2_t30_150M_UR50D embedding vectors (C, F) the molecular dynamics simulation trajectory extracted feature vectors, for calcite/ aragonite respectively.

FIGURE 5
FIGURE 5Ion density maps of unknown potential proteins (A) the potentially promoting proteins MSI31, Nacrein MSI80 for calcite formation; (B) to possibly inhibiting proteins lysine-rich matrix protein 7 and PfT for calcite formation; (C) the potentially promoting proteins Aspein and MSI31 for aragonite formation and (D) the potentially inhibiting proteins MSI80 and PfT for aragonite formation.

TABLE 1
Proteins used in the case study and their roles in CaCO 3 precipitation.