Introduction

With recent advances in machine learning (ML), material system design and development has undergone rapid acceleration1,2. However, one of the major challenges in applying ML to material system design lies in finding the appropriate design representations3. Most material design applications take advantage of quantitative (or numerical) design variables to represent material systems. In most cases, these quantitative descriptors (features) require either expert knowledge or data analysis to find the most appropriate ones. On the other hand, although most qualitative (or categorical) variables (e.g., chemical elements, chemical compositions) are more accessible than quantitative variables, it is challenging to directly include qualitative variables as a part of the design variables in automated materials design. Metal-organic frameworks (MOFs) are an example of such materials systems. MOFs are a class of porous crystalline materials that have been used extensively for gas storage4,5, gas separation6,7,8,9, and catalysis10,11,12. Because of their highly tunable nature, MOFs have been looked at as a potential solution for different applications such as carbon dioxide (CO2) capture and separation13,14. Using a vector notation in which each element corresponds to a qualitative design variable such as topology, node, and edge, MOFs can be represented with the sole usage of qualitative variables as shown in Fig. 1. However, the versatility and different possible combinations of the MOF building blocks lead to millions of candidates. To demonstrate a simple example, consider a MOF system with a topology that requires 2 nodes and 3 edges for construction. Selecting only 20 different building blocks for each node and edge leads to a combinatorial design space of more than 106 MOF candidates. Due to the high experimental cost, both in time and resources, computational approaches have been increasingly used to replace experimental exploration3.

Fig. 1: The qualitative representation and construction of metal-organic framework materials.
figure 1

Each MOF can be represented with a “vector” where each element (letter) in the vector corresponds to a choice of building block or topology. With the combination of different building blocks and topologies explored in this work, there are more than 104 hypothetical MOFs to be explored.

While high-throughput screening approaches15,16,17,18,19,20 and ML techniques21,22,23 have been utilized to computationally search or design for top-performing MOF structures in different applications, existing approaches usually rely on large data sets and high-dimensional physical descriptors to represent the material design space. These processes can be both time consuming and property specific, meaning that the ML models and descriptors are often not transferable to different design objectives. Finally, many ML models are viewed as ‘black boxes’ that are not easily interpretable for understanding how and why the model performs the way it is24,25,26. Therefore, a new and a generic computational approach that (i) employs a simple but descriptor-free (featureless) design representation, (ii) requires substantially smaller amount of data, and (iii) is easily interpretable would be highly useful for the design of MOFs.

Bayesian optimization (BO) has been shown to be effective for identifying the optimum candidates for materials systems with large design spaces and local optimums in different applications such as drug discovery, additive manufacturing, and genetics27,28. BO has also been used to identify high-performance MOFs29,30. However, previous works on MOFs require expert knowledge for the choice of appropriate physical descriptors (e.g., gravimetric surface area, largest included sphere diameter) as inputs for surrogate model training. Gaussian Process (GP) is a popular surrogate model choice for BO as it provides both predictions and uncertainty quantification, which are the two main components of the acquisition function for choosing samples when applying BO. However, GP models fall short when there are qualitative design variables. This bottleneck has been recently bypassed by the Latent Variable Gaussian Process (LVGP)31 approach, which can incorporate qualitative variables into GPs by implicitly mapping each qualitative variable into a quantitative space through low-dimensional latent variables. Specifically, as the influence of any qualitative variable on any quantitative response must be always due to some underlying, possibly high-dimensional, quantitative physical variables, the latent variable approach provides physics-based dimension reduction. Therefore, the latent variables and their locations in the latent space could provide physically meaningful information on how the qualitative variables influence the responses. In the context of latent space learning, the term “physically-meaningful” can be associated to explaining the cause-effect relationships between the design variables (inputs) and properties (output) through supervised learning, which is different from many existing latent space learnings through unsupervised learning methods. LVGP still possesses the qualities of a GP model in terms of providing fast surrogate modeling, capturing nonlinear responses, providing predictions, and quantifying uncertainties. Thus, LVGP bridges the gap for incorporating qualitative information into engineering design applications and has been already employed in data-driven materials design research32,33. Although LVGP and BO have been applied to materials design and development, its application has been limited to qualitative design variables with small number levels, i.e., the design options per variable.

Here we present the Latent Variable Gaussian Process Multi-Objective Batch Bayesian Optimization (LVGP-MOBBO) framework to perform rapid design of superior MOFs directly from the building blocks that construct the material. Specifically, we are interested in identifying the Pareto front for a multi-objective optimization and top-performing MOFs without any human intervention. We are particularly interested in examining the performance of the approach under both small and large numbers of levels for qualitative variables. We take advantage of the readily available qualitative building block information that is used to construct the MOFs and build an interpretable LVGP surrogate model that cooperates with MOBBO to adaptively lead towards promising MOF candidates for CO2 capture and separation. With the integration of batch BO, this work shows that descriptor-free LVGP can also be effectively extended to applications with substantial number of levels.

Results

Design spaces

To show the effectiveness of LVGP-MOBBO, we demonstrated our framework on a design space using the fof topology, which consists of 4 types of building blocks (BB). We used 7 organic nodes (Nodular BB1) and 4 inorganic nodes (Nodular BB2). There are also two types of edges in the fof topology, and we used 41 edges for Connecting BB1 and 42 edges for Connecting BB2. All the building block choices are displayed in Fig. 2. For the use of MOFs in carbon capture, two of the most important metrics are the CO2 working capacity and the CO2/N2 selectivity. Since we focused on method development, we calculated these properties for all MOFs in our design space in advance to aid in testing different variations of the search methods. The first design space, which we denoted as the Reduced Design Space (RDS) for validation purpose, consists of 1001 MOF designs that were specifically selected by choosing certain building blocks highlighted in red in Fig. 2 to demonstrate the interpretability of LVGP and the effectiveness of finding optimal MOF designs when combined with BO. The chosen building blocks have known similarities and differences in chemistry and structures, which are reflected in the distances between the building blocks in the latent space of LVGP. Further details of RDS are provided under the Performance section. The second design space, which we denoted as Entire Design Space (EDS) contains 47,740 MOF design candidates that were constructed by combining all available building blocks (7, 4, 41, 42) for the organic node, inorganic node, and the two edges (Fig. 2). Our framework, LVGP-MOBBO, was demonstrated on this design space to show the effectiveness of LVGP when a large number of building blocks (levels) are present.

Fig. 2: The design space of fof topology used in the study.
figure 2

The fof topology consists of four building block options (a Nodular Building Block 1, b Nodular Building Block 2, c Connecting Building Block 1 and Connecting Building Block 2). Building blocks highlighted in red are selected for the Reduced Design Space (RDS). For the Entire Design Space (EDS) all of the building blocks shown in the figure are used.

The two main goals of our design optimization are to identify (i) the Pareto front of MOF designs between the CO2 working capacity and CO2/N2 selectivity, and (ii) the top-performing MOFs. The Pareto front represents the set of MOF designs that possess properties that are superior to the rest of the design space but cannot be improved without sacrificing the other properties of interest34. Furthermore, the top-performing MOFs represent the MOF designs that are closest to the Utopian MOF design in the Euclidian space. The Utopian MOF corresponds to a hypothetical MOF design that possesses the maximum available property of all objectives, which is often not achievable and therefore considered to be “Utopian”. In many multi-objective optimization applications, identifying solutions near the Utopian point is used to evaluate the performance of algorithms in terms of optimizing both objectives at the same. As a result, the Utopian MOF is used as a reference point to identify MOF designs, which we denote as “top-performing” designs, that have high values of both properties.

Framework

We would like to explore a given design space with as few resources as possible. Thus, we implemented the LVGP-MOBBO framework to perform descriptor-free MOF design optimization with only qualitative representations of building blocks. Our proposed LVGP-MOBBO design exploration framework, consists of 5 major parts: Initial Design of Experiments, Property Evaluation, Latent Variable Gaussian Process, Multi-Objective Batch Bayesian Optimization, and Design Solution (Fig. 3).

Fig. 3: The Latent Variable Gaussian Process-Multi Objective Batch Bayesian Optimization (LVGP-MOBBO) framework.
figure 3

The initial set of materials, also known as DOE, is generated by optimal sliced Latin hypercube sampling. Property Evaluation includes MOF construction and prediction of their adsorption properties using Grand canonical Monte Carlo simulations. The LVGP builds the surrogate model that captures the relationship between the design and property space. MOBBO makes the next batch of MOF designs for property optimization. Design Solution analyzes the MOF designs and the latent spaces. The details of each box are explained thoroughly in the Methods section.

Initial Design of Experiments (DOE)

For a large design space optimization, the initial selection of design candidates plays a key role. Ideally, they should span the design space as much as possible, for which we employed the optimal sliced Latin hypercube sampling (OSLHS)35. The generated MOF designs were then passed into the next task for property evaluation. The detailed generation of the DOE is explained in the Methods sections.

Property Evaluation

Hypothetical MOFs were created using the ToBaCCo 3.0 package36, and the geometry optimization was carried out using the LAMMPS code37 with the UFF4MOF force fields38. Grand canonical Monte Carlo (GCMC) simulations were performed using the RASPA package39 to evaluate the CO2 working capacity and CO2/N2 selectivity properties. Further details of the property evaluation can be found in the Methods section.

Latent Variable Gaussian Process (LVGP)

Using the available MOF designs and their associated properties from the GCMC simulations, one LVGP model for each property was trained. Next, the properties of unexplored MOFs in the design space were predicted along with their quantified uncertainty in predictions, which are utilized by the MOBBO. The details of the LVGP modeling are provided in the Methods section.

Multi-Objective Batch Bayesian Optimization (MOBBO)

Utilizing both the predictions and the uncertainty estimates on the remaining candidates in the design space from the LVGP model, the MOBBO selects a batch of MOFs that has the highest Expected Maximin Improvement (EMI) values. The EMI is formulated in a way that both objectives have equal importance. A batch of \(B\) number of MOF designs with the highest EMI values is selected and passed on to the Property Evaluation task once again. The framework then continues in this cycle until the stopping criterion (e.g., number of MOFs identified) is reached. Further details and formulation of the MOBBO are provided under the Methods Section.

Design Solution

Once the optimization stopping criterion is reached, the identified design candidates are analyzed further to distinguish the Pareto front and the top-performing MOF designs. Finally, the latent space of each building block is visualized to make inferences about their influence on each property of interest.

Validation using a reduced design space (RDS)

Before applying our proposed methodology to a large design space, we validated the effectiveness of LVGP and BO on MOFs by implementing the optimization campaign on a relatively small design space. This space contains Connecting BBs that were handpicked to show the novelty of the methodology by validating the latent variables obtained at the end of the optimization campaign and assessing the efficiency of the methodology for designing MOFs that possess superior properties. All the Nodular BBs (7 and 4 levels for Nodular BB1 & BB2) and 6 building blocks from the Connecting BB1 & BB2 were selected for RDS. Specifically, we selected Connecting BBs labeled as {5, 7, 8, 28, 29, 41} (Fig. 2). The specifically selected Connecting BBs contain known differences and similarities in both structure and chemistry. The blocks {5, 7, 8} have similar molecular structures with different functional groups (-CN, -F, -NH2). Blocks {28} and {29} are extended structures of blocks {5} and {7}, respectively. Finally, block {41} is an empty building block, which facilitates a direct connection between the Nodular BB1 and BB2 and it is known to result in superior properties for some gas separation applications when included in the design of MOFs. As a result, we first aimed to validate the influence of different choices of building blocks on the properties through analyzing and explaining the spatial relationships between latent variables obtained from the LVGP models. Next, we aimed to demonstrate the effectiveness of LVGP for identifying optimal building blocks when used with BO.

The RDS contains 1001 MOF design candidates and three Pareto front MOFs designs. The property space with the known Pareto front and the Utopian designs is shown in Fig. 4a. In addition to demonstrating the effectiveness of LVGP for MOFs, the design optimization goal of this study was to identify both the Pareto and other top-performing designs. To account for all the possible Nodular BB1s, 7 MOFs were chosen for the initial DOE using OSLHS. Each of the 7 MOFs corresponds to one level of the Nodular BB1. Furthermore, due to the small number of candidates, we chose to add one MOF design, \(B=1\), during each iteration of BO. Batch BO was implemented on a larger design space, as discussed later in the paper.

Fig. 4: The LVGP-BO results for the Reduced Design Space (RDS) exploration.
figure 4

a The property space of the available MOF candidates. The known Utopian and Pareto front MOF designs are highlighted with black and red points, respectively. b Design optimization history for 10 top-performing and Pareto front MOF designs. The blue color represents the Pareto front search, and the red color represents the 10 top-performing design search.

The LVGP-BO design optimization campaign ran until the stopping criteria of identifying both the Pareto front and the 10 top-performing MOFs were satisfied. Starting with 7 initial MOFs in the DOE, this stopping criterion led to 44 iterations, which in return shows that a total of 51 (7 + 44) MOFs (5.1% of the design space) are explored as the next design candidates. Specifically, the LVGP-BO framework found all three Pareto front MOF designs in 42 iterations, which corresponds to 4.9% of the design space. The optimization history of identified MOF designs can be seen in Fig. 4b. The fast design exploration of the LVGP-BO demonstrated its capability of finding top-performing MOFs.

To verify the interpretability of LVGP models, we examined the latent spaces obtained from training the LVGP surrogate model for both properties after the 44-iteration optimization campaign. In addition, to validate the correctness of the latent variables obtained from the optimization campaign, we also trained the LVGP models on the entire RDS for both properties. We then compared the latent spaces obtained from these two instances. The comparison of the latent spaces for both objectives is shown in Fig. 5. The four large colored boxes represent the latent space obtained for each BB after training the LVGP, the two columns represent different properties, and the two rows represent the different training instances. By comparing boxes in each column, we observed that the latent space representations obtained at the end of the design optimization show differences with the latent spaces obtained from the LVGP model trained on the entire design space. This was an expected result since LVGP-BO is optimization driven. However, independent of orientation and the scale of z1 and z2 (the 2D latent space axis), the relative distances between latent variables, which reflect the relationships between design choices (building blocks) and their influence on the properties, are preserved after LVGP-BO. For example, for Connecting BB1, level {41} is far from the other levels for both properties in both training instances in Fig. 5. This similarity shows that even though the LVGP used in the design optimization framework was trained on a very small portion of the entire design space that is biased towards promising building block candidates, it can capture the underlying latent variables and the relationships between building blocks very well. This can be very advantageous for designers to understand and extract true meanings from the design decisions that our framework makes.

Fig. 5: The latent variables obtained from the Reduced Design Space (RDS) study.
figure 5

Each colored box shows the 4 building block design variables and red dots show their respective latent variables. The numbers represent the design choice for the specific building block and the legends for the numbers are found in the yellow box on the bottom of the figure. The axes z1 and z2 represent the 2D latent space obtained from the LVGP model. The 1st row represents the latent variables obtained by training LVGP on the all RDS and the 2nd row represents the latent variables obtained after 44 iterations of LVGP-BO on RDS. The 1st and 2nd columns represent the CO2 working capacity and CO2/N2 selectivity properties, respectively. Finally, the dashed boxes show the zoomed in images of clustered latent variables.

The next question then becomes, what do these latent variables represent? As we previously mentioned, since the influence of every qualitative variable on the quantitative response of interest must be due to some physical quantitative variables, the low dimensional latent variables could provide physically meaningful information regarding the cause-effect relationships between qualitative design variables (inputs) and properties (outputs). Specifically, the spatial relations (distances) between different qualitative design choices in the latent space can show similarity and differences regarding the influence of these properties on the response. Similarly, spatial relationships between latent variables can also imply the dimensionality of underlying physical descriptors. Figure 6 shows the importance of the input space, in terms of the textural characteristics, on the property space. For both the RDS and the EDS that will be demonstrated later, we found that most top-performing MOFs for CO2/N2 separation often have small pores, characterized by low values of the largest cavity diameter (LCD) and small gravimetric surface area (GSA). MOFs with smaller pores could result in stronger van der Waals interaction and thus favor CO2 adsorption over N2 adsorption. Knowing the importance of the input space on the latent space, we further investigated how different building blocks affect the pore size, and ultimately the latent space (Fig. 7). For the latent plot of Nodular BB1 (as shown in Fig. 5), we found that the distance among the blocks {1, 3, 4, 5} and {6, 7} are small, and block {2} is always far from the rest of the variables. Building blocks {2, 6, 7} are smaller blocks than {1, 3, 4, 5}, resulting in MOFs with smaller LCD (Fig. 7a). This could explain why blocks {1, 3, 4, 5} are always closer in the latent space than {2, 6, 7}. Moreover, block {2} is bulkier, with a branching -CH3 group, than block {6, 7}, resulting in MOFs with slightly smaller pores, and thus far away from the other building blocks. In the Connecting BB1 latent variable plots, we observed that the 5 blocks {5, 7, 8, 28, 29} formed a cluster and are located far away from the block {41}. Because {41} is an “empty” building block (Fig. 5), using block {41} resulted in MOFs with significantly smaller pores than other building blocks (Fig. 7b), and thus different in the property space.

Fig. 6: Structure – property relationship of the Entire Design Space (EDS) and Reduced Design Space (RDS) datasets.
figure 6

The CO2/N2 selectivity versus the CO2 working capacity for the EDS (a, b) and for the RDS (c, d), colored by the largest cavity diameter (LCD) (a, c) and the gravimetric surface area (GSA) (b, d).

Fig. 7: The distribution of the largest cavity diameters of 1001 MOFs in the Reduced Design Space (RDS) for different building blocks.
figure 7

Largest Cavity Diameter (LCD) distribution for (a) Nodular BB1 and (b) Connecting BB1 on RDS.

For Nodular BB2 and Connecting BB2, we found that the building blocks lead to minimal differences in the pore sizes (Supplementary Fig. 1), and thus LCD could not be used to explain the latent space. For Nodular BB2, the building blocks have the same shape and differ only in their metal elements {1: Co, 2: Cu, 3: Ni and 4: Zn}. A potential explanation for the latent space is the difference in Lennard-Jones parameters (Supplementary Table 1), in which Zn has an \(\varepsilon\) value of about one order of magnitude larger than the other elements, suggesting a stronger van der Waals interaction for Zn, which could favor CO2 adsorption over N2. As a result, block {4} (or Zn) is far apart from the other designs. Although the chemical identities of the building blocks in Connecting BB2 are the same as in Connecting BB1, Connecting BB2 has small effect on the pore size, and thus the property space. As a result, the points are evenly spread out in the latent space of Connecting BB2. Both the optimization performance and the physically interpretable model obtained from this design study demonstrate the effectiveness of LVGP and BO for further design applications.

Entire design space (EDS)

After confirming the effectiveness of our methodology on the RDS, we applied our framework to the entire design space (EDS) that contains 47,740 MOF candidates with fof topology through combination of 7, 4, 41 and 42 building blocks for Nodular BB1, Nodular BB2, Connecting BB1, and Connecting BB2, respectively. The MOF candidates and their respective properties can be seen in Fig. 8a. The design space contains 7 Pareto front MOF designs of interest. Incorporating our knowledge from previous LVGP implementations32,40 and considering the large number of available building blocks in the design space, we decided to match each edge (Connecting BB2, 42 options) with each metal node (Nodular BB2, 4 options), resulting in a total of 168 MOFs to be selected for the initial DOE. To create this DOE, we used OSLHS once again. After starting the LVGP-MOBBO framework with 168 MOFs, we proceeded by adding batches of \({B}=5\) new MOFs with highest Expected-Maximin Improvement (EMI) values until the design space search campaign reached the stopping criterion. The design optimization campaign stopped when the mean EMI values of the 5 MOFs that are selected for property evaluation in each iteration is less than a constant, \(\delta\), taken as \(\delta ={10}^{-5}\) in our study. We implemented this stopping criterion to account for the fact that in practice, it is not plausible to know if we recovered the entire Pareto front set without exploring all MOF designs in the Pareto front. As a result, the implemented stopping criterion tells us that our metamodels, LVGP, are confident that there can be no further improvements made to the MOF design optimization if more MOF designs are added into the framework.

Fig. 8: Performance of the LVGP-MOBBO on the Entire Design Space (EDS).
figure 8

a The property space of MOF design candidates along with Pareto front and Utopian MOF designs. b Percentage of top-performing MOFs identified after 66 iterations. Numbers on top of bars indicate the amount of identified top-performing MOF designs. c The initial DOE and the identified MOF designs after different numbers of iterations. d The building block representations of Pareto front MOFs and their crystal structures. Each MOF is represented as a vector [A-B-C-D-E], where each letter represents Nodular BB1, Nodular BB2, Connecting BB1, Connecting BB2, Topology respectively.

With the aforementioned stopping criterion, the LVGP-MOBBO design optimization campaign continued for 66 iterations, identifying 498 MOF designs in total, including the initial 168 MOFs. Our results show that by scanning only 1.04% of the entire design space, LVGP-MOBBO identified all MOF designs that lie on the Pareto front. Specifically, as seen in Fig. 8c, all the Pareto front designs are identified within 45 iterations, which corresponds to exploration of only 0.82% of the entire design space. This shows that our methodology is very effective and efficient. Although the initial DOE covers the MOF input design space as evenly as possible, the MOFs in the DOE are not distributed evenly in the property space (Fig. 8c). Figure 8d shows the images of the seven MOFs that lie in the Pareto front (Fig. 8c). The five-dimensional vector representation shows the design choices selected for the Nodular Building Block 1, Nodular Building Block 2, Connecting Building Block 1, Nodular Connecting Block 2, Topology respectively. The choices of building blocks can be found in Fig. 2. For some machine learning and optimization methods, this can be problematic, as we show later with the Random Forest approach. However, our methodology was swift in guiding the design decisions towards MOFs with high properties. Figure 8b shows the result of exploring the different number of top-performing MOF designs that are closest to the Utopian MOF design. The LVGP-MOBBO found all of the 25 top-performing MOFs. Furthermore, our methodology identified more than 97%, 87%, 80% of the 50, 100, 200 top-performing MOF designs, respectively. Finally, out of all 330 MOFs explored, 206 MOF designs (63.3%) belong to the 330 top-performing MOFs. The high efficiency in identifying a large number of solutions is advantageous due to two potential main reasons. First, it is possible that not every proposed MOF can be synthesized in the laboratory. Second, there are other criteria that must be addressed in practice beyond the CO2 working capacity and CO2/N2 selectivity, such as cost and stability. Thus, it is useful to have alternative promising candidates at hand, so that a practical solution can be found.

By looking further into the histogram of selected building blocks at the end of the optimization campaign (Supplementary Fig. 2), we observed a bias towards particular building blocks. Specifically, for Nodular BB2 and Connecting BB1, the blocks {2} and {41} are favored because all the Pareto front MOFs possess these building blocks. Therefore, LVGP-MOBBO can identify the promising building blocks effectively and choose them as the next MOF designs at the very early steps of the optimization campaign. Due to very low amounts of training data used for design optimization (starting with 0.35% and ending with only ~1% of the design space), the overall predictive capability of LVGP models is somewhat limited for the complete design spectrum. The parity plots and the mean absolute error (MAE), described by the mean of absolute difference between the predicted and the true property values of the remaining MOFs (47,242 MOFs) in the EDS that are unseen to the LVGP model, are shown in Supplementary Fig. 3. The relatively high MAE scores obtained from the LVGP models were expected because this framework is design (objective) oriented. For Bayesian optimization to perform well, the LVGP does not have to be accurate for all design candidates41. This is evident in our result because even though LVGP is not an accurate model for global predictive capability, the model is good enough to identify where in the design space to look for the optimal solution. The predictions and prediction uncertainties quantified by the LVGP model are satisfactory in the neighborhood of the optimum design candidates, which led to the promising Pareto solutions observed. The low accuracy in regions of the property space far from the optimum does not have a large effect on the overall performance in identifying top-performing materials.

The interpretability of the LVGP approach can be demonstrated using the results for the entire design space. At the end of the 66 iterations, we observed that the latent spaces of the Nodular BB1 (Fig. 9a, c) and BB2 (Supplementary Fig. 3a, c) converged to a final state. This means that after each iteration, the latent spaces obtained for these design variables did not change. On the other hand, we observed non-convergent latent spaces for Connecting BB1 & BB2 that contain 41 and 42 different design choices, respectively. This is because the LVGP model is trained with a very small percentage of the design space ( ~ 1%). The optimization campaign still works well although the latent spaces are not stable. Specifically, block {41} is always separate from the rest in the latent space plots, meaning that its superior effect on the properties is identified clearly. Furthermore, the non-converging behavior is observed for the blocks that have minimal effect on the performance properties. Therefore, our framework can identify the specific building blocks that are superior with a physical justification using the physics-aware LVGP approach.

Fig. 9: Latent variable plots after the LVGP-MOBBO campaign on the Entire Design Space (EDS).
figure 9

a Latent spaces of the Nodular BB1 for (a) the CO2 working capacity and (c) the CO2/N2 selectivity. Latent spaces of Connecting BB1 for (b) the CO2 working capacity and (d) the CO2/N2 selectivity.

The latent variable plots of the Nodular BB1 (Fig. 9a, c) and Connecting BB1 (Fig. 9b, d) can also be explained using the MOF textural properties. For Nodular BB1, blocks {1, 3, 4, 5} form a cluster in the latent space of the CO2 working capacity, while blocks {2, 6, 7} are spread out (Fig. 9a). A similar trend was observed in the RDS (Fig. 5), which we ascribed to the size of building blocks that determine the LCD of the MOFs. However, the latent space for the CO2/N2 selectivity changed slightly compared to the RDS. Specifically, blocks {3, 4} are away from blocks {1, 5} and become closer to block {6}, while the positions of the other building blocks remain similar. For Connecting BB1 (Fig. 9b, d), block {41} is distant from the other building blocks, which was also observed in the small dataset. Although some building blocks are also further apart from the clusters, their locations change from one iteration to another. The latent variables for the Nodular BB2 and the Connecting BB2 can be found in Supplementary Fig. 3. The Nodular BB2 plots can be explained by a similar reasoning as for the RDS plots, whereas the Connecting BB2 is non-convergent due to low training percentage of the LVGP.

Comparison with random forest and robustness of LVGP-MOBBO

We compared our LVGP approach with another ML approach, Random Forest (RF), which is also used for optimization problems with qualitative variables. To conduct a fair study between two ML models, the same descriptor-free (featureless) MOF representation is used for the inputs to the RF model. Further details of MOF representation are explained in the MOF Representation and Database Construction subsection under the Methods section. Both approaches employed the same MOBBO method defined previously. To conduct the study, we ran both frameworks 15 times on the EDS for 60 MOBBO iterations with 15 different initial DOEs. The Pareto front and top-performing MOF design exploration performance of the study can be seen in Fig. 10. We observed that the LVGP can identify all the Pareto front MOF designs whereas the RF approach fails to do so in most cases (Fig. 10a). The small confidence interval in the performance shows that the LVGP approach is robust and reliable in identifying the Pareto front candidates. On the other hand, the confidence interval of RF is large since some of the RF-MOBBO instances fail to identify any Pareto Front MOF designs. This is because RF-MOBBO is stuck in local optimum designs since the algorithm cannot predict beyond the training data, which usually contained initial DOEs with low properties. In contrast, LVGP was able to expand beyond the low property region towards the high property region by its capabilities of extrapolating beyond training data. Moreover, the Bayesian prediction of uncertainty provided by the LVGP compared to frequentist prediction of RF, leads to better and more effective design space exploration41. The LVGP is able to extrapolate uncertainty well, whereas RF fails to do so, as it provides a fixed uncertainty prediction for MOF designs outside of training data due to unavailable tree splitting. More importantly, the LVGP approach makes the correct design decisions at a faster rate compared to the RF approach, which is crucial if the cost of conducting simulations or experiments is very high. Similarly, for all number of top-performing MOF identification categories, the LVGP approach resulted in a better and more robust performance (Fig. 10b). As we already knew both the Pareto front and the top-performing designs beforehand, the two metrics we used could be considered as greedy metrics. In practice, since these designs are typically unknown, a well-known metric used for multi-objective optimization comparison between different algorithms is the hypervolume indicator42. Hypervolume indicator provides a scalar metric of how much hypervolume the Pareto front designs obtained by the algorithm dominate the reference point, where the reference point is usually chosen as the nadir point (known lowest values of objectives). Typically, larger values represent better Pareto front designs. Figure 10c shows the mean hypervolume indicator values obtained at each iteration of both methods in 15 different runs. We can easily observe that (i) the explored MOF designs by the LVGP approach span a much larger hypervolume at a much faster rate, and (ii) the true hypervolume can be achieved for all 15 different initializations compared to the RF approach.

Fig. 10: Comparative study with Random Forest and LVGP-MOBBO.
figure 10

a Pareto front and b top-performing MOF designs identified by the two methods after 15 different runs. c Hypervolume indicator comparison between two methods after 15 different runs.

The advantages provided by the LVGP approach are not limited to design optimization. The physical justification provided by the latent variables make the LVGP an interpretable and a favorable ML model for MOF design. The latent variables obtained at the end of the optimization campaign enabled us to gain physical insights behind the design decisions. Although RF is an explainable ML model, model agnostic methodologies are required to draw conclusions or understand its performance. Finally, since both models aim to perform design exploration in the most efficient way possible, the models are trained on the fly. As a result, we do not have the luxury to perform hyperparameter tuning, as it could require additional property evaluations, which contradicts our goal with efficient materials design optimization. All hyperparameters in LVGP modeling, including both the hyperparameters for regular GP modeling and the locations of different levels of a categorial variable in the latent variable space, are identified through maximum likelihood maximization with the available explored data. On the other hand, RF models could require external hyperparameter tuning through additional property evaluations, such as identifying the number of trees or depth of the trees in the model. This demonstrates another significant advantage of LVGP over RF or other models that require hyperparameter tuning since LVGP does hyperparameter tuning within itself. Thus, together with the better performance and accuracy results, the interpretability and efficiency of LVGP makes our approach more desirable and meaningful for materials design applications.

Discussion

Due to their versatile and tunable nature, MOFs have very large design spaces, and it is impossible to simulate or perform experiments for every MOF to find the promising candidates for an application of interest. Although numerous ML and high-throughput screening approaches exist, they require either large databases or property-specific descriptors. To tackle these challenges, we presented the LVGP-MOBBO framework to design superior MOFs by only employing qualitative representations of building blocks. The framework presented here provides three main advantages compared to current similar efforts: (i) the framework requires no specific descriptors and only uses the MOF building blocks to perform the adaptive design space search, (ii) the framework is application independent, meaning that it can be applied to any property without the need to select important descriptors for the application of interest, and (iii) the physically justifiable latent variable approach provides interpretability on how each building block influences the resulting performance properties. We demonstrated our framework on a design space with 47,740 MOF candidates. The LVGP-MOBBO successfully identified all Pareto front designs and more than 97% of the 50 top-performing MOF candidate designs by scanning only ~1% of the design space. Compared to Random Forest, LVGP has better performance and robustness, and provides interpretability regarding the design through physically justifiable latent variables. Finally, although we demonstrated our framework on a MOF design space with adsorption properties, LVGP-MOBBO can be applied to any property that requires time consuming simulations such as quantum mechanical calculations.

A key challenge in the presented framework lies in the high number of building blocks. When a large number of blocks are present, although the design optimization campaign works efficiently to identify the top building blocks and MOF designs, the LVGP model struggles to converge to a final latent space due to high number of parameter (latent variables) estimations during model fitting. We expect that by incorporating prior knowledge, when available, into the framework such as assigning prior known distributions to latent variables, the latent variable realizations can be more accurate43. We can also incorporate additional fingerprints, i.e., physical descriptors44,45,46,47, that can further differentiate the MOF candidates from each other to build more accurate LVGP models, which in return can further improve design optimization. For enhancing the original LVGP method, we developed, in our recent work, an approach that combines both categorial variables and physical descriptors to address the many-level challenge when using LVGP48. However, choosing the right fingerprints that can be uniquely mapped to MOF designs is often application specific and requires additional work. Therefore, due to the formulation of the LVGP modeling, the main goal of this paper was to search for MOF candidate designs that are known a prior. As a result, predicting the properties of MOFs with unseen building blocks through LVGP is a promising area of research. Finally, an interesting application of our framework would involve performing materials design and development through autonomous experimentation studies. As there is no human intervention in LVGP-MOBBO, and the experimental inputs can be both qualitative and quantitative, we envision that the method we presented here can help researchers guide their experiments efficiently.

Methods

MOF representation and database construction

To validate our implementation of LVGP-MOBBO, we created a design space using the fof topology, 7 organic nodes (Nodular BB1), 4 inorganic nodes (Nodular BB1), 41 edges (Connecting BB1), and 42 edges (Connecting BB2). The fof topology is a derived net of nbo topology, in which the tetratopic linker in nbo is decomposed into two organic nodes and two edges in fof. Further details on fof topology were discussed in literature49. From the combinations of the building blocks, we created 48,216 hypothetical MOFs. We eliminated 104 MOFs that had poor initial geometries and missing bonds. We performed geometry optimization on the remaining 48,112 MOFs, in which we found and eliminated 372 MOFs that collapsed after the geometry optimization. Therefore, 47,740 MOFs were considered for this study. Among the 47,740 MOFs, at least 8 MOFs were experimentally synthesized (Supplementary Table 2).

Each MOFs with fof topology consisted of an inorganic node, an organic node, and two edge blocks. Thus, to represent each MOF we use a 5-element ‘vector’ representation with integer encoding \([A-B-C-D-E\)], where each letter represents Nodular BB1, Nodular BB2, Connecting BB1, Connecting BB2, and Topology, respectively. For each letter, an integer value is assigned to represent a specific choice of building block and the choices of building blocks can be seen in Fig. 2. In this study, we kept the topology as a fixed variable to keep the design space at a reasonable size for comprehensive validation of the method. The letter “E” can be represented with “fof” or take the value of 1. A visualization of this representation is shown in Fig. 1.

For the initial set of materials, also known as design of experiments (DOE), that initialize the optimization framework, first an optimal Latin hypercube sample with specified number of samples and variables was created. Then, for each qualitative variable, the design space was sliced into \({p}_{i}\) sections, where \({p}_{i}\) represents the number of unique options for each qualitative variable. Each DOE design is assigned to a qualitative variable that falls under the sliced section. This approach enables us to select initial MOF designs that cover the design space as evenly as possible. An example DOE with two qualitative variables (Nodular BB1 & Nodular BB2) that each have 7 and 4 levels is shown in Fig. 3 under the Initial Design of Experiments box.

MOF construction and geometry optimization

MOFs were created using the topologically based crystal constructor (ToBaCCo 3.0)36 software. Geometry optimization was carried out to optimize the unit cell parameters and atomic position using LAMMPS37 with the UFF4MOF38 force field. For each structure, the geometry optimization was performed in a cycle that consisted of two steps, as recommended by Anderson et al.36. The unit cell parameters and atomic positions were first relaxed using a conjugate gradient (CG) algorithm, followed by atomic position relaxation using the FIRE algorithm (we chose a timestep of 0.1 fs). Each minimization converged only when the change in energy from the previous step to the current step divided by the current energy magnitude was less than 10−8 and the forces on atoms were less than 10−8 kcal mol−1 Å. The cycle stopped when the change in energy between the previous cycle and the current cycle was less than 10−8 kcal mol−1.

GCMC simulations

Grand canonical Monte Carlo (GCMC) simulation was carried out using the RASPA package39. Each simulation consisted of 10,000 equilibration cycles and 10,000 production cycles. The Monte Carlo moves used were translation, rotation, insertion, deletion, and random reinsertion. Lennard-Jones (LJ) and Coulombic interactions were used to calculate the energies between non-bonded atoms. LJ parameters between different atom types were computed using the Lorentz-Berthelot mixing rules. CO2 and N2 were modeled as three-site rigid molecules with charges on each site, using the LJ parameters and partial charges from the TraPPE force field50. LJ parameters for the framework atoms were from the Universal Force Field (UFF)51. Previous studies have shown that using UFF parameters for framework atoms can adequately demonstrate the interaction between MOFs and various adsorbates52,53,54,55,56,57. The framework atom partial charges were calculated using the PACMOF (Partial Atomic Charges in Metal-Organic Frameworks) software58. For each MOF, we carried out two GCMC simulations; the first was at the adsorption condition of 1 bar, 313 K, and a bulk molar composition of CO2: N2 = 0.15 : 0.85, and the second was at the desorption condition of 0.1 bar, 313 K, and a bulk molar composition of CO2 : N2 = 0.9 : 0.1.

We used the CO2 working capacity (\(\Delta {N}_{{{\rm{CO}}}_{2}}\)) and the CO2/N2 selectivity at adsorption (\({\alpha }_{{{\rm{CO}}}_{2}/{{\rm{N}}}_{2}}^{{\rm{ads}}}\)) as the criteria to determine top-performing MOFs for CO2/N2 separation. The two properties are defined as follows:

$$\Delta {N}_{{{\rm{CO}}}_{2}}={N}_{{{\rm{CO}}}_{2}}^{a{\rm{ds}}}-{N}_{{{\rm{CO}}}_{2}}^{{\rm{des}}}$$
(1)
$${\alpha }_{{{\rm{CO}}}_{2}/{{\rm{N}}}_{2}}^{{\rm{ads}}}=\frac{{N}_{{{CO}}_{2}}^{{\rm{ads}}}}{{N}_{{{\rm{N}}}_{2}}^{{\rm{ads}}}}\frac{{y}_{{{\rm{N}}}_{2}}^{{\rm{ads}}}}{{y}_{{{\rm{CO}}}_{2}}^{{\rm{ads}}}}$$
(2)

Here, \(\Delta {N}_{{{\rm{CO}}}_{2}}\) is the CO2 working capacity, \({N}_{{{\rm{CO}}}_{2}}^{{\rm{ads}}}\) and \({N}_{{{\rm{CO}}}_{2}}^{{\rm{des}}}\) are the CO2 adsorption loadings at the adsorption and desorption conditions, \({\alpha }_{{{\rm{CO}}}_{2}/{{\rm{N}}}_{2}}^{{\rm{ads}}}\) is the selectivity of CO2 over N2 at adsorption condition, \({N}_{{{\rm{N}}}_{2}}^{{\rm{ads}}}\) is the N2 loading at adsorption, and \({y}_{{{\rm{N}}}_{2}}^{{\rm{ads}}}\) and \({y}_{{{\rm{CO}}}_{2}}^{{\rm{ads}}}\) are the bulk mole fractions of N2 and CO2 at adsorption, respectively. While CO2 working capacity reflects how effective the MOF is at both capturing and releasing CO2, the selectivity determines how selectively the MOF can separate CO2 from the mixture of CO2 and N2.

Latent variable Gaussian process (LVGP)

One of the main contributions of this paper lies in the design optimization of MOFs using only the readily available qualitative representations of building blocks. On the other hand, due to the nature of the correlation functions, it is not possible to directly implement the building blocks into the Gaussian Process (GP) models as the difference between variables becomes unclear. Therefore, in this paper, we implemented the Latent Variable Gaussian Process (LVGP) to account for the qualitative variables in the GP model31. It is known that for every qualitative variable, there are underlying, possibly high-dimensional, quantitative variables that explain its effect on properties. The latent variable approach helps us to map the qualitative variables to a quantitative latent space. Consider a GP model input with \(t=\left[{t}_{1}^{q},{t}_{2}^{q},\ldots ,{t}_{n}^{q}\right]\in {R}^{q\times n}\) with \(n\) qualitative variables and \(q\) number of points, where each point here represents a unique MOF design. Each variable, \(\,{t}_{i}\), has \({p}_{i}\) unique levels (design choices) \({\{l}_{1}({t}_{i}),{l}_{2}({t}_{i}),\ldots {l}_{{p}_{i}}\left({t}_{i}\right)\}\) for \(i=1:n\), (e.g., Cu, Co, Ni options for nodular building block (\({p}_{i}=3\))). Then, each qualitative variable can be represented with a latent variable vector \({z}_{t}({l}_{{p}_{i}})=\{{z}_{t}^{i}({l}_{1}),\ldots ,{z}_{t}^{i}\left({l}_{{p}_{i}}\right)\}\) for \({z}^{i}\epsilon {R}^{k}\). The developers of the algorithm have stated that users are free to choose the dimensions of \({z}^{i}\) but also demonstrated that \(k=2\) is enough to represent the underlying high-dimensional quantitative space. Consequently, we chose \(k=2\). Thus, each level within a qualitive variable can be represented with a latent vector of \({z}_{t}({l}_{{p}_{i}})=\{{z}_{t}^{1}\left({l}_{{p}_{i}}\right),{z}_{t}^{2}\left({l}_{{p}_{i}}\right)\}\), and the input to the GP model becomes \(z\left(t\right)=\left[{z}_{t}^{1}\left({l}_{{p}_{1}}\right),{z}_{t}^{2}\left({l}_{{p}_{1}}\right),\ldots ,{z}_{t}^{1}\left({l}_{{p}_{n}}\right),{z}_{t}^{2}({l}_{{p}_{n}})\right]\in {R}^{q\times 2n}\). An illustration of the latent variable representation of qualitative building blocks is shown in Fig. 3 under the Latent Variable Gaussian Process box. Consider a typical single response Gaussian Process model, which consists of prior constant mean \(\mu\) and \({K}_{Z}(t)\), describing the mean response at any given point in the input space, and a zero-mean Gaussian Process with a covariance function \(K\left(t,{t}^{{\prime} }\right)\), respectively. The covariance function \(K\left(t,{t}^{{\prime} }\right)\) determines the relationship or the correlation between variables in the model. The covariance function can be further extended to \(K\left({t,t}^{{\prime} }\right)={\sigma }^{2}\cdot c\left(t,{t}^{{\prime} }\right)\) where the \({\sigma }^{2}\) represents the prior variance of the GP model and \(c\left(t,{t}^{{\prime} }\right)\) describes the correlation between each point in the model through the specified correlation function. To explain the relationship between each design candidate for this application, we have implemented the Gaussian correlation function:

$$c\left(t,{t}^{{\prime} }\right)=\exp \left\{-\mathop{\sum }\limits_{i=1}^{p}\left({{||}{z}_{t}^{1}\left({l}_{{p}_{i}}\right)-{z}_{{t}^{{\prime} }}^{1}\left({l}_{{p}_{i}}\right){||}}_{2}^{2}+{{||}{z}_{t}^{2}\left({l}_{{p}_{i}}\right)-{z}_{{t}^{{\prime} }}^{2}\left({l}_{{p}_{i}}\right){||}}_{2}^{2}\right)\right\}$$
(3)

The Gaussian correlation function shown in Eq. (3) evaluates the correlation between points \({t}\) and \({t\text{'}}\) based on 2-norm distance. The main reason behind choosing this correlation function is because we assume that points that are close in the spatial input space should also reflect a similar behavior in the output space as well. Along with the 2D mapped latent variables \(z=\left({z}_{t}^{1}\left(l\right),{z}_{t}^{2}\left(l\right)\right)\) for level \(l\) of each qualitative variable \(t\), the parameters\(,\mu\) and \(\sigma\) are estimated through Maximum Likelihood Estimation (MLE) of the log-likelihood function

$$l\left(\mu ,\sigma ,z\right)=-\frac{q}{2}\mathrm{ln}\left(\sigma \right)-\frac{1}{2}\mathrm{ln}\left|C\left(z\right)\right|-\frac{1}{2{\sigma }^{2}}{\left({\boldsymbol{y}}-\mu {\bf{1}}\right)}^{T}C{\left(z\right)}^{-1}\left({\boldsymbol{y}}-\mu {\bf{1}}\right)$$
(4)

where \(q\) is the number of samples, \(C\) is the \(q\times q\) correlation matrix with \({C}_{{ij}}=c\left({t}^{i},{t}^{j}\right)\) for \(i,j=\mathrm{1,2,3},\ldots ,q\), \({\boldsymbol{1}}\) is a vector of ones with dimensions of \(q\times 1\), and \({\boldsymbol{y}}\) is the observed response with dimensions of \(q\times 1\). Finally, the 2D quantitative latent variables are then used to construct a GP model that provides both prediction and statistical representation of uncertainty in the design space for Bayesian optimization.

Multi-objective batch bayesian optimization (MOBBO)

Bayesian Optimization is a well-known efficient, fast, and easy-to-implement optimization technique that has been used in numerous materials design applications. For single objective optimization, BO makes the decision on which design in the design space should be sampled next based on the choice of an acquisition function. Three well-known acquisition functions are Lower Confidence Bound59, Probability of Improvement60, and Expected Improvement (EI)61. With its ability to balance exploration of the design space and exploitation of the objective, EI has been a popular choice for most materials design applications. Considering the large MOF design space, we have also chosen EI as our base acquisition function. For a given candidate design \(x\), with its predicted objective value \({y}_{x}^{{\prime} }\) and quantified uncertainty \({\sigma }_{x}\) from the LVGP model, EI for single objective optimization can be calculated using,

$${EI}\left(x\right)=\left({{\rm{y}}}^{* }-{y}_{x}^{{\prime} }\right)\cdot \psi \left(\frac{\left({{\rm{y}}}^{* }-{y}_{x}^{{\prime} }\right)}{{\sigma }_{x}}\right)+{\sigma }_{x}\cdot \phi \left(\frac{\left({y}^{* }-{y}_{x}^{{\prime} }\right)}{{\sigma }_{x}}\right)$$
(5)

where \({y}^{* }\) is the best observed objective so far in the optimization campaign and \(\psi ,\phi\) represent the cumulative distribution function (CDF) and probability distribution function (PDF), respectively. As Eq. (5) shows, the EI function suggests a new design by not only considering the exploitation of the objective function, \(({y}^{* }-{y}_{x}^{{\prime} })\), but also the uncertainties associated with the design space, \({\sigma }_{x}\).

Often there are tradeoffs between objectives, meaning that one objective cannot be optimized without sacrificing the other one. This type of problem is also known as Pareto front optimization and is frequently observed in material systems34. Thus, for multi-objective optimization problems, the goal becomes discovering the Pareto front of the property space. Therefore, we have expanded the EI formulation by implementing the Expected-Maximin Improvement (EMI) acquisition function to serve as the balancer of the exploration and exploitation for multi-objective optimization. For the case of optimizing two objectives, the formulation of EMI is

$${EMI}\left({x}_{k}\right)=\min \left(\max \left({\boldsymbol{E}}{{\boldsymbol{I}}}_{{\boldsymbol{1}}},{\boldsymbol{E}}{{\boldsymbol{I}}}_{{\boldsymbol{2}}}\right),0\right)$$
(6)

where \({{EI}}_{j}\) corresponds to the Expected Improvement value of each objective \(j.\)The EI formula was used to compare the candidates with respect to the observed number of \(p\) Pareto front designs so far in the optimization campaign. Therefore, each \(E{I}_{j}\) is a vector of \(p\times 1\) that contains the EI values of a candidate design on the observed Pareto front designs for each objective\(.\)Lastly, we first take the maximum of EI’s for both objectives to observe the dominance of the candidate on the current Pareto frontier and then select the minimum of the maximum EIs to balance the multi-objective search. As a result, the EMI is formulated in a way that both objectives have equal importance. Eq. (6) selects the single best candidate in each multi-objective BO iteration.

Due to the large number of candidate designs and the cost of training GP models, it is not ideal to train the LVGP with a single design candidate at each iteration. Therefore, to extend single candidate BO to select a batch of promising candidates, we select \(B\) candidates that possess the highest EMI values in each iteration and use them as the next design candidates. A demonstration of a single MOBBO iteration is demonstrated in Fig. 3 under the Multi-Objective Batch Bayesian Optimization box.