Predicting the Materials Properties Using a 3D Graph Neural Network with Invariant Representation

Accurate prediction of physical properties is critical for discovering and designing novel materials. Machine learning technologies have attracted significant attention in the materials science community for their potential for large-scale screening. Graph Convolution Neural Network (GCNN) is one of the most successful machine learning methods because of its flexibility and effectiveness in describing 3D structural data. Most existing GCNN models focus on the topological structure but overly simplify the three-dimensional geometric structure. However, in materials science, the 3D-spatial distribution of atoms is crucial for determining the atomic states and interatomic forces. This paper proposes an adaptive GCNN with a novel convolution mechanism that simultaneously models atomic interactions among all neighboring atoms in three-dimensional space. We apply the proposed model to two distinctly challenging materials properties prediction problems. The first is Henry’s constant for gas adsorption in Metal-Organic Frameworks (MOFs), which is notoriously difficult because of its high sensitivity to atomic configurations. The second is the ion conductivity in solid-state crystal materials, which is difficult because of the few labeled data available for training. The new model outperforms existing graph-based models on both data sets, suggesting that the critical three-dimensional geometric information is indeed captured.


I. INTRODUCTION
A CCURATE and rapid prediction of physical properties plays an essential role in discovering and designing novel materials. Ab initio methods [1] [2] are commonly used to predict the materials properties. However, ab initio methods, such as Density-Functional Theory (DFT) [3] and wave function theories [4], are time-consuming. An alternative method is to predict these properties using machine learning (ML) technologies [5]. By modeling the structures of materials and atomic interactions, ML algorithms are much faster than DFT calculations and yet provide similar accuracy [6].
Although ML techniques have succeeded in many fields, such as computer vision and natural language processing, successfully applying ML techniques in materials science is still significantly challenging [7]. The atomic structures of the materials are complex three-dimensional spatial struc-tures with an unfixed number of atoms. The classic machine learning models, such as Convolutional Neural Network (CNN), are better at processing fixed-size data. In contrast, graph-based models, such as Message Passing Neural Network (MPNN) [8] and Graph Convolutional Neural Networks (GCNN) [9], have the advantage of flexibility in describing variable data. Moreover, the message passing (for MPNN) and convolution (for GCNN) could effectively stimulate the atomic interaction in the materials. Recently, graphbased models have attracted significant attention because of the flexibility and effectiveness in describing 3D structures formed by discrete particles [10]- [13].
Although the model's effectiveness has been well demonstrated, the current GCNN has some shortcomings. GCNN describes the structures of the materials with undirected graphs, where nodes represent the atoms while edges express the relationship between bonded atoms. The graph-based description is concise and efficient but ignores the materials three-dimensional topology, which critically impacts the properties of the materials. In addition, the GCNN updates the state of each atom by accumulating the impact of the surrounding bonded atoms. This design is based on the assumption that the surrounding atoms independently influence the current atom. However, the surrounding atoms exert their influence on the current atom simultaneously. The model would be more effective if it were true to reality [14].
This paper presents the k-Nearest Atoms Graph Convolution Network (k-NAGCN) for predicting the physical properties of materials. The proposed model improves the accuracy of attribute prediction by introducing three-dimensional topological structure information. We also designed an innovative convolutional structure to simultaneously calculate the influence of surrounding atoms on the current atom. Besides, we developed an algorithm to calculate translate and rotation invariant descriptors for the local atomic distribution. We tested the new model with two distinctly challenging materials properties prediction problems. The first is Henry's constant for gas adsorption in Metal-Organic Frameworks (MOFs), which is notoriously difficult because the results are highly sensitive to inter-atomic separations. The second example is the prediction of ion conductivity in solid-state crystal materials. The task is also difficult due to the lack of labeled data available for training. The k-NAGCN model presents superior performance on both datasets, suggesting that it indeed captures important three-dimensional geometric information. The contributions of this paper are as follows.
• Proposed an end-to-end system for predicting the physical properties of the materials. • Modeled the three-dimensional materials structures with a graph-based model and proposed a novel convolution mechanism that better simulates physical laws. • Developed a rotation-and translation-invariant descriptor to model the local topology of atoms.

II. RELATED WORK
The emergence of large-scale datasets of materials [15]- [17] greatly empowers the applications of novel machine learning technologies in the field of materials science [18]. The representation of materials plays a critical role in building an effective machine learning system, and it is challenging due to the complexity of the chemical structures. One strategy is to describe the materials using a group of physical properties. Cubuk et al. [19] trained a support vector machine (SVM) based on thirty features and then used transfer learning to get a formula-based predictor. The formula-based predictor offered significant advantages in efficiency, being able to screen billions of materials in a short period. Allam et al. [10] employed DFT modeling approach to predict basic quantum mechanical quantities. The predicted features were combined with the structure feature to obtain a neural network to predict molecular electrode materials. Feature-based methods rely on expertise in materials science, and their performance also depends on the effectiveness of the features used. With large-scale data available, researchers in recent years tend to use data-driven approaches that make greater use of low-level structural information. Hoffmann et al. [20], for instance, constructed a smooth and continuous threedimensional density representation of each crystal based on different atomic positions and designed an autoencoder/decoder network to predict innovative materials. The method benefits from the detailed structural information but requires high computational power. Zheng et al. [21] presented the atomic configuration as a two-dimensional periodic table and normalized the table to mimic a digital image. They trained a multitask Convolutional Neural Networks (CNN) model that outputs the lattice parameters and enthalpy of formation simultaneously. The experimental results supported that their method achieved competitive performance on ICSD [15] and OQMD [17]. Schütt et al. [12] also described materials using periodic tables. The proposed architecture, Schnet, used a continuous filter convolutional layers to model local correlations and yielded excellent performance for predicting a group of molecular properties.
Graph-based models could handle variant inputs and thus have attracted much attention in chemistry. Xie et al. [13] proposed Crystal Graph Convolutional Neural Network (CGCNN) that described crystals using undirected graphs, where the vertices and edges corresponded to the atoms and bonds, respectively. CGCNN captured the atomic interactions using the graph convolution layers, and a pooling layer embedded the graph into the feature space. The CGCNN model showed excellent performance on various properties, including formation energy, bandgap, Fermi energy, and elastic properties. Sanyal et al. [22] adapted the original CGCNN using multitask learning [23], and their experiments showed that sharing parameters in the lower level improved the prediction performance.
Domain knowledge can be integrated into data-driven models to improve performance. Chen et al. [24] developed a universal model to predict the physical properties of both molecules and crystals by incorporating state variables, such as temperature, pressure, and entropy. Lee and Asahi [25] used transfer learning, pretrained a GCNN model with big data, and then used it for predicting target properties with relatively small data. Cho et al. [26] added a relative position matrix into the model and provided a mechanism to handle the feature update process. The proposed model, 3DGCN, showed good performance for predicting molecular properties and biochemical activities.

III. METHOD
This section explains the details of the proposed k-NAGCN. The model accepts the structures of the materials as input. After modeling the structure, convolution operations are performed on the graph, and finally, the features of each node are aggregated, and the target attributes are estimated. The following subsections discuss graph-based structure representa-tion, convolution operations, and the algorithm that produces translation-and rotation-invariant structural descriptors in three-dimensional space.

A. GRAPH-BASED REPRESENTATION
To capture the three-dimensional topology of each materials, we used the following graph-based representations: A materials containing N atoms was represented by an undirected graph G = (V, E), where the vertices V = {v 1 , ...v N } represent the atoms, and the edges E = {e i,j |i, j ∈ {1, ...N }, i ̸ = j} represent the connections between atoms (not necessarily bonds, as some atoms in the our model may not be chemically bonded.) Each vertex in the graph stores two pieces of information: atomic feature of the atom and the spatial location of the atom X. Each edge stores one edge feature. (See Figure 1(a)).
We chose a hyperpameter M . For each type of element in the dataset, we used an M -dimensional feature vector to store the prior chemistry knowledge about the element. Such domain knowledge about the elements is independent of the structures of the materials. These vectors were calculated using the adaptive 'atom2vector' method [27]. With L elements in the datasets, these feature vectors form an M × L matrix, denoted by F . If the atom at vertex v i of the graph is an element of type j, the j-th row vector of the matrix F is used as the feature vector of vertex v i . Technically, this was done by using one-hot encoding [28] to represent the element type. That is, to construct an N × L matrix A = (a ij ) N ×L , such that a ij = 1 if the i-th atom in the input structure is of element type j, and 0 otherwise. Then, the i-th row vector of the product matrix AF gives the atomic feature of i-th vertex.
The spatial distribution of the atoms in the materials was represented by the N × 3 coordinate matrix X, in which the i-th row is the atomic coordinates of i-th atom relative to the unit cell of the materials.
In most graph-based representations, the adjacency matrix is used to describe the chemical bonds. This model replaced the adjacency matrix with the distance matrix because k-NAGCN calculates the graph convolution over each atom and its k nearest neighbors, whether the atoms are chemically bonded or not. This distance matrix is represented by the N × N matrix E = (∥e ij ∥) N ×N , where ∥e ij ∥ is the length of the edge e ij , or the distance between the i-th atom and the j-th atom in the materials.

B. K-NEAREST NEIGHBOR CONVOLUTION
The graph convolution has the most significant impact on the performance of graph convolution neural networks. This subsection explains the proposed convolution mechanism that involves the nearest neighboring atoms simultaneously.
With the materials represented by graph G = {V, E}, each vertex v i is associated with the atomic feature f i and coordinates x i , and each edge e ij is associated with the edge feature e i,j , which is defined as the vector where α 1 , α 2 , , . . . , α 32 are some pre-chosen numbers. We made some minor effort to distinguish the edge e ij and its feature vector e i,j .
The atomic feature f i associated with the i-th atom is updated according to equation (1).
where Conv(·) is a combination of the linear transformation and nonlinear activation function, K i is the index set of the k nearest vertices of the i-th vertex, and x out is the transformed coordinate calculated using Algorithm 1. After the coordinate transformation, the x out i goes through a linear embedding layer, and the output is added to the original atomic feature f i .
The hyper-parameter k determines how many neighbor atoms should be involved in the convolution. The proposed convolution considers the impacts of k nearest neighbor atoms simultaneously and thus, is a better simulation of realworld atomic interactions. Figure 1(b) presents the process how convolution is performed in k-NAGCN when k = 2.

C. ROTATIONAL INVARIANT LOCAL DESCRIPTOR
The convolution proposed in subsection III-B is sensitive to the order of atoms and translation. In this subsection, we propose translation and rotation invariant descriptors for the local atomic distribution.
It is vital to keep the atom sequence identical when the local structure rotates or translates. Assume k = 2 and denote the two neighboring vertices of the vertex v i by v j and v k . The input of Conv(·) is the concatenated vector (f i , X i , e i,j , f j , X j , e i,k , f k , X k ). If we rotate the geometric structure composed of v i , v j , and v k along the axis that passes through v i and the middle point of v j and v k , until v j and v k switch places, the chemical feature of the atom v i does not change after rotation, but the input of Conv(·) is changed to . Inspired by the Scale-Invariant Feature Transform (SIFT) [29], which is widely used in the field of computer vision, we introduced the following algorithm to generate input sequences for Conv(·) that are invariant under translation and rotation (Algorithm 1).
For each atom v i (i ∈ [1, N ]) in the structure, we calculated the distance 1 between v i and the rest atoms {v j }, where i ̸ = j, and N is the total number of atoms in the structure.
Then, we sorted all v j according to the distance and found the k nearest atoms 2 , where k is a hyper-parameter. Denote by x i the position of atom v i , and by x 1 i and x 2 i the positions of the closest and the second closest atoms to x i . We first find transformation T that translates x i to the origin. Next, we calculate a matrix R 1 that rotates x 1 i to positive z-axis according to Rodrigues' rotation formula [31]. Then, we calculated the matrix R 2 that rotate x i 2 to the yz-plane, keeping x 1 i fixed. Finally, we calculated the new coordinate Sort v j based on distance and get k nearest neighbors 4: Transform the origin of coordinate system to x i 5: Calculate transform matrix R 1 that rotates x i 1 to [0, 0, |x i 1 |] T using Rodrigues' rotation formula 6: Update the coordinate by Calculate transform matrix R 2 that rotate x i 2 to yzplane 8: Update the coordinate by X i out = R 2 × X i R1 9: end for 10: return X out for every atom in the k nearest neighborhood. It is readily seen that this distance-based sort and rotation guarantee the rotational invariance.

A. EXPERIMENTAL DATA
The proposed k-NAGCN was tested on two experimental datasets. The first dataset, CoRE2019, contains 12763 MOFs, and the property of interest is Henry's constant [32]. Appendix A presents the details of the calculation of Henry's constant.
The proposed algorithm was also tested with a subset of the ICSD database built for ion conductivity prediction. The properties of interest of the ICSD data include minimum bond length, large Li site, and percolation radius. In detail, there are 1758 crystals available for the bond length prediction, 2818 crystals available for percolation radius prediction, and 894 crystals available for Li site prediction. These properties were calculated through topological analysis and could be used as indicators for high potential conductors. Readers could refer to [33] for more details on the calculation of these properties.

1) Data cleaning for CoRE2019.
The original CoRE2019 MOF database contains 12763 MOF structures, and the range of the Henry's constant covers more than 30 orders of magnitude. We examined the structure files and found the broad range of Henry's constant was caused by unreasonable structures. In a small portion of MOFs, atoms were unreasonably close to each other, and the overlapped atoms led to high attraction and high Henry's constant. Meanwhile, there were another group of MOFs 4 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181750 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ that had tightened confined geometry, and the thin structures resulted in extremely low Henry's constant(at the order of 10 −10 ).
We got rid of MOFs with atomic distance smaller than 0.8 A to avoid overlapped atoms. We also filtered out the MOFs with extremely low Henry's constant by removing structures with a surface area less than 50Å 2 (meaning that there will be very little space and very unlikely for surface adsorption to happen). The cleaned dataset contains 10136 MOFs. The details of the above sample selection process can be found in Appendix B.

2) Value transformation of Henry's Constant
The Henry's constants still covered several orders of magnitudes. During the backpropagation of the gradient, the tremendous values would cause gradient explosions and make the training infeasible. Existing research ( [34] predicted the value log 10 (H cc ) instead of the widely varied original value.
In this paper, we transformed the original value to log 10 (H cc ) and log 10 (H cc + 1). The two terms were close with large H cc , but the latter term put less weight on materials with low Henry's constants. The rationale of introducing the latter term log 10 (H cc + 1) is that materials with high Henry's constant were favored for gas adsorption, while materials with extremely low Henry's constant was not of the practical interest.
3) Experimental setup and parameter selection 10-folded cross-validation was used throughout the experiments to ensure the fairness of the evaluation. In detail, samples were randomly divided into training set 80%, validation set 10%, and testing set 10%. The network was trained using the training set, and the performance was evaluated on the testing set. The validation set was used to select hyperparameters during the training process, such as numbers of training epochs and convolution layers. The process was repeated ten times, and the performance was evaluated using the average of the Mean Absolute Error (MAE).
The following parameters were shared for all tasks. The batch size is set as 32 to avoid overfitting. Adam is used as the optimizer to find optimized parameters. The initial learning rate was set as 0.0003. The learning rate was reduced by a factor of 0.1 for every 50 epoch.
The experiments were performed on a Microsoft Azure virtual machine with 24 vCPU (2.5GHz), 112 GB ram, and 1 V100 GPU (5120 CUDA Cores and 32GB ram).

C. RESULTS
We first predicted H cc of the CoRE2019. All 10136 MOFs were used in the experiments. Both log 10 (H cc ) and log 10 (H cc + 1) were predicted. The performance was compared with two graph-based models, CGCNN [13] and 3DGCN [26], which were designed for materials properties prediction. The MT-CGCNN [22] wasn't involved in the comparison because our Core2019 and ICSD datasets didn't include multiple labels. The MT-CGCNN without multiple labels was equivalent to CGCNN. Both CGCNN and 3DGCN were implemented according to the original literature and adapted to fit the CoRE2019 data. For CGCNN, the network structure includes one feature embedding layer, three graph convolution layers, and three fully-connected layers based on the performance of the validation data. For the 3DGCN, max-pooling was used because the original literature reported max-pooling outperformed average pooling. Our experiment revealed that 3DGCN with max-pooling outperformed the 3DGCN with average pooling on Core2019 and ICSD datasets. The k-NAGCN contained three graph convolution layers and two fully-connected layers, and average pooling was used. 10-folded cross-validation was used for all three models. The performance was evaluated using three different metrics, Mean Absolute Error, Mean Absolute Percentage Error, and R 2 score. The results are summarized in Table 1.
k-NAGCN outperformed CGCNN and 3DGCN on H cc prediction of CoRE2019, with the lowest MAE, MAPE, and highest R 2 score. This result proved the two hypotheses we made in the previous section. First, 3DGCN and k-NAGCN achieved better results in H cc prediction compared with CGCNN, proving that 3D topology had an essential role in property prediction. Second, k-NAGCN with the new convolution mechanism achieved lower MAE than 3DGCN, proving that the proposed convolution mechanism could better model realistic atomic interactions. All three approaches achieved smaller MAE on log 10 (H cc + 1) than log 10 (H cc ). The result was predictable because log 10 (H cc + 1) had smaller variation compared with log 10 (H cc ). Considering both terms are one-to-one mapping, the log 10 (H cc + 1) is a better choice for H cc prediction task.
An ablation study was designed to evaluate the impacts of the new convolution mechanism and the 3D spatial information. The ablation study gradually removed the coordination transform, spatial information, and nearest neighbor convolution (the original CGCNN could be seen as the k-NAGCN removing all three components) from the K-NAGCN and compared the results with 3DGCN. The ablation study was tested on the ICSD dataset, including predicting large Lisite, percolation radius, and minimum bond length. Similarly, MAE, MAPE, and R 2 scores were used to evaluate the performance. The results were summarized in Table 2 VOLUME 4, 2016 According to [33], the enlarged Li-site originates from the large local space within the crystal structural framework, and the percolation radius is defined as the maximum radius of a sphere that can percolate across at least one direction of the structure. Both definitions involve three-dimensional topology information. The results of the ablation study proved that the k-NAGCN algorithm benefited from the involvement of spatial information and the models without spatial information. The 3DGCN model also presented better performance than the CGCNN model with no spatial information.
Then we compared the k-NAGCN without spatial information with the CGCNN model. The proposed convolution mechanism, which modeled all neighbor atoms at one time, better fitted the physics laws and thus, outperformed the CGCNN model that only modeled pairwise atomic interactions. The k-NAGCN model without spatial information also outperformed the 3DGCN that contains spatial information by a small margin.
It was noticeable that the k-NAGCN without coordination transformation got the worst results among all competitors. This result revealed that the spatial information helped the performance only when the order of atoms was invariant to rotation and translation.
For a visual presentation of the results, figure 2 gives the scatter plot of the prediction results, where figure 2a presents the predictions of H cc on Core2019 dataset and figure 2b, figure 2c, and figure 2d presents the predictions of minimum bond length, percolation radius, and large Li-site on ICSD dataset, respectively.
The k-NAGCN model also shows good convergence and can avoid the overfitting problem well (see Figure 3).
The number of convolution layers is an important parameter that impacts the performance of the algorithm. However, the experimental results revealed that the number of convolution layers has a negligible effect on the results (see Figure  4). We also discovered that even the most simplified structure (two convolutional layers and one fully connected layer) fit the training data properly. Therefore, we used three convolutional layers and one fully connected layer throughout the rest of the experiments because the convolutional layers' increment led to overfitting the training data.
One crucial hyper-parameter was the number of neighbor atoms k while calculating the convolution (see equation 1). k was vital because it determined the scale of the local structure, similar to the convolutional kernel size in a CNN.
To evaluate the impact of the k, we repeated the experiments using the same setup with different scale parameters, which are 4, 8, 12, 16, and 32. The best average MAE achieved using different k was summarized in Figure 5.
The number of neighbor atoms is another hyper parameter that could significantly impact the performance. The MAE is significantly larger when only a few neighbor atoms, e.g., four, were involved in the convolution. When k was too small, the convolution did not consider the target atom's whole local environment, and the model underperformed. The MAE was reduced while increasing the number and putting more atoms into the calculation. However, when we increased the size of the neighborhood to a specific range (e.g., 12 and 16), a further increment did not improve the performance. The performance decreased significantly when we increased the k to 32. There were two reasons for the deterioration. First, the convergence was affected by the increased parameter number. Secondly, the atomic features were over-smoothed when the scale was too large.
One concern about the proposed method was the computational cost. The computational cost of the algorithm could be divided into two parts, graph building and convolution operation. The graph building included the data import, distance calculation, and coordination transformation. The coordination transformation was time-consuming and increased the total computational cost of k-NAGCN. However, the graph building was one-time computation and didn't impact the calculation time. We counted the running time of the CGCNN and k-NAGCN on our platform (see Table 3). It could be seen that the k-NAGCN algorithm consumed more time on the graph construction process for both Core2019 and ICSD datasets.
We calculated the computational complexity of the graph convolutional layers of k-NAGCN and CGCNN separately. The batch size was set as 32, and the atom and edge featurelength were set as 64 and 41, respectively. The average number of atoms contained in the structure was set as 50, and the number of nearest neighbors was 12. The calculation ignores the activation functions and normalization layers. When both  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   success of these technologies. The graph-based representation is concise and helpful in describing variant structures and effectively handling most materials analysis tasks. The undirected graph could hardly carry the higher-dimensional topology, which is vital to understanding the atoms' interaction. This paper develops the k-NAGCN model that involves three-dimensional topological information.
The three-dimensional information makes the process more complex than the operation on the graph. One of the core problems is that the operation should be rotation invariant in the three-dimensional space. Inspired by the classic SIFT algorithm, we propose to rotate a new description that is rotation invariant. Moreover, Both [26] and [20] reported reduced calculation efficiency after involving the 3D information into consideration. The additional computational cost of the new convolution depends on the total number of atoms in the structure and the neighborhood scales s, which is a constant. Thus, the computational complexity for description calculation is O(s × n). The above computational cost does not affect the forward/backward process and has little impact on the training process. The concept of scale is one significant difference between the existing model and the new model. Most existing methods loop over all the chemically bonded atom pairs and sum up the atomic interaction. The k-NAGCN eliminates the connected matrix and considers the proper scale parameter to prospect the corresponding local structure. One may question whether the fixed scale parameter would limit the model's flexibility. However, the atomic interaction can be learned from data by considering the atom features and 3D atomic configuration, and it is unnecessary to be predefined.

VI. CONCLUSION AND FUTURE WORK
This work investigated three-dimensional geometric information usage in the graph-based model and proposed a novel k-NAGCN model. By comparing the experimental results, we found that 3D geometry played a vital role in modeling the effects of interatomic interactions on the physical properties of the materials. The proposed network structure better reflects real-world interaction than existing graph-based models that mainly ignore 3D topology.
Meanwhile, we developed an algorithm that generated rotation invariant descriptors for the distribution of local atoms. The invariant descriptor enhances the generalization of the prediction model by eliminating the ambiguity caused by rotation. The k-NAGCN model outperformed the existing graph-based methods in a group of challenging materials properties prediction tasks.
Our experiments proved that modeling materials structure as a 3D point cloud is feasible. The proposed model could be extended to general tasks for predicting the physical properties of materials and could be easily embedded in other models such as graph auto-encoders and graph generative networks. Meanwhile, introducing the scale parameter helps discover the most significant local environment for the atoms and may lead to novel meaningful chemistry structures.
The future work involves two parts. On the one hand, the proposed model will be used and validated on a series of property prediction tasks. It will also be used as the discriminator in the Generative Adversarial Network (GAN) that generates new MOFs. On the other hand, algorithms will be developed to enhance interpretability, which prompts the discovery of physic laws. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

APPENDIX A THE CALCULATION OF HENRY'S CONSTANT
Here, we present the Henry's constant in the dimensionless format calculated as followings [32], where β = 1/(k B T ), R represents the gas constant, k B stands for the Boltzmann constant. T is the absolute temperature, and V ext is the external potential. In this work, we consider hydrogen adsorption in MOF structures where the external potential is mainly contributed by van der Waals interactions. The Lennard-Jones 12-6 potential V LJ is therefore used to describe the pairwise additive potential for the interaction of a hydrogen molecule with each atom inside the MOF structure with the cutoff distance of 12.9Å.
where r is the pairwise distance between gas molecules and atoms in the MOFs, and σ and ε are the LJ size and energy parameters. In this work, universal force field (UFF) is used for all atoms in MOFs. For hydrogen molecule, σ H2 = 0.296 nm and ε H2 /k B = 34.2 K. For interactions between different kinds of atoms, the Lorentz-Berthelot mixing rules were used.

APPENDIX B THE REMOVAL OF MOF STRUCTURES WITH EXTREME VALUES
The original CoRE2019 MOF database contains 12763 MOF structures. According to our calculation of Henry's constant, the range of Henry's constant covers more than 30 orders of magnitude (as shown in Figure B1). After examining the MOF structure files, we found the broad range of Henry's constants was due to some unreasonable MOF structures. In Figure B1, we can see that most MOFs distribute around unit Henry's constant after database cleanup. There are a small portion of MOFs with exceptionally high Henry's constants in Figure B1. A closer examination reveals that atoms are unreasonably close to each other in these structures. For example, in MOF (RUBLEH) with the highest Henry's constant, the distance of oxygen-oxygen there is only 0.366Å, which is even shorter than the shortest covalent bond (hydrogen-hydrogen 0.74Å). The overlapped atom position would lead to unreasonably high attraction and unreasonably high Henry's constant.
After getting rid of MOF structure with potentially overlapped atoms (with the smallest atom distance smaller than 0.8Å), the distribution of Henry's constant is more reasonable, as shown in Figure B2. Most MOFs with exceptionally high Henry's constant are filtered out. It is also worth noticing that some Henry's constant values are also extremely low (on the order of 10 −10 ). The low Henry's constant for those MOFs is due to the relatively small pore size or tightly confined geometry (XEKUO). Compared to other MOFs with similar void fractions, XEHUO has a relatively thin structure. Therefore, most interactions inside the MOF are repulsive, resulting in an ultra-low Henry's constant. For example, compared with XEJJOM (void volume of 253 Å 3 ), XEKUO (void volume of 185 Å 3 ) has 0 surface area while XEJJOM has a surface of 9.40 Å 2 when using test particle with a diameter of 3.68 Å(size of nitrogen molecule). After removing MOFs with surface areas less than 50 Å 2 (meaning that there will be very little space and very unlikely for surface adsorption to happen), we filtered out all the MOFs with extremely low Henry's constant. The cleaned data set contains 10136 MOF structures. The values of Henry's constant still cover the several order of magnitudes. However, we believe they may be reachable. The distribution of Henry's constants of the cleaned dataset is shown in Figure B3.