Background & Summary

Atomistic models are an essential tool for the prediction of thermodynamic, mechanical or biochemical properties of a substance. More recently, the use of pre-trained models has become increasingly popular due to their comparably low complexity and high accuracy on modern hardware1,2,3,4,5,6. In order for such models to perform well, their empirical parameters require fitting to high-quality reference data. Depending on the application, reference data are either experimental, or come from computationally more expensive ab initio calculations. Although there are already a handful of large computational data sets covering small organic molecules7,8,9, such data is still scarce for larger periodic systems (cf. Materials Cloud Archive10,11 or the NOMAD database12,13). Motivated by this fact, we present a quantum-chemical data set for zeolites. Zeolites are porous materials comprised of interconnected SiO4 or AlO4 tetrahedra. Their properties can be fine-tuned through synthesis of materials with specific pore size, or the inclusion of additional metal cation sites14,15,16,17. Because of their topology and synthetic flexibility, zeolites have various applications as adsorbents18,19,20 and catalysts17,21,22,23. To this day, a myriad of different zeolite framework types is available experimentally, and many more hypothetical structures can be derived24,25,26. The documentation of fundamental zeolite framework types and derived materials has led to the publication of the well-known Atlas of Zeolite Structures27 in several editions. The atlas lists each unique framework type by its three-letter-code, as assigned by the by the Structure Commission of the International Zeolite Association (IZA). Today, its contents are available online at the Database of Zeolite Structures28, which we use as a source of initial structures for our data set. In this first installment, we include properties for 204 out of the currently available 256 zeolite framework types in the database (a total of 226 unique geometries when also considering derived materials). Our descriptor provides the complete optimization trajectories for each system with atomic positions, lattice vectors, atomic gradients and stress tensors at each step. We envision future extensions of the data set to focus on derived geometries, covering structural defects and host-guest interactions.

Methods

Initial zeolite structures are collected from the public Database of Zeolite Structures28 in the Crystallographic Information File (CIF) format, before conversion to the XYZ format with the Atomic Simulation Environment29 (ASE) package. After selection of all systems with less than 301 atoms, each is manually filtered by removing redundant atom positions in case of fractional occupancies and adding missing hydrogen atoms where needed. Each structure’s coordinates and cell parameters are energy-minimized with the periodic density functional code BAND30, as implemented in the Amsterdam Modeling Suite31 (AMS). The calculations are performed with the revPBE functional32,33, a ‘Small’ frozen core and the double-ζ polarized (DZP) basis set. Grimme’s D3(BJ) dispersion correction34 is applied to all calculations. Previous research has shown that the selected level of theory can accurately reproduce zeolite geometries, albeit slightly overestimating the Si-O bond length (in the range of 2 pm) and smaller Si-O-X angles (in the range of 5 degrees) when compared to experimental results35,36. At the same time, dispersion-corrected functionals are generally more accurate when describing adsorption processes37,38,39. For the optimization of the initial structures, geometry convergence criteria are left at their default values, namely 0.001 Hartree/Å, 0.00001 Hartree/Atom and 0.1 Å for atomic gradients, energy and atomic displacements respectively. We use a Quasi-Newton optimizer40 in the delocalized coordinates space for the initial optimizations. Cases of problematic convergence are restarted with the FIRE41 optimizer.

Data Records

The data is made available at the Materials Cloud Archive42. Each system’s trajectory is stored in an individual NumPy43. npz file. We describe the data types held in each file in Table 1, storing the complete geometry optimization trajectory, including atomic coordinates, system energies, nuclear gradients, lattice vectors and stress tensors for each geometry optimization step. Entries at the first position correspond to the input structure; the last position holds the data for the final, optimized structure. Hirshfeld partial charges44 are provided for the final (optimized) geometries. Atomic coordinates and lattice vectors are stored in ångström, all other properties are stored in atomic units.

Table 1 Overview of the data structures stored in a .npz file.

Technical Validation

The complete data set includes geometry optimizations of 226 systems, resulting in a total of 32550 geometries. System sizes range between 15 and 334 atoms (mean: 126). We illustrate the convergence of all reference calculations in Fig. 1, showing that all optimized systems are well within the defined convergence criteria. Elemental occurrences in the data set are listed in Table 2. Si-O, Si-Si distances as well as Si-O-Si angles are presented in Fig. 2 as the most prominent geometrical descriptors. As most of the initial structures from the IZA database are idealized geometries45, a sharp mean for the Si-O bond distance can be observed at roughly 161 pm (Fig. 2a, blue histogram). Long tails in the distribution vanish and the mean is shifted towards approximately 164 pm when considering geometry-optimized structures (Fig. 2a, orange histogram). Considering the Si-O-Si angles, a slight shift towards smaller values is observed (mean of 149 vs. 142 degrees, Fig. 2c). Both effects have been previously reported by Fischer et al.35,36 and are inherent to the selected level of theory. Distributions of the Si-Si distances in the second coordination sphere do not shift significantly when comparing initial and optimized geometries (Fig. 2b). Relative changes in the cell volumes are presented in Fig. 3 as the ratio of each system’s optimized-to-initial volume. Values below 1 translate to a shrinking unit cell as the optimization progresses. Overall, the geometrical descriptors are in good agreement with experimental data46,47,48,49,50,51. Additional averages for bond distances and angles are summarized in Tables 3, 4 respectively. Distributions of energies, atomic gradients, cell volumes and stress tensors are depicted in Fig. 4. As expected from geometry optimization trajectories, all properties have – with the exception of relative cell volumes – a distinct mean close to zero. Structures close to the initial input geometries contribute to the relatively high standard deviations. Evaluation of the relative cell volumes shows a shifted distribution, with roughly 76% of all structures having a larger volume than their respective optimized geometry. A detailed overview of all calculated structures, sorted by their IZA three-letter-code, the system size and number of iterations is provided in Online Table 1.

Fig. 1
figure 1

Distribution of convergence criteria at the last optimization step for all calculated systems in the data set. Showing (a) the highest absolute component of all nuclear gradients, (b) change in system energy and (c) highest relative atomic displacement.

Table 2 Elemental occurrences in the complete data set.
Fig. 2
figure 2

Distributions of (a) Si-O bond lengths, (b) Si-Si distances in the second coordination sphere and (b) Si-O-Si angles as calculated from all geometries in the data set. Blue and orange bars denote data from initial and optimized geometries, respectively. Mean μ and standard deviation σ printed in the same color as the underlying data. N denotes the total sample size.

Fig. 3
figure 3

Distribution of relative cell volumes per system as the quotient of optimized-to-initial cell volumes. Values below 1 describe a shrinking cell as the optimization progresses. Black line marks V/V0  =  1. Sample size is 226.

Table 3 Mean atomic bond length distributions and their standard deviations (std. dev.) in in ångström.
Table 4 Mean Si-O-R angle distributions and their standard deviations (std. dev.) in degrees.
Fig. 4
figure 4

Distributions of physical quantities in the data set. Showing (a) energy differences per atom, relative to the respective energy of the optimized system; (b) atomic gradient components; (c) unit cell volumes, relative to the optimized system’s volume; (d) stress tensor components. Data is printed on a logarithmic y-scale for a clear display of the distribution. Mean μ and standard deviation σ printed in the same units as the underlying data. N denotes the total sample size.

Usage Notes

No data points were filtered as outliers with regards to the distributions of chemical properties (see. Figure 4). Consecutive structures from the same optimization trajectory will be autocorrelated. The data repository provides an interactive plotting script, displaying the system energy, maximum absolute component of the nuclear gradients and the cell volume at every iteration step for each structure. This requires the Bokeh52 (v. 2.3.1) package for Python to be installed. SHA-1 hash sums are provided for each file to guarantee data integrity, as well as an example input script for a calculation with BAND. Naming conventions: Derived materials are referred to by their IZA three-letter-code, e.g. H-EU-12 is tabulated as ETL_0. Leading non-alphabetical characters have been removed, e.g. *-ITN is tabulated as ITN.