Automated prediction of ground state spin for transition metal complexes

Exploiting crystallographic data repositories for large-scale quantum chemical computations requires the rapid and accurate extraction of the molecular structure, charge and spin from the crystallographic information file. Here, we develop a general approach to assign the ground state spin of transition metal complexes, in complement to our previous efforts on determining metal oxidation states and bond order within the cell2mol software. Starting from a database of 31k transition metal complexes extracted from the Cambridge Structural Database with cell2mol, we construct the TM-GSspin dataset, which contains 2063 mononuclear first row transition metal complexes and their computed ground state spins. TM-GSspin is highly diverse in terms of metals, metal oxidation states, coordination geometries, and coordination sphere compositions. Based on TM-GSspin, we identify correlations between structural and electronic features of the complexes and their ground state spins to develop a rule-based spin state assignment model. Leveraging this knowledge, we construct interpretable descriptors and build a statistical model achieving 98% cross-validated accuracy in predicting the ground state spin across the board. Our approach provides a practical way to determine the ground state spin of transition metal complexes directly from crystal structures without additional computations, thus enabling the automated use of crystallographic data for large-scale computations involving transition metal complexes.


S1 Dataset generation
The database 1 we previously extracted from the Cambridge Structural Database (CSD) contains 31,019 mononuclear complexes featuring eight different metal centers (Cr, Mn, Fe, Co, Ni, Cu, Ru, and Re), whose unit cells were successfully interpreted by cell2mol, and for which metal oxidation state (OS) matched the respective .ciffile.Before creating this database, we performed pre-filtering to exclude complexes missing hydrogens.Moreover, this database does not include any formally radical ligand or polynuclear complexes.In this work, we focused on first row transition metal complexes with five metal centers Cr, Mn, Fe, Co, and Ni whose d electron configurations ranging from d 4 to d 8 , which reduces the initial database from 31,019 to 17,214 entries.Among those, 1,211 complexes identified by cell2mol as containing haptic ligands were excluded.We then checked the coordination number of the complexes, defined as the number of atoms bound to the metal center.For 650 complexes, the coordination numbers determined by cell2mol differ from those displayed in the CSD software Conquest.These discrepancies did not impact the assignment of metal OS.Still, a correction procedure, guided by the coordination environments and metal-ligand bond lengths, was applied to identify the metal-coordinating atoms.17 complexes that showed a disagreement in the coordination number after correction were removed.34 additional nitrosyl complexes were removed to avoid potential spin contributions from typical non-innocent ligands. 2 The complexes' coordination geometry was assigned using a continuous shape measure, abbreviated as CShM, as implemented in the CoSymLib python library. 3 The CShM value is defined as the deviation from an ideal shape of a polyhedron, measured on a scale from 0 to 100.For each complex, the CShM values were obtained according to the selected reference shapes in the corresponding coordination number.We assign the coordination geometry to the shape characterized by the lowest CShM value.115 complexes exhibiting very significant distortions and featuring a deviation value higher than 7.0 were excluded after this step.This selected threshold is chosen to be considerably higher than the rule of thumb values defined in the literature, 4 in which CShM values below 1.0 are considered to indicate minor distortion from the reference shape, while values of up to 3.0 are considered to indicate important distortion.However, it is important to note that even for CShM values around or above 3.0, the reference shape still provides a reasonably good stereochemical description. 4 The remaining 15,837 complexes constitute the "original database", which shows large chemical diversity as illustrated in Figure S2 but also large redundancy.We performed stratified sampling to construct a more compact, albeit representative dataset and avoid the cost of the ground state spin computations for redundant complexes.We labeled the original database of 15,837 complexes based on the following classifiers: metal identity, metal OS, coordination number, coordination geometry, and composition of metal-coordinating atoms.This resulted in 1,633 distinct groups, where the complexes within each group shared the same aforementioned characteristics.The group size ranges from 1 to 616.Within each group, we chose the complexes with the lowest total number of electrons to minimize computational cost.In cases where multiple complexes possess an identical minimal electron count, the one closest to the ideal reference shape was selected.To further enrich the dataset, 628 complexes were additionally selected from the groups consisting of more than one member, prioritizing the complexes with lower electron counts until at least 1 % of the group was selected.As a result, a total of 2,261 complexes were selected for the density functional theory (DFT) computations.We employed five classifiers-metal identity, oxidation state, coordination number, coordination geometry, and the composition of metal-coordinating atoms-to select 2,261 entries from 1,633 groups, categorizing the original database of 15,837 complexes.Introducing additional classifiers, e.g. the formula of ligands, would significantly increase the number of groups to 13,512, covering over 85 % of the original database, which would lead to a large amount of redundant complexes (see below example).We thus opted to continue using the initial five classifiers.
We illustrated the chemical diversity within a group using the example of octahedral Cr(0) complexes that coordinate six carbon atoms.This group consists of 41 complexes and can be divided into 36 subgroups based on the chemical formulas of the ligands, as detailed S-8 in Table S1.The "Formula of Ligands" column reveals that, except for the 36th subgroup containing BIBPEI, the differences between subgroups are limited to changes in a single ligand.Consequently, we consider that sampling three complexes (highlighted in bold in Table S1) out of the 41 is sufficiently representative of the group as a whole in terms of chemical diversity.
Table S1: Subgroups in a group of octahedral Cr(0) complexes coordinating six carbon atoms based on the formula of ligands.Complexes highlighted in bold were selected for the DFT computations.
refcode N elec Formula of TM complex Formula of ligands subgroup size

S2 Ground state spin computations
The ground state spins of 2,261 complexes selected via stratified sampling (see above) were computed at the B3LYP*-D3(BJ)/def2-TZVP level.Table S2 shows accessible spin states based on the d electron configuration of the metal center.Out of 2,261 complexes, 184 complexes were removed due to convergence failures in all accessible spin states, small energy gap between possible ground state spins (Table S3), and significant spin contamination (Table S4).Significant spin contamination was defined as the expectation value of Ŝ2 that deviates more than 0.2 from the exact value of S(S + 1), 5 or more than 0.1 for the singlet and doublet ground states.
Finally after removing 184 of the 2,261 complexes, further inspection of the chemical names (as reported in the .ciffiles) and ligand compositions revealed that among the remaining entries, one contains radical ligands, and 13 have identical chemical compositions and nearly identical structures to others despite originating from different CSD entries.Consequently, these 14 complexes were excluded, bringing the total number of complexes in the final dataset to 2,063.

S-12
Due to the aforementioned filtering and exclusion processes, 78 groups in Table S5 were not represented in the final dataset, termed as the TM-GSspin dataset.Most of these unrepresented groups had a group size of 1 in the original database, leaving no alternatives available.For groups with more than one entry, DFT calculations were performed on one to three additional complexes.However, if these additional complexes also failed in the computations, those groups are noted in Table S5.Despite the reduction from 1,633 to 1,555 groups, the dataset still exhibits large chemical diversity (see Figure 2 in the main text).

S-14
The B3LYP* results were compared to TPSSh and M06L using the same def2-TZVP basis set.This comparison was conducted on a subset of 445 Fe complexes, as depicted in Figure S3.Notably, when excluding the complexes with significant spin contamination or small energy gaps between possible ground state spins, the number of low-spin (LS), intermediatespin (IS), and high-spin (HS) complexes is consistent across all three functionals.We also conducted additional computations to determine the ground state spins of nitrosyl complexes, which were originally excluded at the dataset curation process.Generally, the ground state spins of nitrosyl complexes favor the LS states (Table S6).cell2mol has assigned the ligand charge of 0 to linear nitrosyl and −1 to bent one based on the metal-nitrogenoxygen angle.However, in specific cases (highlighted in red in Table S6), we observed a mismatch between the ground state spin of the complex and the possible spin states of the metal ion.This discrepancy arises from a spin originating from a nitrosyl ligand, which was assigned to a linear type by cell2mol but appears to be a bent type.

S-18
To evaluate the impact of geometry optimization, we compared the effect of four different geometry optimization strategies in 10 exemplary complexes (see Table S7): no optimization of the crystallographic structure, optimization of only hydrogen atoms, optimization of all atoms except the metal and its coordinating atoms, and full geometry optimization.

S-19
Figure S5 shows that the effects of geometry optimization on spin-splitting energies are minor, and the resulting ground state spins remain unchanged.Since we do not consider spin crossover complexes in our dataset, which should be the systems where the spin state is most sensitive to geometry, we believe that the effect of crystal packing and geometry relaxation is minimal.
In the final TM-GS spin dataset, 32 complexes have hydrogens as coordinating atoms.
Geometry optimization for hydrogens in the LS state completed successfully for all these complexes, except for one Co(II) tetrahedral complex, which was optimized in the HS state instead.Final ground state spins were determined through single-point energy computations for different spin states, revealing 27 LS and 5 HS complexes.For the four complexes in the HS ground state, geometry optimization only for hydrogens in the LS state presented no issues.
Lastly, following iterative and automated corrections were applied in case of convergence failures: • We automatically turned on integral symmetry in the SCF if the run terminated due to no symmetry.
• We loosened the geometry convergence threshold if the geometry optimization of the hydrogen atoms did not converge.
• If the optimization failed due to the lack of a lower energy point, we automatically tried other spin states.For example, the square pyramidal Ni(II) complex (CSD refcode: TITRAQ) did not converge in the singlet state, so we refined the geometry in the triplet state, which then terminated normally.This correction happened only for 31 complexes.Figure S14 shows the ground state spins of 144 Fe(II) octahedral complexes depending on coordination sphere composition.Complexes where the major coordinating element is carbon (C) or phosphorus (P)-both strong field ligands-predominantly exhibit a LS state.

S3 TM-GSspin dataset
In contrast, complexes where the major coordinating elements are halogens or oxygen (as shown in Figure S14a: Cl6, O6) or a combination of weak field ligands (as shown in Figure S14c: Cl4-O2, O4-Br2, O4-Cl2) generally adopt a HS state.Complexes where nitrogen (N) or sulfur (S) dominate the coordination sphere display mixed ground state spins.Figure S18 shows the feature importances in models trained using F TM+CE+aSLATM .In the models trained on each metal subset, features that constitute more than 1 % within aSLATM primarily arise from two-body or three-body terms involving the central metal and one or two coordinating elements, such as carbon, nitrogen, oxygen, or phosphorus.On the other hand, in the model trained on the whole dataset, features representing more than 1 % within aSLATM predominantly stem from one-body terms associated with the different metal centers.

Figure
S1 depicts the distributions of deviation values of the coordination geometry determined for complexes with coordination numbers ranging from 2 to 8. The deviation values, and the metal-coordinating atoms are given for several examples (i.e., CSD refcodes LUMZEZ, OBIVAX, NODHUK, RAGLOD, REJPOM, VIDMIH, WUBFUX, BEJVAO, and ZIRKEQ).

Figure S1 :
Figure S1: Distribution of deviation values from the determined coordination geometry.The coordination number (CN) ranges from 2 to 8. Vertical dashed lines represent a threshold value of 7.0, defined to exclude complexes exhibiting very significant distortions from the reference shape.

Figure S2 :
Figure S2: Chemical diversity of the original database of 15,837 complexes.a. Percentage of oxidation state (OS) for each metal.The OS is denoted 0, I, II, or III.The color code indicates the number of d electrons in the metal ion.b.Frequency distribution of coordination geometries.c.Frequency distribution of metal-coordinating atom elemental identities.d.Number of coordinating elements in the first coordination sphere for each coordination number.

Figure
FigureS4shows the ground state spins of complexes with 3d, 4d, or 5d transition metal (TM) ions, which were determined at the B3LYP*-D3(BJ)/def2-TZVP level.As expected, the ground state spins of 3d coordination complexes with 0, 1, 2, 3, 9, or 10, d electrons are determined based on the number of unpaired electrons.Some 3d complexes containing haptic ligands with 2, 3 or 4 d electrons exhibit the LS state while 3d coordination complexes with d electron numbers of 2, 3, or 4 display triplet or quartet ground states.On the other hand, the ground state spins of 4d or 5d complexes are typically LS state.In the case of heavy metal complexes with haptic ligands, their ground state spins are predominantly in the LS state.

Figure S3 :
Figure S3: Benchmark of DFT functionals B3LYP*, TPSSh, and M06L.a.Comparison of ground state spins in 445 Fe complexes, computed with B3LYP*, TPSSh, and M06L using the def2-TZVP basis set.LS: low-spin, IS: intermediate-spin, HS: high-spin.L/I, I/H, and H/L denote complexes with energy gaps less than 5 kcal/mol between two low-lying spin states.b.Confusion matrices before (top) or after (bottom) filtering complexes with significant spin contamination or small energy gaps.In top panels, complexes with small energy gaps represent as LI (LS < IS), IL (IS < LS), IH (IS < HS), HI (HS < IS), HL (HS < LS), or LH (LS < HS).

Figure S4 :
Figure S4: Ground state spins of mononuclear TM complexes with 3d (top), 4d (middle), or 5d (bottom) metal center.The left panels show the ground state spins of coordination complexes, and the right panels show those of complexes with haptic ligands.

Figure S5 :
Figure S5: Impact of geometry optimization on spin-splitting energy in 10 exemplary complexes.No optimization (blue), optimization performed only for hydrogen atoms (orange), optimization performed on all atoms except the metal and its coordinating atoms (green), full geometry optimization for all atoms (red).

Figure S6 :
Figure S6: An overview of TM-GSspin dataset.a.The number of atoms, b. number of electrons, c. total molecular charges, and d. spin multiplicity.

Figure S7 :
Figure S7: Number of atoms in the 6,631 ligands in TM-GSspin dataset.

Figure S9 :
Figure S9: Ground state spin of d 4 Cr(II) and Mn(III) complexes.

Figure S10 :
Figure S10: Ground state spin of d 5 Mn(II) and Fe(III) complexes.

Figure S11 :
Figure S11: Ground state spin of d 6 Fe(II) and Co(III) complexes.

Figure S13 :
Figure S13: Ground state spin of d 8 Co(I) and Ni(II) complexes.

Figure S14 :
Figure S14: Ground state spins of 144 Fe(II) octahedral complexes depending on coordination sphere composition.a.All six coordinating atoms are identical.b.Five out of six coordinating atoms are the same element.c.Four out of six coordinating atoms are the same element.d.Three out of six coordinating atoms are the same element.e.Other coordination sphere compositions.Vertical lines demarcate figures according to the major coordinating element.

Figure S15 :
Figure S15: Histograms of relative metal radii (r rel ) and corresponding ground state spins for Fe(III) complexes with coordination numbers ranging from 3 to 6. N indicates the number of complexes used to plot each histogram.The color code represents the ground state spin: blue (singlet), yellow (triplet), and red (quintet).

Figure S17 :
Figure S17: Relationship between relative metal radius (r rel ) and average of metal radius (x M ) for different metal centers.xM =

Table S2 :
Accessible spin states with corresponding total spin quantum number S based on d n configuration of the metal center.LS: low-spin, IS: intermediate-spin, HS: high-spin.

Table S3 :
Excluded 35 complexes due to the small energy gap between possible ground state spins less than 5 kcal/mol.OS: metal oxidation state, N d : the number of d electrons, E HS−LS : energy difference between HS and LS state, E dif f : energy difference between two low-lying spin states, order: spin state ordering in an ascending order, where the left one has lower energy than right one.

Table S4 :
Excluded 22 complexes due to the significant spin contamination in the ground state.These complexes exhibit an expectation value of Ŝ2 deviated more than 0.2 from the exact value of S(S +1) (in case of S = 0 or S = 1/2, deviation more than 0.1).OS: metal oxidation state, N d : the number of d electrons, E HS−LS : energy difference between HS and LS state, E dif f : energy difference between two low-lying spin states, deviation: deviation of an expectation value of of Ŝ2 from the exact value of S(S + 1), S: total spin quantum number for the ground state spin of a given complex.

Table S5 :
Unrepresented 78 groups in the TM-GSspin dataset due to the exclusion of 198 complexes out of 2,261.OS: metal oxidation state, N d : the number of d electrons, CN: coordination number.

Table S6 :
Ground state spin of nitrosyl complexes initially excluded during the dataset curation process.OS: metal oxidation state, N d : the number of d electrons, N N O : the number of nitrosyl ligands present in the complex, E HS−LS : energy difference between HS and LS state, E dif f : energy difference between two low-lying spin states, deviation: deviation of an expectation value of Ŝ2 from the exact value of S(S + 1), S: total spin quantum number for the ground state spin of each complex.The complexes highlighted in red show a discrepancy between the ground state spin of the complex and the possible spin states of the metal ion due to a spin originating from the nitrosyl ligand.The complexes colored in blue represent cases where E dif f is less than 5 kcal/mol or where the computation failed to converge for one of the accessible spin states.

Table S7 :
Exemplary complexes to assess the impact of geometry optimization on spinsplitting energy.OS: metal oxidation state, N d : the number of d electrons, spin multiplicity in the ground state.

Table S8 :
6ovalent radii used in this work.These values are sourced from a previous analysis of experimental crystal structures.6Formetal elements, proposed covalent radii are based on the analysis of the complexes with coordination number 6.For Mn, Fe, and Co, we use the proposed covalent radii in the LS state.

Table S10 :
TableS9shows 2-coordinate complexes in TM-GSspin and their ligand information.We additionally searched the CSD for 2-coordinate mononuclear complexes with Cr, Mn, Fe, Co, or Ni.These complexes typically feature monodentate bulky ligands.This observation supports our hypothesis that 2-coordinate complexes with d electrons between 4 and 8 (excluding zero-valent complexes) generally possess bulky ligands, resulting in a weak ligand field and a HS ground state.We also identified a few Cu complexes with two halogen ligands, such as the Cu(I)Br 2 complex (CSD refcode: MIYWIE).Due to its d 10 electron configuration, a rule-based decision tree assigns its ground state spin as a singlet, regardless of the bulkiness of the ligands.We did not find any examples of these with Cr, Mn, Fe, Co, or Ni.Cut-off values based on the combination of the coordination geometry and the TM element.The cut-off values of Fe complexes are bold because we compute the cut-off values of other TM complexes by multiplying a factor f .The factor is defined as ratio of slope in FigureS17.For Cr or Mn, f = 0.719/0.757=0.95.For Co, f = 0.793/0.757=1.05.For Ni, f = 0.806/0.757=1.06.The cut-off values of Fe complexes were chosen based on the distribution of relative metal radii and corresponding ground state spins of Fe(II) (Figure3cin the main text) and Fe(III) (FigureS15), to minimize incorrect spin state assignments.

Table S11 :
Comparison of performances of random forest models using F CE and F CE plus the average number of atoms in ligands.10-fold cross-validated accuracies and standard deviations (SD) are shown for the TM-GSspin dataset and for each metal subset.

Table S13 :
Performance of random forest models trained on the TM-GSspin dataset and each metal subset for ground state spin prediction.Stratified 10-fold cross-validations were performed, where accuracy and F1 score indicate the mean accuracy and the mean weightedaveraged F1 score across the 10 validation sets in each dataset, respectively.SD is the mean standard deviation across the 10 validation sets in each dataset.Five different descriptors served as input features for the random forest models.F TM : a vector containing the metal atomic number, metal OS, and the number of d electrons, F CE : a vector containing coordination number, coordination geometry, and relative metal radius, F TM+CE : a vector incorporating F TM and F CE , aSLATM: atomic SLATM representation 7 of the metal center, calculated using the QML package 8 with a modified discretization grid (0.4 a.u.) and cutoff (4.0 a.u.), F TM+CE+aSLATM : a vector incorporating F TM+CE and aSLATM.The number of complexes within each dataset is shown in parentheses.To ensure reproducibility in training and testing the random forest models, we set a fixed random state for the randomized search on hyperparameters and stratified K-Fold cross-validation.

Table S14 :
43 complexes whose random forest model using F TM+CE results in incorrect prediction.OS: metal oxidation state, N d : the number of d electrons, r rel : relative metal radius, S DF T : DFT computed ground state spin, S Exp : experimentally characterized spin state, S M L : predicted ground state spin by random forest model using F TM+CE trained on the TM-GSspin dataset, Prob.: ML prediction probability.If S Exp was not found in the literature, it is indicated with a hyphen.S DF T that does not match S Exp is displayed in red, indicating that the ML model's reference data is incorrect.SCO indicates a spin-crossover complex.