Physics-based representations for machine learning properties of chemical reactions

Physics-based representations constructed using only atomic positions and nuclear charges (also known as quantum machine learning, QML) allow for the reliable and efficient inference of molecular properties from training data. Chemistry is a science rooted in chemical reactions, naturally involving multiple molecular species. Here, we extend QML’s capabilities to include the prediction of reaction properties by defining reaction representations: representations taking as input multiple molecules participating in a reaction, each represented by their corresponding atomic charges and three-dimensional coordinates. Several reaction representations are constructed from established molecular ones and benchmarked on four datasets representative of thermodynamic or kinetic reaction properties. One of these, the Hydroform-22-TS dataset (2350 energy barriers), is introduced as part of this work. The relevant ingredients for a high-performing reaction representation are extracted and used to construct the Bond-Based Reaction Representation ( B2R2 ) for the prediction of quantum-chemical properties of chemical reactions. Finally, variations of B2R2 with varying representation size vs. performance are provided.


Introduction
Physics-based or quantum machine learning (QML) [1][2][3][4] representations form a comprehensive class of chemical fingerprints that are inspired by fundamental laws of physics and basic laws of symmetry. These representations rely on the fact that all (static) information about a chemical system is uniquely encoded into the system-specific parameters that fix the electronic Schrödinger equation: nuclear charges (Z I ) and positions (R I ). Because QML representations are rooted in foundational laws of nature, they are extremely transferable and do not need to be adapted to each specific learning task. Given their transferability, generality, and deep connection to electronic targets, QML representations have been the forefront of machine learning applied to solve chemical problems [2][3][4][5][6][7].
Despite their conceptual and mathematical differences, existing QML representations always focus on encoding either an entire molecule ('global') or a molecule as a set of atomic environments ('local'). These representations can then accurately and efficiently predict molecular and atomic properties, such as atomisation energies [1,8,9], forces [10][11][12][13], potential energy surfaces [14][15][16], excited state properties [17], polarisabilities [18,19] and electron densities [20,21]. Free energies can be predicted with a Boltzmann-weighted ensemble of molecular representations [22]. While atoms and molecules are the instruments of chemistry, chemical reactions are its orchestra. The prediction of reaction properties such as reaction energies, activation barriers, changes in dipole moments, catalytic yields and turnover, are not yet routine within the framework of QML. As opposed to single molecule properties, reaction properties always include a notion of transformation: quantities change from reactants to products. These changes may be subtle, but they dictate chemistry. To be valuable within this context, a representation should capture both transformations from reactants and products, as well as differences between reactions.
Outside the domain of QML, the development of predictive models for reaction properties has been an active field for the last few years. Descriptors derived from 2D molecular graphs of reactants and products [23,24] are the standard choice for the prediction of reaction properties that are not highly sensitive to subtle changes in three-dimensional molecular structure. An alternative is expert-selected (ab-initio) descriptors [25,26], which often correlate well with reaction properties, but typically rely on a mechanistic understanding of the reaction, and are not transferable across reaction classes. In some cases, simple one-hot encoded descriptors [27,28] perform equally well to ab-initio descriptors for significantly lower computational cost. Finally, representations derived from graphs in deep learning models [29,30] have shown promising performance on several reaction properties, but are expensive to train, and tend to perform well only for large dataset sizes.
Compared to the above descriptors, QML reaction representations offer a greater degree of generality (vide supra). Recently [31], it has been demonstrated that reaction representations derived from molecular ones can be modified to better describe the transformational nature of chemical reactions. Inspired by this initial work, here we analyse the essential characteristics of accurate and efficient QML reaction representations in greater depth. Leveraging three key design principles, we propose the Bond-Based Reaction Representation (B 2 R 2 )-a specialised fingerprint for reaction property prediction.

Datasets
To assess the robust performance of the reaction representations, we use four datasets of chemical reactions which are categorised according to their reaction type, property and year of publication. For example, the Proparg-21-TS dataset corresponds to propargylation reactions (Proparg), was published in 2021 (21), and is labelled with barriers (TS). The only set that is not associated with barriers, the SN2-20, correspondingly is missing the -TS. The four sets are SN2-20, GDB7-20-TS, Proparg-21-TS and Hydroform-22-TS. All datasets are available alongside with the source code at https://github.com/lcmd-epfl/b2r2-reaction-rep. The data is also available separately at https://zenodo.org/record/6937747.

SN2-20
This dataset is adapted from the original set of reactants and transition states published as QMrxn by von Rudorff et al [32]. The reactants consists of (i) variations of ethane functionalised with four substituents and (ii) a nucleophile. The products consist of (iii) the corresponding substituted product and (iv) the leaving group. The lowest energy conformation of (i) and (iii), along with their corresponding energy were extracted from QMrxn. We then computed the energies of (ii) and (iv) at the MP2 [33]/6-311G(d) [34][35][36] using ORCA 4.0.1 [37,38], as per the procedure in QMrxn. The eventual dataset consists of 2670 reactions with corresponding reactant and product structures, and reaction energies. While barriers were published as part of the original work, these correspond to barriers between the reactant complex and transition state, rather than isolated reactants and transition state. Correspondingly, we focus on thermodynamics as a target, and consider kinetic properties in the other three datasets.

GDB7-20-TS
This dataset is taken from [39]. The dataset consists of 11 961 diverse organic reactions constructed from the GDB7 dataset [40][41][42], with corresponding energy barriers computed at the ωB97X-D3/def2-TZVP level. Unlike in the previous deep learning model [29], we do not mix forward and backward reactions to double the dataset size. As noted by Heid and Green [30], this practice results in misleading low prediction errors if, for example, a forward reaction is in the training set and a backward reaction in the test set.

Proparg-21-TS
This dataset is taken from [43]. It contains 760 structures of intermediates before and after the enantioselective transition state of the propargylation of benzaldehyde, as well as the barriers computed at the B97D/TZV(2p, 2d) level. As in our previous work [31], we target the energy barriers, which could eventually further been used to derive enantiomeric excess (e.e.) values.

Hydroform-22-TS
As part of this work, we built the Hydroform-22-TS dataset. It consists of 2350 structures of intermediates before and after the alkene insertion transition state in the catalytic cycle of olefin hydroformylation [44],  with associated barriers. The initial set of catalysts were designed from combinations of three group 9 metals (Co, Rh, Ir) with different phosphine/phosphite ligands. The ligands were designed using 29 different substituents appended in an R 1 /R 2 /R 2 fashion to form monodentate phosphine/phosphite ligands as in the original paper [44]. The step of the hydroformylation reaction of interest, shown in figure 1, converts an olefin-bound π-complex to a σ-complex via a 1,2-insertion process. Transition state geometries were generated from a model template through functionalisation with the phosphine/phosphite ligands using the AARON program [45,46]. The generated structures were then optimised in the gas phase at the PBE0 [47,48]-D3(BJ) [49,50]/def2-SVP [51] level followed by single points at the PBE0-D3(BJ)/def2-TZVP level including solvation (in benzene) using the SMD model [52] in Gaussian16 [53]. The π-and σ-intermediate complexes were optimised following the imaginary vibrational mode of the transition state, followed by geometry optimisation at the PBE0-D3(BJ)/def2-SVP level. Free energy corrections (using the def2-SVP basis set) for all species were determined using the rigid-rotor harmonic oscillator model [54] as implemented in the GoodVibes program, version 3.0.1 [55]. Default settings were used.
Finally, additional filters were applied. Some structures had missing hydrogen atoms after optimisation, which were removed. Additionally, many structures relaxed to the cis configuration between the CO and hydride group. The cis structures are not the relevant ones in the chemical process (thermodynamically higher than trans) but are local minima. These were removed such that only the trans structures remain. The final dataset is composed of cobalt (726), iridium (809), and rhodium (815) complexes for which the final barrier distribution is illustrated in figure 2.

General remarks
The SN2-20 and GDB7-20-TS datasets focus on small molecules (up to seven heavy atoms per molecule), whereas the Proparg-21-TS and Hydroform-22-TS sets contain larger complexes (up to 52 and 67 heavy atoms per molecule, respectively). The SN2-20 dataset, as many other examples in the literature [29,30,56], consists of textbook chemical reactions that are readily interpretable. However, robust reaction representations should also be capable of describing the chemistry of large molecules. While the GDB7-20-TS set is made of small molecules, it spans a broader range of chemical reactions with an associated large range of activation energies (0-200 kcal mol −1 ), making it the most challenging dataset of the four.

Molecular representations
The tested molecular representations are separated into three categories: (i) Coulomb Matrix (CM) [8,9] and Bag of Bond (BoB) [57]; (ii) SLATM [58] and FCHL19 [59,60] and (iii) SOAP [18,61]. The CM relies upon pairwise interactions using Coulomb potential terms [8,9]. The BoB takes CM terms and organises the elements of the CM into atom-pairwise types 'bags' [57]: e.g. all interactions between pairs of C atoms are organised into a C-C bag. SLATM and FCHL19 append higher-order interaction terms (between triplets of atoms) with different potentials [58][59][60]. SOAP rather considers atoms in molecules according to their neighbouring atom density. While SOAP was introduced in the context of its kernel, the power spectrum is often treated as a representation and can be fed into any arbitrary kernel functions [3,62].

Reaction representations
Representations are constructed either using only representations of reactants (X r ), products (X p ) or combinations of reactants and products (X d and X rp ). CM and BoB use an internal sorting of features by row-norm, which makes these representations non-additive. Representations of individual molecules are therefore concatenated to give reaction representations X r , X p and X rp where the latter is a concatenation of the first two. For the remaining representations, X r is a summation of the N reactant representations: (1) X p is a summation of the M product representations: and X d is a difference in the summed product and reactant representations: where X denotes the molecular representation. SOAP was generated using the dscribe python package [63]. The other molecular representations were generated using the QML python package [64]. The default parameters are used for all representations. Our B 2 R 2 representation (vide infra) is available as part of the code repository. The most important parameter, the cut-off radius R cut , is optimised for each dataset on a grid. Optimal R cut values are provided in the supplementary information.

Machine learning models
In all cases, Kernel-Ridge Regression (KRR) models were used. While other models, like deep neural networks, might also be suitable, data-efficient KRR has historically dominated the physics-based machine learning landscape [3,65] with benchmark studies [66] demonstrating its superior performance on prototypical quantum chemistry datasets. For this reason, we choose to initially benchmark our reaction representations using KRR models only.
All KRR models are trained with a Gaussian kernel K. The predicted property p for a reaction X, either the reaction energy or barrier, is then given by: where the coefficients α are learned from the training set: The kernel width σ and regularisation parameter λ are optimised using five-fold cross validation. In other words, after initially shuffling the datasets, they are split into five folds, where the first 4 (80%) are used for testing, and the last (20%) for validation. The best hyperparameters are those with a minimal MAE on the test set across the five folds. Details of the hyperparameters tested and the eventual optimal values are given in the supplementary material.
For all datasets except the Hydroform-22-TS set, learning curves are generated using standard ten-fold cross validation. The last fold (10%) of the data is held out as a test set. Of the 90% available for train, increasing fractions of the data are included to generate the learning curves. The MAEs on the test set are averaged over the ten folds. Learning curves for the Hydroform-22-TS set are generated in a different way, owing to its multi-modal nature (see figure 2). The data is split into ten folds as before, where the last fold is always held out as a test set. Rather than including the first 200, 400, etc points for training, the most diverse 200, 400, etc points are included at a time. Farthest point sampling (FPS) is used greedily to determine the most diverse points in the training set. The out-of-sample MAEs are again averaged over the ten folds.

From molecular to reaction representations
In order to maximise transferability, we benchmark the selected reaction representations (see section 2.2) on the four datasets covering a range of reaction types, properties and molecular sizes. The learning curves in figure 3 report the performance of one molecular representation per category (BoB, SLATM and SOAP) but the remaining ones (CM and FCHL19) can be found in the supplementary material.
With the kernel models used herein, thermodynamic properties can be learned to higher accuracy than kinetic properties as illustrated by the lowest prediction errors obtained for the SN2-20 set. This result is expected, as the reaction energy (thermodynamic) depends on the isolated reactant and product structures and not on the reaction barrier. In contrast, other test sets with energy barriers as a target (e.g. Proparg-21-TS, Hydroform-22-TS) are more difficult to learn, as they depend on transition state structures [31]. As noted by previous authors [29,30], the large errors observed for the GDB7-20-TS dataset arise from the inherent challenging nature of this set caused by the significant spread in the target property (0-200 kcal mol −1 ). While pre-training combined with deep neural network architectures can offer improvements for reaction tasks [30], such a comparison is beyond the scope of this work dedicated to physics-based reaction representations.
In the following analysis, we identify three key design principles for an effective reaction representation: first, the difference between products and reactants should be meaningful (i.e. the representation should be additive). Second, emphasis should be placed on the pairwise (two-body) interactions. Third, distinct pairwise interactions should be distinguished. Each of these ingredients and their motivation are discussed in the next sections.

Representations of meaningful differences
In figure 3, it is evident that representations including both reactants and products (i.e. X rp and X d ) systematically outperform the equivalent representations based on the reactants or products only (i.e. X r and X p ). For most datasets, the slope of X rp and X d models are steeper, leading to an improved performance over X r and X p . For the Proparg-21-TS set, it is instead the offset of X d that improves the predictions. Previous efforts in the literature suggested that a single reactant might be sufficient to learn reaction properties [56,67]. This was motivated for example by Hammond's postulate [56,68], where the transition state (and therefore reaction barrier) should closely resemble either the reactant or product. In general, a description of both sides of the chemical equation is fundamental, as neither the reactant nor the product consistently resemble the transition state. Additionally, in a given reaction dataset, the same set of reactants may have multiple products. A representation based on reactants alone therefore leads to a direct violation of the representation uniqueness and injectivity. Therefore a transferable reaction representation must describe all of the participating molecules in the reaction.
As mentioned in section 2.2.2, some representations, like CM and BoB, use an internal sorting procedure which renders the notion of difference between representations meaningless. Representations based on additive potentials (FCHL and SLATM) or densities (SOAP) allow for a suitable notion of difference. Figure 3 illustrates that difference-based representations X d outperform those that concatenate reactants and products X rp in the prediction of reaction properties. The additivity criterion is well-justified from physical laws, as most reaction properties are algebraically additive. For instance, reaction energies are defined as: If a representation is additive, then we can write similarly: Assuming that molecular representations contain sufficient information to regress molecular energies, then the algebraic sum of the molecular fingerprints should also correlate well with the algebraic sum of the energies. Reaction barriers are not an explicit algebraic difference like reaction energies. However, transition state structures often can be reasonably approximated as a (weighted) interpolation between reactant and product structures, analogous to interpolation in the Nudged Elastic Band [69] method. This allows X d to resemble X TS : In the simplest case, a = b = 0.5. In previous work [31], it was found that adapting a ̸ = b ̸ = 0.5 did not improve the performance of the reaction representation. Taking the simplest interpolation allows for a consistent definition of X d which performs well for thermodynamic and kinetic property prediction. Additionally, X d has the same dimensions as the single-molecule representation equivalent X, whereas X rp doubles the number of features. X d thereby enhances the predictive performance of reaction properties in the same number of features as single-molecule features.

Importance of two-body features
While representations incorporating higher-order terms (SLATM, SOAP) outperform those with two-body interactions only (BoB) in figure 3, it is known that the two-body interactions are critical, and that higher order interactions only modestly improve upon the initial necessary terms [3]. For chemical reactions in particular, the predominant interactions are bonds breaking and forming. We hypothesised that only two-body interactions were necessary for good predictive performance of reaction properties. To this aim, we tested the two-body terms of SLATM (SLATM (2) d ) and compared it to full SLATM d . Figure 4 illustrates that SLATM (2) d indeed performs almost equivalently to SLATM d across the four datasets. The eventual MAE of SLATM (2) d and SLATM d models is very close. For the smaller molecule datasets (SN2-20 and GDB7-20-TS), the slope of SLATM (2) d is a little more shallow, but not significantly so. We note that SLATM places a higher weight on short-range interactions by describing pairwise interactions using the London potential 1/R 6 [70] which diminishes the description of longer-range interactions, which might be needed to allow SLATM (2) d to match SLATM d in all cases. Removing or reducing this higher weight on shorter-range interactions might allow for a two-body only representation to be sufficiently accurate.

Separation of distinct two-body features
The two-body terms of SLATM d are separated into pairwise bags of element types (C-C, C-N, C-O, etc). This separation allows for the isolation of the description of important bonds that are involved in a reaction. In figure 5, we leverage the conceptual simplicity of an S N 2 reaction to look at SLATM r , SLATM p and SLATM d two-body features. A C-F bond is broken and C-H bond is formed, while all other bonds remain unchanged during the reaction. The pairwise interaction bags are organised into a fixed order, such that the same bond type (e.g. C-H) is at the same position in the reactant and product representation vector. The The pairwise bagging approach provides clearly separated and interpretable features, but has relatively inefficient computational scaling at O(n 2 ), where n is the number of unique elements in the dataset. We present the B 2 R 2 representation first with the same scaling, and later propose alternative approaches: at O(n), still separating two-body features, and at O(1), reducing feature separation and slightly suffering in performance accordingly.

The bond-based reaction representation
SLATM d is a robust reaction representation as a consequence of three key ingredients: (i) a meaningful difference, (ii) an emphasis on two-body interactions and (iii) separation of the relevant two-body features. These ingredients enable SLATM d to capture and amplify changes in individual bonding environments, which are the ultimate drivers of reaction properties. Relying upon these concepts, we present the Bond-Based Reaction Representation (B 2 R 2 ), a dedicated reaction representation built on the notion of difference in pairwise interactions between reactants and products. There are three variations depending on the bagging strategy: (i) the canonical variation, with pairwise bags (O(n 2 )); (ii) a linear variation (O(n)) and (iii) a constant-size variation (O(1)). In all cases, the B 2 R 2 is significantly smaller than SLATM or other higher-order potential based representations which scale as O(n 3 ).

Canonical B 2 R 2
The canonical variant of B 2 R 2 employs the same bagging strategy as SLATM: that is, by pairwise element types. To describe an interaction between a pair of atoms I and J (in a reactant or product molecule), with nuclear charges Z I and Z J respectively, only the distance between the atoms R IJ is needed, since the element types are encoded in the bag. Here, we use simple Gaussian functions to encode pairwise interactions, which we found to offer good predictive accuracy without the need for physically-informed potential terms (as is typical in QML representations [8,9,71,72]). As outlined in section 3.1, it is the additivity of the functions employed that is responsible for their performance. We choose Gaussian functions centred on the bond between two atoms (µ = R IJ /2), with a standard deviation that we found to perform well on average across the datasets tested (σ = R IJ /8).
For each bag, all p ∈ P bond distances R p in the products and r ∈ R bond distances R r in the reactants are collected and used to construct a difference: To emphasize the relevant bond distances, only R IJ < R cut are included in the representation. R cut is optimised for each dataset (see supplementary material), but typically lies between 3 and 5 Å. This suggests that most information needed to predict reaction properties is local, i.e. within the range of a few bond lengths. Depending on the dataset, in some cases longer-range interactions are needed to describe the structural rearrangements associated with bond-breaking and -making. Unlike in SLATM, the shorter-range interactions are not weighted. Instead, R cut accounts for the nature of bonding interactions per dataset. For example, if longer-range interactions dominate, a longer R cut would be optimal. Figure 6 illustrates B 2 R 2 's desirable properties based on diagnostics introduced in our previous work [31]. Dissimilarity plots were introduced to determine whether a representation correlates with the target property. These plots map the pairwise Euclidean distance between representations against the pairwise difference between target values. A suitable representation should recognise that a small distance between representations should correspond to a small difference in the target property. Plots are constructed for SLATM (2) r , SLATM (2) d and B 2 R 2 on the GDB7-20-TS dataset (plots for other sets are given in the supplementary material). Both SLATM (2) d and B 2 R 2 exhibit ideal behaviour, whereas SLATM (2) r results in a noisy dissimilarity plot which actually increases the distance between representations around zero barrier difference.
Correspondingly, in figure 7, except for minor performance differences, SLATM (2) d and B 2 R 2 perform similarly well, while SLATM (2) r fails to accurately predict the target property. Except for the SN2-20 set, the B 2 R 2 curves have the same slope as SLATM d , or even steeper (Proparg-21-TS). While B 2 R 2 does not offer an overall improvement in performance vs. SLATM d , it does encapsulate the same critical information in far fewer features (vide infra).

Linear-and constant-size versions
While pairwise bags are the canonical choice, and they allow for interpretable features, they also scale as O(n 2 ). While less than SLATM's O(n 3 ), other bagging strategies reduce the scaling.
To reduce the scaling to O(n), B 2 R 2 l uses bags constructed using elements rather than pairwise elements. In other words, bags contain information about Z I only rather than about Z I and Z J . As a consequence, information about Z J should be included in the representation. To this aim, we use a skew-normal distribution rather than a Gaussian distribution to describe pairwise interaction terms: where ϕ(x) is the standard normal probability density function: and Φ(x) is the cumulative distribution function: The degree of skewness is modulated by the second nuclear charge Z J , which also amplifies the magnitude of the function. A similar form is used for the third no bags B 2 R 2 n representation, where interaction terms are no longer separated by pairs of elements (as in B 2 R 2 ) or elements (as in B 2 R 2 l ), instead collected in the same fixed-size vector. Since there is no bagging to isolate Z I , the magnitude is modulated by Z I rather than Z J as in B 2 R 2 l :

Evaluation and performance
The features of B 2 R 2 n correspond to equation (14) evaluated for each grid point equally spaced between 0 Å and R cut . The same grid-spacing of 0.03 Å is used in all B 2 R 2 variations, such that the length of B 2 R 2 n is determined only by the cut-off R cut . Similarly, for B 2 R 2 and B 2 R 2 l , the features of each bag are evaluated using equations (10) and (11) respectively across an equally spaced grid. Their size is determined by both R cut and Features of the three B 2 R 2 variants are illustrated for an S N 2 reaction example in figure 8. Akin to SLATM (2) d , B 2 R 2 emphasises bonds breaking and forming in pairwise bags. B 2 R 2 l encodes the same information, but collects it in element bags, where now the relevant features are in the H, C and F bags. Finally, B 2 R 2 n no longer separates the features into bags, and simply collects all pairwise interactions into the same feature vector. Moving between variations subsequently reduces the representation size, all of which are significantly smaller than the baseline SLATM d . Figure 9 compares the predictive accuracy of the three B 2 R 2 variants. Interestingly, B 2 R 2 l achieves nearly identical performance to B 2 R 2 . Except for the GDB7-20-TS and Proparg-21-TS sets, the slopes are also equally steep. This suggests that incorporating information about Z J in the functional form of the interaction terms, and bagging only by Z I , is as effective as incorporating information about both Z I and Z J . A similar strategy is employed in the design of FCHL19 [10], which uses element bags rather than pairwise bags, and typically achieves similar performance to SLATM in the prediction of molecular properties. FCHL19 does not construct a single function for each pairwise interaction, but rather, in the spirit of ACSF [73], a set of radial basis functions for each unique element type. B 2 R 2 l maintains the single function concept of SLATM while employing an intelligent bagging strategy similar to that of FCHL19. The B 2 R 2 l results thus demonstrates the relevance of this approach for reaction properties.
Finally, B 2 R 2 n (variant with no bags) achieves surprisingly good predictive capabilities. As illustrated for an S N 2 example in figure 8, features that are otherwise separated into four (B 2 R 2 ) or three (B 2 R 2 l ) distinct bags now overlap. Nevertheless, the overall reduction in performance is not drastic, especially for the Hydroform-22-TS and Proparg-21-TS sets. Both sets correspond to larger molecules containing a more diverse set of chemical elements than those in SN2-20 and GDB7-20-TS. As a consequence of the diversity in chemical elements, there are a larger range in bond lengths, likely preventing significant overlap in the same features and allowing for the effects of different pairwise interactions to be distinguished. For GDB7-20-TS, B 2 R 2 n exhibits unusual behaviour, with increased out-of-sample MAE for intermediate training set sizes. Since this is the most challenging dataset, B 2 R 2 n is not the most suitable choice here. In any case, such oscillations at intermediate training set sizes can likely be corrected with FPS learning curves as done for the Hydroform-22-TS set.
In practice, all three variations of B 2 R 2 are available but we recommend the users to default to B 2 R 2 l for a compromise in representation size and predictive capability. Overall, the B 2 R 2 series emphasize the necessary attributes for an effective reaction representation and constitutes a first step towards exploring more sophisticated functional forms. We note that the B 2 R 2 representations might not perform well in all cases. For example, they do not naturally encode symmetry of reaction energies: a model trained on a dataset of forward reactions with their corresponding reaction energies will not naturally predict the same reaction energy with opposing sign for a dataset of backward reactions. Additionally, while a trained model tested on reactions consisting of identical reactants and products should recognise that the reaction energy should be zero, it will not necessarily recognise that the barrier should not be. Previous authors [29,30] addressed such issues by including both forward and backward reactions in the training set. However, it might be feasible to encode such symmetries in the representation directly.

Conclusion
We systematically explore the construction of reaction representations from existing molecular ones, and benchmark their performance on four illustrative datasets of chemical reaction properties. One of these, the Hydroform-22-TS (2350 hydroformylation reaction barriers), is introduced as part of this work. Key design principles for a high-performing, transferable reaction representation are extracted: (i) a meaningful notion of difference between products and reactants (i.e. additivity); (ii) an emphasis on bonding (pairwise) interactions and (iii) effective separation of the description of these bonding interactions. We use these principles to design the Bond-Based Reaction Representation (B 2 R 2 ) and illustrate its competitive performance. Finally, strategies are proposed to manipulate the size of the representation while maintaining excellent predictive capabilities. We expect these findings will expand QML from a molecule-focused discipline to a reaction-focused one.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://doi. org/10.5281/zenodo.6627913.