A Primer about Machine Learning in Catalysis – A Tutorial with Code

Based on a well‐edited dataset from literature by Schmack et al.[1] this manuscript provides a tutorial‐like introduction to Machine Learning (ML) and Data Science (DS) based on the actual programming code in the Python programming language. The study will not only try to illustrate a ML workflow, but will also show important tasks like hyperparameter tuning and data pre‐processing which often cover much of the time of an actual study. Moreover, the study spans from classical ML methods to Deep Learning with Neural Networks.


Introduction
Machine Learning (ML) is a growing area in nearly all fields of science but also in public in general. Everyone is using for example recommendation systems from Netflix or Amazon. Other examples are personal assistants like Siri or Alexa or the algorithms behind the search engines from Google. The examples seem to be endless and continue to applications like self driving cars. But also in Chemistry and Catalysis more and more applications for Machine Learning appear. Reviewing and conceptual publications [2][3][4] do not move away from Computational Chemistry or experimental Catalysis but shift more weight on exploration of the available data and the definition of proper catalyst descriptors.
Very often ML is used at the border of Computational and Solid State Chemistry [5,6] as the descriptors of the materials are well defined and larger material libraries are easy to prepare. ML is even more prominent in the field of Computational (Heterogeneous) Catalysis and Chemistry [7] as this field of interest has the power to generate large datasets in silico. Here the limiting factor is often not the ML part of the studies but the computational expensive Quantum Chemical calculations. But there are also studies from other fields of Chemistry and Catalysis. With a suitable dataset water oxidation catalysts are predictable with ML [8] and even for approaches from Organic Synthesis there are approaches to make ML based predictions. [9] But especially for people newly approaching this field with a view from their respective discipline Data Science (DS) and Machine Learning sometimes seem to be some kind of arcane art. But when comparing for example a Neural Network with the math in Computational Chemistry then the latter is way more complex. This study will try to convince the inclined reader that ML and DS can be a valuable addition to the Chemists toolbox. The respective algorithms will of course not solve every problem in Chemistry but can be a help and guidance to see and visualize trends that are sometimes well hidden in the data.
As starting point of this study no artificial data will be used but a well edited dataset compiled from experiments from literature initially collected by Zavyalova et al. [10] The dataset used in this manuscript deals with the oxidative coupling of methane (OCM) and has more than 1.000 entries. One conclusion drawn based on the data is that there are 18 key elements for OCM, namely Sr, Ba, Mg, Ca, La, Nd, Sm, Ga, Bi, Mo, W, Mn, Re, Li, Na, Cs, F and Cl which turned out to be important for a good performance of the OCM reaction. In a first publication based on the original data Kondratenko et al. [11] just used a fraction of the dataset to gain more insights. Finally Schmack et al. [1] came to an even more curated dataset based on the original version. Based on their statistical analysis they showed that a good OCM catalyst contains at least two elements and one of it must be able to form a carbonate at the reaction temperature of the OCM reaction. The second element must be thermally stable at the relevant conditions. This leads to catalysts for the OCM composed of a thermally stable oxide support together with an active species being able to form a carbonate. More information about the original data can be found in the respective manuscripts. The OCM reaction is still under active research because of the relative abundance of methane and there are for example newer studies combining DS approaches with High-Throughput Screening. [12] This study will not reveal new trends from the history of the OCM but draw some conclusions just based on the published data and if possible with cross-reference to the original manuscript. The author will try to illustrate a ML workflow with a decent choice of algorithms and tools, knowing that the same is also achievable with other tools, for example Matlab instead of Python.
does not have to be on same computer than the browser interface but can also live on a remote high performance machine. Please notice that Jupyter notebooks have the power to mix programming code, Markdown text (a HTML variant) and LaTeX for the narrative around the code. Moreover, it is possible to add videos, pictures, widgets and so on for a rich user experience. Apart from being a tool for programming Jupyter notebooks are especially a useful tool for teaching. For DS and ML Python is close to the standard programming language.
This manuscript is meant to be somewhere between a concept and a tutorial. So there will be things that cannot be explained in depth because of length restrictions. Python and Jupyter Notebooks are well documented, and the author encourages the reader to learn more in books or online resources. Now and then semicolons are added to the source code to make the output better readable. They are not needed for functional code (most of the time). The choice of the used libraries is intentionally kept simple. For example for the plotting Matplotlib is chosen instead of for example Altair or Bokeh. The experience from lecturing topics related to programming shows that one main issue is to get started and simplicity helps in the first place.

Data preprocessing
As first step the source data has to be imported and preprocessed in a way that the ML algorithm of choice can use the dataset. This is often a very important step in each ML study because the preprocessing step helps to get an overview over the data, to find outliers and more peculiarities of the respective dataset. To start the project we will first import some libraries like Pandas [14] to work with the tabulated data, NumPy [15,16] for array type elements and the respective math and Matplotlib [17] for the visualization of the data.
The next code block is included to print out the version numbers of the libraries used to simplify the reproducibility of this study.
The code below with % is a 'magic command' in the Jupyter Notebook and is used to make all plots appear in the browser.
Although ML methods are often statistical methods it is important to end up with a reproducible and deterministic study. Therefore it is important to provide seeds for the random number generators working behind the scenes in some algorithms and procedures. If a dataset is very large and very balanced this is sometimes not important but for medium and small datasets this can be an issue. Therefore we will fix the global random seed here first to an arbitrary value of 42. Additional seeds will be fixed later in the study if needed and there will be some additional comments. Now we will restrict the Pandas library to show only five columns of the respective tabular output. This restriction is only necessary in the context of this publication to be better readable.
From the original study [1] the authors did not only publish the manuscript and an electronic supplementary file but there is also an Excel file available online. This link is put now into the variable url. Now we load the actual dataset from the respective website. For this the read_excel function from Pandas is used. It takes the URL where the file lives or simply a filename and the name of the sheet we want to use as arguments. Of course there are more possible arguments for fine tuning. The loaded table is stored in the Pandas dataframe raw data. No we print the first five rows of raw_data with the Pandas head function. every used element and its respective amount. For a ML approach it would help to have a table with columns for each element first and its amount in the sample, no matter if it is a cation, an anion, support material or something else. With a table like that we could use the data as features meaning the input data for ML algorithms. To get a table like that we will first create pivoted tables with Pandas for all ions and the supports. The pivot function will sort every original column for us with respect to the elements. Now we create a Python pivot_list holding the single pivoted tables.
And now we combine all of the pivoted tables with the Pandas concat function to a huge list.
Pandas like Numpy is array oriented. So we can simply divide the concat_pivot_lists dataframe by 100 and put it into a new dataframe called composition.
When we print this dataframe we can see that it now formatted by the chemical elements with the molar ratio below.
For an overall dataframe we extract now the left over publication numbers and the reaction conditions. To do so we use the Pandas iloc function which allows us to extract data in the same way like from a Numpy array. The respective data is then put in the dataframes called pub_nr and reaction_data. Now we create a list called cleaned_list holding all the single data parts and finally we concat the three arrays into one single dataset called data_cleaned.
The printout of this arrays indicates that now all information is stored in single columns which is easier to analyze than a mixture of columns and rows. Optionally we can still continue to work with the single arrays like composition which are still kept in memory.

Data inspection
Now we can start to work with the data. For example let us calculate the mean of all the elements in the composition dataframe. Next we sort_values in a descending order to make the elements with the highest amount appear first. From this we take the first 18 rows and by using the keys we get the names for the elements. All this can be done in a single line of code leading to Mg, Ca and Si being the most prominent elements in the composition array. We will compare this 18 elements with the findings of one of the original publications [10] a little later.
Visualization is a key part in every ML workflow and so it is here for a first inspection of some part of the data. So let us take the nine elements being present in the largest amount in the dataset. We will loop over this nine entries and plot for each of this the selectivity to C2 components on the y-axis and the respective molar amount on the x-axis. In every plot all samples from the dataset are occurring.
From the inspection ( Figure 1) there are no clear trends to see. For all the nine elements the dots are spread widely over the whole composition axis and for all samples there are clearly good candidates and also bad ones. In the next step we will use ML algorithms to try to see some trends.

Unsupervised Learning
In Unsupervised Learning the respective algorithms never gets to see the targets (output values) of the dataset. It only gets to see the features (input variables, like e. g. the composition). This results in a situation were the algorithm has to figure out any 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57 dependencies in the data by itself. In the next passage several algorithms will be combined with each other: * First a K-Means clustering algorithm to group similar samples together; * Second a Principle Component Analysis (PCA) and a t-Distributed Stochastic Neighbor Embedding [18] (TSNE) to reduce the dimensionality of the dataset.
To use the respective algorithm the corresponding libraries will be imported from Scikit-Learn. [19] Scikit-Learn is a ML library being popular in the Python community for its easy application even for people not familiar with the field. It is even quite popular within more experienced users. Now that we have imported the respective libraries for the algorithms we have to make some use of them. Let us start with the K-Means clustering. The algorithms tries to separate the data into clusters where each data point has a minimal distance to a cluster center. But how to find the right amount of clusters? A typical approach is to plot an "Elbow curve". To do so we loop over several cluster amounts n_cluster, in this case between 1 and 20 clusters and in every pass of the loop we initialize the K-Means algorithm with a different cluster amount. Next the algorithm is fit to the composition array and last we calculate a score for each pass and store it in a list. One of the comfortable things about Scikit-Learn is that most of the algorithms have a scoring method included to evaluate the quality of the fitting. All scoring methods give results between 0 and 1 and the higher the better. So after this step we have a score for each cluster amount. Now the score is plotted over the respective cluster amount. To make the "Elbow curve" look nicer the gradient of the score is calculated via Numpy.
It can be seen that the line goes all the way down until a cluster amount of around 7 ( Figure 2). Choosing fewer clusters will result in a worse result and choosing a cluster amount of more than 7 clusters does not lead to a better result. So a cluster amount of 7 is chosen and a final algorithm is fit with this value. Indeed in the original publications about 10 groups of similar catalysts were found with completely different methods. Having this in mind we are in good agreement with the previous knowledge.
The next code also shows nicely how most Scikit-Learn algorithms work * First an instance of the respective algorithm is initialized, here KMeans, and the parameters are set, * next the instance of the algorithm is fit to the respective data, * and last some predictions, transformations or scores are calculated depending on the algorithm. In this case we end up with a cluster number for each observation in the dataset so that we know which observation belongs to which cluster.
Unfortunately the dataframe for the elemental composition has 68 columns and 1802 entries now which means we have a 68 dimensional dataset and it is not straight forward to do a 2D visualization directly. But Unsupervised Learning also includes many algorithms for dimensional reduction. In this work first a classical one like Principal Component Analysis (PCA) will be used. The algorithm tries to project the variables to a new coordinate system where the axes, the principal components, have a high variance. Each axis of the PCA represents a linear combination of the input features and there are ways to get even more information from these linear combinations. This will be neglected for now because then the PCA has to be programmed a little different and we will only use the power of PCA to reduce dimensions.
To use a PCA first an PCA instance is initialized with two components and then is fitted to the composition dataframe and finally it is transformed into the 2D space.
An alternative algorithm is the t-Distributed Stochastic Neighbor Embedding (TSNE). It searches for a probability distribution in the high dimension which data points are neighboring, and the algorithm tries to project the resulting probability distribution down to the 2D space. The programming procedure is apart from the arguments set the same as for the PCA above.
It is now possible to plot the dimensional reduction from both algorithms and color the data points according to the cluster that they belong to (Figure 3). From the PCA in the left graph it seems that there is not too much variance in the data at all and all the clusters are nicely separable. Also the TSNE algorithm nicely reduces the 68 dimensional data into the plane and separates them into the calculated clusters. Of course it is not perfect but especially for TSNE no hyperparameters were adjusted which could even improve the separability.  What is lost by the dimensional reduction algorithms is the information about the chemistry and the composition. But the original composition can be colored with respect to the cluster number we found (Figure 4). When we do so we can for example recognize that one cluster (first row, first column, cluster 0) is rich in Magnesium, one cluster (first row, second column, cluster 5) includes lots of Calcium and one cluster (second row, second column, cluster 3) is rich in Aluminum. With this neat little trick we know now much more about the elemental composition of each of the clusters we found. It is clear to see that every cluster always contains more than one element. The clustering is not heading for classes containing just one element but for multi-component catalyst classes. This makes the interpretation sometimes difficult.
Coming back to the original publication [10] and the proposed 18 key elements for the OCM reaction and comparing them with the elements available in the largest proportions in this study. It shows that Sr, Ba, Mg Ca, La, Nd, Sm, Mn, Li, Na, Cl are present in both studies. So even by looking at just the data without any prior knowledge about OCM we would be able to make an educated guess for some OCM catalysts. Looking at the hypothesis from Schmack et al. [1] about one stable oxide support and one active species that could form a carbonate we could also guess a support like Alumina and active metals like Li, Mg, Ca Sr and Ba from the visualizations we just did. Schmack et al. also verified their hypothesis experimentally and an initial guess based on the cluster mapping would be a valuable experimental starting point.

Supervised Learning
So now we gained quite some insight into the dataset even if we have a 68 dimensional dataset but now it is time to use Supervised algorithms to make predictions based on that dataset. It can be quite challenging to choose a good algorithm but there are some guidelines like the one from Scikit-Learn [22] which mentions more algorithms like the ones used in this study.
For most Supervised algorithms it is common practice to split the dataset into a training set and a test set. The idea behind is that the test set tries to mimic new data that the trained algorithm has never seen it in order to evaluate its performance. Of course it is possible to do the splitting, shuffling and so on by hand but there is a method available from Scikit-Learn called train_test_split.
So first we put all the features, the composition, into a feature array X and fill empty values with zeros. In the same manor we put the selectivity to C2 products into a target array y and divide all values by 100 to be in the range between 0 and 1. The feature and target arrays are then split into a training and a test set with train_test_split. Very often the shuffling of the data before the split is an issue and so it is here. Fixing the random_state for the shuffling to 42 leads to more stable results and is already part of the hyperparameter tuning. The intention behind this is the same as fixing the global random seed. A typical test set contains between 20 and 30 % of the samples. More samples in the test set makes the risk of overfitting higher because of a smaller training set.

Support Vector Machine
Now that we have a decently split dataset we will use a Support Vector Machine (SVM, SVR) [21] as a first model. Unfortunately the default parameters are not very useful, so they have to be adjusted. Therefore the function RandomizedSearchCV will used to help us with the hyperparameter search.
Although we could just initiate an instances of the SVM and set some values like before, fit to the data and look at the score and do so as long as we have some good values it would be a little tedious. Therefore hyperparameter optimization is of great importance for ML. To make the search for good hyperparameters easier we can make use of one of the methods available in Scikit-Learn like RandomSearchCV. To do so we have to create a dictionary with the hyperparameters to search, then an instance of the SVM is initiated with static parameters that will not be changed during the search. Next an instance of RandomizedSearchCV will be initiated with the SVM and the hyperparameters as arguments and then upon the use of the fit method this instance will use the training data to look for the best hyperparameters with respect to the score. There are other methods around like GridSearchCV. The difference is that the randomized search takes random combinations from the proposed hyperparameters and the grid method tries all in a brute force fashion. The randomized search is chosen here because it is the faster one when no parameters are known.
This search leads to a C constant of near 1 and a gamma value of about 13. Now we fix the parameters to the best found ones and initiate a new SVM instance with exactly these values.
And now we train the algorithm again on the training dataset.
With the ready trained SVM at hand we can now make predictions first on the training set and then on the up to now unknown test dataset. To see the performance of the trained algorithm we take a look at the scores.
A score of about 62 % is at least better than guessing the right combination from 68 elements but not really good and the score for the test set is even worse. Visualizing the trend in a parity plot shows the same ( Figure 5).

Random Forests
Maybe it is not the right algorithm for this kind of dataset, and this will happen quite often on real live examples like here. SVM is an algorithm that is often well suited for smaller datasets like this one here. Another aspect that is not regarded in this study is the data itself. Here only the composition of the catalyst is used but the reaction parameters are completely neglected. For example, the reaction temperature is a very important variable in OCM and there is a high probability that including the reaction parameters will improve the study. Please keep that in mind when we stay now with the elemental composition. We will try a Random Forests algorithm instead of an SVM. Random Forests is known to work even with very large datasets, and it is quite popular in Chemistry as there is a good chance that the result is easy to interpret.
After the import of the RandomForestRegressor we can now look for decent hyperparameters like we did before by first defining a dictionary of possible parameters, then the regressor and after that the search algorithm will be initiated. Finally the regressor will be fitted to the training data.
We always get n_estimators of around 20 as best result and this is now fixed for the training of the final Random Forest. Now we can again fit the final regressor to the training data.
Of course we can make predictions with the trained algorithm and calculate the score for the training and the test dataset.
Plotting the results from Random Forests looks a little bit better though ( Figure 6).  To visualize the performance of both of the algorithms with respect to each other some more metrics are calculated. The first metric is a simple mean_squared_error and the second the r2_score which is basically the same score we already calculated but here we explicitly use the r2_score to be concise. The scores are calculated for both the training and the testing datasets for comparison.
What we see when plotting the metrics is again that the Random Forests is slightly better than the SVM (Figure 7). Another thing to mention is that the scores on the test data is for both a little worse than for the training data. This can be a hint that both algorithms are not overfitting and are still able to generalize on new and unknown data which exactly the aim of Supervised Learning.
But are there measures to be sure that the algorithms are not overfitting? The randomized search includes a concept called cross-validation. For each tested parameter combination, the training data is split into parts, five parts is a common choice. Four of the parts serve as new training set and one as test set for the parameter search. Then every parameter combination is tested five times and every time another fraction becomes the test set. Then we can evaluate the mean score from each of the calculations and find the best combination.
When we take a look at the mean test score results (Figure 8) for the Support Vector Machine we see that the score is always around 25 % and this is close to the value of the final version of the SVM with respect to the test set. For Random Forests it looks a little different because the scores vary much more. Good scores with about 27 to 30 % come close to the final regressor on the test set. The bad values are a hint that for Random Forests a good performance depends very much on the shuffling of this actual dataset and indeed when the random state in the train_test_split is set to different seeds the values vary even more. We will continue with a random seed of 42 and more stable results. After evaluating the cross-validation results we can now be quite sure that both algorithms are not overfitting.
Especially Random Forests can give some more information about the features we put into the algorithm. For example we can ask which of the 73 elements in the dataset are the most decisive and have the highest impact on the predictions of the algorithm. To get this information we can use the SelectFrom-Model method from Scikit-Learn. It gives us just a selection of the most important features in the dataset which is in this case a subgroup of about 20 elements and the output is ranked so that the feature with the highest importance comes first. Again, this observation is well in line with the findings from Ref. [10] about the key elements and from Ref. [1] for the combinations of support and active components. For example Ca on Alumina would be a good experimental candidate and the experimental data from Ref. [1] supports this guess. This is the initial step of a process called feature selection and becomes more and more important the more features (input variables) a dataset has. Of course it can be beneficial to have more features but at some point the calculation times then get too long and it becomes mandatory to select the most important features.
Up to now we could not beat 80 % accuracy of the Random Forest algorithm on the training set. Can we get any better?

Deep Learning to the rescue?
What can we try now to improve the quality of our prediction? Well, everybody is talking about Deep Learning nowadays, so maybe we should try this? We will try soon but first, what is it all about? Machine Learning is available now for quite a while and indeed some algorithms are also pretty old. But there are also more techniques like for example Neural Networks (NN). Also NN's exist for quite some time and a simple NN looks more or less this way: * An input layer of artificial neurons, * a hidden layer where the values of the input layer are multiplied with a weight array. If the value in a neuron is high enough it gets activated by an activation function and the value of this neuron propagates to * an output neuron delivering the final value.
In principle one can model any arbitrary function with a NN but it was hard to adjust the weighting arrays in the middle of the networks in the past. But then an idea came up how to adjust the weights with something called back-propagation [22] and combine it with a gradient descent optimization. This made NN's better usable but is was still computational expensive until the use of accelerator units like Graphics Processing Units (GPUs) which lead to a tremendous speedup. This enabled researchers to add more and more hidden layers to their NNs and creating more complex structures. If we now have more than one hidden layer in a NN we are generally talking about Deep Learning. [23] We will now try to see if Deep Learning can solve our problem more efficient than classical methods.
Nowadays it is not important to program a NN from scratch but there are libraries out to do the heavy lifting for the researcher. Examples are PyTorch (by Facebook) or Tensorflow (by Google) [24] but there are more libraries available. Here we will use Tensorflow with the Keras backend. This means the AI library Tensorflow will do the calculations and the Keras library helps to define the structure of the NN. This is a user friendly way to get started. But first the libraries needs to be imported. The next line is intended to clear the Tensorflow model. This is just done for the manuscript to have a fresh model in each pass of code.
The imports could be a little easier as it would be sufficient to just import tensorflow with a suitable alias, but this would lead to much longer code lines in the box below. Moreover it is a little more narrative to import additional libraries with a proper alias. Now we define a model with the keras functional API, which is an easy way to get started. The model contains * one input layer with 73 neurons * two hidden layers again with 73 neurons each * one output layer * Dropout layers after each dense layer * ReLu is used as activation function.
Dense layer means that every neuron in one layer is connected with every neuron in the next layer, this is often displayed by an arrow. Dropout layers can be imagined like a stroke. When the NN is trained the connection to some neurons is cut in each training pass forcing the NN to learn alternative ways through the network. This is a measure against overfitting. The output layer gathers everything together and tells us the selectivity we are aiming at.
The structure of a NN is one of the main hyperparameters to adjust in such a study. Now the model gets compiled with a loss function being the measure of optimization, in this case it is mean absolute error mae. Moreover we need a suitable optimizer. For a regression problem like we are dealing with right here rmsprop works well. Of course the parameters can always be optimized but for the time being we can imagine rmsprop as a suitable version of gradient descent. Now we can take a look what the NN model looks like. Now it is time to train the NN with more than 15.000 parameters. To do so we call fit on the model and feed it with the training features and targets. The training iterates several times over the data which is called epochs and is set to 500. In the validation_split a fraction of the training data is put aside to monitor the performance of the training. The loss function will always just go down but on the validation dataset we can have indications that we are for example overfitting or underfitting. The use of a validation set follows the same idea as the test set, but it is only used internally for the fitting of the NN and not equal to the test set. Although a size of about 20 to 30 % from the training set would be good for the validation set one often finds smaller validation datasets. Especially when working with smaller datasets this is quite common in order not to lose to many samples that are needed for a proper training. Here 10 % were reasonable. Now we make a plot of the loss function and the validation loss function (Figure 9). We can see that over the epochs the loss function always goes down, leading to a higher accuracy. On the other side the validation loss has a slight upward trend after about 70 epochs. This can be a hint that the NN is overfitting which means it learns by heart because it is large enough to store all training data in the huge amount of parameters. The validation data mimics unknown test data and therefore we will get a low score on the test data with a NN that is overfitting, but it will look good on the training data.
To test this, we make some predictions with the ready trained NN on the training and the test data and get the score.
And we get a score slightly around 70 % on the training data and about 23 % on the test data. So the training data looks good but the test data not. What went wrong? When we do the training again with just 80 epochs we will end up with a 50 % score for the training data and around 22 % for the test data which is a clear hint that we are overfitting with the model we build and training it over 500 epochs.
When we now make a parity plot for the test data it does not look much better than for example the SVM (Figure 10). On  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57 the other hand the tree based algorithm looks better and is easier to interpret than the NN. Of course the NN is not fully optimized by the author but it clearly shows that just using Deep Learning is not a warranty for success and can make the study even very complicated.

Conclusions
I hope I could guide you through a Data Science study based on literature data illustrating also some pitfalls in such a study. I hope I could show you that you have to spend much time about the data in Data Science with preprocessing, inspection and visualization of the data and that you will need your scientific knowledge for the science fraction in the phrase Data Science to gather more knowledge. I wanted to illustrate that the tuning of the various hyperparameters is very important but can also be very tedious. Moreover I wanted to illustrate that Deep Learning is indeed very powerful but because of its complexity it can be prone to errors and has to be used prudently. Finally I hope I could convince you that Data Science and Machine Learning can be a valuable addition also to a Chemists toolbox and that there is more than Excel and Origin out there.