Machine Learning CICY Threefolds

The latest techniques from Neural Networks and Support Vector Machines (SVM) are used to investigate geometric properties of Complete Intersection Calabi-Yau (CICY) threefolds, a class of manifolds that facilitate string model building. An advanced neural network classifier and SVM are employed to (1) learn Hodge numbers and report a remarkable improvement over previous efforts, (2) query for favourability, and (3) predict discrete symmetries, a highly imbalanced problem to which both Synthetic Minority Oversampling Technique (SMOTE) and permutations of the CICY matrix are used to decrease the class imbalance and improve performance. In each case study, we employ a genetic algorithm to optimise the hyperparameters of the neural network. We demonstrate that our approach provides quick diagnostic tools capable of shortlisting quasi-realistic string models based on compactification over smooth CICYs and further supports the paradigm that classes of problems in algebraic geometry can be machine learned.


Introduction
String theory supplies a framework for quantum gravity. Finding our universe among the myriad of possible, consistent realisations of a four dimensional low-energy limit of string theory constitutes the vacuum selection problem. Most of the vacua that populate the string landscape are false in that they lead to physics vastly different from what we observe in Nature. We have so far been unable to construct even one solution that reproduces all of the known features of particle physics and cosmology in detail. The challenge of identifying suitable string vacua is a problem in big data that invites a machine learning approach.
Calabi-Yau threefolds occupy a central rôle in the study of the string landscape. In particular, Standard Model like theories can be engineered from compactification on these geometries.
As such, Calabi-Yau manifolds have been the subject of extensive study over the past three decades. Vast datasets of their properties have been constructed, warranting a deep-learning approach [1,2], wherein a paradigm of machine learning computational algebraic geometry has been advocated. In this paper, we employ feedforward neural networks and support vector machines to probe a subclass of these manifolds to extract topological quantities. We summarise these techniques below.
• Inspired by their biological counterparts, artificial Neural Networks constitute a class of machine learning techniques capable of dealing with both classification and regression problems. In practice, they can be thought of as highly complex functions acting on an input vector to produce an output vector. There are several types of neural networks, but in this work we employ feedforward neural networks, wherein information moves in the forward direction from the input nodes to the output nodes via hidden layers. We provide a brief overview of feedforward neural networks in Appendix A.
• Support Vector Machines (SVMs), in contrast to neural networks, take a more geometric approach to machine learning. SVMs work by constructing hyperplanes that partition the feature space and can be adapted to act as both classifiers and regressors.
A brief overview is presented in Appendix B.
The manifolds of interest to us are the Complete Intersection Calabi-Yau threefolds (CICYs), which we review in the following section. The CICYs generalise the famous quintic as well as Yau's construction of the Calabi-Yau threefold embedded in P 3 ×P 3 [9]. The simplicity of their description makes this class of geometries particularly amenable to the tools of machine learning.
The choice of CICYs is however mainly guided by other considerations. First, the CICYs constitute a sizeable collection of Calabi-Yau manifolds and are in fact the first such large dataset in algebraic geometry. Second, many properties of the CICYs have already been computed over the years, like their Hodge numbers [9,10] and discrete isometries [11][12][13][14]. The Hodge numbers of their quotients by freely acting discrete isometries have also been computed [11,[15][16][17][18].
In addition, the CICYs provide a playground for string model building. The construction of stable holomorphic vector [19][20][21][22][23] and monad bundles [21] over smooth favourable CICYs has produced several quasi-realistic heterotic string derived Standard Models through intermediate GUTs. These constitute another large dataset based on these manifolds. Furthermore, the Hodge numbers of CICYs were recently shown to be machine learnable to a reasonable degree of accuracy using a primitive neural network of the multi-layer perceptron type [1]. In this paper, we consider whether a more powerful machine learning tool (like a more complex neural network) or an SVM yields significantly better results. We wish to learn the extent to which such topological properties of CICYs are machine learnable, with the foresight that machine learning techniques can become a powerful tool in constructing ever more realistic string models, as well as helping understand Calabi-Yau manifolds in their own right.
Guided by these considerations, we conduct three case studies over the class of CICYs. We first apply SVMs and neural networks to machine learn the Hodge number h 1,1 of CICYs. We then attempt to learn whether a CICY is favourably embedded in a product of projective spaces, and whether a given CICY admits a quotient by a freely acting discrete symmetry.
The paper is structured as follows. In Section 2, we provide a brief overview of CICYs and the datasets over them relevant to this work. In Section 3, we discuss the metrics for our machine learning paradigms. Finally, in Section 4, we present our results.

The CICY Dataset
A CICY threefold is a Calabi-Yau manifold embedded in a product of complex projective spaces, referred to as the ambient space. The embedding is given by the zero locus of a set of homogeneous polynomials over the combined set of homogeneous coordinates of the projective spaces. The deformation class of a CICY is then captured by a configuration matrix (1), which collects the multi-degrees of the polynomials: In order for the configuration matrix in (1) to describe a CICY threefold, we require that r n r −K = 3. In addition, the vanishing of the first Chern class is accomplished by demanding that a q r a = n r + 1, for each r ∈ {1, . . . , m}. There are 7890 CICY configuration matrices in the CICY list (available online at [24]). At least 2590 of these are known to be distinct as classical manifolds.
The Hodge numbers h p,q of a Calabi-Yau manifold are the dimensions of its Dolbeault cohomology classes H p,q . A related topological quantity is the Euler characteristic χ. We define these quantities below: Model.
If the entire second cohomology class of the CICY descends from that of the ambient space A = P n 1 ×. . .× P nm , then we identify the CICY as favourable. There are 4874 favourable CICYs [24]. As an aside, we note that it was shown recently that all but 48 CICY configuration matrices can be brought to a favourable form through ineffective splittings [25]. The remaining can be seen to be favourably embedded in a product of del Pezzo surfaces. The favourable CICY list is also available online [26]. (In this paper, we will not be concerned with this new list of CICY configuration matrices.) The favourable CICYs have been especially amenable to the construction of stable holomorphic vector and monad bundles, leading to several quasi-realistic heterotic string models.
Discrete symmetries are one of the key components of string model building. The breaking of the GUT group to the Standard Model gauge group proceeds via discrete Wilson lines, and as such requires a non-simply connected compactification space. Prior to the classification efforts [11,12], almost all known Calabi-Yau manifolds were simply connected. The classification resulted in identifying all CICYs that admit a quotient by a freely acting symmetry, totalling 195 in number, 2.5% of the total, creating a highly unbalanced dataset. 31 distinct symmetry groups were found, the largest being of order 32. Counting inequivalent projective representations of the various groups acting on the CICYs, a total of 1695 CICY quotients were obtained [24].
A CICY quotient might admit further discrete symmetries that survive the breaking of the string gauge group to the Standard Model gauge group. These in particular are phenomeno-logically interesting since they may address questions related to the stability of the proton via R-symmetries and the structure of the mixing matrices via non-Abelian discrete symmetries.
A classification of the remnant symmetries of the 1695 CICY quotients found that 381 of them had nontrivial remnant symmetry groups [13], leading to a more balanced dataset based on symmetry. We will however focus on the first symmetry dataset available at [24] purely on the grounds that the size of the dataset is itself much larger than the latter dataset.

Benchmarking Models
In order to benchmark and compare the performance of each machine learning approach we For regression problems we make use of both root mean square error (RMS) and the coefficient of determination (R 2 ) to assess performance: where y i and y pred i stand for actual and predicted values, with i taking values in 1 to N , andȳ stands for the average of all y i . A rudimentary binary accuracy is also computed by rounding the predicted value and counting the results in agreement with the data. As this accuracy is a binary success or failure, we can use this measure to calculate a Wilson confidence interval. Define where p is the probability of a successful prediction, n the number of entries in the dataset, and z the probit of the normal distribution (e.g., for a 99% confidence interval, z = 2.575829). The upper and the lower bounds of this interval are denoted by WUB and WLB respectively.
For classifiers, we have addressed two distinct types of problems in this paper, namely balanced and imbalanced problems. Balanced problems are where the number of elements in the true and false classes are comparable in size. Imbalanced problems, or the so called needle in a haystack, are the opposite case. It is important to make this distinction, since models trained on imbalanced problems can easily achieve a high accuracy, but accuracy would be a   Table 1, whose elements we use to define several performance metrics: Accuracy := tp + tn tp + tn + f p + f n , where, TPR (FPR) stand for True (False) Positive Rate, the former also known as recall. For balanced problems, accuracy is the go-to performance metric, along with its associated Wilson confidence interval. However, for imbalanced problems, we use F -values and AUC. We define, while AUC is the area under the receiver operating characteristic (ROC) curve that plots TPR against FPR. F -values vary from 0 to 1, whereas AUC ranges from 0.5 to 1. We will discuss these in greater detail in Section 4.3.2.

Case Studies
We conduct three case studies over the CICY threefolds. Given a CICY threefold X, we explicitly try to learn the topological quantity h 1,1 (X), the Hodge number that captures the dimension of the Kähler structure moduli space of X. We then attempt a (balanced) binary query, asking whether a given manifold is favourable. Finally, we attempt an (imbalanced) binary query about whether a CICY threefold X, admits a quotient X/G by a freely acting discrete isometry group G.

Machine Learning Hodge Numbers
As noted in Section 2, the only independent Hodge numbers of a Calabi-Yau threefold are h 1,1 and h 2,1 . We attempt to machine learn these. For a given configuration matrix (1) describing a CICY, the Euler characteristic χ = 2(h 1,1 − h 2,1 ) can be computed from a simple combinatorial formula [27]. Thus, it is sufficient to learn only one of the Hodge numbers. We choose to learn h 1,1 since it takes values in a smaller range of integers than h 2,1 .

Architectures
To determine the Hodge numbers we use regression machine learning techniques to predict a continuous output with the CICY configuration matrix (1) as the input. The optimal SVM hyperparameters were found by hand to be a Gaussian kernel with σ=2.74, C=10, and =0.01.
Optimal neural network hyperparameters were found with a genetic algorithm, leading to an overall architecture of five hidden layers with 876, 461, 437, 929, and 404 neurons, respectively.
The algorithm also found that a ReLU (rectified linear unit) activation layer and dropout layer of dropout 0.2072 between each neuron layer give optimal results. (See Appendix C for a description of hyperparameters.) A neural network classifier was also used. To achieve this, rather than using one output layer as is the case for a binary classifier or regressor, we use an output layer with 20 neurons (since h 1,1 ∈ (0, 19)) with each neuron mapping to 0/1, the location of the 1 corresponding to a unique h 1,1 value. Note this is effectively adding extra information to the input as we are explicitly fixing the range of allowed h 1,1 s. For a large enough training data size this is not an issue, as we could extract this information from the training data (choose the output to be the largest h 1,1 from the training data -for a large enough sample it is likely to contain h 1,1 = 19).
Moreover, for a small training data size, if only h 1,1 values less than a given number are present in the data, the model will not be able to learn these h 1,1 values anyway -this would happen with a continuous output regression model as well.
The genetic algorithm is used to find the optimal classifier architecture. Surprisingly, it finds that adding several convolution layers led to the best performance. This is unexpected as convolution layers look for features which are translationally or rotationally invariant (for example, in number recognition they may learn to detect rounded edges and associate this with a zero). Our CICY configurations matrices do not exhibit these symmetries, and this is the only result in the paper where convolution layers lead to better results rather than worse. The optimal architecture was found to be be four convolution layers with 57, 56, 55, and 43 feature maps, respectively, all with a kernel size of 3×3. These layers were followed by two hidden fully connected layers and the output layer, the hidden layers containing 169 and 491 neurons. ReLU activations and a dropout of 0.5 were included between every layer, with the last layer using a sigmoid activation. Training with a laptop computer's CPU took less than 10 minutes and execution on the validation set after training takes seconds.    Wilson confidence interval evaluated with a validation size of 0.25 the total data (1972). Errors were obtained by averaging over 100 different random cross validation splits using a cluster.

Outcomes
Our results are summarised in Figures 1 and 2 and in Table 2. Clearly, the validation accuracy improves as the training set increases in size. The histograms in Figure 2 show that the model slightly overpredicts at larger values of h 1,1 .
We contrast our findings with the preliminary results of a previous case study by one of the authors [1,2] in which a Mathematica implemented neural network of the multi-layer perceptron type was used to machine learn h 1,1 . In this work, a training data size of 0.63 (5000) was used, and a test accuracy of 77% was obtained. Note this accuracy is against the entire dataset after seeing only the training set, whereas we compute validation accuracies against only the unseen     Following from the discussion in Section 2, we now study the binary query: given a CICY threefold configuration matrix (1), can we deduce if the CICY is favourably embedded in the product of projective spaces? Already we could attempt to predict if a configuration is favourable with the results of Section 4.1 by predicting h 1,1 explicitly and comparing it to the number of components of A. However, we rephrase the problem as a binary query, taking the CICY configuration matrix as the input and return 0 or 1 as the output.
An optimal SVM architecture was found by hand to use a Gaussian kernel with σ=3 and C=0. Neural network architecture was also found by hand, as a simple one hidden layer neural network with 985 neurons, ReLU activation, dropout of 0.46, and sigmoid activation at the output layer gave best results.
Results are summarised in Figure 3 and Table 3. Remarkably, after seeing only 5% of the training data (400 entries), the models are capable of extrapolating to the full dataset with an accuracy ∼ 80%. This analysis took less than a minute on a laptop computer. Since computing the Hodge numbers directly was a time consuming and nontrivial problem [27], this is a prime example of how applying machine learning could shortlist different configurations for further study in the hypothetical situation of an incomplete dataset.

Machine Learning Discrete Symmetries
The symmetry data resulting from the classifications [11,12] presents various properties that we can try to machine learn. An ideal machine learning model would be able to replicate the classification algorithm, giving us a list of every symmetry group which is a quotient for a given manifold. However, this is a highly imbalanced problem, as only a tiny fraction of the 7890 CICYs would admit a specific symmetry group. Thus, we first try a more basic question, given a CICY configuration, can we predict if the CICY admits any freely acting group. This is still most definitely a needle in a haystack problem as only 2.5% of the data belongs to the true class.
In an effort to overcome this large class imbalance, we generate new synthetic data belonging to the positive class. We try two separate methods to achieve this -sampling techniques and permutations of the CICY matrix.
Sampling techniques preprocess the data to reduce the class imbalance. For example, downsampling drops entries randomly from the false class, increasing the fraction of true entries at the cost of lost information. Upsampling clones entries from the true class to achieve the same effect. This is effectively the same as associating a larger penalty (cost) to misclassifying entries in the minority class. Here, we use Synthetic Minority Oversampling Technique (SMOTE) [28] to boost performance.

SMOTE
SMOTE is similar to upsampling as it increases the entries in the minority class as opposed to downsampling. However, rather than purely cloning entries, new synthetic entries are created from the coordinates of entries in the feature space. Thus the technique is ignorant to the actual input data and generalises to any machine learning problem. We refer to different amounts of SMOTE by a integer multiple of 100. In this notation, SMOTE 100 refers to doubling the minority class (100% increase), SMOTE 200 refers to tripling the minority class and so on:

SMOTE Algorithm
1. For each entry in the minority class x i , calculate its k nearest neighbours y k in the feature space (i.e., reshape the 12 × 15, zero padded CICY configuration matrix into a vector x i , and find the nearest neighbours in the resulting 180 dimensional vector space).
2. Calculate the difference vectors x i − y k and rescale these by a random number n k ∈ (0, 1).
3. Pick at random one point x i + n k (x i − y k ) and keep this as a new synthetic point.  The results obtained here all trivially obtained validation accuracies ∼ 99%. As noted in Section 3, this is meaningless and instead we should use AUC and F -values as our metrics.
However, after processing the data with a sampling technique and training the model, we would only obtain one point (FPR, TPR) to plot on a ROC curve. Thus, to generate the full ROC curve, we vary the output threshold of the model to sweep through the entire range of values.
More explicitly, for an SVM, we modify the classifying function sgn(f (x)) to sgn(f (x) − t). For neural networks, we modify the final sigmoid activation layer (8) . Sweeping through all values of t we generate the entire ROC curve and a range of F -values, thus obtaining the desired metrics. Figure 4 shows the profile of a good ROC curve.  Table 4: Metrics for predicting freely acting symmetries. Errors were obtained by averaging over 100 random cross validation splits using a cluster.

Permutations
From the definition of the CICY configuration matrix (1), we note that row and column permutations of this matrix will represent the same CICY. Thus we can reduce the class imbalance by simply including these permutations in the training data set. In this paper we use the same scheme for different amounts of PERM as we do for SMOTE, that is, PERM 100 doubles the entries in the minority class, thus one new permuted matrix is generated for each entry belonging to the positive class. PERM 200 creates two new permuted matrices for each entry in the positive class. Whether a row or column permuation is used is decided randomly.

Outcomes
Optimal SVM hyperparameters were found by hand to be a Gaussian kernel with σ = 7.5, C = 0.
A genetic algorithm found the optimal neural network architecture to be three hidden layers   PERM results are summarized in Table 5 and Figure 6. Note these results are not averaged over several runs and are thus noisy. We see that for 80% of the training data used (the same training size as used for SMOTE runs) that the F-values are of the order 0.3 − 0.4. This is a slight improvement over SMOTE but we note from the PERM 100, 000 results in Table 5    This work serves as a proof of concept for exploring the geometric features of Calabi-Yau manifolds using machine learning beyond binary classifiers and feedforward neural networks.
In future work, we intend to apply the same techniques to study the Kreuzer-Skarke [29] list of half a billion reflexive polytopes and the toric Calabi-Yau threefolds obtained from this dataset [30]. Work in progress extends the investigations in this paper to the CICY fourfolds [31] and cohomology of bundles over CICY threefolds.

A Brief Overview of Neural Networks
Neural networks are one branch of machine learning techniques, capable of dealing with both classification and regression problems. There are several different types of neural networks, but they all act as a non-trivial function f (v in ) = v out . We proceed to discuss feedforward neural networks. We first consider a single neuron, which for an input x outputs σ(x · w + b). Here σ is the activation function, w the weights (for a weighted sum of the inputs), and b the bias.

Feedforward Neural Networks
Activation functions are applied to the resulting sum, typically mapping to the interval [0, 1].
This mimics a real neuron, which is either firing or not. The neuron can then act as a classifier A bias is included to offset the resulting weighted sum so as to stay in the active region of the activation function. To be more explicit, consider the sigmoid activation. If we have a large input vector, without a bias, applying a sigmoid activation function will tend to map the output to 1 due to the large number of entries in the sum, which may not be the correct response we expect from the neuron. We could just decrease the weights, but training the net can stagnate without a bias due to the vanishing gradient near σ(x) = 1, 0.
We generalise to multiple neurons by promoting the weight vector to a weight matrix. This collection of neurons is known as a layer. We denote the output of the i th neuron in this layer as σ i Extending to several layers, we just take the output of the previous layers as the input to the next layer, applying a different weight matrix and bias vector as we propagate through. Note that all internal layers between the input and output layers are referred to as hidden layers.
Denote the output of the i th neuron in the n th layer as σ n i , with σ 0 i = x i as the input vector.
This concludes how a fully connected neural network generates its output from a given input.
Training a network to give the desired output thus consists of adjusting the weight and bias values. This is achieved by the back propagation algorithm.

Back Propagation in Neural Networks
To decide how to adjust the weights and biases of a neural network when training, we define a cost function. Standard cost functions include mean squared error and cross-entropy (categorical and binary). Back propagation is an algorithm to find parameter adjustments by minimising the cost function. It is so named as adjustments are first made to the last layer and then successive layers moving backwards.
To illustrate the approach consider a network with M layers and a mean squared error cost Here N is the number of training entries and t the expected output for a given entry. Taking derivatives, shifts in the last weight matrix become: Working backwards, shifts in the second to last weight matrix: We define Therefore by induction we can write (for an arbitrary layer m) This defines the back propagation approach. By utilising our neural network's final output and the expected output, we can calculate the ∆s successively, starting from the last layer and working backwards. We shift the weight values in the direction the gradient is descending to minimise the error function. Thus shifts are given by With η the learning rate (effectively a proportionality constant fixing the magnitude of shifts in gradient descent). Care must be taken when choosing the learning rate. A rate too small leads to slow convergence and the possibility of becoming trapped in a local minimum. A rate too large leads to fluctuations in errors and poor convergence as the steps taken in parameter space are too large, effectively jumping over minima.
Note that parameter shifts are dependent on the gradient of the activation function. For activation functions such as sigmoid or tanh this then drives the output of a neuron to its minimal or maximal value, as parameter shifts become increasingly small due to the vanishing gradient. This is advantageous in an output layer where we may want to use binary classification.
However, if neurons in hidden layers are driven to their min/max too early in training, it can effectively make them useless as their weights will not shift with any further training. This is known as the flat spot problem and is why the ReLU activation function has become increasingly popular. To be more explicit, consider a two dimensional input (matrix). A kernel will be a grid sized n × n (arbitrary). This grid convolves across the input matrix, taking the smaller matrix the grid is covering as the input for a neuron in the convolution layer, as shown in Figure 8. The output generated by convolving the kernel across the input matrix and feeding the weighted sums through activations is called a feature map. Importantly, the weights connecting the two layers must be the same, regardless of where the kernel is located. Thus it is as though the kernel window is scanning the input matrix for smaller features which are translationally invariant.

Convolution Neural Networks
For example, in number recognition, the network may learn to associate rounded edges with a zero. What these features are in reality relies on the weights learned during training. A single convolution layer will usually use several feature maps to generate the input for the next layer.

Overfitting
To improve a network's predicting power against unseen data, overfitting must be avoided.
Overfitting occurs during training when accuracy against the training dataset continues to grow but accuracy against unseen data stops improving. The network is not learning general features of the data anymore. This occurs when the complexity of the net architecture has more computing potential than required. The opposite problem is underfitting, using too small a network which is incapable of learning data to high accuracy.
An obvious solution to overfitting is early stopping, cutting the training short once accuracy against unseen data ceases to improve. However, we also wish to delay overfitting such that this accuracy is as large as possible after stopping.
In this paper we also make use of dropout to avoid overfitting. Dropout is a technique where neurons in a given layer have a probability of being switched off during one training round. This forces neurons to learn more general features about the dataset and can decrease overfitting [32].

B Brief Overview of Support Vector Machines
In contrast to neural networks, support vector machines (SVMs) take a more geometric approach. They can act as both classifiers and regressors, but it is more instructive to begin this discussion about classifiers. While a neural network classifier essentially fits a large number of parameters (weights and biases) to obtain a desired function f (v in ) = 0, 1, a SVM tries to establish an optimal hyperplane separating clusters of points in the feature space (the n-dimensional Euclidean space to which the n-dimensional input vector belongs). Points lying on one side of the plane are identified with one class, and vice versa for the other class. Thus a vanilla SVM is only capable of acting as a binary classifier for linearly separable data. This is somewhat restrictive, but the approach can be generalised to non-linearly separable data via the so called Gaussian Kernel, non-linearly separable data Figure 9: Example SVM separation boundary calculated using our Cvxopt implementation with a randomly generated data set.
kernel trick and likewise can be extended to deal with multiple classes [33] (see Figure 9).
We wish to separate points x i with a hyperplane based on a classification of true/false, which we represent with the labelling y i = ±1. First define a hyperplane where w is the normal vector to the hyperplane. Support vectors are the points in the feature space lying closest to the hyperplane on either side which we denote as x ± i . Define the margin as the distance between these two vectors projected along the normal w, i.e., There is typically not a unique hyperplane we could choose to separate labelled points in the feature space, but the most optimal hyperplane is one which maximises the margin. This is because it is more desirable to have points lie as far from the separating plane as possible, as points close to the boundary could be easily misclassified. Note that condition defining a hyperplane (15) is not unique as a rescaling α(w · x + b) = 0 describes the same hyperplane. Thus we can rescale the normal vector such that f (x ± i ) = ±1 and the margin reduces to Moreover, with such a rescaling, the SVM acts as a classifier on an input x through the function sgn(f (x)) = ±1. Maximising the margin thus corresponds to minimising |w|, with the constraint that each point is correctly classified. This wholly defines the problem which can be stated as This is a quadratic programming problem with well known algorithms to solve it. Reformulating this problem with Lagrange multipliers: leads to the dual problem: subject to α j ≥ 0, j α j y j = 0.
With our classifying function now being sgn(f (x)) = sgn ( i (α i y i x i · x) + b). Again this is a quadratic programming problem. In this study we solve the dual problem by using the Python package Cvxopt, which implements a quadratic computing algorithm to solve such problems.
The dual approach is much more illuminating as it turns out the only α i which are non-zero correspond to the support vectors [33] (hence the name support vector machine). This makes SVMs rather efficient as unlike a neural net which requires a vast amount of parameters to be tuned, a SVM is fully specified by its support vectors and is ignorant of the rest of the data. Moreover the quadratic programming optimisation implemented via Cvxopt ensures the minimum found is a global one.
The dual approach also enables us to generalise to non-linearly separable data rather trivially.
In theory, this is achieved by mapping points in the feature space into a higher dimensional feature space where the points are linearly-separable, finding the optimal hyperplane and then mapping back into the original feature space. However in the dual approach, only the dot product between vectors in the feature space is used. Thus in practice we can avoid the mapping procedure as we only need the effective dot product in the higher dimensional space, known as a kernel. Thus by replacing x i · x with Ker(x i , x) we can deal with non-linearly separable data at almost no extra computational cost. This is known as the kernel trick. Common kernels include: In our study of CICYs we exclusively use the Gaussian kernel as this leads to the best results. SVMs can also act as a linear regressor by finding a function f (x) = w · x + b to fit to the data. Analogous to the above discussion, one can frame this as an optimisation problem by choosing the flattest line which fits the data within an allowed residue . Likewise one can make use of Lagrange multipliers and the kernel trick to act as a non linear regressor too.
Note in the above discussion we have avoided the concept of slack. In order to avoid overfitting to the training data, one can allow a few points in the training data to be misclassified in order to not constrain the hypersurface too much, allowing for better generalisation to unseen data. In practice this becomes quantified by replacing the condition α i ≥ 0 with 0 ≤ α i ≤ C, where C is the cost variable [33].

SVM Regressors
The optimisation problem for a linear SVM regressor follows from finding the flattest possible function f (x) = w · x + b which fits the data within a residue . As |∇f | 2 = |w| 2 , this flatness condition reduces to the problem: Introducing Lagrange multipliers leading to the dual problem: subject to the conditions Thus, identical to the classifier case, this optimisation problem can be implemented with Cvxopt. As the dual problem again only contains a dot product between two entries in the feature space, we can use the kernel trick to generalise this approach to fit non-linear functions.

C Hyperparameter Optimisation
While both neural networks and SVMs are trained algorithmically as outlined in Appendices A and B, certain variables must be set by hand prior to training. These are known as hyperparameters. Examples include net architechture (number of hidden layers and neurons in them) and dropout rate for feedforward neural networks, kernel size and number of feature maps for convolution layers and the cost variable, kernel type and kernel parameters for SVMs. In this study we make use of a genetic algorithm, which effectively begins as a random search but then makes an informed decision of how to mutate parameters to increase accuracy.
In general, a genetic algorithm evolves a population of creatures, each creature having associated with it a list of parameters and a score. Each new generation consists of the top scorers from the previous generation and children which are bred from the last generation. More specifically, to breed better models we create an initial population of models, each being initialised with a random set of hyperparameters. Each model is trained to its early stopping point and its validation accuracy recorded. The models with the top 20% (arbitrary) validation accuracy are kept, along with a few others by chance.
Breeding then consists of pairing the surviving models into parents and forming new models by choosing each hyperparameter randomly from one of the parents. Bred models also have a small chance to randomly mutate its parameters. For a genetic algorithm to be successful, it is crucial to allow mutations and a few low scoring nets into the next generation. This ensures there is enough variance in the parameters available as to avoid a local minima.