Getting CICY High

Supervised machine learning can be used to predict properties of string geometries with previously unknown features. Using the complete intersection Calabi-Yau (CICY) threefold dataset as a theoretical laboratory for this investigation, we use low $h^{1,1}$ geometries for training and validate on geometries with large $h^{1,1}$. Neural networks and Support Vector Machines successfully predict trends in the number of K\"ahler parameters of CICY threefolds. The numerical accuracy of machine learning improves upon seeding the training set with a small number of samples at higher $h^{1,1}$.


Introduction
Ever since Kaluza and Klein extended the original insight of Einstein, we regard the fundamental forces as having an intrinsically geometric origin. The modern realization of this paradigm is the compactification of superstring theory down to four dimensions in order to recover the particle physics probed in experiments and inferred from astrophysical observations. In the most straightforward approach consistent with low energy supersymmetry, the six extra dimensions predicted by string theory comprise a compact Calabi-Yau threefold. Geometric and topological properties of the Calabi-Yau threefold determine features of the four dimensional effective action. For example, the Euler character of the geometry fixes the number of generations of light particles. Starting from the work of [1] and [2], numerous constructions of this type replicate the matter spectrum and gauge symmetries that we observe in Nature [3][4][5][6][7][8][9][10]. Naïve extrapolation of even the simplest class of models suggests that there are 10 23 (nearly a mole's worth) of superstring derived Standard Models [11].
The vacuum selection problem, to find a principle that explicates which solution of the fundamental theory constitutes our world and how and why this came to be, remains an outstanding puzzle. It is also unknown what the typical string compactification looks like and how closely this solution resembles the one we actually inhabit. There are 7890 complete intersection Calabi-Yau (CICY) threefolds realized as the zero locus of polynomials in complex projective space. There are an unknown number of toric Calabi-Yau threefolds obtained from triangulation [12,13] of the 473 800 776 reflexive polytopes in R 4 tabulated by Kreuzer and Skarke [14].
Other Calabi-Yau spaces are neither CICY nor toric. The largest available database [15,16] describes only the toric Calabi-Yau geometries with Hodge number h 1,1 ≤ 6. While [17] explores the shape of the full Kreuzer-Skarke dataset, it suffices to notice that the distribution peaks sharply, and 910 113 of the polytopes sit at (h 1,1 , h 2,1 ) = (27,27). The explicit Standard Model constructions to date meanwhile correspond to geometries whose Hodge numbers are O(1) rather than O (10). These are atypical as manifolds with small Hodge numbers are sparse.
Recently, a promising new approach to studying the vacuum selection problem has emerged.
The development of Big Data techniques in computer science and the broad applicability of these methods to such disparate fields as art, finance, chess and go, linguistics, medicine, music, experimental particle physics, and zoology invites us to also use these tools to investigate aspects of string phenomenology and string mathematics. In particular, the paradigm of machine learning the landscape by using neural networks to study algebraic geometry, potentially bypassing expensive computations such as Gröbner bases, was proposed in [18,19] (cf. a pedagogical introduction in [20]). Already, there has been a significant amount of work in this direction, ranging from the studies of CICY geometries to the computation of line bundle cohomologies of toric hypersurfaces [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33]. These studies have relied upon a multitude of algebro-geometric databases collected over the past few decades. A large fraction of such machine learning aided studies of the string landscape has been through the lens of neural networks with a variety of architectures [18,19,[21][22][23][24][25][26][27][28]. A host of other techniques such as linear and logistic regression, Support Vector Machines (SVMs), and random forests, to name a few, have also been used, sometimes in conjunction with neural networks [21,25,[28][29][30][31]33].
In our previous work [21], using the CICY threefolds as a testbed, we answered the following questions. Given the configuration matrix which defines a CICY threefold, can machine learning techniques compute the Hodge numbers of the geometry? Can the machine deduce whether the geometry is favorable, viz., does the number of projective space factors in the ambient space equal h 1,1 ? This property is important because such geometries accommodate the construction of stable vector bundles for string model building. Can the machine determine which geometries enjoy discrete symmetries, which are crucial for introducing Wilson lines that break the GUT symmetry to the Standard Model group? We find that even with 50% of the data for training, neural network classifiers identify the Hodge numbers at better than 80% accuracy. We select favorability with SVMs with more than 90% accuracy. Because CICYs with discrete symmetries are relatively rare (∼ 2.5% of all cases) [34], correctly isolating only these geometries is a comparatively less successful effort.
Heuristically, all of these investigations unfold as follows. We segregate the dataset into two disjoint parts: a training set T and its complement T c , used for validation. The machine is taught the associations {a 1 , a 2 , . . . , a n } −→ {b 1 , b 2 , . . . , b n } for elements a i ∈ T . Based on what it has learned about the training set, the machine then tries to determine the b j corresponding to the unseen elements a j ∈ T c . The selection of the elements in T is performed at random at the outset. and why triangulating polytopes to populate the toric Calabi-Yau database [16] stopped at h 1,1 = 6. One estimate of the total number of triangulations of the Kreuzer-Skarke dataset is 10 10505 [33]. While there are 10 8 reflexive polytopes associated to toric Calabi-Yau threefolds, the best guess in the literature is that there are 10 18 reflexive polytopes whose triangulations yield toric Calabi-Yau fourfolds relevant for F-theory model building [36]. There may be 10 3000 distinctly resolvable base geometries [37]. The scale of these numbers renders any systematic survey of the string landscape unfeasible. We would therefore like to develop techniques such that the training and validation sets are different in character. We aim to train with the easy cases and use the machine to predict solutions to harder problems for which the calculations are more intricate or where the answers could be unknown. We want as well to measure how reliable the results are when we segregate the data in this imbalanced way. By organizing the CICY dataset into a low h 1,1 training set and a high h 1,1 validation set, we report on progress in this effort.
The structure of this letter is as follows. In Section 2, we review the CICY threefolds. In Section 3, we describe the machine learning architectures we employ. In Section 4, we present the results of our investigation, which focuses on determining h 1,1 starting from the configuration matrix as the input. In Section 5, we provide a brief discussion and a prospectus for future work.

Complete Intersection Calabi-Yau Threefolds
For completeness, we briefly recall the relevant geometry. We refer the reader to [38] for a pedagogical review and references therein to the original literature.
A Calabi-Yau manifold admits a Ricci flat Kähler metric. We enforce this requirement by ensuring that the first Chern class vanishes. The simplest example of a compact Calabi-Yau threefold is the Fermat quintic in P 4 : where (z 1 , . . . , z 5 ) ∼ λ(z 1 , . . . , z 5 ) are coordinates on projective space and λ ∈ C . As (2) is a homogeneous equation, we designate this geometry P 4 (5) −200 . The subscript denotes the Euler character χ = 2(h 1,1 − h 2,1 ). This is the prototype example of a class of geometries. Consider the configuration matrix The zero locus of a set of homogeneous polynomials defined by the given matrix over the combined set of coordinates in the product of the projective spaces P n i is a complete intersection The former condition imposes the requirement that the manifold is a complete intersection threefold while the latter guarantees that c 1 = 0. The simplest geometries obtained in this manner are The Tian-Yau manifold is another example of a CICY threefold: where w and z are homogeneous coordinates on each of the two P 3 s and a, b, c are generic coefficients.
For CICY threefolds, the size of the configuration matrix X is constrained. We find that Here, N 1 counts the number of P 1 factors and N a counts the number of other projective space factors. There are 7890 configuration matrices ranging in size from 1 × 1 (the quintic) to 12 × 15 with elements q i a ∈ [0, 5]. In this dataset, we find 70 distinct Euler characters χ ∈ [−200, 0] and 266 distinct Hodge pairs (h 1,1 , h 2,1 ). The topological invariant h 1,1 counts the number of two cycles and four cycles and accounts for the Kähler deformations, whereas h 2,1 counts the number of three cycles and accounts for the complex structure deformations. These are, respectively, the size and shape parameters of the geometry. Within the set of CICY threefolds, Mirror symmetry -invariance under the interchange h 1,1 ↔ h 2,1 -is not a property of the dataset. As χ is always negative, h 1,1 ≤ h 2,1 for any given CICY threefold. The Euler character is a cubic expression in the elements of the configuration matrix. Calculating h 1,1 and h 2,1 is conceptually straightforward but requires some care [39][40][41][42][43][44]. One of the goals of applying machine learning to this dataset is to circumvent the necessity of studious sequence chasing.
Of the CICY threefolds, 195 possess freely acting symmetries; 37 different finite groups appear, ranging from Z 2 to Z 8 H 8 [34]. A number of CICY threefolds also admit non-freely acting symmetries [45,46].   We tabulate the number of geometries with each value of h 1,1 in Table 1. Among the CICY threefolds, 4874 out of the 7890 are favorable, i.e., h 1,1 equals the number of P n factors in the ambient space. Notice that this is a slightly different definition of favorable than others that have appeared in the literature: [10], for example, defines a geometry as favorable when its second cohomology class descends from that of the ambient space A = P n 1 × . . . × P n . Our definition misses those geometries that can be made favorable by splitting the CICY configuration matrix further, or by thinking of the CICY as a hypersurface in del Pezzo products. We finally note that h 2,1 ranges over a larger interval than h 1,1 . Figure 1 plots the number of geometries at a given h 2,1 . Knowing h 1,1 , once we compute χ, the Hodge number h 2,1 is of course redundant information. The goal of machine learning is to determine topological invariants and properties like favorability using the configuration matrix as an input.
The CICY fourfolds are catalogued in [47]. (There are 921 497 configuration matrices most of which correspond to elliptically fibered Calabi-Yau spaces.) Fourfolds have four non-trivial Hodge numbers of which three are independent: Identifying all of the discrete symmetries in this dataset has not been accomplished. Thus, there is a potential benefit to applying machine learning to this effort as well. This is work in progress.

Neural Networks and Support Vector Machines
We briefly summarize the main ideas behind the machine learning tools we have employed in this paper (and its predecessor [21]), namely, neural networks and Support Vector Machines (SVMs). Neural networks and SVMs can function as both classifiers and regressors, and as such have been the subject of active research in the machine learning community for several years. We point the reader to the Appendices of [21] for further details. The reader familiar with these techniques can skip ahead to Section 4 in which we record our results.

Feed-forward Neural Networks
A neural network can be thought of as a non-trivial function f acting on an input vector v in to produce an output vector v out , that is, f (v in ) = v out . The most successful neural network model is the feed-forward neural network, alternately known as the multi-layer perceptron.
Architecturally, as the name indicates, a multi-layer perceptron consists of multiple layers, each of which is a collection of a number of nodes called neurons. A multi-layer perceptron has an input layer (whose nodes correspond to the components of the input vector v in ), a number of hidden layers, and an output layer (whose nodes correspond to the components of the output vector v out ). In a feed-forward neural network, information always moves in one direction, from the input layer to the output layer. Every neuron in a given layer is connected to every neuron in the adjacent layers. Such connections are parameterized by weights denoted by the vector w.
where w n ij is the weight associated with the connection between the i th neuron in the n th layer and the j th neuron in the (n − 1) th layer, and b n i denotes the bias in the activation function for the i th neuron in the n th layer. This multilayer feedforward architecture is what allows multi-layer perceptrons to be universal approximators [48].
Learning happens when the multi-layer perceptron is trained to output desired vectors.
Consider a multi-layer perceptron with m layers being trained on a set of size n. For the i th training example, denote the output vector of the multi-layer perceptron by σ m | i , and the desired output vector by t i . The mean squared error cost function can then be defined as The idea is to minimize E by adjusting the weights (w k ij ) and biases (b k i ) in the multi-layer perceptron. In general this is done efficiently via gradient descent using the technique of error back propagation. Without going into derivations, here we simply inform the reader that the prescribed adjustments, or shifts in the values of the weights and biases for a gradient descent step, are given by where η is the learning rate. The parameter η should be chosen judiciously since it has a strong bearing on the convergence of the gradient descent algorithm and its ability to find the true minima. Once all the weights and biases have been set by above, the neural network is trained.
There are neural network architectures which allow for layers in which the neurons do not receive a weighted sum from all the neurons in the previous layer, but employ a kernel (grid) that restricts the neurons that can contribute. Such neural networks are called convolutional neural networks. These are best suited to data whose inputs exhibit translation or rotational invariances, and are thus suited to problems in image recognition.
Sometimes, the complexity of the neural network is such that it possesses more computing potential than is actually required. This leads to the problem of overfitting, wherein the accuracy of the neural network against unseen data stops improving despite growing training accuracy.
The technique of dropout provides a way to counter to this, by randomly dropping neurons along with their connections from the neural network during training. This is a proven strategy against overfitting and tries to force the neurons to learn more general features of the dataset [49].
Optimizing the error function (12) even for a relatively simple architecture involves the tuning of a large number of weights and biases (often running into tens of thousands), which can be a drawback of neural networks. SVMs, which we discuss next, take a geometric approach to learning, and typically do not require as many parameters.

Support Vector Machines
Support Vector Machines (SVMs) are natural binary classifiers and are thought to be one of the best off the shelf supervised machine learning algorithms. The simplest SVM is a binary classifier for linearly separable data. The classification is performed by finding an optimal hyperplane that can separate clusters of points from the two classes in the feature space. This can be extended to tackle non-linearly separable data (using the so called kernel trick ) and data that have multiple classes [50].
The simplest situation to consider is the binary classification of points in R n . We begin by defining a hyperplane H with a normal vector w by The idea is to find the H such that data points in the two classes lie as far from it as possible.
Alternately, one maximizes the margin, which is the distance along the normal vector w, between the two vectors that are the closest to the hyperplane H on either side. Such vectors are called support vectors and it turns out that they fully specify the SVM. If we denote them by x ± , corresponding to the two classes, the margin is then given by M := w · (x + − x − )/|w|. Since rescaling w and b by the same factor does not change H, one can rescale w such that f (x ± ) = ±1.
This reduces the margin M to 2/|w|. Thus an x i ∈ R n is classified by the SVM using the function sign(f (x i )) ∈ {−1, 1}, as belonging to the positive or negative class. We denote the result of classification of x i , that is, sign(f (x i )) as y i . An alternate statement of the problem is then Optimization Problem : Since the objective function of the above optimization problem is convex, the solution is relatively straightforward, using standard algorithms. One can recast this problem using Lagrange multipliers as follows: which presents the Dual Optimization Problem : subject to Θ j ≥ 0 and j Θ j y j = 0 .
The classifying function is sign(f (x)) := sign ( i (Θ i y i x i · x) + b). It turns out that the only non-zero Θ i s correspond to the support vectors.
In order to deal with non-linearly separable data, one could map points in the feature space to a higher dimensional feature space where the data is linearly separable. Once the optimal hyperplane H is found, one can then map back to the initial feature space. The kernel trick implies that this is equivalent to solving the dual optimization problem after replacing the dot product x i · x by Ker(x i , x). A common form for Ker is the Gaussian Ker( , which is what we employ in this work, since it leads to the best results. SVMs can act as linear regressors by attempting to fit the flattest function f (x) := w · x + b to the data within a residue . This is equivalent to the optimization problem since |∇f | 2 = |w| 2 . Similar to the case of SVM classifiers above, one can introduce Lagrange multipliers here and decise a dual version of the problem: leads to the dual problem subject to the conditions Θ i , Θ * i ≥ 0 and i (Θ i − Θ * i ) = 0. As with the SVM classifier, one can employ the kernel trick to fit non-linear functions to the data.
To avoid overfitting in SVMs and allowing better generalization to unseen data, one can allow a few training points to be misclassified. This has the effect of avoiding over-constraining the hyperplane H, and is achieved by replacing the condition Θ i ≥ 0 in the dual optimization problem by 0 ≤ Θ ≤ C, where C is called the cost variable.

Architecture
Our analysis in this paper involves a neural network regressor as well as a classifier, and an SVM regressor. The architectures for the regressors are similar to that in [21]. We use the We use machine learning to compute the Hodge number h 1,1 of CICY threefolds. Training on the configuration matrices at low h 1,1 , the algorithms successfully predict trends in the distributions of Hodge numbers at higher h 1,1 , but do not provide accuracy comparable to the random sampling previously studied in [21]. This is corrected by including a small selection of samples at higher h 1,1 .
We set up the experiment in two parts. In the first part, we train with configuration matrices with h 1,1 ≤ x, and test with configuration matrices with h 1,1 > x. In the second part, we repeat the experiment by augmenting the training set above with 10% of the configuration matrices with h 1,1 > x, randomly sampled, and test using the remaining configuration matrices. We denote these two training sets by T x and T x respectively. The integer bound x is a tuneable parameter. In our experiments we choose 2 ≤ x ≤ 10. With reference to Table 1 , and the size of the test set is 7890 − N (x). Using the training set T x ( T x ), at h 1,1 = 7, we train with ∼ 53% (58%) of the dataset while at h 1,1 = 9, we train with ∼ 83% (85%) of the dataset.
The true distribution of CICY threefolds peaks at the value h 1,1 = 7. Figure 2 shows neural network and SVM predictions of this distribution. Figure 3 shows the accuracy, root-meansquared (rms) errors and Matthews correlation coefficient (φ) for the predictions. The left and right panels of these figures correspond to the use of the two training sets T x and T x respectively, which were defined above. The neural network classifier performs rather poorly, when trained using the set T x , and we exclude its predictions from Figures 2 and 3.
Focusing first on the experiment using the training set T x , wherein we use the neural network and SVM regressors, we note that the algorithms predict a peak in the h 1,1 distribution for each value of x, though the position of the peak is slightly incorrect. Both the algorithms consistently overpredict the number of manifolds with low h 1,1 , regardless of the parameter x. This is not surprising since the only data the machine has seen for training are those geometries with h 1,1 ≤ x. This stagnates the neural network, with it eventually predicting most of the manifolds with h 1,1 > x to have h 1,1 ≤ x, causing the growth in the rms error after the initial dip ( Figure 3). The dip itself corresponds to the better predictions as seen in the neural network plot ( Figure 2). From the accuracy and rms error plots (left panels in Figure 3), we note that the SVM performs significantly better than the neural network, though the overall predictive powers of both the algorithms are limited. This analysis shows that the regressors are capable of predicting trends in the distribution of Hodge numbers from the limited data.
We now compare the results above with those from the experiment using the modified training set T x . The right panels in Figure 2 show the level of agreement of the predictions with the true h 1,1 distribution, demonstrating a marked improvement in the machines' predictive ability, x Class size x Class size  from above. This is further evidenced by the higher accuracies and Matthews coefficient, and lower rms errors (in the right panels of Figure 3). This significant enhancement of predictive ability is seemingly disproportionate to the expected gain of these algorithms (especially the neural networks) from the use of an increased number of training examples. This indicates that adding a small fraction of randomly sampled data from the list of manifolds with h 1,1 > x to the training set results in significantly improved predictions. Finally, we note that the neural networks perform better than the SVM in the domain of low x, and the SVM performs marginally better in the domain of high x. The accuracy, which is lower than what we report in [21], corresponds to an exactly correct identification of a manifold's h 1,1 based on an imbalanced training set. The misidentifications follow a Gaussian profile: a prediction is more likely to be off by a little than by a lot. Even with a simple Mathematica implementation, the algorithm is much better at distinguishing large from larger h 1,1 .
As we have noted in Section 2, the Euler character is cubic in the elements of the configuration matrix. It is also proportional to the difference between h 1,1 and h 2,1 . Instead of training with the elements m ij of the CICY configuration matrix, suppose we use m 2 ij or m 3 ij as inputs. 1 We can nudge the performance slightly. The square and cubic inputs both yield nearly the same results ( Figure 3). The neural networks respond more favorably to the alternative input than the SVM.

Discussion
The difficulty of exploring the string landscape and characterizing the vacuum space of solutions is technical. We cannot perform detailed calculations, for instance, in Standard Model building, when the Hodge numbers are large. Indeed, even finding all triangulations of a reflexive polytope at h 1,1 ≥ 7 to determine the full set of toric Calabi-Yau threefolds that are candidate geometries for superstring compactification has not been accomplished [16]. A similar systematic effort for fourfolds in F-theory has not even been attempted. As a result, we do not know how many string vacua there are and what fraction of these resemble the real world.
Supervised machine learning provides a structure to attack this class of problems in the face of incomplete data. Studying CICY geometries, this letter suggests that the strategy to employ is to compute simple examples and a representative smattering of the harder cases.
This supplies the information that the machine requires to predict trends in the data and achieve results roughly comparable to sampling from the entire dataset. Something similar happens when neural networks learn the hyperbolic volume of knot complements from Jones polynomials [51]. The answers we obtain offer a starting point by flagging geometries that a string phenomenologist or a string theorist might find interesting. Because the answers are not always error-free, we view this as an example of probably approximately correct learning [52].
The topological invariants of CICY geometries are by now extremely well studied. We have therefore not learned anything new about these manifolds as a result of this investigation. The work of [18][19][20][21] and what we report here nevertheless teaches us something profound. The traditional methods for computing topological features of Calabi-Yau geometries -sequence 1 We thank Andre Lukas for suggesting this experiment. chasing, doubly exponential Gröbner basis algorithms, etc. -may not be the most efficient way to proceed. Machine learning responds to these queries in polynomial time. We therefore conclude that there are better ways to calculate.
How does a machine learn? At the most basic level, the problems we confront in computational algebraic geometry reduce to finding the (co-)kernels of integer matrices. We have a black box that applies this process to land on useful semantics without knowing any syntax.
The central open question is to dissect the black box and translate these algorithms into something a human can understand and implement. We aim to report progress in this endeavor in future work.