The World in a Grain of Sand: Condensing the String Vacuum Degeneracy

We propose a novel approach toward the vacuum degeneracy problem of the string landscape, by finding an efficient measure of similarity amongst compactification scenarios. Using a class of some one million Calabi-Yau manifolds as concrete examples, the paradigm of few-shot machine-learning and Siamese Neural Networks represents them as points in R(3) where the similarity score between two manifolds is the Euclidean distance between their R(3) representatives. Using these methods, we can compress the search space for exceedingly rare manifolds to within one percent of the original data by training on only a few hundred data points. We also demonstrate how these methods may be applied to characterize `typicality' for vacuum representatives.


I. INTRODUCTION AND SUMMARY
The biggest theoretical challenge to string/M-theory being "the theory of everything" is the proliferation of possible low-energy, 4-dimensional solutions akin to our universe.The plethora of possibilities in reducing the high space-time dimensions -where gravity and quantum field theory unify -gives rise to such astronomical numbers as the often-quoted 10 500 in compactification scenarios [1] as well as more recent and much larger estimates [2,3].While constraints such as exact Standard Model particle spectrum place severe reduction on the allowed landscape of vacua [4][5][6][7][8][9], such reductions are typically on the order of 1 in 10 10 and are but a drop in the ocean.
Confronted with this vastness, a key resolution would be the identification of a measure on the landscape, so that oases of phenomenologically viable universes are favoured, while deserts of inconsistent realities are slighted.Such statistical approaches were undertaken in [10], even before exact string standard models were found.Nevertheless, finding such a measure, giving how close one compactification scenario is to anotherwhereby giving hope of vastly reducing the degeneracy of the string landscape -still remains a conceptual and computational puzzle.
The typical approach adopted in these studies is to feed landscape data into an ML framework and pose questions either as classifications (e.g., [11,16,[21][22][23]) or regressions (e.g., [24]), or to use reinforcement learning [25][26][27][28] to formulate optimal strategies for arriving at both top down and bottom up models of particle phenomenology.
This letter shows, using concrete examples, that ML provides a natural and direct incursion onto the degeneracy problem, using the powerful paradigm of similarity, via so-called Siamese neural networks (SNNs) [29,30], an architecture precisely designed for similarity of elements of a dataset2 .In addition, SNNs possess two powerful properties of great help in our analyses of landscape exploration: (1) few shot learning, which evades the need for significant amount of training data, and (2) supervised clustering [32], which explicitly realizes the similarity measure.In particular, the the NN learns a representation of input data into an embedding space where "similar" points are close together with respect to the usual Euclidean distance 3 .This similarity principle is our saving grace from the lack of a vacuum selection principle: given two compactification scenarios, a similarity score gives a measure on the landscape.Thus, whether a vacuum solution is phenomenologically viable can be decided by fiat and the "des res" [7] of our universe can be selected, whereby compressing the vastness of the landscape into typical representatives.
labeled data and the bulk of the data is unlabeled.The methods outlined in this paper, especially in Section IV B, may also be regarded as a natural starting point for such analyses.We provide here a concrete proof-of-concept of this idea using explicit data, viz., what was referred to as "landscape data" in [11,12] We will focus on the so-called complete intersection Calabi-Yau (CICY) manifolds in complex dimension three [33][34][35][36] and four [37,38], which are chosen for their deep relevance to foundational problems in algebraic geometry and string theory [23,[39][40][41] as well as the proven effectiveness of AI/ML methods in analyzing these datasets [11,12,21,22,[42][43][44].These two datasets will be our representative "landscape" 4 .By clustering with similarity, one may hope to identify subsets of data which are very likely to yield a given topology, and conversely exclude those subsets which are not.In short, we are attempting to few-shot learn the string landscape.
This letter is organized as follows.We begin by introducing the landscape data and their representations in §II and how SNNs address them in §III.The results are presented in §IV and conclusions with outlook, in §V.

II. REPRESENTING THE LANDSCAPE
The key point to our methodology is that a CICY can be realized as an integer matrix, which encodes the (multi-)degrees of the defining polynomials in some appropriate ambient space, see e.g. the recent textbook [45].While we relegate details to Appendix A, the upshot is that for the purposes of computing topological quantities, a CICY is the matrix 4 Indeed, apart from their direct relevance to string phenomenology -the Hodge numbers determine the spectrum of massless fermions in the string compactification -these datasets also explicitly realize the general spirit of the landscape problem.The computation of the Hodge numbers is a complicated problem whose difficulty is further exacerbated by the sheer number of cases for which this must be done.This, in a nutshell, is the string landscape problem.
where q j i ∈ Z ≥0 and D is the (complex) dimension of the manifold M , which for us will be 3 or 4, to whcih we will refer as CICY3 and CICY4 respectively.
One of the key problems in algebraic geometry, and in parallel, in string theory, is to compute topological quantities5 from M .Such topological quantities will govern such important properties such as fundamental standardmodel particles.Indeed, the paradigm of string theory is that the geometry of compactification manifolds such as M determines the physics of the macroscopic (3 + 1)dimensions of spacetime.The most famous of such a topological quantity is the so-called Hodge number, a complex generalization of Betti numbers which count the number of "holes" of various dimension in M .There are multiple Hodge numbers for Calabi-Yau manifolds of dimension D and since the early days of string phenomenology, these quantities have been interpreted as dictating the particle content of the compactification of string theory to low-energy standard model [47].We will largely focus on the positive integer h 1,1 throughout this letter.Thus, the model for our "landscape" will be labelled data of the form where the criterion for similarity ∼ is between two CICY matrices labeled by (A) and (B).We emphasize that our methodology is general and one could choose any other quantity of geometrical and phenomenological interest to label the data.The demographics of the CICY datasets are outlined in the Supplementary Materials (Section and Figs there).We note here that the CICY3 dataset consists of 7890 matrices corresponding to 18 distinct values of h while the CICY four-fold dataset is much larger, with 905684 non-trivial entries and 23 distinct values of h.An important hindrance in the study of either dataset is its extreme skewness, with the tails of allowed h values sparsely populated, while the middle is densely populated.Indeed, this skewness characterizes every known landscape dataset [20].
A possible way of addressing this skewness is to construct synthetic data for the sparely populated classes a la [43].Few shot learning enables us to go in a complementary direction, where we aggressively reduce the number of elements we need for training, even for densely populated classes.Indeed, to learn the 7890 elements of CICY3 and 905684 elements of CICY4 we draw on merely 2.67% and 0.62% of the full datasets, respectively 6 .
Finally, since the CICY matrices have variable shape, with entries ranging from 1×1 (the famous quintic threefold) to 15 × 18, we uniformize the input data by resizing each matrix to a uniform size n × n as described in Appendix B 2. This uniformization further removes explicit information about the number of rows and columns in a given CICY matrix.Since a large portion of the datasets are favorable, i.e. the Hodge number h 1,1 equals the number of rows in the matrix, this step also guards against the SNN learning spurious correlations between the matrix size and Hodge number.As a further check, we find similar results when uniformizing bi-linear interpolation [48] which is a completely different approach from padding and washes out any favourability information.

III. METHODOLOGY
As mentioned in the Introduction, there is tremendous difficulty but uttermost importance in defining an appropriate distance in the landscape, even theoretically, so as to identify the "typical" vacuum or the "similarity" between vacua.The key element with which an SNN solves this problem is is a so-called Features Network (FN).This implements a map φ w from elements of a dataset D to R d .Here w denote the weights and biases of the FN and we set d = 3 below.The desired property of the map φ w , visualized in Figure 2 is that similar elements of D are mapped close together, and dissimilar elements are mapped far apart.The w are determined by extremizing a loss function dependent on the squared Euclidean distance between the representative points of data elements x 1 and x 2 .There are multiple options for the loss function, starting with the original approach of [29,30], and we adopt the triplet loss function given by [49,50]  where x a is a reference 'anchor' CICY matrix, x p is a 'positive' CICY matrix with the same h as x a , and x n is a 'negative' CICY matrix with a different h.Minimizing this loss in w leads to learning an embedding φ w such that similar data cluster together and the dissimilar data are pushed apart.The d w in (III.1) is then interpreted as our desired similarity score.
The input data of n×n real values matrices have a natural interpretation as pixelated images, where the (i, j)th pixel is coloured according to q i j in grey-scale.Our features network, Figure 1, exploits this "image" representation by incorporating elements from computer vision architectures, principally, convolutional layers.The resulting network is called a convolutional neural network or a ConvNet, and is briefly reviewed in the Appendix C.
Note: We may also, inspired by [21,22], design the features network using simplified versions of the Inception [51,52] and Residual [53] blocks.This yields comparable results, and we focus on the ConvNet for simplicity.

IV. RESULTS & DISCUSSION
We now evaluate the trained SNN by computing similarity scores across the test set, which, we recall are 97.82% and 99.38% of the entire CICY3 and CICY4 data respectively.Mean similarity scores for each pair of Hodge numbers h 1,1 from these test sets are displayed in Figures 3 and 4. We see that the similarity scores along the diagonal (i.e. for matrices belonging to the same h 1,1 ) are concentrated close to 0, while scores for dissimilar matrices are concentrated away from 0, which was indeed our criterion for the putative similarity measure.Put together, these figures explicitly demonstrate few shot learning; the SNNs have been trained on extremely sparse data, sometimes just 3 from each class (see Tables I and II).Despite this extreme paucity of training data, the SNNs learn a similarity score that is representative of the full datasets.The representation of the dataset learned by the Siamese net in the embedding space R 3 is shown in Figures 5 and 6 where we see FIG. 6: CICY four-folds as visualized by the SNN.The color scheme is h 1,1 .

A. Clustering CICY Manifolds
We now apply these results to identify subsets of CI-CYs that are likely to contain given h 1,1 values.Such computations provide a paradigm to isolate regions of the string landscape by their likelihood to contain standardmodel-like vacua.The precise choice of these regions is somewhat subtle, and depends on the confidence with which we would like to select/reject manifolds in the landscape.For illustration, we train a Nearest Neighbors Classifier on the embedding space representations learnt by FN for CICY3 and CICY4 respectively.
Our results in Tables I and II demonstrate that the SNN hones into relatively tiny regions of the landscape where these manifolds are most likely to be found.In all cases but for h 1,1 = 21 for CICY4, the clustering is significantly better than random choice.As an example, the h 1,1 = 4 subclass for the CICY4 test set corresponds to 3829 manifolds out of the total 901099.Identifying even one manifold from this subset correctly by random guessing is extremely unlikely.In contrast, the above classifier predicts 4247 manifolds as corresponding to h 1,1 = 4, which is correct for 2668 of these.Thus, the learned similarity score dramatically reduces the search space of h 1,1 = 4 CICY4s from 901099 to 4247 'most likely' manifolds.
B. Typical h 2,1 s for CICY3s The SNN has been trained above on a particular criterion of similarity, namely, the matching of h 1,1 .We now examine how the clustering learned by the SNN may be interpreted to reflect a still broader notion of similarity in the dataset.To define this question, we start from the  (and we indicate the % of the total T e), of which the true positives are recorded in the third column.
general expectation that manifolds close to the centroid of a cluster may be thought of as being the 'most similar' to all other manifolds in the cluster.
For concreteness, we next examine if h 2,1 -this is another important Hodge number, but has much wider spread in value -of these manifolds are sufficiently generic to the h 1,1 subset or are outliers.As an example, consider h 1,1 = 7 which has 1463 elements.The 23 distinct h 2,1 values in this subset lie in the range 28.45 ± 4.41 at one confidence level.The complete distribution is visualized in Figure 7. Analyzing this clustering using the k means algorithm [54,55] picks out h 2,1 = 26, 28, 29, which may indeed be regarded as being 'typical'.We show further

V. CONCLUSIONS AND OUTLOOK
In sum, AI/ML formulations of similarity condense the vast string landscape into comprehensibility, most notably by providing a concrete representation of the landscape with similar vacua clustered together.Central to this is few shot learning, the ability to draw inferences across the full landscape from very limited input data.This allows us to narrow the search space for 'desirable' vacua by drawing on a very limited training set.This opens up the exciting possibility of using these methods to obtain strong priors for regions likely to contain standard-model like vacua, thereby aiding in the search and classification of such solutions.
We further expect this line of exploration to yield still deeper insights into the underlying structures of the string landscape.Firstly, since any compactification scenario can be expressed as some numerical tensor input as shown here for the CICYs, our framework is very generally applicable to the string landscape.Secondly, identifying criteria for similarity is itself an important advance on classification: rather than grouping objects into different categories, we identify the principal features which allow this grouping to take place.This step is the gateway leading from empiricism to understanding.
Additionally, while we are unaware of a mathematically rigorous framework by which manifolds with different Hodge numbers may be compared for similarity, our results enable us to do precisely this.Indeed, Figures 3  and 4 indicate that the SNN learns to regard manifolds with closer values of h 1,1 as being 'more similar' to each other.This was not an input to the SNN, which is not even shown the actual h 1,1 values, and must be regarded as a nontrivial output.Interestingly, the similarity score is very intuitive for a human looking at Riemann surfaces embedded in three dimensions.One would indeed be led to regard surfaces with 3 holes being more similar to ones with 4 holes than they are to ones with 10 holes.
Finally, our work also explicitly realizes a general paradigm for conjecture formulation using AI/ML [18], namely, the ability to generalize significantly beyond the dataset on which the algorithm is trained.By training on a vanishingly small subset of the landscape and extracting meaningful results on the full dataset, we demonstrate that the ability to extract precise conjectures about the string landscape is well within grasp, even though the full landscape may not yet be.Hence we expect the analysis here to be only the precursor to an exhaustive exploration of the string landscape and more generally the physical and mathematical properties of string theory.
Acknowledgements: YHH would like to thank STFC for grant ST/J00037X/1.SL acknowledges support from the Simons Collaboration on the Non-perturbative Bootstrap and the Faculty of Sciences, University of Porto while the work was conceived and partially carried out.

APPENDICES Appendix A: Calabi Yau Manifolds
We here give the reader a rapid initiation to Calabi-Yau manifolds; for a recent pedagogical introduction, q. v. [45].The typical student of physics is inculcated to the differential-geometric definition of a manifold M , before, if at all, being exposed to the algebro-geometric.This, in some sense, is reverse in complexity.One is familiar with (real, affine) algebraic geometry since school Cartesian coordinates.For instance, the intersection of a linear and a quadratic polynomial in real coordinates (x, y) ∈ R 2 prescribes the algebraic variety obtained from the intersection of a line and a conic section, which is generically 2 distinct points in the real plane.
The purpose of (complex) algebraic geometry is to realize manifolds M as polynomials in an ambient space A, typically a complex projective space P n .Because P n is Kähler and compact, this ensures that M is also, which is perfect for string compactification.In addition, when M has vanishing first Chern class, M will be Calabi-Yau.The point is that such statements as vanishing Chern class, which, in the language of differential geometry, would involve curvature and tensors, but in that of algebraic geometry, involves no more than properties such as degrees of polynomials.
The algebro-geometric set-up gives a completely algebraic (i.e., polynomial and combinatorial) way of constructing a Calabi-Yau manifold, which is precisely why it is perhaps more amenable to machine-learning.The archetypical example is to take a single polynomial (hypersurface) of homogeneous degree 5 in P 4 with homogeneous coordinates [z 0 : z 1 : z 2 : z 3 : z 4 ] as where 4 i=0 α i = 5 and α i ∈ Z ≥0 so that each monomial is degree 5 and the coefficients C α0,α1,α2,α3,α4 dictate the compelx structure (shape) of M .This is the quintic Calabi-Yau 3-fold.In general a degree n + 1 polynomial in P n is a compact Calabi-Yau (n − 1)-fold.Now, topological quantities do not depend on C α0,α1,α2,α3,α4 (and thus do not depend on the detailed monomials, so long as one can choose a generic enough set of monomial terms with generic enough choice of coefficients) so for our purposes the single number 5 suffices to characterize this Calabi-Yau manifold.We can generalize by considering M , of complex dimension D, as intersections of complex-valued polynomials in the homogeneous coordinates of a product A = P n1 × . . .× P nm of complex projective spaces P ni .When the number of polynomials K is equal to the co-dimension n 1 + . . .n m − D, so that each new polynomials slices out one complex dimension, we call M a complete intersection Calabi-Yau manifolds (CICY).

Complete Intersection Calabi-Yau: CICY
In brief, a CICY is the matrix for 2) We can thus see (A.2) as the definition of a Kähler manifold of complex dimension D, as a complete intersection of k polynomials in A = P n1 × . . .× P nm .Indeed, q j i specify the degree of homogeneity of the j-th defining polynomial in the homogeneous coordinates of the i-th projective ambient space factor.The complete in- q j i = n i + 1, ∀i = 1, . . ., m, so the first column of n i is redundant information.Clearly, independent row and column permutations of M define the same manifold.The coefficients of the defining polynomials are the complex structure parameters and the computation of any topological quantity is independent of these, thus the degree information given by the matrix M suffices.The quintic example above is then the 1 × 1 matrix [5].

Appendix B: CICY Data for the SNN
We now turn to an overview of the demographics of the CICY datasets along with a discussion of how the data is prepared for training the SNN.As remarked in the main text, the population distribution of these datasets is heavily skewed, with densely populated middles and sparsely populated tails.For example, there is only one CICY3 manifold that corresponds to h 1,1 = 16, 5 manifolds corresponding to h 1,1 = 1, while there are 1463 manifolds corresponding to h 1,1 = 8.In the same vein, there are 7 CICY4 manifolds with h 1,1 = 1 and 151447 manifolds with h 1,1 = 10.This is shown explicitly in Figures 8 and 9, containing the histograms of the number of CICY3s and CICY4s for each h 1,1 along with the FIG.9: The CICY4 population stratified by h 1,1 values.train test splits mentioned below in Appendix B 1. We see that the distribution is strongly peaked around the middle (h 1,1 = 5 − 10) and dies off sharply at the tails.

Splitting into Train and Test Sets
A crucial part of the machine learning methodology is to partition the data into training, validation and test splits.We describe briefly the raison d'etre for each of these and our procedure for partitioning.
First, the train set (T) is the subset of the data that the SNN is given to learn from.By using this data, the SNN tunes the weights of the features network to make optimal decisions about similarity of a given pair.Next, the validation set (V) is the data which is used to periodically evaluate the SNN while it trains, but is not given to the network to train on.Ideally, the performance of the network on the train and validation sets should be comparable.Finally, the test set (Te) is the complement of T and V in the dataset.This is not shown to the SNN until after training is completed as is used to evaluate the performance of the network, and explicate the extent to which the SNN has solved the given problem.A typical choice for the partitioning of the data could be (0.6, 0.2, 0.2), i.e., 60% of the data is used for training, 20% for validation, and 20% for testing, and often this partitioning is done by random sampling, i.e. we pick random subsets of the dataset in these fixed proportions.
However, completely random sampling in imbalanced datasets is not always possible; one may end up with zero elements of sparsely populated classes in one or more of the T, V, T e subsets.We therefore carry out stratified random sampling, i.e. we partition each h 1,1 class in the CICY datasets into T, V, T e subsets by random sampling and concatenate these to arrive at the full training, validation, testing data.This also enables us to drive down the size of the T + V subset while ensuring elements are drawn from each h 1,1 class.In this work we have split the CICY3 dataset as (0.0225, 0.0025, 0.975), subject to a minimum of 3 elements in T + V .As an illustration, consider the h 1,1 = 4 class in CICY3, which has 425 el- ements.The above split yields 9 elements in T , 1 in V and 415 in T e.The CICY4 dataset on the other hand is split as (0.0045, 0.0005, 0.995), subject to a minimum of 10 elements in T + V .Consider as an example the h 1,1 = 2 class which has 103 elements.Splitting the data as above, without a minimum would lead to zero elements in T + V from this class, hence the minimum cap.
The complete partitioning for the CICY3 dataset is shown in the second and third columns of Table I and for CICY4 in the corresponding columns of Table II, as well as in the histograms of Figures 8 and 9.

Feature Engineering
The CICY data is in the form of matrices of variable shape, with integer entries ranging from 0 to 5 in CICY3 and 0 to 6 in CICY4.To make the data more amenable to machine learning we firstly resize the matrices to a uniform n × n by padding.In practice, we find good results by padding the CICY3 matrices to size 18 × 18 by appending constant values −1 on each side until the desired size is reached.A toy example of a 2 × 2 matrix padded to 4 × 4 in this manner is A CICY4 matrix before and after padding in this manner is shown in Figure 11.This uniformization of the configuration matrix also prevents the neural network from learning spurious correlations in the dataset.Namely, a significant fraction (∼ 50%) of both the CICY3 and CICY4 datasets are favourable, in that the Hodge number h 1,1 equals the number of rows of the configuration matrix.Geometrically this means that ambience projective space Kähler classes descent completely to the the CICY.Since the matrices are now uniformized, this correlation is removed from the CICY data.Next, we rescale the matrix entries via where x ij is the i, jth entry in a matrix x belonging to a CICY dataset and max({x}) is the maximum value among all matrix entries in that dataset.Of course, this scaling does not mean anything in the algebraic geometry because the matrix entries are the multi-degrees of the defining polynomials.However, the scaling is an equivalent representation and the normalized data is more amenable to ML.

Appendix C: Features Network
As mentioned in Section III, the design of the features network FN relies crucially on the incorporation of Convolutional Layers.These are responsible for efficiently extracting local patterns in the image, which are then processed further by the neural network.For example, a face recognition algorithm would typically extract and compare patters associated with common landmarks on the face such as eyes, noses, lips etc.This is usually accomplished by the means of filters, which are matrices that are scanned across the image to extract particular features, where the entries of the matrices are determined by the kind of feature one aims to extract.As a simple example, consider a 5 × 5 image with a horizontal edge across the centre.This may be represented by the matrix is used to extract the horizontal edge from the above image by means of convolutions.The convolution operation involves sliding the filter over the image (in steps of 1 for simplicity), computing * -the element-wise/Hadamard product -and summing all elements of the resulting matrix.As an example, consider the 3 × 3 submatrix of the image with the 1, 0 element on the upper left corner.The above operation yields i.e. the horizontal edge has been extracted successfully.This is clearly visible from the images corresponding to Img and Img shown in Figure 12.Extracting a single feature is typically insufficient to characterize an image; multiple features are needed.This requires passing the image through multiple filters.In the above example, since the form of the feature was simple, an appropriate filter could easily be constructed.In general, the identification of appropriate filters corresponding to complicated features is a notoriously difficult problem.Indeed, even identifying the optimal set of features to characterize images on for a dataset is a far from obvious task.A central insight of deep learning to computer vision is that rather than using predetermined filters, we should instead treat the matrix entries of filters as tunable parameters to be optimized on the given dataset [56].This is accomplished by incorporating a convolutional layer in the neural network, which is essentially a stack of tunable filters.This allows the neural net to determine in one go both the optimal features to classify along and the appropriate filters for doing this classification.A neural network built from convolutional layers is called a convolutional neural network or a ConvNet.FIG.14: The loss curve for the Siamese Net trained on similarity with respect to h 1,1 values of four-folds.
ters by determining the location of the centroid of each cluster and assigning each point in the point cloud to the cluster with the centroid closest to it.Here k is an external parameter which must be supplied to the algorithm.A convenient rule of thumb for selecting the optimal   Typically, however, the inertia falls rapidly until an optimal value of k is reached, after which its rate of decrease is much less.That is, there is little apparent benefit to adding more centroids.This is visualized as an 'elbow' in the inertia vs k graph, and the optimal value of k is the location of the elbow.This graph is plotted on the left in Figure 16 for the h 1,1 = 7 and we see clearly that the elbow is located at k = 3.This optimal value may be further cross-checked by computing the silhouette score, which is the mean silhouette coefficient, defined by We may repeat this analysis with h 1,1 ranging from 4 to 11.There are more than 350 manifolds for each case, a number large enough that the notion of 'typical' is meaningful.Our results are summarized in Table III.Note that the number of typical CICYs is different for different h 1,1 values.This is due to the variance in the number of clusters detected for each h 1,1 point cloud.

FIG. 1 :
FIG. 1: The 'features' network.The convolutional layers enable the extraction of local features from CICY images.

2 ) 6 FIG. 2 :
FIG.2:The mapping φ w : D → R d learnt by the features network.Similar images A, P map close together while the dissimilar image N maps far away.

FIG. 16 :
FIG. 16: The inertia (left) and silhouette score (right) for the h 1,1 = 7 CICY3 cluster.The x and y axes are the k values and corresponding scores respectively.
b − a max(a, b) , (E.1) across the dataset.Here a is the mean intra-cluster distance of the given point, while b is the mean distance from points in the nearest cluster.When the point is near the centre of the cluster, b >> a and the coefficient is nearly 1.In contrast, when the point is near the edge, b a and the score is nearly 0. For a point assigned to the wrong cluster, b << a and the score is nearly −1.These observations suggest that the silhouette score should then ideally reach a maximum for the optimal value of k.This analysis again yields k = 3.It is then straightforward to fit a k-means clusterer with k = 3 on the train set, fit it to the test set, identify the manifold closest to the centroid of each point cloud and read off the corresponding h 2,1 values.This yields 26, 28 and 29 as mentioned in Section IV B. The cluster is shown in Figure 15 after projection to two dimensions using principal component analysis.The orange crosses indicate the centroids and the red triangles the typical CICY3 manifolds detected.The tails of the red arrows pointed at the typical CICYs carry the corresponding h 2,1 values.The larger points denote the manifolds in the training set while the smaller points denote the manifolds in the test set.The colours of the points correspond to their h 2,1 values.

TABLE I :
Clustering CICY3 manifolds.(T, V, T e) denote the training, validation and test sets respectively.The clusterer predicts a number of CICY3s for each h 1,1 h 1,1 T, V Total (T e) True Positive Predicted (% Σ T e)

TABLE II :
Clustering CICY4; notation as in TableI.details and corresponding results for other h 1,1 values in Appendix E.