Contrast data mining for the MSSM from strings

We apply techniques from data mining to the heterotic orbifold landscape in order to identify new MSSM-like string models. To do so, so-called contrast patterns are uncovered that help to distinguish between areas in the landscape that contain MSSM-like models and the rest of the landscape. First, we develop these patterns in the well-known Z6-II orbifold geometry and then we generalize them to all other ZN orbifold geometries. Our contrast patterns have a clear physical interpretation and are easy to check for a given string model. Hence, they can be used to scale down the potentially interesting area in the landscape, which significantly enhances the search for MSSM-like models. Thus, by deploying the knowledge gain from contrast mining into a new search algorithm we create many novel MSSM-like models, especially in corners of the landscape that were hardly accessible by the conventional search algorithm, for example, MSSM-like Z6-II models with ∆(54) flavor symmetry. ar X iv :1 91 0. 13 47 3v 1 [ he pth ] 2 9 O ct 2 01 9


Introduction
String theory is a promising candidate for a UV-complete theory of quantum gravity. However, to proof its usefulness it has to incorporate the standard model (SM) or its supersymmetric extension (the MSSM) as a low-energy limit. Thus, one of the main tasks of string phenomenology is to find a string model that is consistent with all experimental and observational data -or, at least, that comes close to the MSSM, i.e. a model that is MSSM-like. This task is very difficult, mainly due to two reasons: (i) String theory predicts extra spatial dimensions. Hence, the connection between string theory and the MSSM must be related to the way how the extra dimensions are compactified. However, the number of different compactifications is huge [1,2], giving rise to the so-called string landscape of effective 4D theories originating from string theory. (ii) String theory is very predictive because after specifying the compactification the effective 4D theory, including all symmetries, particles and couplings, is completely fixed.
In the case of the ten-dimensional E 8 × E 8 heterotic string we have to compactify six dimensions. To do so, we choose six-dimensional toroidal orbifolds [3,4] as they allow for a world-sheet formulation of string theory instead of a supergravity approximation, see e.g. [5][6][7][8][9][10][11][12][13][14][15] for other schemes. Then, the conventional approach to search for MSSM-like string models from heterotic orbifolds is given by a random scan in the landscape [16][17][18] or in fertile islands that can be identified by certain patterns, like local GUTs [19][20][21]. In this paper, we show that the approach of defining fertile islands can be generalized by applying machine learning techniques to the string landscape [22][23][24][25], see also [26][27][28][29][30][31][32][33]. A first hint towards such a generalization was observed in ref. [34]: a neural network was able to identify additional patterns and to cluster the models of the heterotic orbifold landscape according to them. Surprisingly, MSSM-like models turned out to be localized close to each other on fertile islands, even though the neural network did not know whether a given model is MSSM-like or not, denoted by MSSM-like. In this paper we propose and demonstrate a new search strategy for MSSM-like models based on additional patterns that is superior by orders of magnitude. This search is developed from the knowledge gained by analyzing the heterotic orbifold landscape with tools from data mining. Data mining has been developed for the purpose to prepare, visualize and explore huge databases. Hence, the suitability of data mining to the string landscape is obvious. In particular, we apply a special setup called contrast data mining to the heterotic orbifold landscape. The basic idea of contrast data mining is to identify so-called contrast patterns that allow us to focus our search on those areas in the landscape where the MSSM-like models reside. Our contrast patterns have a clear physical interpretation: they are given by the number of unbroken roots in the hidden E 8 factor [9,35] and the numbers of various bulk matter fields. This paper is structured as follows: In section 2 we review the conventional search algorithm for MSSM-like models in the heterotic orbifold landscape. Then, we improve this algorithm using traditional methods: first, we take the Weyl symmetry into account and, secondly, in section 3 we include phenomenological constraints. These improvements already reduce the number of models that have to be scanned by the search algorithm in the test case of the Z 6 -II orbifold geometry. Afterwards, in section 4 we apply data mining techniques to our Z 6 -II dataset. Doing so, we can identify contrast patterns that help to distinguish between areas in the landscape that contain MSSM-like models and others. We implement these contrast patterns into our search algorithm and show that we can easily construct many new MSSM-like Z 6 -II models -a fact that might be surprising given the extensive searches done especially in this orbifold geometry. Remarkably, we can identify two MSSM-like Z 6 -II models with ∆(54) flavor symmetry (related to a vanishing Wilson line of order three), see section 5.
Thereafter, our contrast patterns are successfully extended to all Z N orbifold geometries in section 6, where table 8 summarizes our results, see also [36]. Finally, in section 7 we conclude. vector order N k additional constraint 1 not present, i.e. V 2 = (0 16 ) 2 Searching the heterotic orbifold landscape An orbifold compactification [3,4] is specified by a six-dimensional orbifold geometry O and a gauge embedding that acts on the sixteen gauge degrees of freedom of the heterotic string.
While there exist only 138 orbifold geometries with Abelian point group (i.e. Z N 1 or Z N 1 ×Z N 2 ) and N = 1 supersymmetry [37], the true size of the heterotic orbifold landscape unfolds if we take the number of inequivalent gauge embeddings into account. For a general Z N 1 × Z N 2 orbifold, a gauge embedding is given by two shift vectors, V 1 and V 2 , and six Wilson lines, W 1 to W 6 , corresponding to the six compactified directions. In this paper we concentrate on the E 8 ×E 8 heterotic string 1 , which implies that each vector can be split into two eight-dimensional vectors. For example, the sixteen-dimensional shift vector V 1 consists of the eight-dimensional vectors V (1) 1 and V 1 , which act on the first and second E 8 factor, respectively. Altogether a gauge embedding is determined by a gauge embedding matrix where we denote the vector in the k-th line by M k for k = 1, . . . , 8 and split it into two parts M (α) k for α = 1, 2 corresponding to the two E 8 factors, for example, 1 , W 1 ) for the first Wilson line W 1 . Depending on the orbifold geometry, shift vectors and Wilson lines are subject to geometrical constraints that, for example, fix the order of the shift vector V 1 to N for a Z N orbifold geometry, see e.g. ref. [38]. To be more specific and as we will mainly work with the so-called Z 6 -II (1, 1) orbifold geometry (using the nomenclature from ref. [37], abbreviated as Z 6 -II in the following), we summarize the geometrical constraints on the shift vectors and Wilson lines for the Z 6 -II orbifold geometry in table 1.
In general, there are two ways to expand a sixteen-dimensional shift vector or Wilson line naturally: either in terms of the simple roots α I of E 8 × E 8 or in terms of the dual simple roots α * I , I = 1, . . . , 16, where α * I · α J = δ IJ . Both choices give a basis of the (self-dual) root lattice Λ E 8 ×E 8 of E 8 × E 8 . For later convenience (see section 2.3) we decide to expand the vectors in terms of the dual basis, i.e. we parameterize the vectors M k in the gauge embedding matrix M as

Conditions from modular invariance
In order to obtain consistent string compactifications we have to impose conditions from modular invariance of the one-loop string partition function on the gauge embedding matrix M . These conditions read [39] for i, j = 1, . . . , 6 and where v 1 and v 2 denote the so-called twist vectors. They encode the geometrical rotation angles of a general Z N 1 × Z N 2 orbifold geometry, while for Z N 1 orbifolds the twist v 2 = (0 4 ) is not present. In addition, the gcd in eq. (4) denotes the greatest common divisor. These conditions are very restrictive and already forbid a huge fraction of points in the space Z 128 corresponding to eq. (2). The only reasonable way to create a consistent gauge embedding matrix M is by successively creating shift vectors and Wilson lines step-by-step and checking each time if the relevant conditions from eqs. (4) are fulfilled for those combinations of shift vectors and Wilson lines that have been chosen so far, see figure 1.
In this paper we work out an extension of this logic, i.e. we create a successive search that only considers those areas in the heterotic orbifold landscape that can fulfill certain properties: first, in section 3 we will introduce phenomenological properties of the MSSM and then, in the main part of the paper in section 4, we define so-called contrast patterns that also can be checked at each step during the construction of a gauge embedding matrix M . By doing so, we will neglect those areas in the heterotic orbifold landscape that have no chance or an extremely low probability to host a valid MSSM-like orbifold model.

The Orbifolder
Given an orbifold geometry O and a consistent gauge embedding matrix M , it is in principle possible to compute the complete 4D orbifold model at low energies denoted by model(M ). 2 In practice, some computations are too complicated, e.g. the strengths of Yukawa couplings. However, for a given M one can use the orbifolder [40] 3 in order to get the massless string spectrum, denoted by spectrum(M ), with all gauge charges. Moreover, the orbifolder can identify MSSM-like models, i.e. models with SU(3) C × SU(2) L × U(1) Y gauge symmetry and the exact chiral spectrum of the MSSM plus at least one Higgs-pair and exotics that have to be vector-like with respect to the SM. Also discrete symmetries and a list of all allowed couplings up to a certain order in fields can be analyzed. In addition, the orbifolder can be used to identify inequivalent orbifold models by taking two orbifold models, model(M ) and model(M ), to be equivalent if their massless spectra coincide (on the level of the non-Abelian representations and, in case of MSSM-like models, the U(1) Y hypercharge), i.e.

Searching in the Weyl chambers
A Weyl reflection of the gauge embedding vector M k at the hyperplane orthogonal to the simple root α I of E 8 × E 8 is defined as using (α I ) 2 = 2 for I = 1, . . . , 16. Then, it is easy to show that Hence, Weyl reflections leave the modular invariance conditions (4) invariant. Furthermore, one can show that they are symmetries of the full string theory We can reduce the search space and therefore find more physically inequivalent models when we divide out this symmetry. For this task we propose a fundamental Weyl chamber search. The proposed technique is based on the algorithm of ref. [42] that any vector in the root space can be rotated to the fundamental Weyl chamber, which is defined as the subspace where all Dynkin labels M k · α I are non-negative.
Starting from a gauge embedding matrix M we can imagine to apply the algorithm of ref. [42] such that the shift vector V 1 is rotated to the fundamental Weyl chamber, i.e. V 1 ·α I ≥ 0 for all simple roots I = 1, . . . , 16. Since we do not want to change the orbifold model by this transformation, we have to act with the same Weyl reflections that mapped V 1 to the fundamental Weyl chamber on the other vectors simultaneously. After this, we might still have some Weyl symmetries left, i.e. the shift vector V 1 may be invariant under certain Weyl reflections. These unbroken Weyl reflections are those that leave V 1 invariant, i.e. w I (V 1 ) = V 1 if and only if V 1 · α I = 0, i.e. d 1 I = 0. These residual Weyl reflections can now be used to bring the next vector, in our case the Wilson line W 1 , to an enlarged Weyl chamber which we define in analogy to the fundamental Weyl chamber but using only those Weyl reflections that leave V 1 invariant. Consequently, after the transformation of the Wilson line W 1 to the enlarged Weyl chamber, those Dynkin labels W 1 · α I are constrained to be non-negative that correspond to the Weyl reflections w I that leave the shift vector V 1 invariant. This procedure can be reapplied to the next vectors until no Weyl symmetry is left.
The mindset above can be used to directly choose only gauge embedding matrices that solely have the first vector in the fundamental Weyl chamber and the following vectors are in the correspondingly enlarged versions of it, as illustrated in figure 2. To achieve this we expand our vectors not in the basis of the simple roots α I but in the dual basis α * I , see eq. (2), where we can apply the constraints on the Dynkin labels directly via the coefficients d k I of the gauge embedding matrix. For the first vector, i.e. the shift vector V 1 , we have the full freedom of the Weyl group and can therefore choose this vector directly from the fundamental Weyl chamber for I = 1, . . . , 16. Thereafter, we have to compute the unbroken Weyl symmetry that can be exploited to restrict the second vector. This unbroken Weyl symmetry is generated by those   vector V 1 along the direction of α * 2 as an example. Consequently, the vector V 1 is invariant under the Weyl reflection w 1 , i.e. w 1 (V 1 ) = V 1 , which corresponds to a vanishing Dynkin label V 1 · α 1 = 0. Hence, this choice for V 1 has not broken the whole Weyl symmetry and we can use the unbroken Weyl reflection w 1 to restrict the search space for W 1 to the enlarged Weyl chamber (shaded area) which is defined by W 1 · α 1 ≥ 0. Hence, the coefficients of W 1 can be constrained as d 3 1 ∈ N 0 and d 3 2 ∈ Z. This procedure is continued for the next vectors and takes at each step all previously chosen vectors into account for computing the respective unbroken Weyl symmetry.
Weyl reflections that leave V 1 invariant, i.e. V 1 has to be a fixed point of a Weyl reflection such that this Weyl reflection remains unbroken. Since we have chosen V 1 from the fundamental Weyl chamber it can only be a fixed point under a Weyl reflection if V 1 lies on the boundary of the fundamental Weyl chamber. This boundary is given by the union of the hyperplanes perpendicular to the simple roots, i.e. the hyperplanes at which the Weyl reflections w I act [42]. Consequently, only those Weyl reflections w I which leave all previously chosen vectors M k invariant can still restrict the search space of the shift vector and Wilson lines that have to be chosen next. Therefore, at step n in figure 1 we can constrain the coefficients d n I of the vector M n in eq. (2) as

Phenomenological constraints
Obviously, we have to neglect any orbifold model specified by a gauge embedding matrix M that does not obey the stringy consistency conditions on M : the geometrical constraints and the modular invariance conditions. Similarly, we can add phenomenologically inspired constraints on M : Any orbifold model whose four-dimensional gauge symmetry G 4D (M ) does not contain the one of the SM does not provide a valid model to describe nature. Importantly, if we search in the heterotic orbifold landscape taking only the stringy consistency condition into account, these phenomenologically uninteresting models build by far the main part of the heterotic orbifold landscape. We have verified this by constructing 10 7 random models in the Z 6 -II orbifold geometry that satisfy all stringy consistency conditions using the fundamental Weyl chamber search algorithm. These 10 7 models give rise to approximately 3.5 · 10 6 inequivalent massless spectra. Then, the inequivalent spectra are sorted according to their frequency of occurrence inside the full list of 10 7 random models. The result is shown in figure 3 and details on some of the inequivalent spectra are highlighted in table 2. Consequently, from a phenomenological point of view the most uninteresting models turn out to have the highest repetition values, and the most interesting models are the rarest. As a remark, we cannot explain this imbalance, for example, by E 8 × E 8 lattice translations and Weyl reflections. Since the models are compared on the level of their massless spectra, it is likely that a lot of these models actually differ by some additional model parameters, for instance by their Yukawa couplings. Nevertheless, we want to avoid these uninteresting models in our search for MSSMlike orbifold models. Therefore, we will describe in the upcoming sections how we can constrain our search to those areas of the heterotic orbifold landscape that can potentially host the SM. On the horizontal axis, the inequivalent models are enumerated from 1 to 3 690 513, while on the vertical axis we see the corresponding frequency of occurrence, i.e. model # 1 has a frequency of 8 008, see also table 2 for more details on some of these models.

The pseudo-GUT constraint G n (M ) ≥ G SM
We want to avoid to produce phenomenologically uninteresting models that have a gauge group smaller than the SM gauge group factors G SM = SU ( for α = 1, 2 corresponding to both E 8 factors and Φ E 8 is defined as the root system of E 8 with 240 roots p. Furthermore, δ(x) = 1, 0 depending on whether x is integer or not, respectively. In the case of the SM we have six unbroken roots for SU(3) and two unbroken roots for the SU(2) factor, i.e. N sr (SM) = 8. This is our lower bound at each step n in the production of our model M for the first E 8 factor, i.e.
On the other hand, the second E 8 factor is free to produce any additional hidden gauge groups. Note that SU(2) × SU(2) × SU(2) × SU(2) could fulfill the constraint (12) but is already broken too far. Due to this we demand in addition that one gauge factor of G n (M ) has a root system that allows for SU (3), i.e. with six or more unbroken roots. If a newly chosen vector M k results in a gauge group breaking below these lower bounds, we neglect this vector and choose the same vector again, until it fulfills this constraint. In the following we call this constraint the pseudo-GUT constraint.

The Standard
We denote this constraint by SU(3) × SU(2) ⊆ G 4D (M ). Due to the geometrical constraints, we first have to identify the last shift vector or Wilson line that can be chosen independently, i.e. which is not of order one and not equal to a previous vector. For example, for the Z 6 -II orbifold geometry this results in the Wilson line W 6 , as can be seen in table 2. However, note that for some other orbifold geometries like Z 3 ×Z 6 (2, 2) all Wilson lines are fixed by the geometry, W i = (0 16 ) for i = 1, . . . , 6, and the second shift vector V 2 has to enable the G 4D (M ) constraint. The constraint is checked by computing the unbroken roots from the first E 8 factor and the sizes of their orthogonal root systems. This means that in order to contain SU(3) × SU(2) at least two root systems, one of size six and another of size two, have to be present, where we allow for additional gauge group factors, also from the first E 8 factor.
We implement the phenomenological constraints from section 3.1 and section 3.2 into our search algorithm and apply it to the test case of Z 6 -II orbifold models. It turns out that the probability of finding an MSSM-like model increases by a factor 10 from 1 10 000 000 = 10 −7 in the case without the phenomenological constraints to 3 2 665 463 ≈ 10 −6 in the case where the phenomenological constraints are applied. In addition, we use physical intuition that MSSMlike models are often related to a vanishing Wilson line [20] and perform a second search where we set W 5 = (0 16 sr ≥ X for X ∈ {8, 10, 12, . . . , 86} and obtain the dynamic hidden E 8 dataset. Here, the case X = 6 was disregarded since it was already sampled in the hidden E 8 dataset. Finally, the U-sector dataset was created using the U-sector contrast pattern from section 4.2.3 in addition. Note that we also made use of the additional conditions W 5 = (0 16 ) or W 3 = W 4 = (0 16 ), where W 5 = (0 16 ) is known to be beneficial for finding MSSM-like models, see ref. [20].

Contrast patterns for Z 6 -II orbifolds
In the previous section, we discussed phenomenological constraints that can be checked easily during the search for MSSM-like orbifold models. Importantly, these conditions are absolutely necessary for a model to be MSSM-like (but not sufficient). Now, we want to extend this procedure to include additional constraints (so-called contrast patterns) for MSSM-like models by exploiting techniques from data mining. These new constraints will be determined by a statistical approach. Hence, demanding them can potentially rule out a few MSSM-like models. In other words, the new constraints are not necessarily satisfied for all MSSM-like models but they significantly enhance the probability for a given model to be MSSM-like. In this way, we will constrain the heterotic orbifold landscape further to the areas of MSSM-like models. Some of these areas are hardly accessible by the conventional search algorithm but easy to access due to the significantly enhanced probability given the additional constraints from the contrast patterns.
A contrast pattern c can be defined as a pattern whose supports differ significantly among the datasets under contrast [43]. Here, the support is defined as where D is a set of data points, i.e. orbifold models, and c is a set of certain constraints that have to be fulfilled. In our case, we have two datasets that are under contrast: D MSSM-like and D MSSM-like , which is the set of MSSM-like models and the complementary set, respectively. In other words, we are searching for constraints c that are satisfied for nearly all MSSM-like models while they are violated by a huge fraction of MSSM-like models. In the ideal case, we can identify contrast patterns c with supp(c, D MSSM-like ) = 1 and supp(c, D MSSM-like ) = 0. This can be formalized by defining the growth rate which has to be maximized. In the following, we will often just write gr(c) if the datasets D MSSM-like and D MSSM-like are clear from the context. To understand the growth rate better and get some intuition for its value, we rewrite it in terms of the probabilityp. Here, the hat indicates that we estimate the probability by the sample proportionp( MSSM-like} and the total sample size is given by N = N MSSM-like + N MSSM-like . Then, one finds .
Here, one can observe several cases: where the Taylor expansion in eq. (18a) converges forp(MSSM-like) < 1 |gr(c)−1| . Now, eq. (18a) can be interpreted easily: For gr(c) < 1 we have a negative effect on our favored class of MSSMlike models. For gr(c) = 1 the effects on both classes cancel each other and for gr(c) > 1 we have a positive effect, i.e. a higher probability to find MSSM-like models in the subspace defined by the contrast patterns c.
However, before we can start to search for contrast patterns c, we have to define some (physical) quantities that possibly can lead to such patterns. This is known as feature engineering as explained in the next section.

Feature engineering
In this section, we will define (physical) quantities for a given orbifold model M . In the context of data mining, we will call such quantities features and their construction is called feature engineering. In general terms, feature engineering denotes the process of computing useful quantities from the raw data. For example, neural networks generate features in each hidden layer on their own during training. However, this is one of the biggest open problems of neural networks: it is in general very difficult to extract any meaning of the features that a neural network has learned on its own -these features hardly yield any knowledge gain. Alternatively, in this paper we will use physical intuition and knowledge of the system to create features. Then, after visualizing some of these features, we can use machine learning techniques (i.e. a decision tree) to quantify if our educated guess for a certain feature, or combinations of multiple features, leads to a correlation between these features and the property of M being MSSM-like or not. If such a correlation exists (i.e. if gr(c) > 1), we have identified a promising contrast pattern c. Moreover, if we can check this pattern c easily in the search algorithm displayed in figure 1, we can utilize it to reduce the search space in the heterotic orbifold landscape to areas where MSSM-like orbifold models accumulate. The advantage of this approach is obvious: by construction we have a straightforward physical interpretation of our features.
However, in general it is very difficult to identify useful features. In our case, we have tried many concepts, for instance, local GUTs, the breaking patterns of each shift vector and each Wilson line, the breaking patterns of certain combinations thereof, the number of non-Abelian gauge group factors in 4D, and many more. We know from section 2 that the 4D gauge group has a great impact and it can be checked easily at every step of the production of a model. Therefore, there is hope that additional features can be found from the 4D gauge group. When attempting to identify a promising feature, data visualization can be very useful: In figure 4, we plot the number of unbroken roots N (α) sr from each E 8 factor in a scatter plot against each other using the phenomenology dataset created in section 3. In addition, the respective histograms for the number of orbifold models with a certain number of unbroken roots are displayed in figure 4 for each axis (i.e. for each E 8 factor). To see if the feature N (α) sr has the potential to be useful as a contrast pattern, we have to pay attention to two aspects of this plot. First, we have to identify areas in this plot where no MSSM-like orbifold model is present. Second, it is important that such an area is not only qualitatively separated from MSSM-like orbifold models but also quantitatively interesting, i.e. highly populated with MSSM-like orbifold models. This aspect can be read off from the respective histogram in figure 4. By doing so, we identify the area N (2) sr < 6 that does not contain any MSSM-like Z 6 -II orbifold model but is fairly high populated with MSSM-like models in our phenomenology dataset. Hence, the condition N (2) sr < 6 is our first promising candidate for a contrast pattern and we expect to get a strong reduction of the Z 6 -II orbifold landscape by excluding this area form the search. In contrast, an example of an area that consists of MSSM-like Z 6 -II models only but is irrelevant due to the small number of models is given by N  figure 4, the vertical histogram on the right-hand side for N (2) sr > 42 and the horizontal histogram above the scatter plot for N (1) sr > 20) show that the number of Z 6 -II orbifold models in these regions is very small.
In addition to the above features, we will also use the numbers of orbifold-invariant bulk matter fields as additional features. They are computed similar to the number of unbroken roots of the gauge group in eq. (11), with the difference of an additional displacement from the geometrical twist vector v, i.e. at each step n of our search algorithm displayed in figure 1 we compute for α = 1, 2 and a = 1, 2, 3. Note that the term q (a) · v (k) in eq.

Decision tree and false negatives
In this section, we will use a decision tree [45] in order to identify those features from table 4 that correlate with the property of a model being MSSM-like or not. If such a correlation exists, the corresponding feature can be used as contrast pattern in our search algorithm for MSSM-like orbifold models. A decision tree belongs to the class of supervised machine learning and can be used for the purpose of classification or regression. In our case, we want to classify whether a given orbifold model is MSSM-like or not using simple true-or-false decisions on the features listed in table 4. Then, by analyzing the decisions made inside of the decision tree, we can identify those features that lead to successful contrast patterns. In more detail, our decision tree is a function from some of the features listed in table 4 to the classification value denoted by Y , i.e. it is of the form where Y predicted (M ) ∈ {MSSM-like, MSSM-like}, and it can be applied to all orbifold models M . Note that we can compute for each orbifold model M both, the features and the correct classification value Y correct (M ) ∈ {MSSM-like, MSSM-like} using the orbifolder. Yet, the benefit of using a decision tree is given by the possibility to uncover unknown correlations between our features and the property Y correct (M ) of an orbifold model M to be MSSM-like or not. Furthermore, it is by no means guaranteed that a function like eq. (21) exists. We will only know about its existence after we have trained and tested our decision tree.
In a first step, the decision tree has to be trained, i.e. the algorithm tries to learn the function (21) U U 3 (P 1 ), . . . , Y correct (P 1 ))}, containing the features of some other orbifold models P i . Then, we can compare the results Y predicted (P i ) of the decision tree (21) to the correct values Y correct (P i ) obtained from the orbifolder. In this way, we can check whether our decision tree was able to identify the function eq. (21) between our features and the property of a model being MSSM-like or not. In practice, a decision tree will not be trained perfectly. First of all, it is possible that there is no exact functional dependency of the form eq. (21). Furthermore, even in the case when such a functional dependency would exist in principle, the decision tree might be unable to learn it, possibly because the training set was too small or imbalanced. In general, we can distinguish between two types of errors, i.e. cases where Y correct (M ) = Y predicted (M ). They are called: Every classification process tries to minimize the number of false predictions. However, at a certain level it always comes to a trade-off between false positives and false negatives and we have to decide whether we want to suppress one of them for the drawback of raising the other one. In our case, a false positive classification by the decision tree is not a big problem since we can simply check each of these orbifold models afterwards explicitly using the orbifolder. However, in the case of a false negative classification the consequences are that we will loose an MSSM-like orbifold model. Since MSSM-like orbifold models are far too valuable to us, we want to minimize the number of false negative cases by all means, while we want to keep the number of false positives as low as possible. Therefore, we introduce a loss matrix L, which informs the machine learning algorithm about the different importance of certain models [46]. We choose a loss matrix where the two rows correspond to the correct value Y correct (M ) being either MSSM-like or not, and the two columns correspond to the predicted values Y predicted (M ) being either MSSM-like or not. Then, L 12 corresponds to the false negative cases of an MSSM-like orbifold model M that has been classified by the decision tree to be Y predicted (M ) = MSSM-like. As this is very undesirable, the system is punished with a large loss value L 12 = 10 6 . As discussed before, the other possible error of a false positive classification is not so severe. Hence, we set L 21 = 1. This will guide the decision tree algorithm towards suppressing the false negative cases such that we do not miss any MSSM-like orbifold models.
For later convenience, we will quantify the quality of the predictions by the recall. It is defined as the number of correct predictions of the MSSM-like class divided by the total number of MSSM-like orbifold models. Hence, if the number of false negatives for all MSSMlike orbifold models P i is zero, the recall is 1.00 on the validation set and all MSSM-like orbifold models are assigned with the correct value Y = MSSM-like.
In the following, we apply decision trees to our features in order to extract promising contrast patters.

The hidden E 8 contrast pattern
As a first step, we have to define our datasets. The training and validation set are created by a random split (using a validation size of 33%) of the phenomenology dataset from table 3 (based on the phenomenological constraints from section 3.1 and section 3.2). However, we add a small modification to the dataset to avoid data leakage. Data leakage refers to the mistake to inform the machine learning algorithm about data from the validation set during training. In this case, the machine learning algorithm might overfit on some of the data from the validation set even though the data was divided into training and validation set. As the performance of a machine learning model on the validation set is a measure for its ability to generalize to unseen data, this mistake has to be avoided. In our case this could happen if, for example, there exists an MSSM-like model that completely dominates all MSSM-like models with all of its equivalent copies. Then, this model would appear most likely in both, training and validation set. Hence, the machine learning algorithm would see this model during training. Moreover, the same model would dominate the results on the validation set and pretend that the learned predictions generalize to generic MSSM-like models. Therefore, to avoid data leakage we only use inequivalent MSSM-like models. Nevertheless, for the MSSM-like models we keep the equivalent models, since the frequency of occurrence gives us a notion of the size of the area that a certain split in the decision tree excludes. In more detail, we have to perform our search for MSSM-like orbifold models in the space Z 128 of gauge embedding matrices {M i }, see eq. (2), MSSM-like models in this node, respectively, and, finally, (iv) class = MSSM-like is the prediction for all models in this intermediate node. The final prediction for the models is given by the leaf nodes. even though we are interested in the space of physically inequivalent models {model(M j )}, or more precisely, in the space of inequivalent massless particle spectra {spectrum(M k )}. Now, we defined features that directly depend on spectrum(M ), not on d ∈ Z 128 and our decision tree performs its splits based on these features. Consequently, a certain split in the decision tree will exclude all points d ∈ Z 128 that give rise to the same excluded features. In this way, a small restriction in feature space gives rise to a huge effect in the space Z 128 of gauge embedding matrices. Moreover, by not restricting the feature space too much, we leave enough room to discover new MSSM-like models, also in unexpected areas of the landscape. In contrast, if we had used only inequivalent MSSM-like models for the training of our decision tree, the decision tree would not care to exclude a single model even though it might actually correspond to a huge area in Z 128 . At the end, we want to enhance the search algorithm. Hence, it is better to exclude a few models with extraordinary high frequency of occurrence than multiple models with very low one. Now, we train the decision tree on the training set. Here, we tune the hyperparameters such that we get a recall value for MSSM-like models of 1.00 on the validation set. This is due to the fact that we want to find contrast patterns that are satisfied by all MSSM-like models contained in the validation set. During training, the decision tree identifies areas in feature space and assigns the two classes {MSSM-like, MSSM-like} to them, using the data from the training set. However, we want the decision tree to assign the class MSSM-like only to those areas that are also highly populated with MSSM-like models. In this way, we can be sure that the probability for an MSSM-like model is extremely small in these areas of MSSM-like models. This can be achieved using a technique called pruning. Consequently, the complexity of the decision tree is reduced to a minimum and we keep the possibility to find MSSM-like models in rather unexpected areas of the landscape.
The resulting decision tree is displayed in figure 5. We see that the intuition given by the scatter plot in figure 4 manifests in a lower bound on the number of unbroken roots from the hidden E 8 factor: From the second node in the second line of the decision tree we can extract the condition using eq. (15) with supp(c hidden E 8 , D MSSM-like ) = 1 for our phenomenology dataset and the numbers for supp(c hidden E 8 , D MSSM-like ) can be read off from figure 5. In other words, we can modify our search algorithm such that we would have avoided 2 047 379 MSSM-like models in the training set. Hence, the contrast pattern (25) allows us to exclude a huge area in the Z 6 -II orbifold landscape which statistically does not lead to MSSM-like models. Instead, we can invest the gained computing time to search in areas where the probability of a model to be MSSM-like is significantly increased.
We implement the hidden E 8 contrast pattern into our search algorithm displayed in figure 1 and perform an intensive search using in addition different constraints on the Wilson lines: First, we allow all Wilson lines to be non-trivial and then, motivated by ref. [20], we turn off W 5 by hand. The resulting dataset is called hidden E 8 and summarized in table 3. One observes an increase of the probability to find an MSSM-like model from which is consistent with the estimated growth rate in eq. (26). However, these are the probabilities to find any MSSM-like model and it does not need to be inequivalent to the already known ones. Unfortunately, the total number of inequivalent MSSM-like models did not satisfy our expectations: it increased from 130 inequivalent MSSM-like models in the phenomenology dataset to 136 in the hidden E 8 dataset. In the next section, we will investigate the reasons for this and present a solution that will lead to many new inequivalent MSSM-like models.

The dynamic hidden E 8 contrast pattern
Next, we analyze the effect of the hidden E 8 contrast pattern in more detail in order to identify a way to improve this constraint further. To do so, we take the (equivalent) MSSM-like Z 6 -II models from the hidden E 8 dataset and visualize how many MSSM-like models M appear for various values of N (2) sr (M ), see figure 6. From this chart we see that the models with small numbers of unbroken roots N    These inequivalent models have 6, 6 and 7 copies in the whole dataset, respectively. Note that in this chart only those MSSM-like models appear that have G SM only in one E 8 , since the notion of hidden E 8 is ambiguous otherwise. models have larger hidden sector gauge groups. Furthermore, we investigate the change of the growth rate for higher threshold values X of our contrast pattern N (2) sr (M ) ≥ X and obtain table 5. Therefore, it seems very promising to change the threshold value X = 6 of the contrast pattern N (2) sr (M ) ≥ X into a dynamic variable X. We call this new constraint dynamic hidden E 8 . By applying this dynamic contrast pattern, we hope that the sampling among the various sizes of the hidden sector gets more balanced. Furthermore, we expect a boost in the number of MSSM-like models due to the increasing growth rate for higher values of N   various values of the threshold X and different constraints on the Wilson lines: First, we allow all Wilson lines to be non-trivial, then we turn off either W 3 = W 4 or W 5 . As a result, we obtain a new dataset (which we also call dynamic hidden E 8 ), see table 3. Compared to the hidden E 8 dataset with 136 inequivalent MSSM-like Z 6 -II models we now have in total 415 MSSM-like models. This is already more than in any existing Z 6 -II search [17][18][19][20]. Hence, we were able to significantly improve the search for inequivalent MSSM-like models in the Z 6 -II orbifold landscape. Moreover, this search solves the puzzle of the absence of MSSM-like models in the case W 3 = W 4 = (0 16 ): So far, it was not possible to find any MSSM-like model if the order 3 Wilson line is turned off, even though there is no theoretical obstruction for such a model to exist. Now, we have identified two MSSM-like Z 6 -II models with W 3 = W 4 = (0 16 ) as can be seen in table 3. These models are equipped with a phenomenologically appealing ∆(54) flavor symmetry. Thus, we present these models in some detail in section 5.
A few remarks are in order. It is clear that previous searches based on the traditional approach as well as those presented in this paper are in general not exhaustive. During any random search process the number of inequivalent MSSM-like models will follow a saturation curve [47]. Consequently, the effort for creating a new inequivalent MSSM-like model growth exponentially during sampling. Thus, we believe that any attempt to reach our result using a basic random search would take an unrealizable amount of computing time and should be considered only a theoretical possibility rather then an alternative approach. So, why is our new search strategy so successful? Astonishingly, it turns out that a huge fraction of the diversity of MSSM-like models lies in areas of the heterotic orbifold landscape where the hidden sector gauge group is large, see figure 7. In more detail, using the dynamic hidden E 8 contrast pattern we could (i) obtain many new MSSM-like models with N (2) sr (M ) = X for X ∈ {34, 36, 44, 46, 56, 60, 62, 72, 74, 84} and (ii) resolve the richness of MSSM-like models for higher X values, e.g. with X ∈ {30, 40, 42}. These large hidden sector gauge groups can have direct physical implications related to supersymmetry breaking via gaugino condensation at rather high energies [35] and have to be studied in more detail.

The U-sector contrast pattern
On the basis of our hidden E 8 and dynamic hidden E 8 datasets we want to search for further contrast patterns. To do so, we follow the same logic as in section 4.2.1 and apply a decision tree on the remaining features N (α) Ua (M ) to our new, combined dataset. For computational reasons we downsample our background of MSSM-like models. This means we only work with a fraction of ∼ 50% of the total dataset. This is a valid approach since we have so much data that the actual statistics for the decision tree will not change in a relevant way even if the whole dataset had been given. Furthermore, for the rare and important MSSM-like models we keep all inequivalent MSSM-like models, as described in section 4.2.1.
Then, the decision tree is trained with the same aim as before to classify all MSSM-like models correctly in both, training and validation set. However, it turns out that this is rather difficult due to two MSSM-like models: One of these models is misclassified during training, Figure 7: Bar chart of MSSM-like Z 6 -II models M found using the dynamic hidden E 8 contrast pattern, c.f. figure 6. Note that increasing the threshold value X of the contrast pattern N (2) sr (M ) ≥ X leads to a deeper search in those areas of the Z 6 -II orbifold landscape that were insufficiently sampled by the static search using X = 6. see figure 8, the other one during validation. It turns out that these two MSSM-like models are the special ∆(54) models, where W 3 = W 4 = (0 16 ). Due to the fact that these models lie in a very specific area within the Z 6 -II orbifold landscape, we decided to accept a misclassification of these models but with the benefit of obtaining a new contrast pattern that yields a further, significant reduction of the Z 6 -II orbifold landscape. Doing so, we identify a new contrast for a model M to have an increased probability to be MSSM-like. We call this contrast pattern U-sector as it gives bounds on the number of certain bulk matter fields, charged under the first or second E 8 factor, depending on α = 1, 2, respectively. Using this new constraint on top where D Nsr is obtained by combining the datasets hidden E 8 and dynamic hidden E 8 from table 3. Some remarks on the subtleties of the U-sector constraints are in order: gr(c): Contrary to the hidden E 8 contrast pattern, the U-sector contrast pattern can possibly exclude models which have G SM in both E 8 factors: in the case of N This can not be guaranteed for the U-sector constraint. Therefore, the growth rate in eq. (29) is estimated using only those models where G SM is exclusively in the first E 8 factor. U 2 (M ) ≥ 2 on the number of bulk matter from the 5 Note that the estimated growth rate is computed based on the numbers of equivalent models. To do so, the same random split as for the training data of the decision tree has to be applied to the equivalent MSSM-like models from D Nsr , yielding 8 466 models. These numbers reduce to 8 087 for models having G SM only in the first E8 factor and, finally, to 8 082 models fulfilling cU-sector. Consequently, supp(cU-sector, D Nsr MSSM-like ) = 8082 8087 ≈ 0.999. sr ≥ X and (ii) a combination of case (i) and the U-sector constraint, respectively. U 2 sector and charged under the first E 8 . Even though the decision tree decided strictly correct (by taking the statistics into account for optimization) and misclassified these two MSSM-like models, it is still appealing that all MSSM-like models (with G SM ∈ E (1) 8 ) obey this constraint. This observation might be worth further investigations.
Implementing the U-sector contrast pattern into our search algorithm displayed in figure 1 and performing an intensive search (using all Wilson lines or turning off W 5 by hand), we obtain our final dataset, called U-sector, see table 3. The results show once more the strength of the contrast data mining technique applied to the heterotic orbifold landscape: The probability to find MSSM-like models has increased further as shown in figure 9 and the U-sector contrast pattern generalizes to the Z 6 -II landscape such that we obtained many new inequivalent MSSM-like models. Starting from 395 inequivalent MSSM-like models in the dynamic hidden E 8 dataset we obtain now 459 models. Finally, combining all datasets yields in total 468 inequivalent MSSM-like Z 6 -II models. 6 To summarize, we were able to significantly exceed all previous searches for MSSM-like Z 6 -II models [17][18][19][20] by excluding those regions in the Z 6 -II orbifold landscape where most likely no MSSM-like model exists. It is tempting to speculate that some of our contrast patterns might even be necessary conditions for all MSSM-like models. Moreover, we learned some general features of MSSM-like models that can be produced in the Z 6 -II orbifold landscape: We were able to identify constraints on physical quantities that can be interpreted and analyzed directly. Later, in section 6, we will show that the hidden E 8 contrast pattern can be transferred to other orbifold geometries while the lower bound of this constraint will be sensitive to the orbifold geometry under consideration.
sr of unbroken roots from the hidden E 8 factor for MSSM-like orbifold models from refs. [17,18] for various Z N orbifold geometries O (where "all" refers to the various lattices for a given Z N orbifold geometry, as given in the first column of table 8.).

Geometry-dependent contrast patterns
The results from the previous sections were developed in the Z 6 -II (1, 1) orbifold geometry.
However, the basic insights from this analysis can be transferred easily to other orbifold geometries. Foremost, the concept of a hidden E 8 contrast pattern can be applied directly to other orbifold geometries: The number of unbroken roots from the hidden E 8 is computed identically for all orbifold geometries and does not depend on some unknown sorting. Unfortunately, this is not given for the U-sector contrast pattern: The number of bulk matter fields N (α) Ua for a = 1, 2, 3 depends on the twist vectors of a given orbifold geometry, see eq. (19). Then, the sorting of N (α) U 3 is determined by the sorting of the entries in the twist vectors, which is typically sorted from from small to large rotation angles. However, it is not clear why a particular U-sector might be special among the different sectors, i.e. if a special status of an U-sector is related to the sorting or to some nontrivial relation between all sectors. Therefore, we begin with the dynamic hidden E 8 search and analyze the U-sector later.
In order to apply the dynamic hidden E 8 contrast pattern to all Z N orbifold geometries, we first have to identify the lower bound of N To compute these bounds, we use the traditional searches of refs. [17,18] as a background search and split the combined dataset into datasets D (O from [17,18]) corresponding to the different Z N orbifold geometries O. Then, the conventional search in refs. [17,18] can be seen as a search with N sr ≥ 0 in our approach. Thus, we can analyze the results of the traditional search [17,18] to obtain the lower bounds X min (O) for all Z N orbifold geometries. The results are stated in table 7. Moreover, the background datasets D (O from [17,18]) allow us to focus the dynamic hidden E 8 contrast pattern on thresholds greater than X min (O), i.e. N sr > X min (O), in order to save computational resources in our search.
First, let us state the results of our search in table 8: For each Z N orbifold geometry, we give the numbers of inequivalent MSSM-like orbifold models that we found using our dynamic hidden E 8 contrast pattern and compare these numbers to the literature. Several remarks are in order. One can observe that the dynamic hidden E 8 search was able to find many new inequivalent MSSM-like orbifold models in almost all orbifold geometries. Foremost, the different Z 6 -II orbifold geometries as well as the Z 12 -I case have improved strongly using our contrast patterns. Note that for Z 6 -II (1, 1), the 13 additional MSSM-like models from refs. [17,18] fulfill all our constraints derived in section 4. Consequently, even though these models were missed in our search, they are part of our search area. Hence, these models would have been found in an extended search. A great success of our contrast patterns is also given by inequivalent MSSM-like orbifold models orbifold # MSSM-like # MSSM-like # MSSM-like using # MSSM-like geometry from [17] from [18] contrast patterns 'merged'   [36]. Note that the numbers of MSSM-like orbifold models listed in the third column differ from those in ref. [18]. This is due to an improvement of the orbifolder which has led to a better comparison of models and identified some duplicates in these sets. The last column, gives our final results: the numbers of inequivalent MSSM-like orbifold models obtained by merging the three datasets of the previous columns.
the appearance of the MSSM-like Z 7 model. This model was found so far only in refs. [18,38] using an orbifold-specific search strategy, as described in appendix A of ref. [38]. Also the Z 6 -I orbifold geometry is remarkable: In this case, we find a huge amount of equivalent MSSMlike models but only 31 inequivalent ones remain. These 31 inequivalent models were found very easily by searching in areas of the Z 6 -I orbifold landscape with large hidden sector gauge groups, c.f. the lower bound X min (Z 6 -I) = 12 given in We analyze these problematic models separately and find a lower bound X min (Z 6 -I) = 10 for these cases (because (N (1) sr , N sr ) takes only the values (8,12) and (10,10) in the background dataset). This means that MSSM-like Z 6 -I models which have G SM in both E 8 factors are contained in our search N sr ≥ 10 for both Z 6 -I orbifold geometries. Furthermore, it seems that for some orbifold geometries like Z 8 -II (1, 1) the conventional approach has some advantages. However, a comparison is difficult since it is not known how much computational power was invested to obtain these numbers. Moreover, for Z 8 -II (1,1) our contrast patterns seem to be less efficient since there is no lower bound for N   as well. Hence, on first sight it seems that our search algorithm is too complex for such geometries and the additional effort in computing constraints is not rewarded. However, this conclusion is premature: The merged datasets in table 8 show that our contrast patterns could still significantly improve the numbers of inequivalent MSSM-like orbifold models in these geometries. Thus, our search algorithm was able to find new MSSM-like orbifold models in corners of the landscape that were missed by the conventional approach. On the basis of the 'merged' datasets we can now use decision trees to derive the U-sector constraints as in the case of the Z 6 -II orbifold geometry. The results are given in table 9.
Let us analyze the resulting U-sector contrast patterns in some detail: In nearly all orbifold geometries it is possible to get a recall in the validation set of 1.00 and no MSSM-like model is missed in the training set. Only for the Z 6 -II orbifold geometries (1, 1), (2, 1) and (3, 1) a very few false negative predictions were made either in the validation set or in the training set.
Another special case is the Z 8 -II (2, 1) orbifold geometry: In order to get a growth rate larger than one the decision tree had to split the set of MSSM-like models into two sets at the first node with a constraint on N U 3 ≤ 25 that has a recall value of 1.00 and gr(c U−sector ) = 2.18, however, with the trade-off of missing one MSSM-like model from the training data.
Interestingly, it seems that the U-sector constraints in table 9 show some patterns on their own. Foremost, it is remarkable that for a given twist vector the exact orbifold geometry (i.e. the choice of the six-torus which is enumerated by the number l = 1, 2, . . . in the label (l, 1)) does not have any significant effect. This could be used to extrapolate from one orbifold geometry to another. If the constraints are different within one twist vector one should be careful and probably take the weakest constraint, e.g. N U 1 ≥ 4 for Z 8 -I. This can be seen as regularizing the machine learning model. One can use this insight to avoid overfitting and to use more statistics from other orbifold geometries. Additionally, one can observe that the hidden sector is completely dominated by a '≤' constraint in the U 3 -sectors, while the visible sector favors '≥' (except for Z 12 -I and Z 6 -I). This shows that there is still more structure to explore in the heterotic orbifold landscape and, more importantly, that we are on the right track to obtain necessary conditions on both the observable E 8 factor, containing the MSSM, and the hidden E 8 factor.

Conclusion
In this paper, we have developed an advanced search strategy for MSSM-like orbifold models using the Z 6 -II (1, 1) orbifold geometry as a test case. We obtained a significant improvement from 363 inequivalent MSSM-like models [18] to 481, see table 8. To do so, we used a technique called contrast data mining, where one identifies so-called contrast patterns that help to distinguish between MSSM-like models and others. In principle, this technique is easy to generalize to all orbifold geometries and, presumably, to other string compactifications. As a first step towards this, we analyzed all Z N orbifold geometries in section 6 and showed that in all cases our contrast patterns significantly enhance the known datasets of MSSM-like orbifold models, see table 8. Let us stress that this new search strategy is superior by orders of magnitudes with respect to the computing time. Theoretically, the conventional search algorithm can find all MSSM-like orbifold models. However, this would correspond to an unfeasible amount of computing time because the effort for finding a new MSSM-like model grows exponentially with the number of already constructed models. This fact was studied in detail in ref. [47] and can also be inferred from figures 3, 6 and 7. These figures show that the towers of already known orbifold models dominate the search and statistically keep growing before new orbifold models are expected to appear. Hence, with increasing search time the probability to find a new orbifold model is suppressed further and further. Consequently, we believe that contrast patterns can be of great importance when studying the string landscape. In addition, contrast patterns are particularly useful as they have a clear physical interpretation. In our setup of heterotic orbifolds, we identified the following contrast patterns: the number of unbroken roots in the hidden E 8 factor and the numbers of various bulk matter fields, charged under first or second E 8 factor. Hence, our contrast patterns are related to bulk fields that originate from the compactification of the ten-dimensional E 8 × E 8 gauge bosons. Moreover, our contrast patterns have direct phenomenological implications, as they are important for supersymmetry breaking via hidden sector gaugino condensation [9,35], gauge-Higgs unification [52,53] and gauge-top unification [54]. Further studies along these lines have to follow.
Moreover, using the approach with contrast patterns it was possible to solve some long standing issues in the heterotic orbifold landscape, namely: • We found many new MSSM-like orbifold models, especially in corners of the heterotic orbifold landscape that were hardly accessible by the conventional search algorithms, see table 8.
• As stated in table 3, it was possible to proof the existence of MSSM-like Z 6 -II models with vanishing Wilson line of order three, i.e. W 3 = W 4 = (0 16 ). This is the first time that such models are described in the literature. They might be phenomenologically interesting as they are equipped with a ∆(54) flavor symmetry, see section 5.
• Furthermore, using the new technique we were able to reproduce the only known MSSMlike model in the Z 7 orbifold geometry [18,38]. This model could not be found by any random search so far. Instead, it was found using a method (described in appendix A of ref. [38]) that is not feasible for most other orbifold geometries.
• Moreover, even though the main aim of our search algorithm is to find inequivalent MSSM-like orbifold models, we obtain in addition an important byproduct: our contrast patterns significantly increase the probability to find MSSM-like orbifold models in certain regions of the heterotic orbifold landscape. In future applications, the models that are classified as inequivalent by the orbifolder may differ in some other aspects, e.g. in their Yukawa couplings, see section 2.2. As soon as a preferred MSSM-like orbifold model is identified, our search algorithm allows to explore a specific part of the landscape in order to find models that have similar spectra but are not necessarily equivalent with respect to the full model.
• Also this work can be seen to be a fundamental step in applying further machine learning techniques to the heterotic orbifold landscape. In this paper, we are fighting the imbalance of the string theory datasets, i.e. we are trying to get enough data of the minority class, which is build up by the MSSM-like models. Especially, deep learning techniques are very sensible to unbalanced data and tend to perform worse than traditional machine learning methods.
Finally, we want to state some preferred properties of contrast pattern that should be kept in mind when constructing new features in the future. These properties are useful for implementation as well as for the impact of a new contrast pattern. The most important property is that a new feature can be checked quickly and easily, since it will be computed several times during the successive search algorithm. In addition, it is an advantage if a contrast pattern is testable at each step of the successive construction, see figure 1. Therefore, a new feature must be a monotonically decreasing (increasing) function with respect to the successive creation of shifts and Wilson lines. Moreover, in combination with the monotonic behavior the constraint has to be a lower (upper) bound on the model. For example, the minimal number of unbroken roots is a good contrast pattern since it can only decrease at each step in figure 1 and, therefore, it is a monotonically decreasing function with a lower bound. On the other side, the maximal number of bulk fields in a certain U -sector can only be checked at the last step in figure 1 since this number is also a monotonically decreasing function and any subsequently chosen Wilson line can decrease this value further. Hence, even though the number of bulk fields is a monotonically decreasing function with respect to the successive creation of shifts and Wilson lines, the U -sector contrast pattern is given by an upper bound, which weakens this contrast pattern.
In conclusion, this paper shows that techniques from data mining and machine learning can be applied successfully to the heterotic orbifold landscape and produce practical results, i.e. novel MSSM-like models that were out of reach using all traditional approaches so far. Further investigations in this direction have to be done in order to complete the set of contrast patterns. One might speculate that contrast patterns could ultimately help to identify an analytic formula for the construction of MSSM-like models in the heterotic orbifold landscape.

Acknowledgements
This work was supported by the Deutsche Forschungsgemeinschaft (SFB1258). We would like to thank Saúl Ramos-Sánchez for useful discussions.