Unsupervised Machine Learning to Classify the Confinement of Waves in Periodic Superstructures

We employ unsupervised machine learning to enhance the accuracy of our recently presented scaling method for wave confinement analysis [1]. We employ the standard k-means++ algorithm as well as our own model-based algorithm. We investigate cluster validity indices as a means to find the correct number of confinement dimensionalities to be used as an input to the clustering algorithms. Subsequently, we analyze the performance of the two clustering algorithms when compared to the direct application of the scaling method without clustering. We find that the clustering approach provides more physically meaningful results, but may struggle with identifying the correct set of confinement dimensionalities. We conclude that the most accurate outcome is obtained by first applying the direct scaling to find the correct set of confinement dimensionalities and subsequently employing clustering to refine the results. Moreover, our model-based algorithm outperforms the standard k-means++ clustering.

Recently, we have described a rigorous method to characterize the confinement of waves in periodic media with defects and in superlattices in general [1].We first introduced a so-called confinement dimensionality, that quantifies the intuitive term of "confinement" and then developed a scaling theory to determine the confinement dimensionality of every band in a given system.This scaling theory is valid for any type of physical wave -acoustic, electromagnetic, electron, spin, etc. -in both quantum and classical setting, and for systems in any dimension, and is readily usable in computer algorithms, allowing for automated classification of the bands.
Nevertheless, the theory of Ref. [1] requires for every investigated superlattice a smaller reference superlattice, so that one can observe the scaling behavior of the key quantities when changing the supercell size.Generally, obtaining the data for the reference supercell is significantly less computationally demanding than performing the calculations for the supercell of interest.On the other hand, this requirement for a reference supercell makes the scaling approach of Ref. [1] inapplicable in case the reference supercell is not available.Furthermore, the scaling theory is exact only in the case of infinitely large supercells and thus for experimentally relevant small supercells inaccuracies may occur, as described in Ref. [1].Since the reference supercell is smaller than the investigated one, it is clear that eliminating the reference supercell from the confinement determination may provide an advantage in terms of accuracy of the scaling method.
Assigning the confinement dimensionality to each band in a spectrum is a typical classification task, which is among the archetypal problems solved by unsupervised machine learning [41].A prominent approach within unsupervised learning are clustering algorithms, which aim to partition the dataset into several clusters in such a way that "similar" data points are grouped together in the same cluster [42,43].In the context of wave characterization, machine learning has been employed, e.g., by Refs.[44,45] for topological classification of band structures pertaining to various types of periodic lattices.
In this paper, we investigate the application of clustering algorithms to improve the precision of confinement identification for waves in small supercells.We employ two clustering algorithms to partition the bands of interest in the two-dimensional parameter space of mode volume and confinement energy, so that each resulting cluster corresponds to a specific confinement dimensionality.The first one is the well-known k-means algorithm [46] with improved initialization [47], which measures the "similarity" between the bands by their distance in a parameter space.This is a standard algorithm, which, however, does not take into account any specific physical properties of our system and, as we show, this results in less accuracy in some cases.Secondly, we implement our own model-based clustering algorithm (MBC), which uniquely merges clustering with the physical model behind the wave confinement, measuring the "similarity" between the bands by their distance from the scaling curves predicted by Ref. [1].The MBC algorithm shows improved accuracy in the cases where k-means fails.
We conclude that even though the clustering approach brings specific complications on its own, it can be used to enhance the precision of confinement identification for small supercells, especially if one first employs the direct scaling method of Ref. [1] to determine the correct set of confinement dimensionalities.In terms of computational complexity, the additional clustering step is negligible, taking mere seconds for the cases presented in this paper.
The remainder of this paper is organized as follows: Section 2 of this paper introduces notation and a brief review of scaling theory of wave confinement.In the third section, we reformulate the scaling equation for wave confinement into a more machine-learning friendly form.The fourth section deals with the evaluation of the accuracy of the clustering algorithms, while in the fifth section we describe our model-based clustering algorithm.In the sixth section, we compare the performance of the k-means clustering algorithm, our MBC algorithm, and the standard scaling approach using a reference supercell.

Superlattices and scaling
Superimposing a periodic lattice of defects on another underlying crystal lattice gives rise to a so-called superlattice [1,[48][49][50].A unit cell of such a superlattice is called a supercell.Fig. 1 depicts supercells containing various types of defects.This is a traditional model for, e.g., defects in periodic materials.A supercell of linear size  in a -dimensional system contains   unit cells.A defect with a dimensionality  can confine waves in  =  −  dimensions, as illustrated in Fig. 1.The number  is referred to as confinement dimensionality.
In Ref. [1], we have developed a general scaling theory of wave confinement.This method relies on the confinement quantity  (r) as an input.This quantity depends on the investigated system and represents what is intuitively understood as confinement; it can be, e.g., the energy density in photonic systems, or the charge density in electronic systems.From  (r), two other quantities are calculated; the relative confinement energy defined as where the quantity  S ∫  S  (r)dr represents the total energy in the supercell, and the relative mode volume Here,  S is the supercell volume and  C a specific integration volume, usually chosen so that it contains a part of the defect.The quantities defined by Eqs. ( 1) and ( 2) have the following scaling behavior in the limit  → ∞: with ,  constants that are independent of .
We found that by observing the scaling behavior of the combined quantity with respect to the supercell size , for judiciously chosen auxiliary powers , one can determine the confinement dimensionality  for each band of states of a given system [1].Here, the constant  =   / is a priori unknown and may depend on various system parameters and the specific investigated band.Although this scaling approach is rigorously valid only for sufficiently large , it was found that it works for several smaller systems as well, which is unusual for scaling theories.However, the method sometimes also returned unphysical results for certain bands in these small supercells, necessitating a critical evaluation of the results.
In this paper, we present an alternative scaling approach and employ machine learning algorithms to enhance the precision of the method of Ref. [1].Although this approach introduces an additional step in the confinement analysis, this step is computationally cheap and improves the accuracy of the results for small supercells.Moreover the clustering analysis still maintains its wide range of applicability to different physical systems.
Throughout this paper, we use as reference system a 3D inverse woodpile photonic band gap crystal with two proximate defects [51], described in detail in Appendix A. Specifically, we employ the well-researched structure with regular pores of radius  = 0.24, with  being the lattice parameter, and defect pores of radius  = 0.5 [23,24,52].This system represents a suitable model due to its relatively complex arrangement of linear defects, creating a cavity at their crossover.We know that, as a result, the system exhibits extended ( = 0), line-confined ( = 2), and point-confined ( = 3) bands that need to be distinguished, corresponding to the total number of  = 3 clusters.On the other hand, we know from physics that the system does not support plane-confined ( = 1) bands and those should therefore not appear in the correct clustering result.We employ the range of supercell sizes  = 2, 3, 4, exhausting what is feasible in a reasonable computing time.For every supercell size , we used the plane-wave expansion method implemented via the well-known MPB code [53] to obtain the mode volume and energy density for each band.

Machine-learning approach to scaling
Instead of combining the quantities in Eqs.(1) and (2) into the ratio (4), one can also express the mode volume as a function of confinement energy, in the limit  → ∞: where Eq. ( 5) represents a different form of scaling than Eq. ( 4), this time with respect to .Of course, from Eq. (3), it is clear that  also changes with the system size .The band behavior based on Eq. ( 5) is graphically illustrated in Fig. 2. For small supercell sizes , the bands are accumulated Figure 2. Schematic of the scaling behavior of the bands according to Eq. ( 5).Mode volume versus the confinement energy.For small supercell size  all the bands are grouped together, but, for larger N, they spread over the ( , ) space, forming clusters in accordance with their confinement dimensionality .
in the upper right corner with relatively high  and .As the supercell size increases, the bands start to follow a path corresponding to their confinement dimensionality .Eventually, bands with the same  form clusters as depicted in Fig. 2. The bands in a  dimensional system will form at most  + 1 clusters, corresponding to all possible values of the confinement dimensionality .
The immediate advantage of this approach compared to the one presented in Ref. [1] is that this method does not require a smaller reference supercell, but one can directly analyze which bands belong to which cluster for the supercell size of interest.Nevertheless, since Eq. ( 5) strictly holds only for  → ∞, the clusters will be clearly distinguishable only for sufficiently large supercells.In both computations and experiments, one is realistically constrained to relatively low  and it thus may become difficult to distinguish the clusters from each other.To improve the accuracy of confinement determination, we propose here the employment of a clustering algorithm, which divides the data into clusters quantitatively and automatically.

Data clustering
Data clustering is one of the staple problems of machine learning techniques [41] and it is thus reasonable to expect that such techniques enhance the precision of confinement identification.In order to formulate the band confinement identification as a clustering problem, we consider the logarithm of mode volumes log 10   and the confinement energies   for each band  = 1, . . .,   , where   is the total number of bands for the corresponding supercell.To ensure that both log 10   and   are treated with equal importance and thus neither dominates the clustering, we renormalize the data once again, so that both variables have the same range of values prior to clustering.The normalization is performed for each dataset separately as follows: Note that we choose the semi-logarithmic variable space because this provides more uniformly spread data than both linear and log-log view.The normalized dataset used for the clustering is thus The goal of a clustering algorithm is to subdivide the data in X  among  clusters, such that "similar" data points are assigned to the same cluster   = {x   |  = 1, . . .,   }, where  = 1, . . ., .In total, each cluster will contain   ≥ 1 data points, such that  =1   =   .We refer to the specific distribution of the data points within the clusters as a partition, denoting it as C  = { 1 , . . .,   }.The centroid c  of the cluster   is defined as the mean of its constituent points, i.e., c  = (   =1 x   )/  .We denote the set of all possible partitions of the dataset X  into  clusters as P  .
We explore two different ways of clustering the data for the purpose of confinement identification: A standard k-means++ algorithm implemented in the Python library scikit-learn [54] and our own MBC algorithm utilizing physical insight to perform the clustering.Each of them utilizes a slightly different notion of what it means for two data points to be "similar", and therefore employs a different approach to obtain the resulting partition, as described below.

The k-means++ algorithm
A well-known known clustering algorithm is the k-means algorithm [46], which considers two data points to be "similar" if their Euclidean distance in the data space is small.Therefore, it aims to group such points together in one cluster.Within a given partition C  , this notion of similarity can be measured via the cost function Ψ : P  → R representing the sum of distances of each data point from the centroid of its cluster: where • denotes the Euclidean norm.The k-means algorithm aims to minimize the cost function Ψ over all possible partitions.The partition C , for which Ψ attains its minimum is considered the best partition of the dataset, i.e., C is such that The k-means algorithm is summarized in Algorithm 1. Algorithm 1 will always converge to a local minimum of the cost function Ψ, see Ref. [55].There is, however, a likelihood that this local minimum does not coincide with the global minimum.As a result, the output of the algorithm may depend on the specific initialization choice.

Algorithm 1 k-means algorithm
Input: Data array X  ⊂ R 2 , number of clusters  ∈ N.
Output: Division of the data into  clusters such that the cost function Ψ is minimized.

2.
Assign each data point x  ∈ X  to its nearest centroid using the Euclidean metric.In case of equality of the nearest centroids, the data point is assigned to either cluster.

3.
Recalculate the centroid of each cluster as the mean of all its constituent data points.

4.
Repeat the steps 2 and 3 with the new centroids until a termination criterion is reached.
One can improve the probability of reaching the global minimum by applying the algorithm multiple times with different random initializations and choosing the result with the smallest cost function Ψ.Alternatively, one can choose a more sophisticated method of the cluster centroid initialization, as described by the k-means++ algorithm [47].Here, instead of choosing the initial centroids in Step 1 of Algorithm 1 at random, we choose the first centroid randomly from the dataset X  , with a uniform probability distribution.Then, we continue to choose each subsequent centroid c  , 1 <  ≤  randomly from the dataset X  with the adjusted probability distribution where is the distance of the point x to the closest centroid that has already been determined.The remaining part of the k-means++ algorithm is the same as for the standard k-means algorithm, described in Algorithm 1.

Clustering accuracy
Clustering of the bands based on their confinement dimensionality is complicated by the fact that there is no known ground truth, i.e., there is no reference solution against which we can assess the accuracy of the clustering process and evaluate its validity.This is an important hurdle for the clustering approach, since the k-means++ algorithm requires the number of clusters  as an input, which is, however, not known a priori.One can, of course, run the algorithms for all possible values of , but how can we determine that we have found the correct number of clusters?Since the cost function Ψ defined by Eq. ( 9) is, by definition, a decreasing function of cluster numbers, it cannot be used to determine the correct .We therefore need another type of clustering accuracy measure and this is where the concept of cluster validity indices (CVIs) comes into play.
Cluster validity indices are quantitative measures to determine the validity of the clustering results.A plethora of CVIs has been proposed in literature, employing various approaches to measuring the clustering accuracy [56].Based on the definition of each CVI, better clustering results correspond to either lower or higher values of the index and thus the most accurate clustering out of a set of results is the one achieving the optimum of the CVI, i.e., either the minimum or the maximum.CVIs can be divided into two categories in a rather straightforward manner: external indices, comparing the clustering result against the ground truth, and internal indices, analyzing only the partitioned data without the need for any external information [57].It is clear that for our purposes we are interested in the CVIs of the internal type.Most CVIs measure cluster cohesion and cluster separation of the data, in some sense.Cluster cohesion describes to what extent the entities inside a cluster are alike, whereas cluster separation evaluates how well different clusters are separated.The behaviour of the CVIs largely depends on the context and the setting [56].This is why numerical tests must be conducted to convincingly select the CVI that provides the correct measure of accuracy in the given context [56].This obviously also applies to clustering for wave-confinement analysis.Therefore, in this section, we analyze select CVIs, in order to find the ultimate clustering accuracy measure for our specific problem.

Methodology
To limit the number of CVIs to be tested, we pick five of the best-performing CVIs from Ref. [56].These are the Silhouette coefficient (Sil), the Calinski-Harabasz (CH), the Davies-Bouldin (DB) including its slight variation (DB*), the COP and the S_Dbw indices.Sil, CH, and DB were implemented via the package library scikit-learn, S_Dbw was implemented via the package library S_Dbw, and DB* and COP were implemented directly by ourselves.For the overview of the CVIs and their definitions, see Appendix B.
The testing procedure was executed in an identical fashion for each CVI, utilizing the reference inverse woodpile structure with  = 0.24 and  = 0.5, specified in Section 2 and Appendix A. To find the CVI that provides the best accuracy measure, we run the clustering algorithm for various total number of clusters  and subsequently compute the value of each CVI for that partition.The global optimum (minimum or maximum, depending on the CVI definition) should then be attained for the correct number of clusters  = 3.For this analysis, we choose the range of  = 2, . . ., 25.This range exceeds the number of clusters expected in the context of wave confinement, but such a broad range of values may provide additional information on the accuracy of the CVIs.Note that we do not use  = 1 since the value of the CVIs is not defined for only one cluster, see their definitions in Appendix B. It is important to note that the choice of clustering algorithm should not affect the CVI performance, since CVIs only evaluate the intrinsic clustering outcome [58,59].This is why, for the purpose of the CVI analysis, we stick to only the k-means++ algorithm.
According to the supercell scaling method the distance between the clusters of different levels of confinement should increase as the supercell size  grows.Taking also into account that we always normalize the dataset to a unit domain, we expect the well-behaved CVIs to display better result, i.e., lower minimum or higher maximum, every time  is incremented.In other words, the CVI must be a monotonically increasing or decreasing function of , depending on the CVI definition.We refer to this as the convergence property of the CVI.
Thus, in this section, we are looking for the best performing CVI for our specific setting of wave confinement analysis.The suitable CVI has to satisfy the following criteria: 1.For the reference structure, the maximum or minimum of the CVI should be obtained at  = 3 clusters.
2. The CVI should be a monotonically increasing or decreasing function of the supercell size , exhibiting convergence.

Results and discussion
Fig. 3 shows obtained CVI values as a function of the input number of cluster , for the reference structure.For the supercell size  = 2, none of the indices achieve their respective optimum for  = 3.This is not surprising, as this is an extremely small supercell and the scaling properties are not yet well developed for this size.In the case of  = 3, the only CVI achieving correctly its  apparent that all the indices obey the required monotonic convergence.
Based on our analysis, we conclude that, out of our tested sample of CVIs, DB* is the only CVI to satisfy the requirements for a good clustering accuracy measure in our setting for both  = 3 and  = 4 supercell sizes.Therefore, we will focus on this CVI in the remainder of this paper.We also note that for the larger supercell size  = 4, DB also seems to be well-behaved and thus might be suitable as a CVI for larger supercell sizes.

Model-based clustering
The k-means clustering algorithm is a standard and easy-to-implement process.However, it simply clusters the data without any account for the underlying physics.Therefore, we present here our model-based clustering (MBC) algorithm, a unique mix of clustering and a model-based regression method.
As discussed in Section 3, bands in a superlattice analysis follow certain trajectories in the ( , log 10 ) space based on their confinement dimensionalities , forming clusters as depicted in Fig. 2. Mathematically, upon transforming Eq. ( 5) from the ( , ) to ( , log 10 ) space, the trajectory of a band with confinement dimensionality  will be given by Note that the second normalization given by Eqs. ( 7) and ( 8) only adds a constant to the right-hand side of Eq. ( 13), which can be absorbed by the unknown constant C. We can thus directly write In the MBC method, instead of evaluating the distance of a data point to the centroid of its cluster, we evaluate the distance of a point to the curve given by Eq. ( 14).For 0 ≤  ≤ , we define the distance of a point x 0 = (E 0 , V 0 ) to a point x = (E, V) on the curve (14) as where Note that   formalizes the curves in Fig. 2: In a  = 3 system,  0 corresponds to the black,  1 to the red,  2 to the blue and  3 to the green curve, for specific choices of the scaling constants C. For 0 ≤  < , the norm on the right hand side of Eq. ( 15) can be explicitly written as and, for  = , it is simply given by the difference in the mode volumes |V − V 0 |.
In the MBC algorithm, we calculate the distance of each data point to each curve corresponding to different values of  and assign the data point to the cluster corresponding to the smallest distance.Based on this clustering, we then calculate the centroids of each cluster and adjust the curves   to pass through these new centroids by changing C in (14).Then the algorithm is iterated again, until a termination criterion is reached.Since, for sufficiently large supercells, curves   are strictly separated from each other, i.e., they do not cross, as illustrated in Fig. 2, there is no need for repeated random initialization.As initialization values, we simply set all the centroids at the point (max   =1 E, max   =1 V), corresponding to the top-right corner of the clustering dataset in Fig. 2.
The MBC algorithm is summarized in Algorithm 2. We note that, while the k-means algorithm only performs the clustering and one needs to manually assign the corresponding confinement dimensionality  to each cluster, MBC does this inherently and automatically, which is an additional advantage of this approach.
Again, we use DB* as a CVI to determine the correct number of clusters.We stress here that for MBC, the input is not only the number of clusters, but also their specific confinement dimensionalities.The set of  clusters can thus include different combination of  values, yielding different partitions.In such a case, DB* can sometimes prefer a physically impossible partition, such as plane-confined bands in a structure with no plane defects.Nevertheless, if we restrict ourselves to only sets of physically meaningful confinement dimensionalities, the performance of DB* is comparable to the case of the k-means algorithm.This is inherently a shortcoming of DB* applied to our problem, as it does not measure the validity with respect to the theory, but only based on clustering cohesion and separation, as discussed in more detail in Appendix B. There would thus be a clear benefit in devising a model-based CVI along with our model-based clustering algorithm.This is, however, beyond the scope of this paper.Moreover, as we discuss

Algorithm 2 Model-based clustering algorithm
Input: Data array X  ⊂ R 2 , values of  confinement dimensionalities, corresponding to different clusters.
Output: Division of the data into clusters such that the distance from each point to its corresponding curve is minimized.

2.
Assign each data point x  = (E  , V  ) ∈ X  to its nearest curve   using the distance measure   .In case of equality of the nearest centroids, the data point is arbitrarily assigned to one of the corresponding clusters.

3.
Recalculate the centroid of each cluster as the mean of all its constituent data points and adjust the corresponding curve to pass through the new centroid.

4.
Repeat the steps 2 and 3 with the new curves until a termination criterion is reached.
below, our computations suggest that the best accuracy of the confinement identification can be achieved by finding the correct set of confinement dimensionalities via the scaling method of Ref. [1] combined with physical insight and use that as an input for the clustering algorithm.The use of a CVI can thus be skipped if one is able to perform the scaling and has some information about the physics of the investigated structure.

Performance analysis in confinement classification
In this section, we compare the results obtained by our MBC algorithm and by the k-means++ algorithm with the results obtained by direct application of the scaling method described in Ref. [1] with reference supercell size  0 = 2.We do this by applying the methods to several inverse woodpile structures with different properties.The absence of the ground truth information for these problems complicates the analysis and therefore we attempt to somewhat alleviate this issue by visually inspecting the energy-density distribution  (r) of selected bands.Note, however, that this visual inspection is a qualitative tool and it may be difficult to precisely assign the confinement dimensionality of certain bands, this way.Fig. 5 shows the results of the classification of confinement dimensionality on an  = 3 supercell of a silicon inverse woodpile photonic crystal with the most widely studied parameter set, namely an unperturbed pore radius  = 0.24, and a defect pore radius  = 0.5, see Appendix A and Refs.[23,24,52].In this case, both clustering algorithms yield the same solution, which almost perfectly coincides with the direct scaling approach of [1].It is remarkable that all three methods independently agree on the presence of at least 9 bands with  = 3, between ω ≈ 0.52 and ω ≈ 0.57, instead of the 5 previously identified by qualitative guesswork by Refs.[23,24,52].Some of these bands have been shown to demonstrate "Cartesian light" or a 3D coupled cavity behavior [24] as also seen in recent experiments [60].The direct scaling approach of Ref. [1], however, identified several bands, mostly near ω ≈ 0.65, as  = 1 plane-confined modes, which is unphysical, since the structure does not contain any planar defects, but only line and point ones.Thanks to the fact that the clustering algorithms do not require a smaller reference supercell but work only with the larger investigated supercell, the clustering approach successfully avoids these unphysical results.In this case, we therefore conclude that the clustering approaches are superior to the direct scaling.It is important to stress that the scaling theory beyond the approach of Ref. [1] still remains valid and the clustering merely improves its accuracy for small supercells.
Fig. 6 shows the results of our confinement identification on a larger  = 4 supercell of the inverse woodpile photonic crystal with the same unperturbed pore size  = 0.24, and defect pore size  = 0.5 as in the previous paragraph.Note that all approaches agree that a higher number of  = 3 bands is present here than for the  = 3 supercell with the same pore sizes.This clearly illustrates the problem with small supercells: the localization length of some confined bands is too large to be identifiable in the small supercells and hence these confined bands are only apparent with larger supercells.In this case, the clustering approach again avoids the unphysical plane-confined bands ( = 1) obtained by the direct scaling application near ω ≈ 0.65.The k-means++ algorithm classifies the  = 2 and the  = 3 bands in a very similar way to the direct scaling approach.The direct scaling approach, however, contains a unique red ( = 3) band at ω ≈ 0.61 and several green ( = 2) bands around ω ≈ 0.58, within the area filled with blue  = 2 bands.The red  = 3 band seems to be degenerate with a blue  = 2 band, which seems physically implausible and has been already observed by Ref. [1] for a crystal structure with different pore sizes.The more robustly grouped result of the clustering approach thus seem to be more physically plausible also in this case.
The MBC approach offers another remarkable insight, namely, it assigns the green  = 1 bands from the scaling approach to be  = 2 bands.Even by the visual inspection of energy density distribution  (r) it is hard to judge if the bands in question are  = 2 or  = 0 as concluded by the MBC and the k-means++ algorithm, respectively.There does not seem to be a significant qualitative difference between the energy density distributions for the bands around ω ≈ 0.65 and the bands around ω ≈ 0.67, when visually compared.All of them have  (r) extended throughout the whole supercell, but also with certain peaks at the defect positions.Those peaks are, nevertheless, noticeably lower than the bands between ω ≈ 0.52 and ω ≈ 0.60.The correct classification of the bands above ω ≈ 0.65 thus remains an open question at this time.Fig. 7 shows the results of confinement identification on an  = 3 supercell of the inverse woodpile photonic crystal with different structural parameters, namely the same unperturbed pore size  = 0.24, but increased defect pore size  = 0.8.The results of the clustering algorithms are almost identical, here, again avoiding unphysical cases.The direct scaling approach finds two degenerate unphysical  = 1 bands below the band gap at ω ≈ 0.47.However, visual inspection of the energy density indicates that these two bands are unconfined ( = 0), which was concluded by the clustering algorithms.There are several other bands, between ω ≈ 0.62 and ω ≈ 0.67, that the scaling approach identifies as having  = 1.Some of these bands are assigned to have  = 2 (among the other blue bands), while others  = 0 (above the other black bands) by the clustering algorithms, which seems a lot more consistent.Thus, also in this case, the clustering performs better than the direct scaling approach.
Fig. 8 shows the results of confinement identification on an  = 3 supercell of the inverse woodpile photonic crystal with unperturbed pore size  = 0.15, and defect pore size  = 0.5.In this parameter set, the main pores are smaller than optimal, a situation typical of experiments [16,61,62].In this case, the results obtained by both clustering algorithms almost coincide again.The direct scaling approach again shows several unphysical green  = 1 bands.Some of these bands are classified by the clustering algorithms as  = 2 and some as  = 0. Visual inspection of the energy density indicates that the  = 1 bands around ω ≈ 0.41, resulting from the direct scaling approach, are in fact  = 0.This has been also identified by the clustering algorithms.We thus again conclude that the clustering algorithms outperform the direct scaling approach.The black  = 0 band at ω = 0.39, resulting from the scaling and the MBC algorithm has similar energy-density distribution properties to its neighboring bands and therefore it looks like it should have been also labelled as  = 1, which has been done only by the k-means++ algorithm.
Fig. 9 shows the results of confinement identification on an  = 3 supercell of the inverse woodpile photonic crystal with large unperturbed pores with  = 0.29, and defect pores with  = 0.5.This is a case where the direct scaling application overcomes the clustering.Visual investigation of the energy-density distribution  (r) clearly shows that there are  = 3 bands present in this system, such as those around ω ≈ 0.77.The clustering algorithms with the help of DB* as a CVI thus clearly failed to identify the correct number of clusters here.
To exclude possible errors in CVI performance, we now eliminate the need for finding the correct number of clusters and confinement dimensionalities by explicitly imposing these values based on the results of the direct scaling.Fig. 10 shows the clustering classification where we have explicitly imposed the number of clusters  = 3 for the k-means++ algorithm and the possible confinement dimensionalities  = 0, 2, 3 for the MBC.Now, both clustering algorithms identify  = 3 bands similarly to the scaling algorithm.Here, we see much better agreement regarding the  = 3 bands among all three approaches.The clustering algorithms obviously avoid the unphysical  = 1 bands, thanks to the imposing of the number of clusters and the confinement dimensionalities before the analysis.In this case, there is a remarkable difference between the line-confined  = 2 bands identified by the k-means++ and the MBC algorithms.The k-means++ algorithm finds a large number of these bands, especially also below the band gap of an unperturbed structure ω < 0.72, which seems physically highly implausible.Indeed, visual inspection of the energy density of some of these bands indicates that they are  = 0. Note that MBC also classifies a few bands below the gap as  = 2.To our surprise, the visual inspection of the energy density for some of these bands shows that, despite not being truly spatially confined, these bands do exhibit some confinement patterns at the defect cross-sections.Therefore, MBC outperforms the other approaches here.
In this section, we investigated the confinement-classification performance of MBC, k-means++ and the direct scaling application methods on small supercells of various inverse woodpile structures.The clustering algorithms clearly outperform the direct scaling approach if the implemented CVI identifies the correct number of clusters to use as the input to the algorithms.The performance of the two clustering algorithms is in most cases very similar.Nevertheless, the MBC algorithm takes into account the underlying physics behind the clustering, which may prove beneficial in some cases, such as the one illustrated in Fig. 10.Moreover, MBC has an additional advantage of immediately assigning the clusters to their corresponding confinement dimensionalities, offering an additional advantage over the k-means++ algorithm where one has to assign the dimensionalities to the clusters manually.Finally, to avoid errors in cases when the CVI fails, we find it valuable to first directly employ the scaling analysis of Ref. [1] to identify the correct set of physically plausible confinement dimensionalities and then use those values as input for refinement by the MBC algorithm.

Conclusion
In this paper we investigate the application of clustering algorithms, which are particular techniques of unsupervised machine learning, to classify the confinement dimensionality of bands of confined states in photonic band gap superlattices.We find that the use of clustering algorithms increases the accuracy of band confinement classification for small supercells.To overcome the lack of a ground truth, we employ cluster validity indices as a measure of partition correctness and as a way to find the correct number of clusters.We analyze several CVIs and find that the most suitable validity measure for our application is the DB* variant of the Davies-Bouldin index.
We further propose two different algorithms to be used for band confinement identification: the k-means++ and our own model-based clustering (MBC).We analyze the performance of these two algorithms in comparison to the direct application of scaling without clustering and find that the addition of clustering improves the accuracy of identification if the number of clusters is correctly found.Nevertheless, we also discover that even with the use of the CVI, the clustering algorithms are not always able to find the correct number of clusters.To overcome this issue, we propose to first directly apply scaling to find the number of physically valid clusters and then use that number as an input for the clustering algorithms to find the best band confinement classification.
Even though, as we have shown, both the k-means++ and our MBC algorithm usually perform well, in some cases the MBC algorithm is able to provide better results than the k-means++ algorithm.Combined with the fact that MBC assigns confinement dimensionalities to the clusters inherently and automatically without any need for external input, we suggest the use of this algorithm over the k-means++ algorithm In the future, it would be beneficial to devise a model-based CVI, measuring the cluster validity not only in terms of cohesion and separation but also with respect to the adherence to the scaling theory of confinement, in a similar manner as our model-based clustering algorithm works.Furthermore, it would be interesting to explore the application of fuzzy clustering in order to include an error measure of assigning each point to its cluster [43].

Funding
This research is supported by the NWO-CSER program, project "Understanding the absorption of interfering light for improved solar cell efficiency" under the Poject No. 680.93.14CSER035; the NWO-JCER program, project "Accurate and Efficient Computation of the Optical Properties of Nanostructures for Improved Photovoltaics" under the Project No. 680-91-084; the NWO-GROOT program, project "Self-Assembled Icosahedral Photonic Quasicrystals with a Band Gap for Visible Light" under the Project No. OCENW.GROOT.2019.071; the NWOTTW program P15-36 "Free-Form Scattering Optics" (FFSO); and the MESA+ Institute for Nanotechnology, section Applied Nanophotonics (ANP).

Disclosures
The authors declare no conflicts of interest.

A. The inverse woodpile structure
The 3D inverse woodpile photonic band gap crystal consists of bulk silicon ( = 12.1, see Ref. [63]), in which nanopores of radius  filled with air ( = 1) are etched [51,61].We model the unperturbed crystal using a tetragonal unit cell with lattice parameters  in the  direction and  = / √ 2 in the  and  directions.This unperturbed structure is depicted in Fig. 11(a).A defect can be incorporated in the structure by altering the radius  of two proximate nanopores, as illustrated in Fig. 11(b) [23].This results in the creation of a region with excess of either air ( > ) or silicon ( < ) at the intersection of the two defect pores, serving as a point defect.We include one defect of this type per supercell.

B. Overview of the studied CVIs
Here, the mathematical definition for each CVI is provided along with some intuitive explanation of each formula, mainly in terms of cluster cohesion and cluster separation.For a comprehensive review, see Ref. [56].We denote the centroid of the dataset X containing  data points as X = 1  x  ∈X   .

B.A. Silhouette
The Silhouette index [58] is defined as and is aimed to be maximized for the best clustering outcome.Here, represents the cohesion of the cluster   given as a distance between a chosen point x  ∈   and all other points in the given cluster.Clearly, the smaller  is, the better the cluster cohesion.Furthermore, the function represents the separation of the cluster in terms of nearest neighbor distance from the other clusters   ,  ≠  to the point x  ∈   .The larger  is, the further apart different clusters are from each other and thus the better the cluster separation is.

B.B. Calinski-Harabasz
The Calinski-Harabasz index [64] is defined as A high value of CH is assumed to correspond to better data clustering.Intuitively, the distance between the global centroid X and the cluster centroids c  in the numerator acts as a cluster separation measure, with larger number corresponding to better separation.The cluster cohesion is represented by the denominator as the sum of distances between the cluster centroids and their constituent data points.Smaller distances between the cluster centroid and the cluster data obviously correspond to a more cohesive cluster.For a good clustering, one thus aims for maximizing the numerator and minimizing the denominator of CH, thus maximizing its overall value.

B.C. Davies-Bouldin
The Davies-Bouldin index [59] where DB is expected to decrease with better partioning.
Here, the cohesion of the cluster   is gauged by the function (  ) in the numerator, as the sum of the intracluster distances, i.e., the sum of the distance of each data point assigned to the cluster   to the cluster centroid.The separation is measured in the denominator simply as the distance between the cluster centroids.A good clustering result will have a small numerator and a large denominator, thus resulting in a small value of DB.

B.D. Davies-Bouldin*
This variant on the Davies-Bouldin algorithm, described in Ref. [65], is defined as max   ∈ C\  (  ) + (  ) and is expected to attain low values for good clustering outcomes.The fact that the standard DB minimizes the maximum of the ratio of cluster cohesion and cluster separation can lead to pathological cases, as described by Ref. [65].To remedy this, DB* instead maximizes the cluster cohesion and minimizes the cluster separation independently.

B.E. COP
The COP index [66] is defined as  (25) and is expected to be small for good clustering results.Here, the cluster cohesion is measured in the numerator as the average of intracluster distances and the cluster separation is measured by minimizing the furthest-neighbor distance to a given cluster in the denominator.One aims to minimize the numerator and maximize the denominator, thus maximizing the value of COP.We note here that, due to the furthest-neighbor distance measure, long and thin cluster can possibly skew this measure by overestimating the actual separation between the clusters.

B.F. S_Dbw index
The S_Dbw index [67] is defined as and is expected to attain small values for good partitions.Here, (X) is the variance of each component of the data set X. Its -th component is defined as Furthermore, (C) is the standard deviation of the partition C, defined as We further define the terms (  ) ∑︁ and (  ,   ) ∑︁ where  (x  ,   ) The term (  ) evaluates the number of points in a cluster   within the standard deviation of the partition and the term (  ,   ) evaluates the number of points at the midpoint between the centers of   and   .A good clustering corresponds to large (  ), (  ) and small (  ,   ) for each ,  ≤ ,  ≠ , which translates into a small value of the second term in (26).The first term in (26) is termed intracluster variance and is a measure of the cluster cohesion in terms of a mean cluster variance.The first term in Eq. ( 26) thus corresponds to the cluster cohesion and the second term represents the cluster separation [67].Unlike all the other CVIs described in this paper, S_Dbw relates these two terms by addition, instead of a ratio.Overall S_Dbw aims to minimize both of these terms for a good clustering result.

Figure 1 .
Figure 1.Illustration of 3D supercells containing various defects that contain waves.Blue spheres represent unperturbed unit cells, red spheres correspond to defect unit cells and confined waves are depicted in yellow.(a) Supercell containing a point defect,  = 3.(b) Supercell containing a line defect,  = 2. (c) Supercell containing a plane defect,  = 1.(d) Supercell with no defects, where all unit cells are identical, does not support confined waves,  = 0.

Figure 3 .
Figure 3. CVI values as a function of the number of clusters .The arrows in the legend indicate if the optimum of a specific CVI is a minimum (down arrow) or a maximum (up arrow).The suitable CVI should attain its respective optimum at the correct number of clusters  = 3, highlighted with the dotted vertical line.Note that for  = 1 the CVIs are undefined.(a)  = 2 supercell.(b)  = 3 supercell.(c)  = 4 supercell.

Figure 4 .
Figure 4. CVI values as a function of the supercell size  for the correct number of clusters  = 3.The arrows in the legend indicate if the optimum of a specific CVI is a minimum (down arrow) or a maximum (up arrow).The suitable CVI is expected to decrease (increase) with growing , if its optimum is a minimum (maximum).

Figure 5 .
Figure 5. Band structure of an  = 3 supercell of an inverse woodpile photonic crystal (see Appendix A) with parameters  = 0.24,  = 0.5, with bands color-coded to indicate the confinement dimensionalities  of each band.Red color corresponds to point-confined  = 3, blue to line-confined  = 2, green to plane-confined  = 1 and black to extended  = 0.The vertical axis is given in the reduced frequency ω = /(2), with  being the lattice constant and  the speed of light.The left panel shows the confinement classification for the MBC clustering, the central panel for the k-means++ clustering, and the right panel for the direct application of the scaling approach from Ref. [1].

Figure 6 .
Figure 6.Band structure of an  = 4 supercell of an inverse woodpile photonic crystal (see Appendix A) with parameters  = 0.24,  = 0.5, with bands color-coded to indicate the confinement dimensionalities  of each band.Red color corresponds to point-confined  = 3, blue to line-confined  = 2, green to plane-confined  = 1 and black to extended  = 0.The left panel shows the confinement classification for the MBC clustering, the central panel for the k-means++ clustering, and the right panel for the direct application of the scaling approach from Ref. [1].

Figure 7 .
Figure 7. Band structure of an  = 3 supercell of an inverse woodpile photonic crystal (see Appendix A) with parameters  = 0.24,  = 0.8, with bands color-coded to indicate the confinement dimensionalities  of each band.Red color corresponds to point-confined  = 3, blue to line-confined  = 2, green to plane-confined  = 1 and black to extended  = 0.The left panel shows the confinement classification for the MBC clustering, the central panel for the k-means++ clustering, and the right panel for the direct application of the scaling approach from Ref. [1].

Figure 8 .
Figure 8. Band structure of an  = 3 supercell of an inverse woodpile photonic crystal (see Appendix A) with parameters  = 0.15,  = 0.5, with bands color-coded to indicate the confinement dimensionalities  of each band.Red color corresponds to point-confined  = 3, blue to line-confined  = 2, green to plane-confined  = 1 and black to extended  = 0.The left panel shows the confinement classification for the MBC clustering, the central panel for the k-means++ clustering, and the right panel for the direct application of the scaling approach from Ref. [1].

Figure 9 .
Figure 9. Band structure of an  = 3 supercell of an inverse woodpile photonic crystal (see Appendix A) with parameters  = 0.29,  = 0.5, with bands color-coded to indicate the confinement dimensionalities  of each band.Red color corresponds to point-confined  = 3, blue to line-confined  = 2, green to plane-confined  = 1 and black to extended  = 0.The left panel shows the confinement classification for the MBC clustering, the central panel for the k-means++ clustering, and the right panel for the direct application of the scaling approach from Ref. [1].

Figure 10 .
Figure 10.Band structure of an  = 3 supercell of an inverse woodpile photonic crystal (see Appendix A) with parameters  = 0.29,  = 0.5, with bands color-coded to indicate the confinement dimensionalities  of each band.Red color corresponds to point-confined  = 3, blue to line-confined  = 2, green to plane-confined  = 1 and black to extended  = 0.The left panel shows the confinement classification for the MBC clustering, the central panel for the k-means++ clustering, and the right panel for the direct application of the scaling approach from Ref. [1].Here, we have explicitly chosen the number of cluster for the k-means++ algorithm and the clustering curves for the MBC, based on the direct scaling results.

Figure 11 .
Figure 11.(a) Structure of an unperturbed inverse woodpile photonic crystal.We use a tetragonal unit cell with lattice parameters  in the  direction and  = / √ 2 in the  and  directions.The pore radius is denoted by .The figure shows an  = 2 supercell, with one unit cell designated by the dashed border.(b) Design of the defect.The radius  of two proximate defect pores (shown in green) is altered.At their intersection a region with excess of one material is created that serves as a point defect (red glow).