Similarity Statistics for Clusterability Analysis with the Application of Cell Formation Problem

This paper proposes the use of the statistics of similarity values to evaluate the clusterability or structuredness associated with a cell formation (CF) problem. Typically, the structuredness of a CF solution cannot be known until the CF problem is solved. In this context, this paper investigates the similarity statistics of machine pairs to estimate the potential structuredness of a given CF problem without solving it. One key observation is that a well-structured CF solution matrix has a relatively high percentage of high-similarity machine pairs.Then, histograms are used as a statistical tool to study the statistical distributions of similarity values. This study leads to the development of the U-shape criteria and the criterion based on the Kolmogorov-Smirnov test. Accordingly, a procedure is developed to classify whether an input CF problem can potentially lead to awell-structuredor ill-structuredCFmatrix. In the numerical study, 20 matrices were initially used to determine the threshold values of the criteria, and 40 additional matrices were used to verify the results. Further, these matrix examples show that genetic algorithm cannot effectively improve the wellstructured CF solutions (of high grouping efficacy values) that are obtained by hierarchical clustering (as one type of heuristics). This result supports the relevance of similarity statistics to preexamine an input CF problem instance and suggest a proper solution approach for problem solving.


Introduction
The research of this paper is like a crossroad of manufacturing systems and computer science. Based on our disciplinary background, we initially study the cell formation (CF) problem that seeks for the clustering of similar machines and parts to support mass customization in [1]. In other words, a CF problem is a two-mode clustering problem [2]. Due to the NP-hard nature of the CF problem [3], many algorithms, including exact, metaheuristic, and heuristic approaches, have been proposed (to be discussed in Section 2.2.3). In the study of hierarchical clustering (abbreviated as HC, classified as a greedy-based heuristic approach), although HC is not the most powerful in searching for near-optimal solutions, it can yield satisfactory results comparable to some powerful metaheuristic approaches (e.g., genetic algorithms) for "well-structured" solutions. In this context, this research investigates the conditions based on the statistics of similarity values to estimate the potential structuredness of a given CF problem without solving it.
In the domain of computer science, the notion of structuredness somehow corresponds to the clusterability concept [4]. Intuitively, clusterability can be interpreted as a measure of an "intrinsic structure" of a dataset to be clustered [5]. Computer scientists have observed that a dataset of good clusterability can be clustered quite effectively (i.e., less impact from the NP-hard nature of the clustering problem). This observation has been summarized in a statement that "clustering is difficult only when it does not matter" (abbreviated as the CDNM thesis) [4,6].
Though developed independently, we want to acknowledge that our approach of evaluating the structuredness criteria is similar to the statistical approach by Ackerman et al. [7]. The difference lies in our application's focus on the CF problem, while Ackerman et al. [7] have focused on the relatively high-level development for clustering tasks. This difference explains our use of similarity measures (instead of distances) in statistical analysis since they are common for the CF problem and allow for some normalization in setting the structuredness criteria. Further, our work numerically checks the relations between structuredness criteria and the solution quality by two different clustering approaches (i.e., HC and GA).
Notably, this paper was extended from our conference paper [8] with the improvement of the techniques (e.g., the threshold setting and the normalization approach). Also, additional numerical examples have been used in the evaluation.
The rest of this paper is organized as follows. Section 2 will overview the CF problem and discuss the three properties of a well-structured CF solution in order to clarify the logical relation of similarity statistics. Section 3 will introduce the histogram analysis of similarity values and develop the Ushape criteria. Section 4 will introduce the Kolmogorov-Smirnov (K-S) test, which is used to develop another criterion to inform the matrix's structuredness. Section 5 will discuss the procedure that applies the developed criteria to classify well-structured and ill-structured matrices. Section 6 will examine the structuredness criteria via numerical examples, which are also used to check the effectiveness of metaheuristics via a two-stage solution process. Section 7 will conclude this paper.

Problem Introduction.
In the design of a cellular manufacturing system, one early and important decision is the formation of machine groups and part families, and it is often referred to as the cell formation (CF) problem. A simple CF problem can be compactly captured by a machine-part incidence matrix. Let M = { } (for i = 1 to m) be the set of machines and P = { } (for j = 1 to n) be the set of parts. Then, an incidence matrix, denoted as B = [b ], indicates whether machine m is required to produce part p (if so, b = 1; otherwise, b = 0). After solving the CF problem, the matrix's rows and columns can be reordered to reveal which subset of machines (i.e., a machine group) is highly related to which subset of parts (i.e., a part family).
By using the incidence matrices to represent CF solutions (i.e., block-diagonal matrices), they can be roughly classified into two types: well-structured and ill-structured matrix [2,9]. As illustrated in Figure 1, a well-structured matrix has few nonzero matrix entries outside the blocks (defined as exceptional elements) and few zero matrix entries inside the blocks (defined as voids). Precisely, exceptional elements are the matrix entries of b = 1 with m and p in different cells, and voids are the matrix entries of b = 0 with m and p in the same call. The opposite conditions apply for an ill-structured matrix (i.e., a matrix solution with many exceptional elements and voids). A well-structured matrix implies that part families can be produced quite exclusively by some machine groups so that the changes of few part families will not be adversely impacting the production of other parts. This is one desirable feature of cellular manufacturing systems [1].

3
To quantify the structuredness of a CF matrix solution, we use the traditional grouping efficacy (denoted as ), which is formulated as follows [10].
where n , n , n are the total number of nonzero matrix entries, exceptional elements, and voids, respectively. In a perfect CF solution where n = n = 0, the grouping efficiency is equal to its maximum value, i.e., one. When there are more exceptional elements (n ) and voids (n ), the grouping efficacy value will become smaller.
Yet, not all incidence matrices can be converted to a well-structured matrix due to the original complex interdependency of the production requirements among machines and parts. This situation cannot be resolved by advanced optimization techniques as the root cause stems from the original inputs of the CF problem. However, we cannot practically know whether a given CF problem is going to have a well-structured matrix or not until we actually solve this problem. In this context, the purpose of this paper is to assess the structuredness of a given CF problem by analyzing the similarity of machines without actually solving it. In the traditional CF notion, two machines can be said similar if they are required mainly to produce a subset of common parts. In this work, the Jaccard similarity coefficient is applied [11,12]. Let s be the similarity value between machines m and m . The formulation of the Jaccard similarity coefficient is provided below.
where a is the number of parts that need both machine m and m ; b is the number of parts that need machine m but not machine m ; c is the number of parts that need machine m but not machine m . Conceptually, the Jaccard similarity coefficient focuses on the number of common features (e.g., a ) that is normalized by the total number of relevant features (e.g., a , b , and c ). Notably, similarity is only evaluated for any two machines (i.e., a machine pair). After specifying the notion of machine similarity, let us revisit the two examples in Figure 1. Each example has 30 machines, leading to 30×(30-1)/2 = 435 machine pairs. By examining the similarity of any two machines (or machine pairs), we find that the well-structured matrix has a higher number of machine pairs with high-similarity values. In the examples of Figure 1, we can get the following two statements concerning the statistics of the machine similarity values.
(i) Well-structured matrix: 81 (out of 435) machine pairs have similarity values higher than or equal to 0.80. (ii) Ill-structured matrix: 4 (out of 435) machine pairs have similarity values higher than or equal to 0.50.
In this illustration, it is roughly identified that a wellstructured matrix can have quite a different statistical distribution of machine similarity values as compared to an illstructured matrix. This observation leads to an investigation question on the statistical conditions in which a wellstructured matrix can be classified. This investigation is the focus of this paper. By knowing such statistical conditions, engineers in the design of cellular manufacturing systems can initially assess their production requirements via the statistics of machine similarity. If the statistical data shows unfavorable results (i.e., chance of getting a well-structured matrix is low), they can either modify the production requirements (e.g., buy more machines) or seek for other manufacturing systems. It can save the efforts to solve the CF problem with such initial assessment. Also, this paper will show that a well-structured matrix can be satisfactorily obtained by some less time-consuming heuristics (where complex optimization methods may not bring additional benefits).

Properties of a Well-Structured CF Solution.
To investigate the statistical conditions of the structuredness of a CF solution, this section will discuss the three properties of a well-structured matrix. These three properties include (1) high grouping efficacy, (2) high percentage of high-similarity machine pairs, and (3) relative ease of obtaining satisfactory CF solutions. Afterward, a research plan will be discussed.

Property I: High Grouping Efficacy.
The original formulation of the grouping efficacy (GE) in (1) can be found in Kumar and Chandrasekharan [10], and it is intended to replace a weighted sum function with a simple ratio to assess the goodness of a CF solution (in a block-diagonal form). Since then, the GE measure has become popular in the CF research (e.g., [9,13]). Despite its popularity, some researchers have criticized its "built-in weights" [14], where a lower number of voids (i.e., n ) tend to give a better GE measure (as compared to exceptional elements (i.e., n )). Brusco [15] has commented that the nonlinearity of the GE measure has incurred a challenge for finding the exact solutions for the CF problems. As commented by Sarker and Mondal [16] in their survey paper, it is not easy to develop a standard measure that fits all CF problems. It is generally recognized that the GE measure is good to discern the structuredness of the matrixbased CF solutions [2]. Thus, we choose the GE measure in this study. Based on its definition, a well-structured matrix should have few exceptional elements and voids, leading to a high value of GE. While GE is effective in indicating the structuredness of a CF solution (high value → well-structured matrix), this value cannot be known until the CF problem is solved. Thus, in this research, GE is used as a verification measure to examine how well machine similarity can be related to the structuredness of a CF solution.

Property II: High Percentage of High-Similarity Machine
Pairs. Compared to the property of high grouping efficacy, it is less obvious to know that a well-structured matrix has a high percentage of high-similarity machine pairs. In view of the Jaccard similarity coefficient in (2), there are two types of factors used to assess the machine similarity. While a (i.e., the number of common parts) is taken as a commonality factor, both b and c (i.e., the number of parts processed in one machine but not another one) serve as differentiating factors to normalize the similarity measure. In turn, if the similarity value of both machines is high, a cannot be zero and the values of b and c should be small, implying not only commonality but also exclusiveness of these two machines to process their common parts. This feature can potentially lead to smaller numbers of voids and exceptional numbers, leading to a well-structured matrix.
In literature, the notion of similarity has been applied for many years to address the CF problem, and the Jaccard similarity coefficient is one of the early applications [11]. Since then, many similarity coefficients have been proposed, and the comparison study of similarity coefficients can be found in Sarker [17], Mosier et al. [18], and Yin and Yasuda [19]. Notably, similarity is a context-dependent concept, and it depends on the application and relevant information to assess how similar between two objects. In our investigation, we choose the Jaccard similarity coefficient because its notion on the commonality and differentiating factors is straightforward to the simple CF application.
While similarity coefficients have been studied extensively for CF problems, the statistical distribution of similarity values of a CF problem has not been investigated reasonably in our understanding. Notably, these similarity values can be found without solving the CF problems. Then, if we know the relation between the statistical distribution of similarity values and the GE measure, we can use the statistical distribution of similarity values to assess the potential of yielding a well-structured matrix for a CF problem. This is the major aim of this paper.

Property III: Relative Ease of Obtaining Satisfactory CF
Solutions. At this point, we may wonder why it is important to know the potential of yielding a well-structured matrix before solving the CF problems. First of all, it has been recognized that a CF problem is a NP-hard problem [3] so that there will be less likely to find a practical algorithm that can guarantee an exact solution for a moderate-size problem. As a result, the effort required to solve a CF problem is not trivial. In literature, many metaheuristic algorithms have been proposed to solve the CF problems such as genetic algorithms [20,21] and simulated annealing [22,23]. Related comprehensive reviews can be found in Papaioannou and Wilson [24] and Renzi et al. [25]. While metaheuristic algorithms have capacities to yield high-quality solutions, they generally require users to have good mathematical skills to understand these algorithms [26] and good experiences to make some "implementation decisions" [15, p. 293] (e.g., terminating conditions in genetic algorithms).
In contrast to metaheuristic algorithms, heuristic algorithms are easier to implement but the quality of their solutions is often targeted [27, p. 159]; [24]. In a nutshell, a common feature of heuristic algorithms is their greedy or hillclimbing approaches that focus on best solutions at a stage without backtracking for other solution possibilities. This feature allows them to converge to some feasible solutions quickly with the trade-off of checking a smaller solution space (thus, potentially weaker solution quality). Hierarchical clustering (HC), which was one early approach for CF problems [11], is one example of heuristic algorithms since HC always groups the object pairs with the highest similarity values progressively without backtracking.
As its third property, it is observed that a well-structured matrix can be obtained relatively easily by a heuristic approach (referred to HC specifically in this paper), where the metaheuristic approach does not necessarily have an advantage for getting higher-quality solutions. Alternately, the advantage of the metaheuristic approach is observed more often in the case of ill-structured matrices. As discussed before, a well-structured matrix demonstrates sharp differences between similar and dissimilar machine pairs. This feature supports the "greedy" nature of the heuristic approach, which can easily distinguish high-similarity pairs in the progressive grouping process. In contrast, an ill-structured matrix has more machine pairs with middle-similarity values so that some borderline cases can potentially lead to solutions of lower quality. While this third property may not be obvious, more verifying examples will be reported later in Section 6.3 as part of the investigation effort of this paper.
Given this third property of a well-structured matrix, the statistical analysis of similarity values can then lead to another application, i.e., supporting the choice of the algorithmic approach for solving CF problems. If the statistical analysis shows a high potential to obtain a well-structured matrix, we can choose a heuristic approach to solve the CF problems. Alternately, if it indicates a high chance of getting an ill-structured matrix, we may consider revising the input incidence matrix (e.g., adding more machines or changing some part requirements). Also, we can prepare to use the metaheuristic approach to seek for high-quality solutions. In sum, the statistical analysis can preliminarily probe the structure of a given CF problem in order to determine the next problem solving step.

Research Plan.
In view of the three properties of a well-structured matrix discussed above, the research and development questions are set as follows.
(i) What are the criteria related to the statistics of similarity values to assess the potential of getting a well-structured matrix?
(ii) How do we decide on whether using a metaheuristic or heuristic approach for solving a CF problem?
To address the first question, this paper will utilize two statistical tools: histogram and the Kolmogorov-Smirnov (K-S) test. Histogram will be used to analyze the distribution of machine similarity values of a given CF problem, and twenty CF solutions will be set to investigate the threshold values for informing the potential structuredness of a matrix. The K-S test will be used to assess the normality of the distribution of machine similarity values. That is, if the set of similarity values roughly follow the normal distribution, it means that many machine pairs have the average similarity value, implying a low proportion of high-similarity values (i.e., an ill-structured matrix). Based on the investigation using the histogram and the K-S test, we will develop a procedure to probe the structure of a given CF matrix and suggest whether using a metaheuristic or heuristic for problem solving (i.e., address the second question). In this paper, we have implemented genetic algorithm (GA) and hierarchical clustering (HC) as the metaheuristic and heuristic approaches, respectively, for solving the CF problems. To verify the procedure, additional forty CF matrices will be set. These CF matrices will be solved by HC and then genetic algorithm to observe the relation between the matrix's structuredness and the utility of the metaheuristic approach for better CF solutions.

Histogram and the U-Shape.
In this study, histograms are used to report the frequency distribution of machine similarity values with an increment of 0.1. Figure 2 shows two histograms for the well-structured and ill-structured matrices of Figure 1, respectively. In these histograms, the horizontal axis stands for the machine similarity values ranging from 0 to 1, and the vertical axis stands for the number of machine pairs within those ranges of similarity values. Notably, these histograms are independent of the orders of a matrix's rows and columns. That is, we can get these histograms of similarity values without solving the CF problem.
From these two histograms, it is observed that a wellstructured matrix tends to yield an U-shape histogram, i.e., relatively high numbers of extreme similarity values. The right peak of the U-shape can be explained by the property of high percentage of high-similarity machine pairs discussed in Section 2.2.2. While the numbers of low-similarity machine pairs are high in both cases of well-structured and illstructured matrices, a well-structured matrix has a low number of machine pairs of similarity values between 0.2 and 0.4. In contrast, an ill-structured matrix has a good number of those middle-similarity machine pairs, which cause a challenge of clear grouping in cell formation. Given this general U-shape observation, the next subsections will discuss the criteria that classify the structuredness of a matrix (i.e., well-structured or ill-structured) based on the histogram data.

Setup of 20 Benchmark Matrices.
Since the frequency distribution of a histogram will not be altered by the orders of a matrix's rows and columns, we can set the CF solution matrices with known structuredness and then observe their histograms to develop the structuredness criteria. In this investigation, twenty 30×40 solution matrices (i.e., 30 machines and 40 parts) with three cells (or blocks) are set.  V  I  e  s  a  C  I  I  I  e  s  a  C  I  I  e  s  a  C  I  e  s I  I  I  e  s  a  C  I  I  e  s  a  C  I  e  s  a  C   V  I  e  s  a  C  I  I  I  e  s  a  C  I  I  e  s  a  C  I  e  s  a  C   V  I  e  s  a  C  I  I  I  e  s  a  C  I  I  e  s  a  C  I  e  s  a  C   V  I  e  s  a  C  I  I  I  e  s  a  C  I  I  e  s  a  C  I  e  The resulting 20 matrices are shown in Figure 3. As general inspections, the matrices in Cases I and II have clear boundaries of three cells. The matrices in Case III have more exceptional elements and voids but their structures are still quite discernible. In contrast, the structure of matrices in Case IV is messier with higher numbers of exceptional elements and voids. Based on these matrices, the next subsection will investigate their histograms and develop the U-shape criteria to classify the matrix's structuredness.

Histogram-Based U-Shape Criteria.
To inform the matrix's structuredness, two conditions as the U-shape criteria are set toward the low and high-similarity values. Let F (x) be the fraction of similarity values that are lower than x and F ℎ (y) be the fraction of similarity values that are higher than y. Then, the general U-shape criteria can be expressed as follows.
where a and b are the thresholds of the minimum fractions of low and high-similarity values, respectively, to characterize the U-shape of a well-structured matrix. The setup of these parametric values (i.e., x, y, a, and b) will be based on the above 20 benchmark matrices. Figure 4 shows the histograms of the 20 benchmark matrices. As the preliminary observations, the frequency Journal of Probability and Statistics 7   Case I  V  I  e  s  a  C  I  I  I  e  s  a  C  I  I  e  s   distributions of these histograms are perceived quite different between the well-structured (i.e., Cases I, II, and III) and illstructured matrices (i.e., Case IV). Yet, some U-shapes are not plainly obvious (e.g., Cases A-III and C-II), and the peaks of high-similarity values of the well-structured matrices are not located at the rightmost region (e.g., Cases C-I and D-I). The U-shape criteria will then be set based on these observations.
Concerning the region of low-similarity values (i.e., the left side of the U-shape), it is found that both well-structured (i.e., Cases I, II, and III) and ill-structured (i.e., Case IV) 8 Journal of Probability and Statistics  matrices have high proportions because many machines, as long as they are not in the same cell, have less common parts to work with in both cases. As a result, the proportions of lowsimilarity values from a well-structured matrix can become less discernible statistically. Thus, we choose to investigate the extreme value when the similarity values equal to zero, i.e., F (x=0). Table 1 records the number of machine pairs with the similarity values equal to zero. As observed, while the matrices of Cases II and III have low right-side peaks, they have high proportions of such zero-similarity machine pairs. As the U-shape criteria will be used for the early screening, we set this criterion rather strictly as follows.
(0) ≥ 0.5 This criterion requires 50% of machine pairs to have zerosimilarity values in order to qualify a well-structured matrix. By checking the benchmark matrices with 30 machines (i.e., 435 machine pairs), the threshold is 218 machine pairs, and the matrices in Case II pass this criterion. Concerning the region of high-similarity values (i.e., the right side of the U-shape), as discussed earlier, not all wellstructured matrices have high proportions of high-similarity values at the rightmost region. By inspecting the histograms in Figure 4, we identify a reasonable cut-off of high-similarity values should be 0.5, i.e., F ℎ (y=0.5). Table 2 records the number of machine pairs with the similarity values greater than or equal to 0.5. As observed, the proportions of highsimilarity values (s ≥ 0.5) in Case IV (i.e., ill-structured matrices) are relatively low. In contrast, Case C-II is the wellstructured matrix with the lowest number of high-similarity values (i.e., 91), and the corresponding fraction is 91/435 ≈ 0.21. As a result, another U-shape criterion for the right-hand side is set as follows.
In sum, if an input incidence matrix satisfies one of the two U-shape criteria formulated in (5) and (6), this matrix has a good chance to yield a well-structured CF solution.
Notably, we treat the histogram-based U-shape criteria as a preliminary filter in this work. That is, if a matrix does not satisfy these criteria, it does not immediately imply that this matrix is ill-structured. In fact, other parameters of an input incidence matrix, such as the number of machines and the density of nonzero matrix entries, can impact the frequency distribution of a histogram. Thus, the next section will develop another criterion based on the K-S test.

4.1.
Background. The Kolmogorov-Smirnov (K-S) test is one type of hypothesis testing in statistics (Corder and Foreman) [28]. As one of its applications, the K-S test is used in this paper to evaluate how well a dataset represents a normal distribution (i.e., the normality of the dataset). The use of the K-S test in this study is mainly motivated by the observation of the histograms in Figure 2 that a well-structure matrix will tend to give a U-shape. As the U-shape will generally exhibit two peaks in the histogram representation, the normality of the associated data (i.e., similarity values) will be weak in comparison to that of an ill-structured matrix. Figure 5 illustrates the concept of the normality of similarity values with two cases: single-peak histogram and Ushape histogram. The K-S test essentially compares the curves of two cumulative distribution functions (CDFs) [29,30]. While one CDF represents the empirical data points (i.e., empirical CDF, solid line), another CDF is based on the normal distribution curve fitted by the empirical data (i.e., hypothesized normal CDF, dashed line). As seen in Figures  5(c) and 5(d), the single-peak histogram has higher normality than the U-shape histogram since the single-peak histogram yields a closer match between the empirical and hypothesized normal CDFs. In contrast, the U-shape histogram yields its empirical CDF in Figure 5(d) with rapid increases at the beginning and the end, along with a relatively flat region in the middle, and this CDF curve significantly deviates from normality [31]. The P value is a common concept in hypothesis testing [32]. It can be interpreted as the smallest probability value associated with a given dataset to reject the null hypothesis (i.e., smaller P value → more likely to reject the null hypothesis). In this work, we treat the P value of a K-S test as a proxy measure on the normality of a set of similarity values. That is, if the P value is smaller, the dataset tends to be less-normal [33]. Interpreted in our context, a less-normal condition implies a U-shape and thus a well-structured matrix. For example, the P value of the single-peak histogram in Figure 5(c) is 7.44×10 -4 , and the P value of the U-shape histogram in Figure 5(d) is 9.27×10 -22 .
Notably, the purpose of using the K-S test in this work is not about hypothesis testing, but only using its P value as a proxy measure to assess the normality of a set of similarity values and then inform the structuredness of a CF matrix. Yet, the P values in our applications tend to be very small. To conveniently handle this proxy measure, let P be the P value of a set of similarity values based on the K-S test, and an alternative proxy measure (denoted as L ) is defined as follows: As L is the negative logarithm of the P value, a higher value of L implies a higher tendency of having a U-shape of the dataset. For example, the values of L for the single-peak histogram (i.e., Figure 5(c)) and the U-shape histogram (i.e., Figure 5(d)) are 3.13 and 21.03, respectively. In other words, if a CF matrix yields a higher value of L , it has a better chance to be solved as a well-structured CF solution. By knowing the property of the trend associated with L , it leads to the next investigation question on setting the threshold value of L to classify ill-structured and wellstructured matrices. To do so, it is recognized that the values of L can be sensitive to the number of machines and the density of nonzero entries of a given matrix. Thus, the next subsection will investigate the upper bound of L of a given matrix to normalize the value of L . Then, we will apply the 20 benchmark matrices in Figure 3 to determine the threshold.

Estimate the Upper Bound of L for Normalization.
The upper bound of L can be estimated by a perfect blockdiagonal matrix, where the numbers of exceptional elements (n ) and voids (n ) are zero (i.e., the grouping efficacy = 1). In this case, the machine pairs have similarity values equal to either one (when two machines belong to the same block) or zero (when two machines are in different blocks). This kind of "bipolar" distribution can be viewed as a far extreme of the normal distribution, and the corresponding P value can be taken as the upper bound of L .
In the normalization process, we can first identify the size and the number of nonzero entries of a given matrix. Let m and n be the numbers of machines and parts, respectively, as the size of the matrix. The number of nonzero matrix entries has been denoted as n . Then, the density of nonzero entries of a matrix (denoted as D ) can be determined as follows.

= ×
Given an incidence matrix, its upper bound of L can be considered in a case when its nonzero entries can be freely moved to form a nearly perfect block-diagonal matrix. By fixing the values of m, n, and D , there can be a corresponding theoretical upper bound of L . Let L denote such an upper bound of L of a given matrix. Then, for any given matrix, we can determine its L and L , where L is treated as a normalizing factor. Since this paper focuses on machine similarity, we drop the consideration of n to simplify the investigation. Then, the next step is to determine the following function.
To estimate the function of L , our strategy is to systematically generate a good number of perfect block-diagonal matrices by varying the numbers of machines, parts, and even-size cells (note: the number of even-size cells will determine the number of nonzero entries). The ranges of these varying parameters in this work are listed as follows.
(i) Number of machines: from 10 to 50 machines (ii) Number of parts: from 10 to 110 parts (with an increment of 10) (iii) Number of even-size cells: from 2 to 14 cells (also restricted by the matrix's size to avoid extremely large and small cells) Further details of the setup of these perfect matrices can be found in Zhu [34]. As a result, this work has generated 2519 perfect matrices. Then, the values of P value and L are determined for these matrices, giving 2519 points to approximate the function formulated in (9) In practice, we can determine the values of L via (7) and L via (10) for a given matrix. Then, we can check its ratio of L to L and examine the U-shapeness and then the possible structuredness of the matrix. The next subsection will discuss the criterion based on the ratio of L to L .

Ratio Criterion Based on L and L .
The setting of the ratio threshold for L and L is based on the 20 benchmark matrices in Figure 3. The values of L , L and their ratios are recorded in Table 3. As a recall, Cases I, II, and III are set to represent the well-structured matrices, and Case IV represents ill-structured matrices. As an initial assessment, the average of the ratios of Cases I, II, and III (i.e., wellstructured matrices) is 0.48, while the ratio average of Case IV is 0.07. This observation indicates that the ratio L /L can make distinctions between well-structured and ill-structured matrices quite effectively from a statistical standpoint. Yet, when we examine the extreme situations, the lowest ratio of the well-structured cases is 0.17 (i.e., Case D-I, bold in Table 3), and the highest ratio of the ill-structured cases is 0.15 (i.e., Case A-IV, also bold in Table 3). As observed, the gap between the two is close, and we intend to impose a tight criterion to classify well-structured matrices. As a result, we set the threshold value at 0.2, formulated as follows. the earlier U-shape criteria. Thus, our next step is to combine the U-shape criteria and the ratio criterion in a procedure to examine the potential structuredness of an incidence matrix. That is, if a given matrix satisfies one of these criteria, it is indicated that this matrix has a high potential to yield a well-structured CF solution. The next section will discuss this procedure to apply these criteria to inform the potential structuredness of a given matrix.

Procedure
This section provides a four-step procedure below to assess the potential structuredness of an incidence matrix using the histogram-based U-shape criteria and the criterion based on the P value of the K-S test. Figure 6 illustrates the decision branches of this procedure.
Step 1 (construct histogram). By receiving an incidence matrix as an input, the similarity values of machine pairs are first determined based on (2). If there are m machines, there will be m×(m-1)/2 machine pairs with their similarity values, forming the dataset of the statistical analysis. A histogram is then constructed to analyze these similarity values.
Step 2 (apply the histogram-based U-shape criteria). This represents the preliminary check based on the frequencies of having high and low-similarity values. If either one of the criteria F (0) ≥ 0.5 or F ℎ (0.5) ≥ 0.2 is satisfied, the incidence matrix is considered having a good potential to yield a well-structured CF solution. If none of these two criteria is satisfied, we will move on to the analysis based on the P value of the K-S test.
Step 3 (compute and ). The dataset of similarity values is treated as the input to determine the P value of the K-S test in view of assessing the normality of the dataset. This calculation can be performed via some statistics software tools. In this work, we have used the statistics functions from Matlab to compute the P value. Then, the value of L can be evaluated using (7). With the incidence matrix, the value of L can be evaluated using (10) by identifying the number of machines (i.e., m) and the density of nonzero entries (i.e., D ).
Step 4 (apply the ratio criterion / ). With the values of L and L , we can check the criterion if L / L ≥ 0.2. If this criterion is satisfied, the input matrix should have a good potential to yield a well-structured CF solution. If not, the input matrix would have a good chance to result in an ill-structured CF solution. The practitioners may consider modifying the input matrix by adding machines or revising the production requirements.

Application and Verification
To examine the statistical analysis of similarity values for CF problems in this paper, other 40 matrices (in addition to the earlier 20 benchmark matrices, making up a total of 60 matrices) will be generated and applied in this section. These 60 matrices will be used to examine the following two issues specifically.
(i) Given the three criteria for assessing the potential structuredness of a matrix, we are going to use these 60 matrices to examine their effectiveness to distinguish well-structured and ill-structured matrices.
(ii) While Property III (i.e., relative ease of obtaining satisfactory CF solutions) of a well-structured matrix has been discussed in Section 2.3, it will be verified via these 60 matrices by two stages of CF problem solving.
6.1. Setup of the 60 Incidence Matrices. The strategy to generate 60 matrices is based on the extension of getting the 20 benchmark matrices in Section 3.2. The additional varying factors include the following.
(i) In addition to the size of 30×40 matrix, another size of 40×100 matrix is set.
(ii) We add cases with more numbers of cells (from 3 to 6, 8, and 12 cells) (iii) The evenness of cell sizes is also varied for each case. Table 4 shows the setup of 60 matrices, where Cases A and E are repeated from Section 3.2 for comparison. Notably, the structuredness of matrices, which were classified as Cases I, II, III, and IV in Section 3.2, is also applied, leading to the study of 15×4 = 60 incidence matrices. As the intention of the setup, the matrices of Cases I and II have no voids and exceptional elements, respectively. Then, they should be classified as well-structured matrices. The matrices of Case III have only few exceptional elements and voids, and they should also be classified as well-structured matrices.
In contrast, the matrices of Case IV have more exceptional elements and voids, and they should be classified as illstructured matrices. The images and histograms of these 60 matrices are provided as supplementary materials (available here).

Examination of the Criteria.
To evaluate the effectiveness of the criteria to assess the structuredness of the matrices, we have evaluated the criteria values for the 60 matrices. The results are provided in Table 5, where the values satisfying the criteria of well-structured matrices are bold. As observed in these results, the structuredness criteria can discern the wellstructured matrices of Cases I, II, and III, where each matrix there satisfies at least one criterion. In contrast, no matrices of Case IV satisfy any criteria of well-structured matrices.
In view of the effectiveness of individual criteria, it is observed that F (0) is effective in filtering the matrices of Case II (i.e., few voids and no exceptional elements). Due to the absence of exceptional elements in this case, any two machines of different blocks will have similarity values equal to zero. This explains the high values of F (0) observed in Case II. In contrast, F ℎ (0.5) is less effectiveness when the matrices have more cells (e.g., Cases H and I) and large sizes (e.g., Cases J to O). Notably, the values of F ℎ (0.5) for Case IV are quite low (ranging from 0.00 to 0.09). In this view, the criterion of F ℎ (0.5) is quite tight.
By comparison, the ratio criterion (i.e., L /L ) seems effective in distinguishing well-structured matrices, where Case D-I is the only case not identified as a well-structured matrix by this criterion only. Notably, the discernible gap of well-structured matrices (lowest at 0.17 in Case D-I) and illstructured matrices (highest 0.16 in Case L-IV) is small. It explains the need of having F (0) and F ℎ (0.5), along with the ratio criterion, in the assessment of the structuredness of the matrices.

Examination of Property III via Optimization.
As a recall from Section 2.2.3, Property III states that a well-structured matrix can be fairly obtained via a heuristic approach, where more complex metaheuristics may not bring in additional benefits. To verify this property, the sixty matrices were tested with a two-stage solution process. First, each matrix will be solved by a hierarchical clustering (HC) method as one heuristic to yield a CF solution. Then, we examine if we can further optimize the obtained CF solution via the genetic algorithm (GA), representing a metaheuristic method. In this way, we can check the correlation between grouping efficiency and the percentage of improvement of solution quality by GA. The algorithmic details of the HC method and the implementation details of GA applied in this study can be found in Zhu [34]. Table 6 lists the grouping efficacy ( ) results for the 60 matrices after running hierarchical clustering (HC) and then genetic algorithm (HC+GA). Also, the percentages of improvement in view of grouping efficacy by GA are reported for comparison. As observed, the matrix solutions in Cases I and II cannot be further improved by GA, while three matrix solutions in Case III can be improved by GA with small percentages (between 0.20% and 0.25%). In contrast, the ill-structured matrix solutions in Case IV can be improved by GA in the percentages of improvement between 0.63% and 22.69%. Overall, we consider that the numerical results generally follow Property III, given that the matrices in Case III are close to the boundary between well-structured and illstructured matrices. Figure 7 shows the plots of the percentages of solution improvement versus the values of grouping efficacy based on HC+GA. Based on the 60 matrices studied in this paper, GA did not improve the quality of matrix solutions that have 0.60 or higher grouping efficacy. For the data points of grouping efficacy values less than 0.60, we find that these data points are negatively correlated, where the correlation value [32, p. 173] is -0.62. In the statistical interpretation, we can state that a lower value of grouping efficacy tends to allow a larger room of improvement by GA but its linearity is not strong. Notably, the capabilities of HC and GA to yield high-quality solutions can depend on other factors (e.g., density of nonzero entries in a matrix). Thus, it is not easy to observe a linear correlation just between the percentage of improvement and the grouping efficacy. More control factors and samples should be required for an in-depth investigation.

Conclusions
This paper has explored the statistics of similarity values to investigate the structuredness of cell formation (CF) matrix solutions. Using grouping efficacy ( ) as one recognized index to inform the quality of a CF matrix, it is found that a well-structured matrix has a high percentage of highsimilarity machine pairs (i.e., Property II). Accordingly, this paper sets up 20 benchmark matrices, with varying structuredness, to develop the U-shape criteria and the criterion based on the Kolmogorov-Smirnov test. Then, a procedure is developed to assess the potential structuredness of a CF matrix without solving the CF problem. The criteria for assessing structuredness of matrices are examined via additional 40 matrices, and agreeable results are observed. Genetic algorithm (GA) is used to see if it can improve the CF solutions obtained by hierarchical clustering (as one type of heuristics). The results show that the matrix solutions with high grouping efficacy values (i.e., well-structured matrices) cannot be effectively improved by GA.
While the worst-case computational complexity of clustering problems (e.g., NP hardness) is well recognized, the  Journal of Probability and Statistics CDNM thesis (discussed in Section 1) has implied that not all clustering problems in practice are difficult to solve. This research corresponds to the "clustering pipeline" proposed by Ackerman et al. [7], where clusterability (or structuredness in our context) can be evaluated to inform the selection of effective clustering algorithms. In this view, one intended contribution of this work is to implement this idea in the context of the CF problem. In future work, we will explore more applications in manufacturing systems that require grouping and combinatorial decisions (e.g., product and systems modularity). Also, we can explore more statistical and machine learning techniques such as multimodality tests and random forest to replace the K-S test for better predication performance.

Data Availability
The matrix data used to support the findings of this study are included within the supplementary information file (pictorial illustrations). Other data formats (e.g., Excel file) can be available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.