Fast deterministic algorithm for EEE components classification

Authors consider the problem of automatic classification of the electronic, electrical and electromechanical (EEE) components based on results of the test control. Electronic components of the same type used in a high- quality unit must be produced as a single production batch from a single batch of the raw materials. Data of the test control are used for splitting a shipped lot of the components into several classes representing the production batches. Methods such as k-means++ clustering or evolutionary algorithms combine local search and random search heuristics. The proposed fast algorithm returns a unique result for each data set. The result is comparatively precise. If the data processing is performed by the customer of the EEE components, this feature of the algorithm allows easy checking of the results by a producer or supplier.


Introduction
Supplying the electronic units of the complex technical systems with EEE components of the proper quality is one of the most important problems for increasing the whole system reliability. Moreover, for reaching the highest reliability of an electronic unit, the EEE components of the same type must have equal characteristics which assure their coherent operation. The highest homogeneity of the characteristics is reached if the EEE components are produced as a single production batch from a single batch of the raw materials. The critically important units are integrated from EEE components manufactured as a special production lots with special quality requirements [1,2].
The characteristics of a lot of the components are checked via destructive and nondestructive tests [1,3]. Data of such tests can be used for analyzing the lot homogeneity [3]. For splitting the EEE components into several assumed production batches, the k-means method can be used [4,5,6].
American and European EEE component manufacturers produce components of special quality classes, Military and Space [7,8]. Russian manufacturers do not form a special class of components for use in space systems [1,2].
The k-means problem can be classified as a continuous problem of the location theory [9,10,11]. The aim is to find k points (centers, centroids, medoids) in a d-dimensional space such that the total distance from each of the data vectors (known points, measurement result vectors) to the nearest of k chosen centers reaches its minimum: . min ,..., 2 2 1 ,..., 1 (1) Quality requirements of the EEE components in space systems are so high that the the range of characteristics measured via quality check tests is very narrow and the quality class and assumed production batch of each component in the lot must be determined via analysis of difference (distance) of test result data vectors which slightly exceed the precision of the measurement tools. Thus, results of each measurement form a finite (discrete) set of possible values defined by the measurement tool precision.
The squared Euclidean norm is most popular distance metric used for calculating differences (distances) in a normalized space of characteristics [9]. Using the rectilinear (Manhattan) [12] norm as a distance metric in the k-means model allows to reach results of the same precision as the precision of data vectors. In this case, the value of each coordinate of the result coincides with the value of the same coordinate of one of the data vectors [9,13]. Moreover, the result of the k-means problem rectilinear metric are more stable under the influence of the outliers in the data which exist due to measurement errors and defective components in the lot. Other approach which allows to achieve the results of the same precision as data vectors is solving the k-medoid problem [14,15]. In this problem, the cluster centers which minimize the total distance are searched among the data vectors only.
The k-means method uses the ALA procedure (Alternating Location-Allocation) which includes two simple steps: Algorithm 1. ALA procedure. Required: data vectors A 1 …A N , k initial cluster centers X 1 …X k . 1. For each center X i , determine its cluster C i as a subset of the data vectors for which this center X i is the closest one.
2. For each cluster C i , recalculate its center X i .
Step 1 unless Steps 1, 2 made no change in data. Traditional usage of the k-means method with squared Euclidean metric (l 2 ) has one important advantage: in this case, calculating a center of a cluster is a simple problem solved in one iteration as calculating the mean value for each coordinate of all data vectors in the cluster. These mean values are coordinates of the new cluster center [9]. If the ith cluster center also have d dimensions then the new cluster center is calculated as [9]: In the ALA procedure with the rectilinear metric, each coordinate of the cluster center is calculated as the median value of this coordinate among all data vectors which belong to the cluster. This process can be described as follows.
Algorithm 2. Calculating ith cluster center (median) in case of the l 1 metric.
This algorithm returns a value of new center which has values of each coordinate coinciding with a coordinate of some data vector.
In the case of the k-medoid problem, procedure of determining of each cluster center requires the exhaustive search among all data vectors in the cluster. However, many researchers propose faster analogous local search procedures [17,18,19] which do not guarantee an exact solution.
Except special cases, the k-means and k-medoids problems are NP hard and require global search [20].
The result of the ALA procedure depends on the choice of the initial cluster centers. Known k-means++ algorithm [21] has an advantage in comparison with the chaotic choice of the initial centers and guarantees the statistical preciseness of the result O(log(p)). However, such preciseness is insufficient for many practically important problems. For such cases, researchers propose various recombination techniques for initial center sets [9].
The ALA procedure can be optimized with use of many techniques. For example, sampling procedures [22] solve the k-means problem for the randomly selected subset of the data vectors and use the achieved result as an initial set of centers for solving the original problem. Authors propose various algorithms for streaming data processing [4] applicable for big data analysis and many other methods.
The dependence of the results of the ALA procedure on the initial centers seeding is a serious problem for the reproducibility of the classification algorithm results: depending on the initial centers seeding, different algorithm starts classify the same data vectors as elements of various clusters. For the EEE component production batches classification problem, this means that various EEE component belong to the same or different production batches depending on the initial seeding. Thus, an algorithm for solving k-means problem which returns a stable result is preferred.
The Information Bottleneck Clustering method (IBC) is a deterministic method for solving the cluster analysis and classification problems able to achieve perfect results in many cases. [23]. This algorithm starts from considering each data vector as a separate cluster. Then, clusters are removed one-by-one until the desired clusters quantity remains. Each time, the algorithm eliminates such cluster that its elimination gives the smallest increment of the objective function value. For the k-means problem, this algorithm eliminates the cluster center which gives the smallest total distance from data vectors to the closest remaining centers. Such algorithms are extremely slow [23]. The genetic algorithms with greedy heuristic initially designed for the discrete k-median problem on a network [24] are compromise variants. However, they are not determined algorithms. In [25], author proposes an approach for adaptation of these algorithms for the continuous location problems. The idea of such approach can be described as follows [10].
Algorithm 3. Genetic algorithm with greedy heuristic for k-median problems. Required: data vectors A 1 …A N , population size N p . 3. Choose randomly two indexes 5. If   >k χ c then go to Step 7: are chosen randomly, then if Steps 5-6 of this algorithm realize the greedy heuristic, the successive elimination of the centers from an interim solution.
An analogous heuristic was proposed by Kuehn and Hamburger in 1963 [16]. The IBC method is based on the same principle of the successive elimination of the clusters from an interim solution [23]. Both method of Kuehn and Hamburger and the IBC chose an unfeasible solution coinciding with the whole data vectors set as the initial solution.
The fitness function for the k-means problem can be calculated for an initial or interim solution as follows: This algorithm is a computationally intensive procedure. Other approach [25] is based on the immediate usage of the total distance from data vectors to the closest center in the interim solution as the fitness function value for Algorithm 3 . In this case, Step 1 of Algorithm 4 is omitted. In fact, this approach solves a k-medoid problem and adjusts the solution with the ALA procedure at the final iteration of the greedy heuristic. Such approach is much faster. However, it reduces the preciseness.
An advantage of the IBC method is its determinancy. This method does not use any random values and each start of the algorithm results in the same set of cluster centers. The quality checks of the EEE components lots is a process which involves two parts, a manufacturer or supplier and a customer or a specialized testing center. The calculation performed by the customer must be reproducible. The exact reproduction of the results is impossible when using algorithms which include random search elements. The IBC method slows down the calculation.
In [11], authors propose a modification of the greedy heuristic used in Step 6 of Algorithm 3. This method uses points in the d-dimensional space instead of data vector indexes as an alphabet of the genetic algorithm. Such points represent the cluster centers. Authors entitle this modification Algorithm with Floating Point Alphabet. Results of the genetic algorithm with this heuristic are much more precise than results of the modification described in [25] (in [25], authors propose solving a k-medoid problem). At the same time, the iterations in Algorithm 3 with this heuristic are performed much faster than Algorithm 4 as the greedy heuristic. Moreover, decrease of the computational complexity of Step 6 of Algorithm 3 makes solving large-scale problems possible. In [11], authors report results of solving problems with up to 560000 data vectors. Such heuristic can be described as follows.
Calculate distances from each data vector to the second closest center in .

Go to Step 2.
Value  is an important parameter of this heuristic. This value determines the percentage of the superfluous cluster centers eliminated in a single iteration. Authors propose value 0.2. Bigger values make the algorithm run faster and reduce its preciseness. Small values  make it work as Algorithm 4 and eliminate a single center at each iteration. We used  =0. 25.
This heuristic combines the greedy heuristic [24,25,3] with elements of the modified ALA procedure performed at each iteration and allows to eliminate up to 20-25% superfluous cluster centers until the required quantity of centers remains. Algorithm 4 requires performing p(k 0 -k) runs of the ALA procedure (here, k 0 is the initial centers quantity). Algorithm 5 reduces quantity of iterations down to O(log(k 0 -k)). Moreover, each iteration does not require performint the whole ALA procedure. Instead, its separate optimized elements (locationallocation) are performed.
It can be easily seen that if k 0 =N then both Algorithm 5 and IBC method start their initial iterations analogously: number of cluster centers coincides with the number of data vectors. Moreover, if k 0 =N then choosing th initial centers is not random: all data vectors are chosen as the initial centers. Thus, the following deterministic algorithm can be proposed.  Table 1. We used example problems from the UCI library [26] and problems with real data of EEE components examination. For small-scale problems, the results are shown in comparison with the IBC method, genetic algorithm with greedy heuristic and genetic algorithm with recombination of fixed length subsets. We used various distance metrics. In addition, k-medoid problems [14,15] were solved. For several problems, the results of new algorithm are insignificantly worse than the results of the IBC. At the ame time, time needed for problem solution reduces many times. In Table 1 . Last value makes the algorithm work as an IBC procedure which eliminates exactly one center at a time. At the same time, the approach to the inclusion of elements of the ALA procedure in this greedy heuristic remains unchanged and the new algorithm with 001 0.

 
works slower than one with 25 0.

 
and still much faster than the IBC algorithm which uses Algorithm 4 for fitness function evaluation on each iteration. In addition, Table 1 shows results of thee simplified IBC method which evaluates the fitness function value without running ALA or another local search procedures. In fact. This simplified IBC procedure solves a k-medoid problem and then adjusts the result with ALA procedure. Such algorithm is entitled "IBC w/o local search". Such algorithm can be constructed after excluding Steps 3 and 4.5 from Algorithm 5 with 001 0. For all inspected problems except those with the Jaccard metric, new algorithm shows the best results among all considered deterministic algorithms except IBC with local search which exceeds new algorithm by preciseness in several problems. However, new algorithm takes much less time. Results of new algorithm are less precise than the results of evolutionary algorithms. However, except problems with the Jaccard metric, the difference does not exceed 2.3% for problems with real data vectors and 3.8% for problems with Boolean data vectors. Note that new algorithm has one important feature for solving a problem of automatic classification of the EEE components. Such problem is solved [3] as series of the k-means problems with max min k , k k  where k min =1 (a single production batch without clusters assumed) and k max is chosen by a decider equal to some reasonable number. Algorithm 6 can be used for the k-means problem with k=k max . Then, starting from Step 2, Algorithm 5 can run again for solving the succeeding problems until k=k min . Thus, results for all values max min k , k k  can be calculated in a single run.

Conclusion
Proposed algorithm allows solving k-means and k-medoids problems in appropriate time. Achieved results are insignificantly less precise than the results of the evolutionary algorithms. However, new algorithm is deterministic and this fact makes its results easy for checking and interpreting by all concerned parts. The preciseness and effectiveness of the program realization of new algorithm allow solving the problem of classifying the EEE components by production batches based on the quality examination test data.