Explainable Fuzzy Cluster-based Regression Algorithm with Gradient Descent Learning

We propose an algorithm for 𝑛 -dimensional regression problems with continuous variables. Its main property is explainability, which we identify as the ability to understand the algorithm’s decisions from a human perspective. This has been achieved thanks to the simplicity of the architecture, the lack of hidden layers (as opposed to deep neural networks used for this same task) and the linguistic nature of its fuzzy inference system. First, the algorithm divides the joint input-output space into clusters that are subsequently approximated using linear functions. Then, we fit a Cauchy membership function to each cluster, therefore identifying them as fuzzy sets. The prediction of each linear regression is merged using a Takagi-Sugeno-Kang approach to generate the prediction of the model. Finally, the parameters of the algorithm (those from the linear functions and Cauchy membership functions) are fine-tuned using Gradient Descent optimization. In order to validate this algorithm, we considered three different scenarios: The first two are simple one-input and two-input problems with artificial data, which allow visual inspection of the results. In the third scenario we use real data for the prediction of the power generated by a Combined Cycle Power Plant. The results obtained in this last problem (3.513 RMSE and 2.649 MAE) outperform the state of the art (3.787 RMSE and 2.818 MAE).


INTRODUCTION
Over the last few years, Artificial Intelligence (AI) and Machine Learning (ML), have become ubiquitous in human society. Riding primarily on the empirical success of the Deep Neural Network (DNN), AI and ML have expanded to sectors where safety and accountability must be guaranteed (e.g., medicine and national security) with an increasing emphasis on data privacy and protection. Decisions derived from AI powered systems in such sectors directly affect people's lives, motivating a need for transparency: explainability and interpretability. Models should make sense to the human observer, and decisions must be justified and legitimate, with detailed explanations that promote trust. Human observers should understand how these decisions are made (the actions or procedures taken by the model), why they work when they do (or fail when they don't).
The excellent performance of DNNs, however, relies on opaque abstractions in the (often hundreds) of hidden layers and millions of parameters that obscure their decision making process, leading to the development of black box models which lack a clear understanding of how the model works. This performance -transparency trade-off was historically acceptable since AI powered systems were deployed primarily for scientific and limited commercial work. The widespread expansion of AI and DNNs outside of academia thus motivates the need for modified machine learning techniques that learn explainable features while maintaining performance, as well as more interpretable, structured causal models.
The previous mechanisms could be understood as a top-down addition of mathematical tools to increase the interpretability. A less explored approach is the bottom-up redefinition of model architectures such that the algorithms are inherently transparent. In fact, most of the methods that seek to provide the desired transparency in neural networks elaborate on top of the fundamentals and rarely provide a novel algorithmic architecture. For example, [26] introduced an enhanced explainable neural network (ExNN) that decomposes a complex relationship into additive components, where the explainability is obtained by the addition of orthogonality constraints and the sparsity of the generated subnetworks. Similarly, [27] used variational autoencoders to generate interpretable features, and [28] suggested the visualization of intermediate models to improve the trustworthiness of the system. While these techniques provide more clarity on the decision making process, they are not easy to generalize, and in fact are less explainable than a decision tree or a linear regression.
Additionally, minor changes to a DNN's input can negatively impact performance, causing misclassifications or false predictions, leading to the use of fragile models easily fooled by noise. The expansion of AI to mission critical fields thus also motivates the need for robust solutions resilient to noise. [29] introduced a novel evolutionary algorithm that leverages competitive learning strategies to obtain the most optimal network that can handle noisy data. Autoencoders have been very successfully employed for denoising tasks, especially in medical procession [30] , and speech enhancement [31] . Nevertheless, all the aforementioned methods hinge on the use of uncorrupted data points in the learning process. Both the noisy and clean instances are shown to the system until it learns to identify noisy patterns and develops its noise suppression strategy -information that is unavailable in most real-world applications. "Truth" often refers to the output values in the data, which in fact may have a certain percentage of noise from the data acquisition process. Thus, the ground truth remains unknown. This paper proposes a novel noise-resilient, explainable learning algorithm for -dimensional regression problems with continuous variables. The salient features of the algorithm include initialization by (1) clustering the input data using a hierarchical approach which is (2) subsequently approximated using linear functions and (3) fitted to a Cauchy membership function, identifying the clusters as fuzzy sets. A Takagi-Sugeno-Kang (TSK) fuzzy system [32][33][34][35][36] merges the prediction of each linear regression to generate the prediction of the model. Parameters are fined tuned using Gradient Descent (GD) optimization. We use a GD-based optimization because it is computationally efficient, it produces a stable error gradient and a stable convergence. We demonstrate that our approach yields superior performance while providing explainability and interpretability as demonstrated for a meaningful industrial AI benchmark and comparing it with other state of the art techniques.
Section II is a description of the algorithm's phases and the corresponding mathematical formulation. Section III describes the data sets used for empirical evaluation, including both synthetic and real-world data. Section IV discusses out empirical results. Section V concludes this paper and offers ideas for future work.

PROPOSED ALGORITHM
First, a clustering algorithm divides the joint input-output space into groups of data points ( is user-defined). Despite referring to them as clusters, they do not necessarily represent isolated and well defined bulks of data, but rather convex regions of a tessellated work-space with observations inside. Therefore, we use an agglomerative MAX-linkage/complete-linkage hierarchical clustering algorithm [37][38][39] to get non-overlapping and non-intersecting globular rounded clusters (with convex boundaries). Hierarchical clustering algorithms have been widely used in combination with fuzzy logic [40][41][42] .
Then, in each cluster, the dependence of the output on the input x is approximated by a linear function, similar to how an ensemble of systems works [43] , where each model is an expert in a certain region of the space. We use the notation x i to denote the ith input, = 1, . . . , . Each x i is a row vector of dimension N, x i ( ) = , = 1, . . . , . So, the data matrix can be expressed as For a generic vector x, the linear function of cluster is, where m c denotes the row vector of the slopes of the function and denotes the intercept.
The output of the model is generated by merging the linear functions of the clusters through a Takagi-Sugeno-Kang approach,ˆ( where (x) is the membership function of each cluster. Appendix A provides an overview of fuzzy learning and the importance of the membership function's choice. We decided to use Cauchy membership functions for this algorithm and we provide three explanations for this choice. The first two are explained in [44] and [45] . The third is covered in Appendix B, where we proof that this membership function is the fastest to compute. In the multidimensional space, the definition of the Cauchy membership function is defined as where a c and b c are row vectors that determine the location of the center, and the amplitude, of the function, respectively, in each of the dimensions of the input space, a c ( ) = , a c ( ) = , = 1, . . . , . As usual, ∥a c ∥ is the Euclidean norm √︁ Once the parameters of both the membership functions and the linear functions have been initialized, the model is trained using GD learning. We define the loss function for the data vector as and the loss over the whole the training set X as where the factor 1 2 is added to ease the differentiation of (6).
The matrix expression of the formulation for the update of the different parameters is where denotes the learning rate in each epoch of the training, Δ denotes the increment required in each parameter, (9) Equation (7) has been developed minimizing (6), as shown in the Appendix C.
For the update of the parameters, one can think of two different approaches, synchronously and asynchronously. The synchronous approach requires visiting all the observations of the training sample before training the parameters. Thus, in this configuration, there is only one update, performed at the end of each epoch. In the asynchronous version, we update the parameters of every cluster after studying an instance of the sample. In this case, we would have as many updates as observations per epoch. The latter requires a higher computational cost, but the evolution of the parameters is more stable and smooth than the synchronous approach. In the results section of this paper, we considered the asynchronous update. These two optimization versions can be visualized in Figure 1. Synchronous and asynchronous approaches for the update of the system's parameters. Each observation of the dataset, has a different set of matrices that together constitute the gradient of the loss function (transparent matrices of the back refer to different clusters). The entries of the blue matrices represent the derivatives with respect to the different parameters of the system; the rows determine the type of the parameter, , , , or , organized in a top-down fashion, and the columns identify the input feature (denoted with letter f, and 1 for the independent term). The green matrices are obtained after weighting the derivatives with the learning rates, which do not necessarily need to be identical for all the parameters. Finally, the resulting matrix could be added to those obtained for the other observations, creating the purple matrices, or it can be used before moving to the next instance of the training sample. The first approach is labeled as synchronous update, and considers a single update of the parameters in every epoch, after all the observations have been visited. We identify the second approach as an asynchronous update, which requires, within the same epoch, as many updates of the parameters as observations are in the sample.

RESULTS
We tested this algorithm in three different problems; 1 input 1 output, 2 inputs 1 output, and 4 inputs 1 output. In the first two cases, we created the data artificially, and in the third application we used real data obtained from the University of California Irvine (UCI) Machine Learning repository [46] . In this section, we provide a brief revision of the lessons learned from the first two, which were covered in [47,48] , and a detailed explanation of the results obtained in the last scenario.

Single-input problem
We studied 20 different functions. In order to measure the noise resilience of the method, we injected random noise in the output variable of the training dataset, (from a uniform distribution), but the testing data had no noise (the algorithm was never exposed to the ground truth). In each case, we used a ranging number of clusters to evaluate the differences in the approximations obtained. As a benchmark, we utilized a selection of neural networks of one-hidden layer with varying number of hidden neurons. The proposed algorithm obtained significantly better results than all the neural network configurations studied [47] . In Figure 2-3, we show the results obtained with both approaches for the function = |sin( 1 ) | Our algorithm's predictions, noisy data, and truth NN's predictions, noisy data, and truth For the training data, we considered the same 100 noisy observations for both approaches. All the systems were trained using a learning rate of 0.01, the learning stopped when the loss converged. In all the cases studied our algorithm converged in 1,000 epochs, whereas the neural networks required 10,000.

Double-input problem
The purpose of this second problem was to prove the applicability of the algorithm to -dimensional inputs. Similarly to what it was done for the previous section, we trained the system with noisy data (artificially created from different functions and noise injected randomly from uniform distributions) but we tested it with the ground truth (non-noisy). In [48] , we discussed the results obtained for a variety of functions. Figure 4 shows the comparison for the function = sin( 1 ) · sin( 2 ) · 2 1 + 2 2 when a system of 7 clusters is trained for 400 epochs, with 0.01 learning rate in all the parameters, and with 160 training points.

Four-input problem
In order to prove the applicability of the algorithm with real multidimensional problem, we considered the regression data of a combined cycle power plant (CCPP) [49] from the UCI machine learning repository. A CCPP consists of two types of turbines that generate electricity (gas and steam turbines). Gas turbines take the air and natural gas fuel inlets for the production of electricity at the cost of generating an outlet of exhaust gases with high temperature. The heat of these gases is recycled with a secondary circuit of water that is evaporated for further extraction of energy in the water turbine. Figure 5 shows the schematic layout of a typical CCPP system, simplified for clarity purposes.
The popularity of such power plants has increased over the last years, particularly in areas with natural gas resources [50] . The CCPP considered for this study has a nominal generating capacity of 480 MW, two 160 MW ABB 13E2 Gas Turbines, two dual pressure Heat Recovery Steam Generators and one 160 MW ABB Steam Turbine.
The task chosen focuses on the prediction of the electric power produced by the plant when it operates at full load. The ambient temperature, atmospheric pressure, and relative humidity have an impact on the efficiency of the gas turbine, and so does the exhaust steam pressure have an effect on the steam turbine performance. These four variables are considered inputs of the problem, and the power output of the plant, the value we want to predict, is the target. All the observations in the data correspond to the average hourly measurements of the sensors installed in the plant. Table 1 contains some basic statistics of these measures for the dataset selected.
The dataset consists of 10,000 observations. We considered 1,600 randomly chosen instances for training and 800 for testing (0.8 split). We chose a model of 6 clusters and we trained for 4,000 epochs with 0.0001 learning rate (for all the parameters). In order to avoid overtraining, we selected a random sample of 960 training instances randomly chosen from the training set in every epoch (samples of 60% from the training points). We used the Root Mean     In [50] , a variety of different methods were studied for this particular CCPP problem. We provide a sum up of the RMSE values obtained in all those approaches together with our results in Table 2.  [51] 5.426 Linear Regression LR [52] 4.561 Least Median Square LMS [53] 4.572 Multilayer Perceptron MLP [54] 5.399 Radial Basis Function NN RBF [55] 8.487 Pace Regression PR [51] 4.561 Support Vector Poly Kernel SVR [55] 4.563 Lazy-learning IBk Linear NN Search IBk [56] 4.656 KStar K* [57] 3.861 Locally Weighted Learning LWL [56] 8.221 Meta-learning Additive Regression AR [58] 5.556 Bagging REP Tree BREP [59] 3.787 Rule-based Model Trees Rules M5R [51] 4.128 Tree-based Model Trees Regression M5P [60] 4.087 REP Trees REP [61] 4.211 Table 3 shows the figures of merit for the top four methods studied in [50] and the proposed algorithm.

DISCUSSION
Using the Gradient Descent optimization provides a significant advantage in computational efficiency when it is compared to other classical bio-inspired evolutionary algorithms. The latter has been widely used to fine tune fuzzy inference systems in a variety of applications. In the aerospace sector for example, genetic fuzzy systems have a demonstrated success. We can see their usage in aerial vehicle controls for combat missions [62] , multi-agent UAV routing [63,64] , or autonomous collaborative operations [65][66][67] . On the contrary, GD has not been used with fuzzy logic as much as nature-inspired meta-heuristics. In part because GD requires a tailored learning formulation for the parameters of the model, although it provides a faster learning.
In terms of performance, the results obtained for the scenarios we tested are excellent. For the case of the single-input problem, it can be seen in Figure 2 that all the output curves are smooth and very similar despite the noise of the data. Conversely, the curves of the Neural Network benchmark shown in Figure 3 are significantly different, and sharp, with abrupt changes that do not capture the essence of the ground truth. The smoothness property of our method is obtained thanks to the information merging of every cluster. Indeed, the joint contribution and weighting of each linear regression provides robustness and resilience to noise. We perceive that same smoothness in the double-input problem of Figure 4. As it occurred in the single-input case, our model is able to learn the pattern and not the noisy data, capturing the essence of the ground truth.
The parameters of the model we have introduced have a displayable mathematical meaning which ease their visualization. Each cluster can be plotted as linear regression and its associated membership function. Actually, a membership function identifies the region where the influence of each regression is greater, similarly to how an ensemble of expert systems would perform. This allows for easier interpretability and comprehension of the predictions made, which ultimately grants an explainable nature to our algorithm. On the other hand, the neurons of the Neural Network architecture considered, cannot be uniquely associated to a certain region of the joint input-output space. Thus, makes it harder to understand the meaning of each weight, which results in greater opacity and misunderstanding.
For the CCPP dataset, the evolution of both the RMSE and MAE of Figure 6-7 have a significant drop in the first 200 epochs. This initial phase of the learning process is crucial for an accurate reorganization of the membership functions and the model's parameters. Although from epoch 2,000 on there is no significant improvement in the figures of merit of the test set, we decided to continue the training to proof the robustness to overtraining of our algorithm. This can be perceived from the fact that we do not see any increasing trends in the curves for the test set, despite the prolonged training.
Compared to other meta-heuristic approaches, such as evolutionary algorithms which is often combined with Fuzzy Systems [66,67] , the proposed algorithm is significantly faster, and so it provides a big advantage in the training stages.
The superiority of the proposed algorithm is clearly demonstrated in Table 3, where we show its performance in comparison to the mentioned state of the art techniques.

CONCLUSIONS
We have introduced a cluster-based algorithm for regression tasks with three key components: 1. A hierarchical clustering for initialization of its parameters. 2. A Takagi-Sugeno-Kang fuzzy inference system with Cauchy membership functions for prediction. 3. A Gradient Descent optimization for learning and fine tuning.
We provide a theoretical development of the formulation for the parameter update by deriving the total loss of the training set. We also tested the algorithm in three different scenarios, two of them with synthetic data and a third with real data from a Combined Cycle Power Plant. The results we obtained outperformed the benchmarks. In the case of the CCPP, we were able to reduce the value of the RMSE from 3.787 (best score obtained from a wide variety of methods, [50] ) to 3.513, and the MAE from 2.818 to 2.649. Additionally, this algorithm did not only show performance but also: • Robustness to overtraining and noise resilience due to the merged contribution of the clusters.
• Transparency as a direct consequence of the interpretable nature of its parameters, the dominance of a cluster in every region of the input-output space, the lack of complexity, and the linguistic nature of its fuzzy if-then rules.

DECLARATIONS
techniques use differentiation.
From this viewpoint, it is desirable to have a differentiable (smooth) membership function. So, we are looking for easiest-to-compute smooth functions.
In a computer, the only hardware supported operations with numbers are arithmetic operations: addition, subtraction (which, for the computer, is, in effect, the same as addition), multiplication, and taking an inverse (division is implemented as / = · (1/ )). There are also operations min, max, and absolute value, but they are not everywhere differentiable, so we will not consider them.
We need the inverse operation in order to obtain a bounded function. If we do not use the inverse, then we get functions which are compositions of additions, subtractions, and multiplications -and thus, polynomials, since a polynomial can be defined as any function that can be obtained from variables and constants by using addition, subtraction, and multiplication. But a polynomial is not bounded -unless it is a constant. So, we need to use the inverse.
The inverse 1/ is not bounded -and for = 0 it is not even defined, so we need at least one other operation.
We can have the additional operation before the inversion or after the inversion. If we perform the additional operation before the inverse, we have the following options: 1/( + ) for some constant , 1/( + ), 1/( · ), 1/( · ), and 1/(1/ ). The last option just gives , the other ones lead to unbounded functions which are not even everywhere defined.
If we have an additional operation after the inversion, we get + 1/ , + 1/ , · 1/ , and · (1/ ). The last option is simply a constant 1, the others lead to unbounded functions. So, one additional operation is not sufficient, we need at least two additional arithmetic operations.
One can show that if we have an additional operation after the inversion, we will still not get a bounded everywhere defined function. So, the only remaining option is to have two operations followed by inversion. These operations can be addition or multiplication, and they can operate on the variable and on some constant .
So, one of the two operations must be addition, another one must be multiplication. We can have two subcases: when addition is performed first, and when multiplication is performed first. Let us consider them one by one.
In all these cases, we have unbounded functions. • When multiplication is performed first: As a result of multiplication, we can have 1 · or · . We can add to each such result either a constant 2 or a variable . So, we get the following options: 1/( 1 · + 2 ), 1/( 1 · + ), 1/( · + 2 ), and 1/( · + ). In the first, second, and fourth cases, the function is unbounded and not everywhere defined. The only case when the function is bounded and everywhere defined is the third case 1/( 2 + const), which is exactly what we called a Cauchy membership function.
Let us obtain the resulting membership functions. A membership function is usually defined in such a way that its largest value is 1. For the function 1/( 2 + ), the largest possible value is 1/ , so we should take = 1 and consider the membership function We also need to take into account that the numerical value of a physical quantity depends on the choice of the measuring unit and on the choice of the starting point. If we change a measuring unit and/or a starting point, then we get new numerical values which can be obtained from previous values by a linear transformation = · + , where is the ratio of the measuring units and is the difference in starting points. A classical example is the relation between temperature in Celsius and temperature in Fahrenheit: When the original values are described by the membership function (10), then, to get the membership function for the new numerical values , we need to substitute, into the formula (10), the expression = − that describes (the old value) in terms of (the new value). As a result, for the new values, we get the following membership function Thus, the membership functions for which the computation is the simplest are Cauchy membership functions (11).

C. PROOF OF THE LEARNING FORMULAS
Let us consider a generic cluster , such that a k is a row vector of dimension N, a k ( ) = , = 1, . . ., . The same applies to b k and m k , but not to the independent term . We can build the following matrix that considers the gradient of the loss function for each of the different parameters, of this cluster.
Then, the gradient of J is ∇J( Let us calculate each of the derivatives of R k , to do so, we use as one of the four possible parameters, ∈ , , , . Then, We study the last derivative expression separately, and substitute the definition ofˆ(x i ), The term is null because no parameter has an influence over the observation. Thus, Again, we resort to the definition ofˆ(x i ), to simplify the second term, +ˆ( The parameter has only influence over (x i ) or (x i ) when = . Thus, Solving, What follows is replacing for each one of the four parameters it represents. For the case of , the term Similarly, the derivative with respect to is For the case of , the term (x i ) is null, so Let us calculate (x i ) , Thus, Similarly, the derivative with respect to is Substituting the expressions obtained of [ −ˆ(x i ) ] for each parameter in (15), we obtain the derivatives