An ℓ0-Norm-Based Centers Selection for Failure Tolerant RBF Networks

,


I. INTRODUCTION
Constructing a radial basis function (RBF) network [1]- [3] includes two key issues. The first issue is to select some suitable RBF centers. The second issue is to determine the RBF weights. There are many ways to perform RBF center selection. We can select all the input vectors from the training samples as RBF centers [4]. But this method may create a network with unnecessarily complicated structure. Another way is to randomly select a subset of the input vectors as the RBF centers, but this simple method cannot ensure that the constructed RBFs cover the input space well [5]. Other advanced methods include clustering algorithms [6], orthogonal least squares (OLS) approach [7], [8], and support vector regression (SVR) [9], [10]. However, many algorithms in this area did not consider the fault tolerant issue.
The associate editor coordinating the review of this manuscript and approving it for publication was Guiwu Wei .
Biological neural networks have the capability to tolerate fault or noise conditions [11]. For example, human can recognize an object from a noisy image/video. In addition, when a few neural cells or synapses malfunction, human brain can still works properly. Since the concept of neural networks comes from biological neural networks, researchers expected that a trained neural network has a capability to tolerate weight or neuron failure. However, many literatures reported that when fault/noise tolerant procedures are not introduced during training, the faulty version of a well trained network may have a poor performance [12]- [15].
In realizing a neural network, we face some practical issues [16]. When we use analog technology to realize a dot-product operation, the accuracy is affected by the offset voltage of operational amplifiers [17]. In addition, for analog components, we usually define their accuracy in terms of percentage of error [18]. In this case, we can use multiplicative noise to model the error. For digital technology realization, when a weight is represented in the floating point format, the round-off errors happen and they can be described by the multiplicative noise model [19]. Apart from multiplicative noise, physical fault may happen [20]. It blocks the data/signal transmission between two connected neurons. Besides, nowadays, the very large scale integration (VLSI) implementation could be at nano-scale and transient noise/failure may happen [21].
In the last twenty years, several fault tolerant algorithms were proposed [13], [14], [22]- [24]. However, most of them assume that a trained network is affected by one kind of fault conditions only. For instance, in [13], [14], only the open node fault model was considered. Recently, [25] first described the concurrent fault situation in which a trained network is affected by multiplicative weight noise and open weight fault concurrently. However, the weight decay term in [25] is not optimal for fault tolerance situation. Later, another approach was proposed in [19]. It is based on regularization and OLS center selection. The performances of this algorithm are better than those of many existing methods. Due to the OLS approach, it cannot select centers and train an RBF network at the same time. To improve the performances of the network and complete the two steps simultaneously, an 1 -norm based fault tolerant RBF center selection method [26] was proposed recently.
As the 0 -norm is a much better approach in compressive sensing for retaining nonzero elements, this paper investigates to use the 0 -norm to replace the 1 -norm for center selection. Since the 0 -norm is a noncontinuous function, it is difficult to design an algorithm to minimize an 0norm based objective function. This paper introduces the minimax concave penalty (MCP) function [27], [28], which is an approximation of 0 -norm, into the objective function. Since the MCP function is able to limit the number of RBF nodes used, minimizing the modified objective function can remove some unimportant RBF nodes during training. However, the MCP function is still nondifferentiable and nonconvex, traditional gradient descent like algorithms are unable to minimize the modified objective function. Based on the alternating direction method of multipliers (ADMM) framework [29], we develop an algorithm, called ADMM-MCP, to minimize the modified objective function. The ADMM framework breaks down the minimization problem into three parts. Each part can be solved in a much easier way, even though some of them contain nonconvex and nondifferentiable terms. A theoretical analysis of the convergence of the proposed ADMM-MCP algorithm is then provided. Simulation results show that the proposed ADMM-MCP algorithm is superior to many existing center selection algorithms under the concurrent fault situation.
The contributions of this paper are as follows.
• Based on the MCP concept (an approximation of the 0 -norm), we derive a fault tolerant objective function for training RBF networks and selecting RBF nodes simultaneously.
• Based on the ADMM framework, we derive the updating equations to minimize the proposed objective function.
• We show that the training weight vector converges to a limit point. Besides, the limit point is a stationary point of the Lagrangian function of the objective function.
• Simulation shows that the performances of the proposed algorithm are better than those of many existing training algorithms. Besides, from the paired t-test result, the improvement of using the proposed algorithm over other algorithms is statistically significant.
The rest of this paper is organized as follows. The backgrounds of the ADMM framework and RBF neural networks under the concurrent fault situation are presented in Section II. In Section III, the proposed algorithm is developed. Its convergent property is presented in Section IV. Simulation results are provided in the Section V. Finally, the concluding remark is drawn in Section VI.

A. NOTATION
We use a lower-case or upper-case letter to represent a scalar while vectors and matrices are denoted by bold lower-case and upper-case letters, respectively. The transpose operator is denoted as ( ) T , and I I I represents the identity matrix with appropriate dimensions. Other mathematical symbols are defined in their first appearance.

B. RBF NETWORKS UNDER CONCURRENT FAULT SITUATION
In this paper, the training set is expressed as where x x x i is the input vector of the i-th sample with dimension K , and y i is the corresponding output. Similarly, the test set is expressed as In the RBF approach, the input-output relationship of the data set is approximated by a weighted sum of the outputs of M RBFs, given by where a j (x x x) = exp − x x x − c c c j 2 2 /s is the output of the jth RBF node, w j is the corresponding weight, c c c j is the center of the j-th RBF node, and s is a parameter which controls the RBF width. Usually, the RBF centers are selected from the training input vectors {x x x 1 , . . . , x x x N }. If we use all training input vectors as centers, the resultant network will have serious overfitting problem for faultless situation. In addition, using all the training input vectors is waste of resources. Therefore, center selection is a key step in the training of an RBF network. VOLUME 7, 2019 For a fault-free network, the training set mean square error (MSE) is given by where w w w = [w 1 , · · · , w M ] T , y y y = [y 1 , · · · , y N ] T , A A A is an N × M matrix, and the (i, j) entry of A A A is given by In the implementation of an RBF network, weight failure may happen. Multiplicative weight noise and open weight fault are two common fault in the RBF network [18], [22]- [24], [30]- [33]. When they occur concurrently [19], [25], the implemented weights can be described bỹ In (6), β j 's are variables that describe whether the weights are opened or not. When a weight β j is opened, β j is equal to 0. Otherwise, β j is equal to 1. This paper assumes that β j 's are independently identically distributed (i.i.d.) binary random variables. The probability mass function of β j 's is given by 1 − P β , for β j = 1.
The statistics of β j 's are then given by The term b j w j in (6) is the multiplicative noise. It can be seen that the magnitude of the noise is proportional to that of the weight. This paper assumes that b j 's are i.i.d. zero-mean random variables with variance σ 2 b . With this assumption, the statistics of b j 's are summarized as where · is the expectation operator. Given a particular fault pattern, the training set MSE is From (10) and (9), the average training set MSE [19] over all possible fault patterns is given by In (12), the term y 2 i is independent of the weight vector w w w. Hence the training objective can be defined as where The ADMM framework solves optimization problems by breaking them into smaller pieces [29]. Suppose we have the following objective function: where z z z ∈ R n . The objective function can be separated into two terms: ψ(·) and g(·). If the term g(z z z) is nonconvex and nondifferentiable, then it is difficult to minimize Q(z z z) directly. The ADMM framework introduces a dummy vector y y y ∈ R n and reformulates the minimization problem as a constrained optimization problem, given by min z z z,y y y : ψ(z z z) + g(y y y) (15a) We then construct an augmented Lagrangian function: L(z z z, y y y, α) = ψ (z z z) + g (y y y) + α T (z z z − y y y) where α ∈ R p is the Lagrange multiplier vector, and ρ > 0 is a parameter that affects the convergent speed. The algorithm consists of three steps, given by It should be noticed that for many forms of g (y y y), we have closed form solutions for (17a), even though g (y y y) is nonconvex and nondifferentiable.

III. DEVELOPMENT OF ADMM-MCP
In (13), we discuss to use M RBF centers, {c c c 1 , · · · , c c c M }, to construct a network. However, we do not discuss the way to create them yet. Suppose that we use all the training input 151904 VOLUME 7, 2019 vectors, {x x x 1 , · · · , x x x N }, as the RBF centers. The expression of the training objective does not change and is still given by However, the definitions of w w w and R R R are changed to In the rest of this section, we will develop the ADMM-MCP algorithm, in which we pack an approximated 0 term, namely MCP term, into the objective function stated in (18). The packed MCP term has an ability to force some RBF weights to zero. Hence during training, the center selection process is achieved automatically.

A. OBJECTIVE FUNCTION AND ADMM FORMULATION
We introduce an additional 0 penalty term into (13), given by where w w w 0 is the 0 -norm of the weight vector and it represents the number of nonzero entries in the vector w w w. Parameter λ is a regularization parameter that controls the number of RBF nodes in the resultant network. Strictly speaking, the 0norm is not a norm. Due to the nature of the 0 -norm term, the problem stated in (20) is NP hard [34]. Inspired by [27], [28], the MCP function is a very attractive approximation of the 0 -norm. Hence, we modify the objective function stated in (20) as The shape of the MCP penalty function with various settings is shown in Figure 1. Although the form of Q mcp has better property, we do not have a closed form solution for minimizing Q mcp (w w w, λ) directly. We use the ADMM framework to minimize the objective function Q mcp (w w w, λ). Firstly, we introduce a dummy variable vector u u u = [u 1 , . . . , u N ] T and transform the unconstrained problem, stated in (21), into the standard ADMM form, given by where We then construct the augmented Lagrangian as (26), when ργ > 1, the closed form solution [27], [28] is given by for i = 1, · · · , N , where S denotes the softthreshold operator [35], It is worth noting that when γ → ∞, S(·, ·) is a soft-threshold function. When γ → 1, S(·, ·) is a hard-threshold function. -When ργ = 1, the closed form solution of (26) is given by VOLUME 7, 2019 -When ργ < 1, the closed form solution of (26) is given by • According to (17b), the ADMM iteration of w w w k+1 is given by • From (17c), υ k+1 is updated as

IV. ANALYSIS OF CONVERGENCE
This section presents the convergent properties of the ADMM-MCP algorithm. We use the general convergence result of nonconvex ADMM given by [36]. When we use the general convergence proof of nonconvex ADMM, we need to show that our algorithm satisfies three conditions. The general convergence result is given by Theorem 1 [36]. Theorem 1: If an ADMM based algorithm satisfies the following three conditions stated below, then the sequence {w w w k , u u u k , υ k } has at least a limit point {w w w * , u u u * , υ * } and any limit point {w w w * , u u u * , υ * } is a stationary point of the Lagrangian function.
C1 (Sufficient decrease condition) For each k, there exists a τ 1 > 0 such that C2 (Boundness condition) The sequence {w w w k , u u u k , υ k } is bounded and its Lagrangian function is lower bounded.
In the rest of this section, we will show that the proposed algorithm satisfies the three conditions of Theorem 1.
Proposition 1: If ρ is greater than a certain value, the ADMM-MCP algorithm satisfies the sufficient decrease condition in C1.
Proof: The Lagrangian function can be rewritten as Note that ψ (w w w) = 1 N y y y − A A Aw w w 2 2 + w w w T R R Rw w w. Since ψ(w w w) is strongly convex, we can deduce that (35) is also strongly convex with respect to w w w. Hence, based on the definition of strongly convex function, we have the relationship between L(w w w k+1 , u u u k+1 , υ k ) and L(w w w k , u u u k+1 , υ k ), given by where a > 0.
Since L(w w w, u u u, υ) is a strictly convex function with respect to w w w, we have From (32) and (37), we can deduce that Thus we have the relationship between L(w w w k+1 , u u u k+1 , υ k+1 ) and L(w w w k+1 , u u u k+1 , , υ k )), given by where l ψ is a Lipschitz constant of function ψ(w w w), and the last inequality is from the fact that ψ(w w w) has Lipschitz continuous gradient.
Proposition 2: If ρ ≥ l ψ , then L(w w w k , u u u k , υ k ) is bounded for all k, and L(w w w k , u u u k , υ k ) converges, as k → ∞. In addition, the sequence {w w w k , u u u k , υ k } is bounded. Proof: The proof consists of two parts. The first one is that L(w w w k , u u u k , υ k ) is bounded. The second one is that the sequence {w w w k , u u u k , υ k } is bounded.
The proof for L(w w w k , u u u k , υ k ) being bounded: First, we prove that L(w w w k , u u u k , υ k ) is lower bounded for all k. From (38), ∇ψ(w w w k ) = υ k . Thus From Lemma 3.1 in [37] and the Lipschitz continuous gradient of ψ(w w w), Hence, we have L(w w w k , u u u k , υ k ) ≥ ψ(u u u k ) + P λ,γ (u u u k ) Obviously, if ρ ≥ l ψ , then the right hand side of (45) is greater than −∞. Hence L(w w w k , u u u k , υ k ) is lower bounded. According to the proof of Proposition 1, we know that L(w w w k , u u u k , υ k ) is sufficient descent. Hence L(w w w k , u u u k , υ k ) is upper bounded by L(w w w 0 , u u u 0 , υ 0 ).
The proof for {w w w k , u u u k , υ k } being bounded: Next, we prove that the sequence {w w w k , u u u k , υ k } is bounded. From (42), we have Then we can deduce that l k=1 w w w k+1 − w w w k 2 Even l → ∞, we still have ∞ k=1 w w w k+1 − w w w k 2 2 < ∞. Thus {w w w k } is bounded. From (40), we know That means, {υ k } is bounded. In addition, according to (32), we have That means, {u u u k } is also bounded. To sum up, the sequence {w w w k , u u u k , υ k } is bounded. The proof is completed.

Proposition 3: The proposed ADMM-MCP algorithm satisfies the subgradient bound condition given by C3.
Proof: The proof involves the derivations of three gradients. They are the gradient ∂L/∂w w w of L with respect to w w w, the limiting subgradient ∂ u u u L of L with respect to u u u, and the subgradient ∂ υ L of L with respect to v v v.
For ∂L/∂w w w, we have ∂L ∂w w w (w w w k+1 ,u u u k+1 ,υ k+1 ) Since w w w k+1 is the optimal solution of L(w w w, u u u k+1 , υ k ), we have From (52) and (53), we have For the limiting subgradient ∂ u u u L, we have Since u u u k+1 is the optimal solution of L(w w w k , u u u, υ k ), we have From (55) and (56), we have For ∂ υ L, from (25) and (39), we have Hence, from (54), (57), and (58), From the inequality stated in (40), we can deduce that where τ 2 > 0. The proof is completed.
Since the ADMM-MCP algorithm satisfies the three conditions of Theorem 1, the sequence {w w w k , u u u k , υ k } has at least one limit point {w w w * , u u u * , υ * } and any limit point {w w w * , u u u * , υ * } is a stationary point. In other words, at least, the ADMM-MCP algorithm has the local convergent property.
The performances of the resultant networks are evaluated by the average test set MSE, given by where {(x x x i , y i ), i = 1, · · · , N } is the test set, N is the number of samples in the test set, y y y = [y 1 , . . . , y N ], A is an N × M matrix, and its element in i th row and jth column is given by The two parameters, P β and σ 2 b , describe the failure levels for open weight fault and multiplicative weight noise, respectively. We consider three fault scenarios: For the ADMM-MCP algorithm, we set γ = 1.001 and ρ = 0.1. The parameter λ is used to control the number of nodes.

B. MSE VERSUS THE NUMBER OF HIDDEN NODES
We use the ASN data set to illustrate the way to control the number of nodes. We can vary the value of λ to control the number of RBF nodes. Figure 2(a)-(b) show test set MSE versus λ and number of nodes versus λ. Combining Figure 2(a) and (b), we obtain MSE versus the number of hidden nodes, as shown in Figure 2(c). Unlike the common algorithms that provide U-shaped test set MSE curves, the proposed algorithm provides MSE curves with nearly monotonic decreasing behaviour respect to the test set MSE of faulty networks. We can observe that increasing the number of RBF nodes leads to the decrease of test set MSE of 151908 VOLUME 7, 2019 faulty networks. This is because the additional term, , not only makes the trained network to tolerate the multiplicative noise and node failure, but also has the ability to avoid overfitting. For other datasets and settings, their MSE curves have the similar behaviour.

C. CONVERGENCE
Here we use the ASN dataset with {P β = σ 2 b = 0.01} as an example to intuitively demonstrate the convergence. The result is shown in Figure 3 which shows the objective value ψ (w w w) = 1 N y y y − A A Aw w w 2 2 + w w w T R R Rw w w versus the number of iterations. We can see that within 100 to 200 iterations, the training objective value nearly settles down. If we increase the value of λ, then the algorithm converges to a larger objective value. It is because increasing λ leads to a restriction on the approximation ability of the resultant networks and the resultant networks have larger objective value ψ (w w w). For all other datasets and other settings, they have similar properties of convergence.

D. COMPARISON ALGORITHMS
We compare our proposed algorithm with six other algorithms. They are, respectively, the fault tolerant OLS algorithm (OLS) [19], the fault tolerant l 1 -norm approach (ADMM-l 1 ) [26], the l 1 -norm regularization approach (l 1 -reg.) [39], the support vector regression algorithm SVR [39], the orthogonal forward regression algorithm (OFR) [47] and the Homotopy method (HOM) [48]. Our aim is to show that our proposed algorithm has better RBF center selection capability. We will show that when we do not use all the training input vectors as the RBF centers, the performances of our proposed algorithm are better than those of the six comparison algorithms.
The fault tolerant OLS algorithm and the fault tolerant ADMM-l 1 -norm algorithm have fault tolerant ability. The fault tolerant OLS algorithm includes two stages. In the first one, it uses the OLS method to generate a sorted list of RBF nodes. In the second stage, it constructs a fault tolerant RBF network with desired number of nodes. The fault tolerant ADMM-l 1 approach is our previous work based on an l 1 -norm regularizer.  The l 1 -norm regularization approach [39] considers the original MSE training objective and uses an l 1 -norm regularizer to control the number of RBF nodes. Its fault tolerant ability is inadequate. Especially, when the fault level is high. The SVR algorithm [39] is able to train the RBF network and to select the centers simultaneously. It uses two parameters C and to control the training process. Table 2 shows the parameter settings for different datasets. The SVR algorithm has certain fault tolerant ability. It is because the parameter C VOLUME 7, 2019  is capable to limit the magnitudes of the trained weights. The parameter is used to control its approximation ability. However, the main drawback of SVR algorithm is that there is no simple way to find an appropriate pair of C and . In our experiment, we use trial-and-error method to determine them.
The Homotopy method [48] is an incremental learning method. It has an l 1 -norm regularization term, and it can tune its regularization parameter automatically. The OFR algorithm [47] is an incremental learning method too. It chooses one RBF center at a time with the orthogonal forward regression procedure. For OFR, an l 2 -norm regularization term is used. It can also tune the regularization parameter automatically during training process.
In the following two experiments, the simulation was ran 20 times. In each trial, the samples of dataset were randomly split for training and testing set.  Table 3 shows the average test set error over the 20 trials. From the table and Figure 4, it can be observed that, under the fault-free environment, the performances of OLS and HOM are better than other algorithms.    set MSE is 0.016419 and the number of the used nodes is 169. For the proposed ADMM-MCP algorithm, it is able to use 139 nodes only to lower the test set MSE to 0.015281. Clearly, the performances of the proposed ADMM-MCP are better than those of the comparison algorithms. For each fault level and each dataset, we repeated the experiment 20 trials. In each trail, the samples of dataset were randomly split for training and testing set. The results are summarized in Table 4. From the table, it can be seen that under the concurrent fault situation, even we select the best MSE values of SVR, l 1 -reg., HOM, and OFR, their performance are still unacceptable. Especially, when the fault level is high, their test MSE values are much higher than those of the remaining algorithms. The ADMM-MCP, ADMM-l 1 and OLS algorithms can effectively reduce the influence of the concurrent fault. Among them, the ADMM-MCP is the best which has smaller average MSE values and uses fewer number of nodes.

G. PAIRED T-TEST
This subsection uses the paired t-test to illustrate that the improvement of our proposed algorithm is statistically significant. From Table 4, The ADMM-1 and OLS algorithm are the second best and the third best. Hence we perform the paired tests between ADMM-MCP and ADMM-1 , and between ADMM-MCP and OLS. The paired test results are summarized in Tables 5 and 6. For the one-tailed test with 95% level of confidence and 20 trials, the critical t-value is 1.729.
From the tables, we can see that all the test t-values are greater than 1.729 and all p-values are smaller than 0.05. In other words, we have enough confidence to say that on average the proposed ADMM-MCP is better than the ADMM-l 1 and OLS algorithm. Besides, all confidence intervals in the two tables do not include zero. Therefore, we can further confirm that the improvement of the proposed ADMM-MCP is statistically significant.

VI. CONCLUSION
In the paper, the fault tolerant RBF neural network and its center selection problem are studied. Based on ADMM framework and 0 -norm, this paper proposes the ADMM-MCP algorithm. First we introduce an 0 -norm term, which has an ability to remove some unimportant RBF nodes during training, into the fault tolerant objective function. Since 0norm is noncontinuous, we cannot use traditional gradient descent like algorithms to minimize the modified objective function. We then approximate the 0 -norm term with the MCP function. However, the MCP-based objective function is still nonconvex and nonsmooth, traditional gradient descent like algorithms cannot handle it. This paper then applies the ADMM framework to construct an algorithm, namely ADMM-MCP, to train an RBF network and to select RBF nodes simultaneously. The ADMM framework breaks down the update into three parts. Each part can be effectively solved, even though some parts contain nonconvex and nondifferentiable terms. In addition, we prove that the algorithm converges. From the experimental results, our ADMM-MCP algorithm is superior to many other existing algorithms. 151912 VOLUME 7, 2019