BD-ELM: A Regularized Extreme Learning Machine Using Biased DropConnect and Biased Dropout

In order to prevent the overfitting and improve the generalization performance of Extreme Learning Machine (ELM), a new regularization method, Biased DropConnect, and a new regularized ELM using the Biased DropConnect and Biased Dropout (BD-ELM) are both proposed in this paper. Like the Biased Dropout to hidden nodes, the Biased DropConnect can utilize the difference of connection weights to keep more information of network after dropping..e regular Dropout and DropConnect set the connection weights and output of the hidden layer to 0 with a single fixed probability. But the Biased DropConnect and Biased Dropout divide the connection weights and hidden nodes into high and low groups by threshold, and set different groups to 0 with different probabilities. Connection weights with high value and hidden nodes with a high-activated value, which make more contribution to network performance, will be kept by a lower drop probability, while the weights and hidden nodes with a low value will be given a higher drop probability to keep the drop probability of the whole network to a fixed constant. Using Biased DropConnect and Biased Dropout regularization, in BD-ELM, the sparsity of parameters is enhanced and the structural complexity is reduced. Experiments on various benchmark datasets show that Biased DropConnect and Biased Dropout can effectively address the overfitting, and BD-ELM can provide higher classification accuracy than ELM, R-ELM, and Drop-ELM.


Introduction
Extreme Learning Machine (ELM) [1], as the latest research achievement of Single-hidden Layer Feedforward Neural Networks (SLFNs), has attracted much attention due to its good generalization performance and fast training speed. Compared with traditional SLFNs, in ELM, the input weights and biases of the hidden layer are generated randomly without fine-tuning, and the hidden layer output weights are the global optimal solution solved by the least square method to avoid falling into the dilemma of local optimum [2,3]. Although ELM has universal approximation [2,4], it requires a considerable hidden nodes to ensure a good generalization performance, which is prone to overfitting. is makes the prevention of overfitting an urgent problem to ELM.
Regularization is one of the common methods to overcome the overfitting [5]. Various regularization methods are applied to the ELM algorithm to address the overfitting [6][7][8][9]. Wang et al. [6] introduced L 1 -norm into ELM to make network parameters sparse and improve the generalization performance of ELM. A Regularized ELM (R-ELM) based on L 2 -norm loss was proposed by Deng et al. [7], which avoids overfitting and improves the robustness of the algorithm. By incorporating L 1 -norm loss with OP-ELM (Optimal Pruned ELM), Miche et al. [8] proposed the TROP-ELM (Tikhonov-regularized OP-ELM). TROP-ELM adopted both L 1 -norm and L 2 -norm regularization methods, and its generalization performance was significantly improved compared with ELM and OP-ELM.
Dropout [10] and DropConnect [11] are new regularization methods for fully connected networks. In Dropout and DropConnect, the output and connection weights of the hidden layer are, respectively, set to 0 with probability p, otherwise are kept with probability (1 − p). Dropout and DropConnect are both very efficient ways of performing model averaging with neural networks [11]. In Dropout and DropConnect, each iteration is a training of a smaller network, and the final network is the average of a large ensemble of networks which shared parameters. With the success of Dropout and DropConnect, their improved algorithms have been proposed by researchers [12][13][14][15]. Based on the difference of the importance of hidden nodes, Poernomo and Kang [12] proposed the Biased Dropout. In Biased Dropout, hidden nodes were divided into different groups according to their activation values, and different groups were dropped with different probabilities, respectively. Cao et al. [13], combining DropConnect with Adaptive Local Restricted Boltzman Machine (ALRBM), proposed the Adaptive Local Drop, in which the drop probability of connection weights changes with the distance function.
Inspired by Biased Dropout, this paper proposes the Biased DropConnect, which divides the connection weights into high and low groups by setting threshold and gives different drop probabilities. en, applying the Biased DropConnect and Biased Dropout to the ELM, this paper proposes a Biased DropConnect and Biased Dropout based ELM (BD-ELM).
e empirical studies show that Biased DropConnect and Biased Dropout can effectively overcome the overfitting of ELM, and compared with ELM, R-ELM, and Drop-ELM, BD-ELM can get higher classification accuracy on various benchmark datasets.

Extreme Learning Machine.
ELM is an efficient singlehidden layer feedforward neural network. Suppose the number of nodes in the input layer, hidden layer, and output layer is d, L, and m, respectively. For a given training . , x id ] ∈ R d , the matrix expression of the network is as follows: where where w i � [w i1 , w i2 , . . . , w id ] T is the input weights vector of ith hidden node, b i is the bias, w i and b i are both generated randomly, β i � [β i1 , β i2 , . . . , β im ] T is the output weight vector, and g(·) is the activation function.
Training ELM is to find the least square solution of the output weights of the hidden layer. Without considering regularization, β can be calculated as follows: where H † is the Moore-Penrose generalized inverse of H. When HH T is nonsingular, 2.2. Dropout and Biased Dropout. As a regularization method for fully connected networks, Dropout can effectively prevent overfitting [16]. During training, the output of the hidden nodes is set to zero with probability p, otherwise being kept with probability (1 − p). e probability p is an important constant and need to be set in advance. When Dropout is applied to a fully connected hidden layer, the output is given as follows: is the input of the hidden layer, W (of size d × n) is the input weights (the bias is set to a fixed value of 1 and is included in W for simplicity), g(·) is the activation function, and M � [m 1 , m 2 , . .., m d ] T is the binary mask matrix with each vector m j ∼ Bernoulli(1 − p) (all elements in vector are set to 1, or 0). Most commonly used activation functions, such as relu, sigmoid, and tanh, have the property that g(0) � 0. us, (4) can be rewritten as follows: In this case, Dropout is applied at the input of the activation function.
Since each hidden node has different properties and different contribution to the network performance, using a universal probability for all nodes would render the effectiveness of Dropout itself [15]. But the Biased Dropout takes into account the difference of hidden nodes. Nodes with the high-activated value are more important, whose deletion will have a great impact on network performance. erefore, nodes with high-activated value should have a lower drop probability. Viewed another way, limiting the drop probability for important nodes will make the network learn and converge faster than the regular Dropout [15].
In the Biased Dropout, the hidden nodes are divided into high and low group by the threshold of activation. e group with high activation nodes is assigned with a low drop probability to retain important information within, while another group with low activation nodes is given a high drop probability to keep the drop probability of the whole network to a fixed constant. Suppose the drop probability of the nodes with the high-activated value is p h D , the drop probability of the nodes with the low-activated value is p l D . e drop probability of network can be calculated as where n h and n l are the number of hidden nodes with the high-activated value and low-activated value, respectively, and n is the total number of hidden nodes. When the median of the activation value is chosen as threshold, n h ≈ n l , p D � p h D + p l D /2. e output of the hidden layer is calculated in the same way as (4), except the generation way of the mask matrix.

DropConnect.
DropConnect, as a general regularization method of Dropout, sets the connection weights to 0 according to the drop probability p instead of the output of hidden nodes. Similar to Dropout, DropConnect also introduces dynamic sparsity in the network, but DropConnect is for connection weights, while Dropout is for the hidden output. When DropConnect is applied to a fully connected hidden layer, the output is given as where M (of size d × n) is the mask matrix and each element M ij ∼ Bernoulli(1 − p). Comparing (7) with (4), because of different objects, the way of generating the mask matrix in Dropout and DropConnect is also different.

BD-ELM
3.1. Biased DropConnect. Inspired by the Biased Dropout, the Biased DropConnect was proposed. e basis of Biased Dropout is the difference of hidden nodes, while the basis of Biased DropConnect is the difference of connection weights. Different connection weights have different contributions to the network. Connection weights with the high value have greater impact on the network than those with the low value. e difference of connection weights has also been applied in an amplitude-based pruning technology by Han et al. [17], which can obtain a more compact network structure by setting threshold to prune lower weights.
In the Biased DropConnect, the connection weights are divided into two groups by threshold. High weight group is assigned with a low drop probability to hold more important information, while low weight group is given a high drop probability to keep the initial drop connect of the whole network. Assume that the drop probability of the high weight group is p h DC , and the drop probability of the low weight group is p l DC . en, the drop probability of the network can be calculated as where m h and m l are the number of connection weights with high value and low value, respectively, and m is the total number of connection weights. When the median weight is used as the threshold, m h ≈ m l , the drop probability of network is p DC � p h DC + p l DC /2. When using DropConnect, the calculation of the hidden output is same as (7), but the way of generating the mask matrix is different.

Biased DropConnect-and Biased DropConnect-Based
Extreme Learning Machine. In order to prevent the overfitting of ELM and improve its generalization performance, we incorporate Biased DropConnect and Biased Dropout regularization into ELM and propose BD-ELM. On the basis of ELM, BD-ELM regularizes the input weights of the hidden layer by Biased DropConnect and the output of the hidden layer by Biased Dropout. en, it solves the output weights of the hidden layer and completes the training of the network. Compared with ELM, the input weights and output of the hidden layer in BD-ELM is sparser, which reduces network complexity and is conducive to improving the overfitting.
It should be noted that the application of Biased DropConnect and Biased Dropout to ELM requires additional setting of four parameters, p h DC , p l DC , p h D , and p l D . It is difficult to optimize those parameters. For reducing parameters, in BD-ELM, we set the drop probabilities of connection weights with the high value and hidden nodes with the high-activated value to 0, i.e., p h DC � p h D � 0. In other words, connection weights with a high value and hidden nodes with a high-activated value will be kept, and only the low will be dropped. us, the drop probabilities of Biased DropConnect p DC and Biased Dropout p D are only depended on the drop probabilities of connection weights with the low value p l DC and hidden nodes with the lowactivated value p l D , respectively. When the medians of input weights and activation value of hidden nodes are chosen as thresholds, the drop probabilities of Biased DropConnect and Biased Dropout are p DC � p l DC /2 and p D � p l D /2, respectively. For the simplicity of comparison with other words, we still use p DC and p D as the measurement scales and input parameters.
us, the training of BD-ELM is as follows: , the number of hidden nodes L, the activation function g(·), the drop probability p DC , p D Output: the output weights of hidden layer β Step 1: randomly generate input weights W and bias b Step 2: generate the mask matrix M weights , according to W and the drop probability p DC Step 3: apply Biased DropConnect regularization to input weights, W � M weights · W Step 4: calculate the output of hidden layer H Step 5: generate the mask matrix M output , according to H and the drop probability p D Step 6: apply Biased Dropout regularization to output of hidden layer, H � M output · H Step 7: calculate the output weights of hidden layer β

Performance Evaluation of BD-ELM
All evaluations were carried out in Matlab 2017(b), running on a desktop with 4.2 GHz CPU, 16 GB RAM, and 1 TB hard disk. e following experiments were designed to evaluate the performance of BD-ELM (it is noted that all the results in this paper are averages of 30 repeated independent experiments):

e Influence of the Number of Hidden
Nodes. e number of hidden layer nodes is an important parameter of the neural network. Insufficient hidden nodes will result in underfitting, while excessive hidden nodes will lead to overfitting. ELM needs a large number of hidden nodes to ensure its generalization performance, but this is easy to lead to overfitting. To verify that Biased DropConnect and Biased Dropout can effectively prevent overfitting of ELM, we change the number of hidden nodes and observe the performance of ELM and BD-ELM on typical datasets.  Figures 1(a)-1(d).
As can be seen from Figure 1, BD-ELM is less prone to overfitting for the four datasets mentioned above. With the increase of the number of hidden nodes, serious overfitting appears in ELM. Compared with ELM, when overfitting appears, BD-ELM has more hidden nodes and less reduction in accuracy. is shows that Biased DropConnect and Biased Dropout can effectively improve the overfitting of ELM. e reason is that Biased DropConnect and Biased Dropout set the inputs weights and output of hidden nodes to 0, which is equivalent to removing these from the network. When the number of hidden is too large, setting part of input weights and output to 0 can remove the redundant information in the network, and the spare parameters reduce the network complexity so that BD-ELM can still keep a better accuracy than ELM.
And the generalization performance of BD-ELM with different drop probabilities show that a larger drop probability p is better for suppressing overfitting. But when p is too large, many input weights and output will be eliminated, and much important information will be lost, which will affect the generalization performance of BD-ELM. is is especially evident when hidden nodes are few. erefore, it is very important to choose a suitable p for BD-ELM, according to the number of hidden nodes.

e Influence of the Drop Probability.
e drop probability p, as the important parameter of Biased DropConnect and Biased Dropout, will affect the effectiveness of regularization and the generalization performance of BD-ELM. In order to clear the influence of the drop probability on BD-ELM, in this evaluation, a variable drop probability is adopted to observe the change of classification accuracy.
Assuming that the experimental data are Vehicle and Image Segmentation datasets, the number of hidden nodes is L � 200, the medians of input weights and output of hidden nodes are used as thresholds, the activation function is sigmoid function, and the ranges of drop probabilities of Biased DropConnect and Biased Dropout are p DC � p 1 � 0, 0.05, 0.1 . . . , 0.5 { } and p D � p 2 � 0, 0.05, 0.1 . . . , 0.5 { }, respectively. e influence of drop probability on accuracy of BD-ELM is shown in Figures 2(a) and 2(b).
As the drop probabilities p DC and p D increase, the accuracy of BD-ELM first rises and then decreases.
is is because Biased DropConnect and Biased Dropout can remove more redundant information and enhance the positive influence of input weights with high value and hidden nodes with the high-activated value with the increase of p DC and p D . However, when the drop probability exceeds the critical point, a large number of connection weights or hidden nodes are dropped, and this means the great loss of important information, which makes the performance of BD-ELM not rise but decrease. erefore, a suitable drop probability should be chosen after a concrete analysis, and it is roughly in (0.1, 0.4).

Performance Comparison.
For testing the comprehensive performance of BD-ELM, this evaluation will use all the datasets mentioned above to compare BD-ELM with ELM, R-ELM, and Drop-ELM [19].  Table 2 (the number in Table 1: e datasets used in the experiments.

Datasets
Attributes Classes  Training  data   Testing  data  Ionosphere  34  2  180  171  Diabetes  8  2  390  378  Vehicle  18  4  446  400  Image  segmentation  19  7  1210  1100   Satellite image  36  6  3217  3218  Letter recognition  16  26  16000  4000  MNIST  784  10  60000  10000  Rectangles  784  2  1200  50000   4 Mathematical Problems in Engineering boldface indicates the highest accuracy) and Table 3, respectively. As can be seen from Table 3, in most datasets, BD-ELM can achieve higher classification accuracy, but requires more training time. (1) e performance of BD-ELM is better than that of ELM and R-ELM. Because, by dropping the unimportant weights and hidden nodes, BD-ELM has sparser parameter and less redundant information, which is conductive to improve the generalization performance, while the L 2 -norm in R-ELM does not have such ability [20]. (2) e performance of BD-ELM is better than that of Drop-ELM because BD-ELM takes into account the differences of connection weights and hidden nodes and preserves more important information contained in the weights with high value and hidden nodes with high-activated. But Drop-ELM only gives the same drop probability to all input weights and output of hidden nodes.
At the same time, BD-ELM needs to generate masking matrixes and this process is more complicated than that in Drop-ELM, thus its training time is more than other algorithms. ese results show that it is necessary and effective to incorporate Biased DropConnect and Biased Dropout with ELM to prevent overfitting and improve generalization performance.

Conclusions
For the aim to address the overfitting and improve the generalization performance of ELM, this paper proposes a Biased DropConnect regularization method and applies the Biased DropConnect and Biased Dropout to the ELM to construct BD-ELM. BD-ELM divides the input weights and output of hidden nodes into different groups by setting thresholds and gives different groups different drop probabilities, which enhance the sparsity and reduce complexity of the network. e empirical studies show that the Biased DropConnect and Biased Dropout can effectively prevent the overfitting, and BD-ELM can achieve better generalization performance on various benchmark datasets. However, BD-ELM requires more parameters and it is more difficult to find their best. ese problems need further research and will be solved in the future. And inspired by other advanced machine learning methods [21,22], the performance of ELM could be further improved.

Data Availability
e data used to support the finding of this study have been deposited in the GitHub repository (https://github.com/ LynnW0W/BD-ELM).

Conflicts of Interest
e authors declare that they have no conflicts of interest.