RSigELU: A nonlinear activation function for deep neural networks
Introduction
Deep learning models are widely used in application domains such as image processing, natural language processing, object recognition, translation, financial forecasting and the importance of these models is increasing every day (Schmidhuber, 2015). The deep learning algorithms that create models using training data are used to estimate the unknown outcomes using input data. Deep learning architectures consist of several layers including convolution, pooling, normalization, fully-connected, and activation layers.
Activation functions are used for healthy functioning of neural network architectures by considering the speed of the models during the training phase, the local minima, and the improved accuracy of the models. Activation functions form the basis for neural networks to model and learn the complex relationships between variables and layers. A well-designed activation function is effective on the performance of deep learning models. The activation layers process the inputs coming into the network and produce the output corresponding to these inputs, using the activation functions. Activation functions should be appropriate for the data characteristics. There are numerous studies on the design of activation functions in the literature. Activation functions used in deep learning are expected to be non-linear and differentiable (Bircanoğlu & Arıca, 2018). In deep learning studies, activation functions, such as ReLU and LReLU, are frequently used (Glorot et al., 2011, Krizhevsky et al., 2012, Maas et al., 2013) and in neural network studies, activation functions, such as sigmoid and tangent, are frequently used (Costarelli and Spigler, 2018, Costarelli, 2019, Costarelli and Vinti, 2017, Costarelli and Sambucini, 2018).
When an activation function is linear, it acts as linear regression with one-dimensional artificial neural network. However, since deep learning architectures are commonly associated with complex real-world problems, the linear activation functions fail to produce successful results. Therefore, nonlinear activation functions are preferred in multi-layered deep neural networks for learning significant features from the data (Bircanoğlu & Arıca, 2018). In the deep neural network, the parameters are updated using the backpropagation algorithm. Since the derivatives are returned in the update process, the activation function used in the architecture must be differentiable. In addition, updates in neurons towards the lower layers become difficult in the deep neural network architectures used recently. This results in the vanishing gradient problem. This problem poses a major obstacle to deep learning in deep neural networks (Ebrahimi & Abadi, 2018). This obstacle raises the possibility of being stuck in local minima. This is because, as the network deepens, the vanishing gradient problem arises as frequently observed in the sigmoid and hyperbolic tangent activation functions. The difficulty in the training phase of the deep neural network can be eliminated by overcoming this problem. For this purpose, ReLU (Nair & Hinton, 2010), LReLU (Maas et al., 2013), PReLU (He et al., 2015), ELU (Clevert et al., 2015), SELU (Klambauer et al., 2017), Hexpo (Kong & Takatsuka, 2017), and LISA (Bawa & Kumar, 2019) were proposed in the literature to overcome the problem. However, these activation functions face the negative region problem. The ReLU activation function (Maas et al., 2013), introduced to overcome the vanishing gradient problem, sets negative inputs to zero when processing according to positive values. Thus, the learning process does not take place since the derivatives of negative outputs cannot be obtained. Values in the negative region are neglected during the learning process. To overcome this problem, studies such as LReLU (Maas et al., 2013), PReLU (He et al., 2015), and ELU (Clevert et al., 2015) were conducted.
In this study, new and novel RSigELU activation functions, such as single-parameter RSigELU (RSigELUS) and double-parameter RSigELU (RSigELUD), which are combinations of ReLU, sigmoid, and ELU activation functions, were proposed. Proposed RSigELUS and RSigELUD activation functions overcome the vanishing gradient and negative region problems to improve the learning process. The proposed RSigELU activation functions exhibit the behavior of a combination of ReLU and sigmoid functions in the positive activation region, while exhibiting the behavior of ELU function in the negative activation region and the linear function in the linear activation region (Bawa & Kumar, 2019). Experimental evaluations were conducted using the MNIST, Fashion MNIST, IMDb Movie, and CIFAR-10 benchmark datasets. Contributions of the study can be listed as follows.
1. New and novel RSigELU activation functions, such as RSigELUS and RSigELUD, which are combinations of ReLU, sigmoid, and ELU activation functions, were proposed.
2. The proposed RSigELU activation functions are capable of working in positive, negative, and linear activation regions, and overcome the vanishing gradient problem and negative region problem. The proposed RSigELU activation functions can work with single and double parameters, called RSigELUS and RSigELUD, respectively.
3. Performance evaluation of the proposed activation functions were carried out using a VGG architecture on MNIST, Fashion MNIST, IMDb Movie, and CIFAR-10 benchmark datasets.
In the second part of this study a literature review is presented. The problem definition is given in the third section and the proposed activation functions are introduced in the fourth section. The approaches used to evaluate the activation functions and datasets are discussed in the fifth section. Experimental results are presented in the sixth section and the results obtained are discussed in the conclusions section.
Section snippets
Literature REVIEW
Studies that focus on the design of activation functions for deep learning structures have focused on the vanishing gradient and negative region problems. In this study, activation functions are proposed for the solution of both problems.
The ReLU (Nair & Hinton, 2010) activation function is able to deal with the vanishing gradient problem. ReLU overcomes the vanishing gradient problem by returning a positive value when a negative value comes to the input, without making any changes (Eq. (1)).
Problem statement
In this study, novel RSigELU activation functions were proposed to overcome the vanishing gradient and negative region problem that has become the focus of attention in deep learning models. In the deep neural networks, the networks are barely updated or become useless due to the present problems. This process prevents the network from deepening and learning, and even may cause to use too much computational power for training. Thus, this may cause a fall into local minima in deep neural
Proposed activation functions
This study proposes new activation functions, RSigELUS and RSigELUD, which can operate in accordance with the data in three active regions (i.e. positive, linear, and negative) in order to overcome the vanishing gradient and negative region problems. The proposed RSigELUS activation function requires a single parameter and RSigELUD activation function requires double parameters. In order to capture the characteristics of both regions, single- and double-parameter structures were used in the
Material and method
The study was carried out using Tensorflow Backend and Keras Library and Google Colab platform to evaluate the classification performances of deep learning algorithms using the activation functions proposed in this study and in the literature. The study was conducted using a system with the Intel Core i5 7200U 2.5 GHz processor, and 12 GB DDR3 memory.
In the following part of this section, the benchmark datasets used in the study were introduced, information was given about the convolutional
Experimental results and discussion
The study presents the experimental evaluations of the proposed RSigELU activation functions on MNIST, Fashion MNIST, CIFAR-10, and IMDb Movie datasets. In the double parameters (RSigELUD) case, the best results of the experiments performed on each dataset to determine the scale value are shown in Table 2, Table 3, Table 4, Table 5. The best success rates of the proposed RSigELUS and RSigELUD activation functions and other known activation functions for the MNIST dataset are presented in Table 6
Conclusions
The activation layers process the inputs coming into the network and produce the output corresponding to these inputs, using the activation functions. Since deep learning architectures commonly concern with complex problems, linear activation functions are not appropriate for such architectures. This is because, activation functions form the basis for the neural networks to learn and predict the complex and continuous relationship between variables. Therefore, nonlinear activation functions
CRediT authorship contribution statement
Serhat Kiliçarslan: Conceptualization, Software, Visualization, Supervision, Writing - review & editing. Mete Celik: Conceptualization, Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (40)
- et al.
Linearized sigmoidal activation: A novel activation function with tractable non-linear characteristics to boost representation capability
Expert Systems with Applications
(2019) Dual Rectified Linear Units (DReLUs): A replacement for tanh activation functions in Quasi-Recurrent Neural Networks
Pattern Recognition Letters
(2018)- et al.
Diagnosis and classification of cancer using hybrid model based on ReliefF and convolutional neural network
Medical Hypotheses
(2020) Deep learning in neural networks: An overview
Neural Networks
(2015)- et al.
ReLTanh: An activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis
Neurocomputing
(2019) - et al.
Learning long-term dependencies with gradient descent is difficult
IEEE Trans. Neural Netw.
(1994) - et al.
- et al.
CoABCMiner: An Algorithm for Cooperative Rule Classification System Based on Artificial Bee Colony
Int. J. Artif. Intell. Tools
(2016) - Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear...
Approximate solutions of Volterra integral equations by an interpolation method based on ramp functions
Comp. Appl. Math.
(2019)
Approximation Results in Orlicz Spaces for Sequences of Kantorovich Max-Product Neural Network Operators
Results Math
Solving numerically nonlinear systems of balance laws by multivariate sigmoidal functions approximation
Comp. Appl. Math.
Saturation Classes for Max-Product Neural Network Operators Activated by Sigmoidal Functions
Results Math
A precise and stable machine learning algorithm: eigenvalue classification (EigenClass)
Neural Computing and Applications
June). Deep sparse rectifier neural networks
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
Bridging nonlinearities and stochastic regularizers with gaussian error linear units
Cited by (49)
A new multifractal-based deep learning model for text mining
2024, Information Processing and ManagementMultiscale leapfrog structure: An efficient object detector architecture designed for unmanned aerial vehicles
2024, Engineering Applications of Artificial IntelligenceFractional ordering of activation functions for neural networks: A case study on Texas wind turbine
2024, Engineering Applications of Artificial IntelligenceA 218 GOPS neural network accelerator based on a novel cost-efficient surrogate gradient scheme for pattern classification
2023, Microprocessors and MicrosystemsDetection and classification of pneumonia using novel Superior Exponential (SupEx) activation function in convolutional neural networks
2023, Expert Systems with ApplicationsEmpirical study of the modulus as activation function in computer vision applications
2023, Engineering Applications of Artificial Intelligence