RSigELU: A nonlinear activation function for deep neural networks

https://doi.org/10.1016/j.eswa.2021.114805Get rights and content

Highlights

  • Novel RSigELU activation functions, such as RSigELUS and RSigELUD, which are a combination of RELU, sigmoid, and ELU activation functions, were proposed.

  • The proposed RSigELU activation functions are capable of working in positive, negative, and linear activation regions, and overcomes the vanishing gradient problem and negative region problem.

  • Performance evaluation of the proposed activation functions were carried out using a VGG architecture on MNIST, Fashion MNIST, IMDb Movie, and CIFAR-10 benchmark datasets.

Abstract

In deep learning models, the inputs to the network are processed using activation functions to generate the output corresponding to these inputs. Deep learning models are of particular importance in analyzing big data with numerous parameters and forecasting and are useful for image processing, natural language processing, object recognition, and financial forecasting. Sigmoid and tangent activation functions, which are traditional activation functions, are widely used in deep learning models. However, the sigmoid and tangent activation functions face the vanishing gradient problem. In order to overcome this problem, the ReLU activation function and its derivatives were proposed in the literature. However, there is a negative region problem in these activation functions. In this study, novel RSigELU activation functions, such as single-parameter RSigELU (RSigELUS) and double-parameter (RSigELUD), which are a combination of ReLU, sigmoid, and ELU activation functions, were proposed. The proposed RSigELUS and RSigELUD activation functions can overcome the vanishing gradient and negative region problems and can be effective in the positive, negative, and linear activation regions. Performance evaluation of the proposed RSigELU activation functions was performed on the MNIST, Fashion MNIST, CIFAR-10, and IMDb Movie benchmark datasets. Experimental evaluations showed that the proposed activation functions perform better than other activation functions.

Introduction

Deep learning models are widely used in application domains such as image processing, natural language processing, object recognition, translation, financial forecasting and the importance of these models is increasing every day (Schmidhuber, 2015). The deep learning algorithms that create models using training data are used to estimate the unknown outcomes using input data. Deep learning architectures consist of several layers including convolution, pooling, normalization, fully-connected, and activation layers.

Activation functions are used for healthy functioning of neural network architectures by considering the speed of the models during the training phase, the local minima, and the improved accuracy of the models. Activation functions form the basis for neural networks to model and learn the complex relationships between variables and layers. A well-designed activation function is effective on the performance of deep learning models. The activation layers process the inputs coming into the network and produce the output corresponding to these inputs, using the activation functions. Activation functions should be appropriate for the data characteristics. There are numerous studies on the design of activation functions in the literature. Activation functions used in deep learning are expected to be non-linear and differentiable (Bircanoğlu & Arıca, 2018). In deep learning studies, activation functions, such as ReLU and LReLU, are frequently used (Glorot et al., 2011, Krizhevsky et al., 2012, Maas et al., 2013) and in neural network studies, activation functions, such as sigmoid and tangent, are frequently used (Costarelli and Spigler, 2018, Costarelli, 2019, Costarelli and Vinti, 2017, Costarelli and Sambucini, 2018).

When an activation function is linear, it acts as linear regression with one-dimensional artificial neural network. However, since deep learning architectures are commonly associated with complex real-world problems, the linear activation functions fail to produce successful results. Therefore, nonlinear activation functions are preferred in multi-layered deep neural networks for learning significant features from the data (Bircanoğlu & Arıca, 2018). In the deep neural network, the parameters are updated using the backpropagation algorithm. Since the derivatives are returned in the update process, the activation function used in the architecture must be differentiable. In addition, updates in neurons towards the lower layers become difficult in the deep neural network architectures used recently. This results in the vanishing gradient problem. This problem poses a major obstacle to deep learning in deep neural networks (Ebrahimi & Abadi, 2018). This obstacle raises the possibility of being stuck in local minima. This is because, as the network deepens, the vanishing gradient problem arises as frequently observed in the sigmoid and hyperbolic tangent activation functions. The difficulty in the training phase of the deep neural network can be eliminated by overcoming this problem. For this purpose, ReLU (Nair & Hinton, 2010), LReLU (Maas et al., 2013), PReLU (He et al., 2015), ELU (Clevert et al., 2015), SELU (Klambauer et al., 2017), Hexpo (Kong & Takatsuka, 2017), and LISA (Bawa & Kumar, 2019) were proposed in the literature to overcome the problem. However, these activation functions face the negative region problem. The ReLU activation function (Maas et al., 2013), introduced to overcome the vanishing gradient problem, sets negative inputs to zero when processing according to positive values. Thus, the learning process does not take place since the derivatives of negative outputs cannot be obtained. Values in the negative region are neglected during the learning process. To overcome this problem, studies such as LReLU (Maas et al., 2013), PReLU (He et al., 2015), and ELU (Clevert et al., 2015) were conducted.

In this study, new and novel RSigELU activation functions, such as single-parameter RSigELU (RSigELUS) and double-parameter RSigELU (RSigELUD), which are combinations of ReLU, sigmoid, and ELU activation functions, were proposed. Proposed RSigELUS and RSigELUD activation functions overcome the vanishing gradient and negative region problems to improve the learning process. The proposed RSigELU activation functions exhibit the behavior of a combination of ReLU and sigmoid functions in the positive activation region, while exhibiting the behavior of ELU function in the negative activation region and the linear function in the linear activation region (Bawa & Kumar, 2019). Experimental evaluations were conducted using the MNIST, Fashion MNIST, IMDb Movie, and CIFAR-10 benchmark datasets. Contributions of the study can be listed as follows.

1. New and novel RSigELU activation functions, such as RSigELUS and RSigELUD, which are combinations of ReLU, sigmoid, and ELU activation functions, were proposed.

2. The proposed RSigELU activation functions are capable of working in positive, negative, and linear activation regions, and overcome the vanishing gradient problem and negative region problem. The proposed RSigELU activation functions can work with single and double parameters, called RSigELUS and RSigELUD, respectively.

3. Performance evaluation of the proposed activation functions were carried out using a VGG architecture on MNIST, Fashion MNIST, IMDb Movie, and CIFAR-10 benchmark datasets.

In the second part of this study a literature review is presented. The problem definition is given in the third section and the proposed activation functions are introduced in the fourth section. The approaches used to evaluate the activation functions and datasets are discussed in the fifth section. Experimental results are presented in the sixth section and the results obtained are discussed in the conclusions section.

Section snippets

Literature REVIEW

Studies that focus on the design of activation functions for deep learning structures have focused on the vanishing gradient and negative region problems. In this study, activation functions are proposed for the solution of both problems.

The ReLU (Nair & Hinton, 2010) activation function is able to deal with the vanishing gradient problem. ReLU overcomes the vanishing gradient problem by returning a positive value when a negative value comes to the input, without making any changes (Eq. (1)).

Problem statement

In this study, novel RSigELU activation functions were proposed to overcome the vanishing gradient and negative region problem that has become the focus of attention in deep learning models. In the deep neural networks, the networks are barely updated or become useless due to the present problems. This process prevents the network from deepening and learning, and even may cause to use too much computational power for training. Thus, this may cause a fall into local minima in deep neural

Proposed activation functions

This study proposes new activation functions, RSigELUS and RSigELUD, which can operate in accordance with the data in three active regions (i.e. positive, linear, and negative) in order to overcome the vanishing gradient and negative region problems. The proposed RSigELUS activation function requires a single parameter and RSigELUD activation function requires double parameters. In order to capture the characteristics of both regions, single- and double-parameter structures were used in the

Material and method

The study was carried out using Tensorflow Backend and Keras Library and Google Colab platform to evaluate the classification performances of deep learning algorithms using the activation functions proposed in this study and in the literature. The study was conducted using a system with the Intel Core i5 7200U 2.5 GHz processor, and 12 GB DDR3 memory.

In the following part of this section, the benchmark datasets used in the study were introduced, information was given about the convolutional

Experimental results and discussion

The study presents the experimental evaluations of the proposed RSigELU activation functions on MNIST, Fashion MNIST, CIFAR-10, and IMDb Movie datasets. In the double parameters (RSigELUD) case, the best results of the experiments performed on each dataset to determine the scale value are shown in Table 2, Table 3, Table 4, Table 5. The best success rates of the proposed RSigELUS and RSigELUD activation functions and other known activation functions for the MNIST dataset are presented in Table 6

Conclusions

The activation layers process the inputs coming into the network and produce the output corresponding to these inputs, using the activation functions. Since deep learning architectures commonly concern with complex problems, linear activation functions are not appropriate for such architectures. This is because, activation functions form the basis for the neural networks to learn and predict the complex and continuous relationship between variables. Therefore, nonlinear activation functions

CRediT authorship contribution statement

Serhat Kiliçarslan: Conceptualization, Software, Visualization, Supervision, Writing - review & editing. Mete Celik: Conceptualization, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (40)

  • D. Costarelli et al.

    Approximation Results in Orlicz Spaces for Sequences of Kantorovich Max-Product Neural Network Operators

    Results Math

    (2018)
  • D. Costarelli et al.

    Solving numerically nonlinear systems of balance laws by multivariate sigmoidal functions approximation

    Comp. Appl. Math.

    (2018)
  • D. Costarelli et al.

    Saturation Classes for Max-Product Neural Network Operators Activated by Sigmoidal Functions

    Results Math

    (2017)
  • Ebrahimi, M. S., & Abadi, H. K. (2018). Study of residual networks for image recognition. arXiv preprint...
  • Erkan

    A precise and stable machine learning algorithm: eigenvalue classification (EigenClass)

    Neural Computing and Applications

    (2020)
  • Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty of training deep feedforward neural networks. In...
  • X. Glorot et al.

    June). Deep sparse rectifier neural networks

  • K. He et al.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

  • D. Hendrycks et al.

    Bridging nonlinearities and stochastic regularizers with gaussian error linear units

  • Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint...
  • Cited by (49)

    • A new multifractal-based deep learning model for text mining

      2024, Information Processing and Management
    • Empirical study of the modulus as activation function in computer vision applications

      2023, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus
    1

    ORCID: 0000-0001-9483-4425.

    2

    ORCID: 0000-0002-1488-1502.

    View full text