Soft ++ , a multi-parametric non-saturating non-linearity that improves convergence in deep neural architectures

A key strategy to enable training of deep neural networks is to use non-saturating activation functions to reduce the vanishing gradient problem. Popular choices that saturate only in the negative domain are the rectiﬁed linear unit (ReLU), its smooth, non-linear variant, Softplus


Introduction
Deep learning has become extremely popular in recent years [30 , 43] . Deep neural networks, trained on vast amounts of data, have reached performance levels that apparently match or surpass human expertise on difficult tasks, such as object recognition [25] or complex games, like Go [45] . Such astonishing progress has been enabled by three key factors: the availability of very large amounts of training data [42] , the vigorous increase in computing power due to massively parallel cores of modern GPUs [43] , and by transiting from shallow to deeper neural architectures [28 , 2] . The present study will focus on this latter breakthrough.
The enthusiastic development of artificial neural networks in the 70 s and 80 s has hit a hard wall when it was recognized that it is very difficult to train networks with more than one or two hidden layers [2] . This limitation arises due to the so-called vanishing gradient problem in the error backpropagation algorithm [17 , 3] , whereby the gradient decreases exponentially fast as it is * Corresponding author.
E-mail address: muresan@tins.ro (R.C. Mure ¸s an). backpropagated through a multi-layered neural network. The problem is exacerbated when classic activation functions are used, like the sigmoid or hyperbolic tangent (tanh), whose gradients vanish asymptotically as the input departs from origin [38] .
More recently, the vanishing gradient problem has been overcome by using two strategies. First, adaptive training algorithms emerged that do not use the value of the gradient but only its sign to update weights in the network -with the most successful algorithms being the RPROP [39] and its improved variant, IRPROP [19 , 20] . A second improvement was to use non-saturating activation functions like the Rectified Linear Unit (ReLU) [14 , 34] . The latter is the most popular activation function in deep learning architectures [30 , 6] because its gradient remains constant throughout the positive domain. These two strategies enable training of networks with unprecedented depth.
It is important to be able to train deep networks because their representational power surpasses that of shallow ones [2] . Indeed, by studying the complexity of the function implemented by networks of various depths, it has been shown that deeper networks are able to implement more complex functions than shallow ones [4] . It was further demonstrated that shallow networks are incapable of providing a good approximation of almost any function unless a sufficient number of neurons is used [26] . Thus, robust training of deep networks is desirable and, as we will show, the properties of the activation function are very important for the training process.
The popular activation functions, like the ReLU and its continuous counterpart, Softplus [9] , are non-saturating in the positive domain, but their value and gradient is, or approaches zero in the negative domain -i.e. they are only partially non-saturating ( Fig. 1 ). The recovery of saturated units, although possible, is very slow, creating large plateaus during learning [10] . In addition, reduced gradients further contribute to the vanishing gradient problem, especially when classical backpropagation is used. A solution to this issue has been introduced with the leaky ReLU (LReLU), which adds a mild linear slope in the negative domain, and its parametrized version, the parametric ReLU (PReLU) [15] , both yielding fully non-saturating activation functions. Another recent activation function, the exponential-linear unit (ELU) is saturating in the negative domain but it takes both positive and negative values, constraining average activations around zero and reducing the bias shift effect [8] . ELU networks can greatly accelerate learning [8] . An additional variant of ELU, the scaled exponential-linear unit (SELU) further provides automatic normalization within the network by preserving zero mean and unit variance [23] .
Here, we show that convergence speed in deep neural architectures with various activation functions is strongly problem-dependent. To this end, we use extensive testing on both large and small, difficult datasets, with both fully-connected (MLP) and convolutional networks, combined with gradient-based or adaptive learning strategies. We argue that further improvement is possible over existing activation functions by introducing a biparametric, fully customizable activation function called Soft ++ . The latter is based on the Softplus function and allows one to parametrize the slope in both the negative and positive domains. Soft ++ shares the advantages of ELU but it can be adapted better to a particular learning problem, thus outperforming all activation functions in terms of convergence speed and, often, test accuracy.

Activation functions
We compared the performance of deep networks having six activation functions: ReLU, Softplus, PReLU (LReLU), ELU, SELU, and Soft ++ . The ReLU function is the most popular activation function [30] . It is linear in the positive domain and zero in the origin and over the negative domain:

ARTICLE IN PRESS
JID: NEUCOM [m5G; December 27, 2019;23:20 ] A "soft" variant of ReLU is Softplus [9] : Both ReLU and Softplus are saturating in the negative domain ( Fig. 1 ). A non-saturating version of ReLU was recently introduced [15] : where, a is a coefficient determining the slope of the negative part. In PReLU, the parameter a is learned together with the weights in the network. If the parameter is fixed, PReLU becomes equivalent to LReLU. Here, we fixed the parameter in each simulation. ELU falls in the category of saturating activation functions, but takes both positive and negative values and it is locally nonsaturated around the origin [8] : where, α is a parameter determining the saturating value in the negative domain, usually set to 1 [8] . The gradient of ELU in the origin is α. Here, we fixed the value of α to 1 in all experiments.
SELU is the scaled variant of ELU, as it introduces a second, scaling parameter, λ [23] : where, α is set to ~1.67326 and λ is set to ~1.05070 for standard scaled inputs.
Here, we propose a new, fully non-saturating function called Soft ++ , extending both Softplus and PReLU, and defined as follows: where, k is an exponent coefficient and c is a slope coefficient which sets the slope in the negative domain. The constant ln (2) term removes the offset of the Soft ++ function in the origin. The derivative of Soft ++ is: For positive k and c , the gradient of Soft ++ in the negative domain is inversely proportional to the slope coefficient, c, i.e. it converges to 1/ c . In the positive domain, the gradient of Soft ++ converges to ( ck + 1)/ c ( Fig. 1 ). When k and/or c is negative, exotic shapes of the activation function can be obtained.
By choosing the parameters k and c appropriately, the gradient of Soft ++ in the origin can be matched to that of ELU ( Fig. 1 c  and d). For later tests we found that matching the initial gradients (given that weights are narrowly distributed around 0) creates better conditions to compare activation functions because learning starts under similar conditions (see below for CIFAR-100). Unlike most other activation functions (except Softplus), Soft ++ is differentiable and has a smooth gradient across the entire real domain (see Fig. 1 b and d).

Network architecture
To test the performance of the various activation functions, we used: i) fully-connected multi-layered neural networks (MLP) with 6-10 hidden layers, a similar depth as used by Cire ¸s an [7] ; ii) deep convolutional neural networks with 7-11 layers (CNN). In all cases we maintained the architecture of the network fixed and created different copies with the different activation functions.
For fully-connected MLP networks, hidden layers contracted from the input towards the output, set as a percentage of the size of the input layer. These were set as follows, from input to output: for MNIST dataset (6 layers: 60%, 50%, …, 20%, 10%), for the visual cortex dataset (10 layers: 90%, 70%, 50%, 30%, 20%, 20%, …, 20%), for the synthetic dataset (10 layers: 110%, 100%, 90%, …, 30%, 20%). In all cases, an additional Softmax layer was added at the output in order to minimize cross-entropy [11] . The networks thus acted as probabilistic classifiers: Each output was constrained between 0 and 1, the value of output i representing the probability that a sample from class i was presented as input. A sample was labeled as belonging to the class whose corresponding output was largest ( argmax ).
Initial bias weights were drawn from a Gaussian distribution as follows: 0.5 ± 0.1 for the MNIST and synthetic dataset, and 0.0 ± 0.1 for the visual cortex dataset. In MLP networks, neuronneuron weights were initialized as (modified LeCun uniform): Fan − in ( i ) (8) where, w i,j is the weight from neuron j to neuron i, s is a scaling factor, U denotes the uniform distribution, and Fan-in(i) is the number of inputs neuron i receives from its afferents. The scaling parameter, s, was set to 0.1 for the MNIST and visual cortex datasets, and to 0.5 for the synthetic dataset. For deep CNNs, initialization of weights was performed as described in Glorot and Bengio [10] . In addition, when the SELU function was used in MLP networks, weights were initialized using the LeCun normal initializer [32] . In CNN networks with SELU activation function we implemented both the initializer used for the other functions (to have an exact matching of the testing conditions) and the LeCun normal initializer.
For tests on deep CNNs, we used two different network architectures, tested on CIFAR-10 and CIFAR-100 datasets. First, we implemented a version of the well-known network Alexnet [25] , modified in order to work on the CIFAR-10 dataset ( https://github. com/Natsu6767/Modified-AlexNet-Tensorflow ). This network is a deep convolutional neural network that consists of four convolutional layers (C [kernel size × number of kernels, stride] and same padding) with local response normalization (Lrn with radius = 5; alpha = 10 −4 ; beta = 0.75, bias = 2) and pooling (P with height = 3; width = 3; stride = 2 and valid padding), followed by three fully connected layers (Fc [number of nodes, dropout]), the last one having a Softmax activation. The network we used has the following: C1 [ Second, for tests on both CIFAR-10 and CIFAR-100 we developed a stacked network similar to that in [8] , but simpler, consisting of stacks of two convolutional layers followed by a pooling and dropout layer. The convolutional layers in each stack had the same number of kernels (kernel height × kernel width × number of kernels): [2 × 2 × 1024, 1 × 1 × 1024] followed by one fully connected output layer with Softmax activation. The dropout applied after the pooling was different for each stack and was greater as the kernel count increased: 0.1, 0.15, 0.2, 0.25, and 0.3. All weights were initialized using the uniform initialization method described in Glorot and Bengio [10] , and all biases were set to 0.001. For the SELU activation function we also used a version of the network where weights were initialized using the LeCun normal technique [32] , coupled with α-dropout [23] , applied only to the last two stacks (when α-dropout was applied to the same stacks like for Soft ++ , SELU networks did not converge).

Benchmark datasets 2.3.1. MNIST
For benchmarking, we used multiple datasets. First, we used the well-known MNIST database [31] with 70,0 0 0 samples of 10 hand-written digits, which is a standard benchmark for many machine learning techniques. The database contains 60,0 0 0 training and 10,0 0 0 test images, the sets being disjoint [29] . During training, only samples in the training set were presented to the networks, in each training step. The test set was split evenly into 5,0 0 0 validation and 5,0 0 0 test images.

CIFAR-10 and CIFAR-100
For convolutional networks, we used other well-known datasets, CIFAR-10 and CIFAR-100 [8 , 24] . These two databases contain a total of 60,0 0 0 32 × 32 × 3 images split evenly into 10 and 100 classes, respectively. Datasets are pre-divided into a training set of 50,0 0 0 images and a test set of 10,0 0 0 images, each having all classes equally represented. As with the MNIST dataset, only the training images were used for training networks while the test set was split evenly into 5,0 0 0 validation and 5,0 0 0 test images.

VISUAL cortex data
In many fields of science, where machine learning is used to discover structure in the data, having many examples per class is simply impossible, due to practical limitations and, often, the number of features can exceed the number of examples. Here, we experimented on such a dataset generated from in vivo recordings performed in visual cortex of mice. Multi-electrode extracellular electrophysiology signals were collected using Neuronexus silicon probes (A-type, 4 × 2 tetrode configuration) inserted into the visual cortex of isoflurane anesthetized mice (1.2% mixed with O 2 ). Animal manipulation and recordings were done in accordance with Romanian law, European Communities Council Directive 2010/63/EU regarding the care and use of animals for experimental procedures, and Society for Neuroscience guidelines and approved by the Local Ethics Committee (3/CE/02.11.2018) and the National Veterinary Authority (ANSVSA; 147/04.12.2018). The 32 signals were sampled at 32 kSamples/s, amplified 10,0 0 0 × and digitized. Visual stimuli consisted of full-field drifting gratings (0.11 cycles/deg; 1.75 cycles/s; contrast: 25%, 50%, and 100%; 8 directions in steps of 45 °; 10 trials per contrast-direction combination, 240 trials in total; stimulus presentation duration: 2600 ms). These visual stimuli drive cortical neurons vigorously [18] . Deep CNNs trained on images usually develop similar, grating-like receptive fields [25] . Data was filtered offline (band-pass 50 0 Hz-350 0 Hz) and spike-sorting [33] was performed to isolate the firing of individual neurons (single units).
After preprocessing and spike sorting the data, we obtained the spike (firing) times of 30 isolated visual cortex neurons. For each trial, we segmented the 2600 ms long recording into twenty 130-ms long windows and counted how many spikes each neuron fired within each window. A vector of 20 × 30 = 600 features was then compiled by concatenating all windows for all neurons. A number of 4 output classes were assigned to the trials by considering the 4 orientations (0-180 °, 45-225 °, 90-270 °, 135-315 °) of the visual drifting grating [21] presented to the animal. Thus, the final dataset used for experimenting with networks consisted of 240 samples with 600 features each and distributed evenly across 4 classes (60 samples per class). The dataset was then randomly split into 120 training examples (50%), 60 validation (25%) and 60 (25%) test samples. The split always balanced across the 4 classes. A final normalization was performed on the dataset by mapping all values to the interval [ −0.1, 0.1] (min-max normalization). The visual cortex dataset is freely available here: http://raulmuresan.ro/sources/softplusplus/VisualCortexData.zip

Synthetic dataset
For a final batch of tests we created a synthetic dataset with a very low signal-to-noise ratio, where a majority of features was irrelevant. The rationale behind this choice was that most machine learning techniques suffer from the presence of irrelevant features [37 , 27 , 1 , 13] , while deep learning was proposed to alleviate the problem [30] , at least in some cases [5 , 40] . In addition, this dataset resembles well the typical case found in many fields of science: large feature set, few classes and few examples. We chose an extreme case, whereby the dataset contained 100-dimensional samples but only 1 dimension contained information about the actual class. The 99 non-informative dimensions were random numbers drawn from a uniform distribution: The informative dimension was a function of the class and some added white noise, as follows: where, i is the class to which the sample belongs. We chose 10 classes and for each class generated 100 samples, yielding a dataset with 10 0 0 samples in total. This synthetic benchmark dataset is available for download here: http: //raulmuresan.ro/sources/softplusplus/ReferenceData.zip . For each training/testing, the synthetic dataset was randomly split in three disjoint subsets: training (60%), validation (20%), and testing (20%).

Tests setup
For each test, we created multiple identical networks as described in Section 2.2 , such that only the activation function differed across networks. Only for the case of SELU, we also created identical architectures but used a different weight initializer and dropout technique to favor self-normalization [23] . The parameters a , for PReLU, and k and c , for Soft ++ , were not adjusted during learning but kept fixed for each instantiation.

MLP networks
MLPs were trained for 30 0, 50 0, and 10 0 0 steps, for the MNIST, visual cortex, and synthetic dataset, respectively, using the batch IRPROP + algorithm [19] . For the visual cortex and synthetic dataset we applied full batch training (entire training set learned in each epoch). For MNIST, to speed up training, only 10% of the training set was randomly selected for each training step (like in minibatch) but every 10 training steps the full training set was presented to the network. This strategy allowed for a fast monitoring of convergence and also yielded good training results. To increase the robustness of training and the generalization ability, we used 5% dropout probability [16] for the synthetic dataset and 40% for the visual cortex dataset. No dropout was used for MNIST. Because performance can vary as a function of the random splitting of the dataset and of the initial weights of the network, we created between 20 and 50 networks for each test. To match the exact initial conditions across networks, instantiations were in some cases performed by fixing the random seed for obtaining the same network but with different activation functions -this however did not seem to yield different results from those using individual randomizations. The visual cortex and synthetic datasets were also randomly split into training/validation/test sets for each network. For the MNIST dataset random splitting was not applied because this was already split into disjoint training and test sets [ chose a fixed splitting into 60,0 0 0/5,0 0 0/5,0 0 0 samples for training/validation/test.

Convolutional neural networks
For the Alexnet on CIFAR-10 we used for training the Adam optimization [22] with a mini-batch size of 100 samples. We initially screened the networks for 200 epochs on the validation set and determined that performance already saturated by the 100th epoch. As such, all tests presented here were run for 100 epochs. In the present study, we were not interested in maximizing performance on CIFAR-10 but rather to compare how identical networks with different activation functions fared on this hard problem. Networks saturated at classification performances close to those reported in [24] , except networks with Soft ++ , which significantly surpassed them.
In an additional set of tests on CIFAR-10, we used the more expressive deep CNN (stacked network modified after [8] ) to further boost performance to values exceeding 90%. In that case, we only tested networks endowed with ELU, SELU and Soft ++ activation functions. For SELU, we used the required adaptations (LeCun initialization and α dropout applied to the last two stacks) [23] .
For the deep CNN networks on CIFAR-100 we wanted to replicate the previous study on ELU networks [8] , as well as to test the different activation functions in conditions as similar as possible in order for the comparison to be fair. As such, we preprocessed the data using global contrast normalization [12] , and instead of the ZCA, we used a difference of Gaussians filter in order to extract contrast information from the image [35] . We also imposed a constant learning rate of 0.005 and used momentum optimization with a parameter of 0.9. Images were rescaled to the interval −0.1 -0.1 for all networks (min-max). All functions were tested under identical training conditions, with a minibatch size of 100 and were run for 500 or 1500 epochs of training. We measured performance on the training and validation sets every epoch, and after the training was complete, we measured performance on the test set with the frozen network. Due to the high computational demand, we only ran three trials for each activation function. For SELU networks, we ran two different simulations: (i) preserving all parameters identical to those used with the other activation functions (SELU unadapted); (ii) adapting the initialization and the dropout technique to favor self-normalization, as described in [23] .
For different datasets and network architectures, we performed a hyperparameter search for determining the parameters of Soft ++ yielding the best performance. This search process was always done by considering the performance on the validation set only.
Network parameters, training algorithms and information about datasets are given in Table 1 .

Performance and learning curves on MNIST dataset
We first evaluated how networks with identical architecture but different activation functions fared on the standard MNIST dataset, and focused on the convergence speed. Initially, we tested ReLU, PReLU, Softplus, ELU, and three Soft ++ networks ( c = 10, 30, and 50, with k = 1) using the custom weight initialization (see Eq. (8) ) and raw inputs. Among these networks, Soft ++ and ELU performed the best, clearly surpassing ReLU and Softplus, and exceeding 98% accuracy on the test set ( Fig. 2 a). We also quantified the number of training steps required to reach a performance of 50% on the training and validation sets ( Fig. 2 b, left and right, respectively). In this first network group, Soft ++ reached a performance of 50% in the fewest steps ( Fig. 2 b-d).
Next, we tested the performance of SELU networks on MNIST. We kept the same architecture but changed the initialization (LeCun normal) and standardized the inputs, to favor selfnormalization [23] . Indeed, SELU networks surpassed the previous network group on the test set ( Fig. 2 a, right), but tended to overfit more, reaching close to 100% on the training set ( Fig. 2 a, left). To have a fair comparison to Soft ++ , we also tested these latter networks with c = 100 and k = 1.2 using the same weight initialization and standardized input like we used for SELU. We found that this new Soft ++ network easily reached a performance similar to SELU, being slightly better on the validation set ( Fig. 2 a, middle), a bit below on the test set ( Fig. 2 a, right), and overfitting less than SELU ( Fig. 2 a, left).
All networks, except the saturating ReLU and Softplus, exceeded 98% accuracy on the test set ( Fig. 2 a). Results on MNIST indicate that Soft ++ can be flexibly adapted to different weight initialization schemes and different input normalizations to boost performance.

Performance and learning curves on CIFAR-10 dataset
We next tested the deep Alexnet CNN trained on the CIFAR-10 dataset ( Fig. 3 a-e). Similarly as before, the saturating ReLU and Softplus networks performed worst -they reached a maximum performance of 63.95 ±1.75% and 63.93 ±1.33%, respectively ( Fig. 3 a  and 3e). Indeed, their performance increased significantly slower than for other networks ( Fig. 3 d; p < 0.01; heteroscedastic, double tailed t -test, Bonferroni corrected). PReLU networks ( Fig. 3 b) reached a maximum performance of 65.89 ±0.25%, converged significantly faster than both but slower than Soft ++ (see Fig. 3

d).
For both PReLU and Soft ++ we tested convergence as a function of their parameter ( a and c , respectively) and found that the steeper these functions were in the negative domain (larger a and smaller c ), the faster was the convergence ( Fig. 3 b and c). For Soft ++ , we always fixed k to 1. ELU networks converged faster than ReLU and Softplus but similarly to PReLU ( Fig. 3 a and d), however they surpassed the former three on the test set (67.17 ±0.81%) after training ( Fig. 3 e). Finally, Soft ++ networks with c = 0.5 converged the fastest of all network types and reached a maximum performance of 68.28 ±0.1%, surpassing all other networks ( p < 0.05).
When trained for only 100 epochs, the AlexNet reached a rather modest performance on the CIFAR-10 dataset. However, for the sake of diversity, we wanted to estimate learning dynamics with various activation functions on a well-known dataset (CIFAR-10) and with a well-known network architecture (AlexNet). Indeed, AlexNet is not optimized to learn on CIFAR, therefore we next resorted to the deep stacked CNN, inspired by a network that was optimized for ELU [8] . This network performed much better than Alexnet on CIFAR-10 ( Fig. 3 f and g): ELU and Soft ++ ( c = 1; k = 2) reached a performance on test set > 91.3%, but there was no significant difference between the two. With this network we also tested SELU (using LeCun normal weight initialization and α-dropout) and found that it performed worse than the other two activation functions on test set (89.12%). Indeed, while the dynamics of learning on the train set was similar across the three functions ( Fig. 3 f), on the validation set SELU networks performed significantly worse ( Fig. 3 g). Next, we used the same network architecture on a much harder problem: the CIFAR-100 dataset.

Performance and learning curves on CIFAR-100 dataset
CIFAR-100 contains 10 × more classes than CIFAR-10 but a similar number of examples. Therefore, learning on this dataset was expected to be much more difficult. On CIFAR-100 ELU networks were shown to reach a performance of 70-75% using a very deep CNN network, global contrast normalization and ZCA, dynamic learning rate schedule and adjusted dropout [8] . To be able to isolate the effect of the learning function, we resorted to a lighter,   stacked CNN architecture with only 11 layers, modified after the network in [8] and using fixed learning schedule and fixed dropout (see Methods). In addition, to have similar starting conditions we chose the parameters of Soft ++ such that its gradient in the origin amounted to 1, thus matching the gradient of ELU. Given the weight initialization procedure, ELU and Soft ++ networks started under the same conditions and diverged only as soon as the neurons' total input deviated from 0. The major advantage of Soft ++ over ELU is that by varying the relationship between c and k , the gradient in the origin can be flexibly set.

ARTICLE IN PRESS
In addition, we also created two more network types based on SELU. First, we kept the same exact settings as for the other networks (ELU & Soft ++ ), using the same weight initializer and same dropout scheme in order to have a one-to-one comparison with the other activation functions (SELU unadapted). Second, we adapted the SELU networks to operate in their preferred regime, using LeCun normal initialization and α-dropout (SELU), which was applied only to the last two stacks.
When trained for 500 epochs, Softplus networks did not surpass chance level, a result consistent with their slow convergence observed on the previous tests. ReLU and PReLU networks did converge, but to a significantly lower performance than ELU and Soft ++ ( Fig. 4 a). Indeed, ELU and Soft ++ networks seemed to be in a different class than ReLU and PReLU. We therefore focused on these two networks, adding the two SELU variants as well, and performed the training for 1500 epochs ( Fig. 4 b). The unadapted SELU networks lost convergence after about 700 training epochs ( Fig. 4 b, left & middle), indicating that keeping the same exact con-ditions as for ELU and Soft ++ is detrimental to their training. The adapted SELU, operating in its favorite conditions, performed better, finally reaching a similar performance to ELU. Soft ++ networks significantly surpassed both ELU and SELU networks on CIFAR-100 ( Fig. 4 b, right; p < 0.05; heteroscedastic, double tailed t -test) and they also had a tendency to overfit less ( Fig. 4 b, left).
We tested multiple parameter instantiations for Soft ++ and found that increased performance is obtained with smaller values of c (i.e. c = 2, k = 1, in which case the gradient in the origin matches that of ELU), the best networks exceeding a performance of 70% on the test set (data not shown). Thus, the flexible adjustment of Soft ++ parameters allows optimization of learning for very hard problems, like CIFAR-100.

Performance and learning curves on the visual cortex dataset
The MNIST, CIFAR-10 and CIFAR-100 datasets are well-known in the machine learning community and provide robust test cases for various machine learning frameworks because each class is relatively well represented, with hundreds / thousands of examples. However, in certain fields of science where machine learning is increasingly used to explore structure in the data, obtaining a large number of examples per class is difficult, if not impossible. One such case is neuroscience, where data is recorded from live animals and the measurement window is typically limited in time [36 , 41] . This leads to datasets with many features (huge amount of signals recorded) but with few classes (e.g., visual stimulus type) and few examples per class (tens of trials at best). To make things worse, the scientist does not know which of the measured features is relevant and therefore includes a large array of features, many of which are not relevant. Examples may also be unbalanced across classes, introducing further complications [41] . Such datasets are found in many fields of science where measurement time is limited, they are typically small and noisy, and pose tremendous challenges to machine learning techniques. Learning from few examples on a large array of (possibly irrelevant) features is a very hard problem. We next tested how deep MLPs performed on a dataset assembled from the firing statistics of 30 visual cortex neurons. Their responses were recorded using extra-cellular in vivo electrophysiology during visual stimulation with drifting gratings spanning 8 directions / 4 orientations (opposing directions merged: 0-180 °, 45-225 °, 90-270 °, 135-315 °; see Methods). These cortical neurons typically respond with elevated firing rates that are modulated by the passage of the grating bars through their receptive fields ( Fig. 5 a, top) and fire selectively to preferred directions/orientations ( Fig. 5 a, bottom). For a particular stimulus direction, these neurons not only fire selectively in their average firing rate but also exhibit a temporal modulation along the trial ( Fig. 5 b). Information about the particular visual stimulus presented to the animal can therefore be recovered by considering the relative time-resolved modulation of the firing of different neurons. By segmenting each trial into 10 bins and computing the number of spikes per each bin for each neuron, we assembled feature vec- The visual cortex dataset was hard to learn for several reasons. First, cortical neurons fire with high trial-to-trial variability [44] . Second, the sampling of cortical activity with current in vivo technology is relatively random and minuscule, sampling just a few tens of units out of tens/hundreds of millions of neurons of the visual cortex. As a result, datasets extracted from such data are small and noisy. Indeed, ReLU and PReLU networks were not even able to learn on this dataset, remaining at chance level ( Fig. 5 c, 5 e, and 5 f). The same was the case for SELU networks (adapted) with dropout, which performed poorly on this dataset (data not shown).
PReLU networks could learn but poorer Fig. 5 c, e, and f) and slower ( Fig. 5 d) than ELU and Soft ++ , which, again, seemed to be in a different league than the other activation functions. Overall, Soft ++ performed robustly for different parameters but it had the best performance for a small c (2) , where it clearly surpassed ELU. Finally, Soft ++ networks learned significantly faster than ELU networks ( Fig. 5 d; p << 0.001; heteroscedastic, double tailed t -test, Bonferroni corrected).

Performance and learning curves on the synthetic dataset
In a final set of tests we generated a synthetic dataset that represents an extreme case of a small, noisy dataset, with a majority of features irrelevant and only one relevant feature. This represents well many datasets found in neuroscience and other fields of science, whereby the scientist does not know a priori what feature is relevant and therefore includes all the empirically defined features. As expected, learning in fully-connected MLP networks was more difficult than on MNIST ( Fig. 6 ). Softplus had the slowest initial convergence but reached a higher final performance than ReLU and SELU. The latter performed the worst  on all sets ( Fig. 6 a). PReLU learned more aggressively than Softplus ( Fig. 6 b) but generalized worse than the latter ( Fig. 6 a, c, and d).
Of all networks, ELU learned the most aggressively ( Fig. 6 b-d) but its validation and test set performances saturated below Soft ++ networks. The latter generalized significantly better, surpassing all other activation functions on the validation and test sets ( Fig. 6

Conclusions
We have investigated the importance of the shape of various activation functions in deep networks. Classical MLPs, with saturating activation functions (sigmoid, tanh), used to be very difficult to train due to the vanishing gradient problem [3] . Consequently, their structure was shallow (up to 1-3 hidden layers), to enable efficient training. By contrast, here we investigated networks with 6-11 hidden layers, which are considered deep [6 , 7] . Our results suggest that on such networks, both the non-saturating property and the actual shape of the activation function play a role in training. In addition, we also demonstrated that these conclusions hold in deep convolutional neural networks as well.
The most popular activation function for deep learning, ReLU, is non-saturating only in the positive domain. Our results indicate that it performed the worst on all tested datasets. Its fully nonsaturating version, PReLU [15] , performed better than ReLU but it was sometimes outperformed by Softplus, a saturating function in the negative domain, but smooth and non-linear around the origin. Of all activation functions, Softplus provided the slowest convergence. By contrast, ELU, another saturating function but with both positive and negative values around the origin, and with a significant gradient around it, is a remarkable activation function, providing in general fast and robust convergence, as reported in [8] .
Its scaled variant, SELU [23] can also provide good results on some datasets but it seems less robust on very difficult or noisy datasets, and on datasets with many irrelevant features.
Here, we introduced Soft ++ , which is a novel activation function that extends both Softplus and PReLU, combining the advantages of both: it is smooth and non-linear around the origin and it is non-saturating, thus favoring training of deep networks. In addition, Soft ++ has two parameters, which enable the flexible adjustment of the initial gradient for learning and of the slopes in the positive and negative domain. In general, networks endowed with this activation function outperformed those endowed with ReLU, PReLU, Softplus, SELU and ELU activation functions, converging significantly faster and/or reaching a higher classification performance on the validation and test sets, under the same conditions.
Smoothness and non-linearity around the origin appear to be even more important for learning datasets with a majority of irrelevant features. Networks having activation functions with purely linear slopes in the negative and/or positive domains (ReLU, PReLU) have a tendency to strongly overfit on such datasets. The smooth and nonlinear nature of Softplus and Soft ++ around the origin avoids such overfitting. In addition, Soft ++ networks behave surprisingly well on tasks with a high number of irrelevant features, discovering quickly the relevant ones. In such cases, Soft ++ facilitates learning that also provides the conditions for robust generalization over the validation and test sets, minimizing cross-entropy.
Parametrization is an additional important property of PReLU, SELU and Soft ++ . PReLU and Soft ++ activation functions can be

ARTICLE IN PRESS
JID: NEUCOM [m5G; December 27, 2019;23:20 ] adjusted to enable more or less aggressive learning: a smaller c (larger a ) corresponds to a steeper slope in the negative domain, while a larger k determines a steeper slope in the positive domain. The double parametrization of Soft ++ comes at a cost: the space of hyperparameters to be explored when training a Soft ++ network is larger than the one of an identical network with a different activation function (e.g., RELU or ELU). Given a grid search on M hyperparameters with N values each, N M experiments are required to find the best configuration. In the case of Soft ++ , the number of hyperparameters is increased by two, hence N M + 2 experiments will be required. Determining the best Soft ++ parameters for a certain dataset is not always an easy task. On the other hand, following the approach in [15] , the parameters of these functions can be learned and, even better, adjusted individually for each neuron. In a future study, it will be very interesting to address how Soft ++ networks with learned c and k per each neuron fair on very difficult datasets, like CIFAR-100 or, e.g., the noisy and small datasets extracted from neuroscience data.
Soft ++ outperformed the other activation functions on all the tested datasets. The function was initially developed to cope with the small and noisy datasets, with many irrelevant features, which are found in neuroscience, but its parametrization lends it a clear advantage also for the case of larger image datasets. By far, the strongest competitor of Soft ++ is ELU, which performs remarkably well on all the tested problems. Soft ++ , ELU, and sometimes SELU appear to belong to a different league than all the other activation functions.
The most important message of the present study is that, for a given activation function, convergence speed and accuracy depend critically on the properties of the dataset and the architecture of the network. Therefore, one would ideally need a parametrized activation function that can be best adapted to the problem at hand. We have shown that Soft ++ is such a versatile activation function, allowing one to obtain significantly faster and/or more accurate convergence than other activation functions. In particular, the flexibility of Soft ++ and its smooth gradient over the entire real domain enables it to gain a critical advantage over ELU and SELU.
To conclude, combining the fully non-saturating character with a smooth and non-linear shape around the origin for the activation function is beneficial for training deep neural networks. Such a function is Soft ++ , whose dual parametrization enables the flexible choice of the slopes/gradients in the positive and negative domains, as well as the shape of the function around the origin. As a result, Soft ++ can significantly accelerate learning and improve generalization on complex datasets, including those with many irrelevant features. It will be interesting to study how this activation function fares on difficult datasets once the slope coefficient, c , and exponent coefficient, k , are trained, individually for each neuron and along with the weights in the network.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.