Adaptive Morphing Activation Function for Neural Networks

: A novel morphing activation function is proposed, motivated by the wavelet theory and the use of wavelets as activation functions. Morphing refers to the gradual change of shape to mimic several apparently unrelated activation functions. The shape is controlled by the fractional order derivative, which is a trainable parameter to be optimized in the neural network learning process. Given the morphing activation function, and taking only integer-order derivatives, efficient piecewise polynomial versions of several existing activation functions are obtained. Experiments show that the performance of polynomial versions PolySigmoid, PolySoftplus, PolyGeLU, PolySwish, and PolyMish is similar or better than their counterparts Sigmoid, Softplus, GeLU, Swish, and Mish. Furthermore, it is possible to learn the best shape from the data by optimizing the fractional-order derivative with gradient descent algorithms, leading to the study of a more general formula based on fractional calculus to build and adapt activation functions with properties useful in machine learning


Introduction
Since the seminal work of McCulloch and Pitts on the neural model [1] and Rosenblatt on perceptrons [2], activation functions have been a topic of interest in the context of artificial neural networks.In particular, given the singularity of the Heaviside function, derivable functions such as Sigmoid or tanh were introduced as activation functions [3].Later, for deep learning, more activation functions were proposed to solve new challenges such as the gradient vanishing problem [4,5], where ReLU [6] plays an important role to the point of being considered the state of the art.Given the success of ReLU, more variants have been proposed, including Leaky ReLU [7], ELU [8], GeLU [9], Swish [10], and Mish [11], among others.
Two relevant concepts that have been considered to improve the results of neural networks are (i) activation functions with adaptive parameters and (ii) activation functions with fractional derivative, or even a combination of these, as in this paper.Certainly, this work is not a pioneer in introducing these concepts, and the papers [12][13][14][15][16][17] are evidence of this.In addition to the previous cited works, as justification and motivation to continue with this research of applying fractional derivatives to activation functions, the research of [18,19] was carried out, where fractional gradients were applied successfully to the backpropagation algorithm to improve the performance of the gradient descent algorithms available in Tensorflow and PyTorch.In such a case, the formula for updating the synaptic weights requires the derivative of the activation function and it involves the factor D ν x x = x 1−ν Γ(2−ν) , which reduces to 1 in the case of ν = 1, i.e., the first derivative of x with respect to x is 1, and makes it possible for it to be extended from integer to fractional orders.Furthermore, the experiments support the idea that fractional optimizers improve their integer-order versions.
Regarding activation functions with adaptive parameters, they can be trained through the learning process and are therefore called trainable or learnable, since the adaptation is obtained from the training data.Indeed, to improve the accuracy of neural networks, adaptive activation functions such as SPLASH or APTx have been studied.On the one hand, SPLASH is a class of learnable piecewise linear activation functions with parameterization to approximate non-linear functions [20].On the other hand, APTx is similar to Mish but is intended to be more efficient to compute [21].Other related works study ReLU variants, such as FELU [22], SELU, MPELU, and DELU [23], with parameters that consider non-zero values for the negative plane to solve the "Dying ReLU" phenomenon that stops learning [24].Specifically, DELU considers three parameters, α, β, z c ≥ 0, to be determined, and special cases are ELU (α = β = 1, z c = 0) and ReLU (z c = 0, β → ∞).The Shape Autotuning Activation Function (SAAF) was proposed to simultaneously overcome three weaknesses of ReLU related to non-zero mean, negative missing (zero values), and unbounded output [25].SAAF is a smooth function like Sigmoid and tanh, and piecewise like ReLU, but avoids some of their deficiencies by adjusting a pair of independent trainable parameters, α and β, to capture negative information and provide a near-zero mean output, which leads to better generalization and learning speed, as well as a bounded output.In [26], three variants of tanh, tanhSoft1, tanhSoft2, and tanhSoft3, are presented as trainable variants depending of α, β, and δ parameters that are used to control the slope of the curve on both positive and negative axes.Also, TanhLU is proposed in [26] as a combination of the tanh and a linear unit.All these works are examples of the effort to build and generalize some activation functions that, in addition to inheriting good properties from the original, provide flexibility with parameters to acquire useful properties for machine learning, such as control of overfitting or stop learning.
The other relevant topic considered in activation functions is the fractional derivative.The theoretical basis to extend the activation function derivative from integer to fractional order is the fractional calculus theory [27,28] that generalizes the integer operators: integration and differentiation.These two operators are combined in a single concept of fractional derivative once the integration is conceived as an antiderivative; thus, the orders are positive for derivatives and negative for antiderivatives so that, when added, the result is a real number, positive, negative, or zero.Historically, instead of "real", it is referred to as "fractional", maybe because, in a letter, L'Hopital asked Leibniz about the order 1 2 [29].An intuitive approach comes from the interpolation between two functions.Let f (x) = x and its first derivative f 1 (x) = 1; then, it is possible to build a "fractional" version with both of them.For α ∈ [0, 1], we obtain there is no derivative, and f α (x) is equal to x.However, if α = 1, then f α (x) is equal to 1, meaning the first derivative of x.For "fractional" values, i.e., 0 < x < 1, it corresponds to a line with gradient (slope) 1 − α.So, it offers a mechanism to control the gradient, which could be useful in backpropagation algorithms [18,19] by considering that f (x) could represent a more sophisticated activation function.It is illustrated in Figure 1, where there is a line with unitary slope for α = 0 and a horizontal line for α = 1.This 3D graph represents all lines with slopes in the interval [0, 1].Effectively, fractional gradient optimizers [18,19] and fractional activation functions have been proposed based on fractional calculus concepts, as a generalization to the classic integer-order operators.For example, ref. [13] combines fractional derivatives on the activation functions and a fractional gradient descent backpropagation algorithm for perceptrons where, instead of choosing the first derivative, the fractional derivative of the activation function is applied.The experiments support the idea that the models that involve fractional derivatives are more accurate than the integer-order models.The fractional approach allows us to obtain the memory and hereditary properties of the processes described by the data, since it is a property of the fractional calculus operators [27,28,30].This is possible because the derivative focuses on a point, but the integral operator covers (observes) a neighborhood (integration interval) around a point of interest.A more extensive study of the interpretation of the fractional derivative can be reviewed in [31], where one can find an analogy that was previously described from a physical point of view, specifically concerning the concept of divergence, which measures how much the vector field behaves like a source (positive divergence) or a sink (negative divergence).
In [17], fractional sigmoid activation functions and Fractional Rectified Linear Units (FReLUs) are proposed, and the order of the derivative of linear fractional activation functions is optimized in [15] to give a place to Fractional Adaptive Linear Units (FALUs).
Another possibility to apply fractional derivatives beyond the activation functions is to modify the standard loss functions, such as the Mean Squared Error (MSE), which involves a power of two, which can be modified to a fractional FMSE if it is multiplied by a 2 [12].When the fractional order a = 2, FSME is equivalent to MSE, but when a is not an integer, FMSE identifies more complex relationships between the input and the output.A similar approach can be used with the Cross-Entropy Loss function, to obtain a Fractional Cross-Entropy Loss function.However, the authors of [12] report that the fractional order a must be adjusted by trial and error.
Wavelets are also on the list of activation functions [32] and, due to their oscillatory property, they allow us to define sophisticated classification regions [33].In this paper, a morphing activation function is presented based on fractional calculus.Morphing refers to the idea of changing the shape gradually to mimic different activation functions.Morphing is possible by applying the Caputo derivative [34] and varying the fractional ν-order to obtain shapes very similar to other activation functions, including Heaviside, Sigmoid, ReLU, Softplus, GeLU, Swish, and Haar.Related works present relationships between a few similar activation functions, for example, ReLU and Leaky ReLU, or ReLU and Heaviside, or Sigmoid and Softplus, or hyperbolic tangent and square hyperbolic secant [14].All the revised papers on adaptive activation function groups focus on families of too-similar activation functions.Adaptive or fixed ReLU variants such as SAAF, SELU, MPELU, DELU, Leaky ReLU, ELU, GELU, Swish, and Mish focus on solving the drawbacks of the fundamental ReLU, but they follow a similar shape pattern.In contrast, SPLASH aims to approximate any function, included activation functions, but requires a significant amount of piecewise linear components to reach a smoothness similar to Sigmoid, tanh, Swish, or Mish, and leaks derivability at the hinges.However, at the time of writing this paper, the morphing function is unprecedented in the sense of linking several seemingly unrelated activation functions, essentially by means of the fractional-order derivative.The kind of shapes that the proposed morphing function can emulate is broad, since it evolves from wavelet to triangular, ReLU-variants, sigmoid, and polynomial by depending only on the fractional-order derivative.So, given a single parameter, it is able to reproduce an infinite set of activation functions, with different behavior, and when necessary, the shapes can be smooth.Indeed, the morphing activation function considers the fractional ν-order as a parameter, and it is possible to obtain the optimal value from the data in the training process, so it is trainable.Additionally, it facilitates obtaining mathematical expressions of piecewise activation functions with a polynomial approach, which could lead to improved computational efficiency.It is worth noting that these points are relevant because, compared to other adaptive activation functions, the morphing function has the fewest number of parameters and explores different shapes.Thus, rather than focusing on a family or subset of adaptive or fixed activation functions, the morphing function aims to encompass as many activation functions as possible.Therefore, this research has a twofold purpose; the first aims to obtain a single equation that mimics a large list of activation functions, and the second is to obtain the optimal activation function shape by calculating the appropriate parameters from the training data using gradient descent algorithms, and this approach is different to other related works.
The approach that SAAF and MPELU follow shares some goals with this research, by attemping to combine the advantages of several activation functions.In the case of SAAF, it has parameters to mimic Mish or Swish in the negative plane, whereas MPELU is limited to ReLU and ELU variants.But Morph goes further and manages to unify more activation functions by introducing fractional derivatives and wavelet concepts.
Experimental results show a competitive performance of the morphing activation function, which learns the best shape from the data and adaptively mimics other existing and successful activation functions.How is this possible?The fractional-order derivative is declared as a hyperparameter in PyTorch, as part of the learning process.Given an initial value, it changes towards the optimal value guided by the gradient descent algorithm used to optimize the rest of the parameters.But why does the Morph activation function work so well?Because it is able to mimic other successful activation functions.
Finally, this paper aims to progress towards the construction of a more general formula to unify as many activation functions as possible.

Materials
The following topics are briefly reviewed in this section: fractional derivatives, fractional and parametric activation functions, and a new morphing activation function Morph() based on fractional calculus.It provides the necessary material to develop the experiments which support the conclusions.

Fractional Derivatives
There is no single definition of fractional derivatives.In this paper, only two are described: the Riemann-Liouville type conformable (RL-type) fractional derivative and the Caputo fractional derivative [34].The former allows us to obtain some generalized activation functions by replacing the exponential term [17].The second is the Caputo derivative used to obtain Morph(), a novel fractional activation function that mimics various activation functions.Definition 1.Let 0 ≤ a < x < +∞.The Riemann-Liouville-type conformable fractional derivative of f or order ν is defined by: RL In particular, for f (x) = e −x : The exponential function f (x) = e −x is very important since it appears as the cornerstone in several activation functions.In this sense, Equation (1) was applied in [17] to obtain fractional versions of classical activation functions that involve exponentials.

Activation Functions
Activation functions introduce the non-linearity required for the neural networks to efficiently map inputs to outputs.In this section, several activation functions are reviewed and some properties of them are briefly discussed.Later on, a new family of activation functions is proposed.
Heaviside.This is a fundamental discriminator and activation function defined as: The discontinuity at x = 0 is a drawback of the Heaviside, as well as the zero derivative of its piecewise parts.However, its behavior is consistent with the Perceptron Convergence Theorem.The Perceptron Convergence Theorem states that for a binary classification problem, where the data are labeled as classes 0 or 1 and are linearly separable, the perceptron will define a separation hyperplane for the two classes in a finite number of iterations [3].
Sigmoid and tanh.To deal with the discontinuity of Heaviside, a Sigmoid activation function can be used, since it defines a smooth curve given by: The tanh function also defines a smooth curve expressed as: and there is a relationship between Sigmoid and tanh: Sigmoid and tanh satisfy the conditions of the Universal Approximation Theorem [37], which states that a linear combination of bounded, monotone-increasing continuous functions is enough to approximate with desired precision any arbitrary continuous function on compact sets of R n .In fact, it has been a justification to build Multilayer Perceptron networks starting with a single hidden layer [38].
Both functions, Sigmoid and tanh, fail for the vanishing gradient problem, which means that the backpropagation of the adjustment for the free parameters decays through the layers [4,5].
Sigmoid and tanh functions involve exponentials, and the series for e x = 1 + x + x 2 2! + x 3 3! + . . .reveals the need to use many terms to reach the non-linearity of the function, with the consequent increase in computational complexity.
ReLU.The Rectified Linear Unit activation function was mainly proposed to deal with the vanishing gradient problem of Sigmoid and tanh functions [39].ReLU is simple and elegant: The ReLU does not satisfy the bounded condition of the Universal Approximation Theorem [37].However, it has shown a great improvement in speed and accuracy in deep learning models over other activation functions.
LReLU.Leaky Relu is a variant of ReLU, and is intended to solve the "Dying ReLU" problem where a zero gradient for x ≤ 0 causes "neuron deaths" because ReLUs cannot update the synaptic weights during training [7].So, LReLU considers a small gradient α > 0 for x ≤ 0, and its mathematical expression is: ELU.The Exponential Linear Unit activation function considers negative values for x ≤ 0, using an idea similar to Leaky ReLU.Given α > 0, the ELU formula is [8]: Swish.It was proposed by Google, and many experiments report that Swish enhances the ReLU performance [10,40].The expression for Swish is: GeLU.Let Φ(x) = P(X ≤ x), X ∼ N(0, 1), the cumulative distribution function of the standard normal distribution.The Gaussian Error Linear Unit (GeLU) multiplies x by Φ(x).It can be approximated by [9]: or Softplus.This activation function uses logarithmic and exponential functions as follows: For 1 << e x , the logarithm and exponential nullify, and then it becomes similar to ReLU.But it follows a smooth curve when x is close to zero.Mish.It is obtained by function composition of tanh and Softplus, and then multiplied by x.Mish is expressed as: SAAF.The Shape Autotuning Activation Function is defined as: where α and β are trainable parameters [25].In the negative plane, SAAF follows a behavior similar to Mish or Swish, whereas in the positive plane, SAAF reaches a maximum value and the asymptotic saturation rate is controlled by their two parameters.
FELU.The acronym refers to Fast Exponentially Linear Unit activation.It uses properties of bit displacement together with integer algebra operations to achieve exponential fast approximation based on the representation of IEEE-754 floating point number calculation.So, it yields to a base 2 with logarithms in the exponent: and the curve that it represents is similar to ELU.SELU.Scaled Exponential Linear Unit is an activation function that addresses the covariant shift problem, which means that the distribution of the network activations changes at every training step, which in turn may slow down training [41].The SELU properties are useful to train networks with many layers.For example, the self-normalizing property implies that node activations remain centered around zero and have unitary variance.The SELU formula is: DELU.It means "extendeD Exponential Linear Unit", and is an extended version of ELU.Given α ≥ 0, β ≥ 0 and x c ≥ 0, the DELU function is defined as: A drawback of DELU is the possible discontinuity at x c .The condition for continuity is βx c = e αx c − 1 and x c ≥ 0. ELU is a special case of DELU when α = β = 1 and x c = 0. Also, a ReLU approximation is given by x c = 0 and β → ∞.MPELU.Multiple Parametric Exponential Linear Units is an activation function that aims to generalize and unify ReLU and ELU units.The main idea is to lead to better classification performance and convergence.MPELU is able to adaptively switch between the rectified and exponential linear units, and makes α a hyperparameter learnable to further improve its representational ability and tune the function shape [42].MPELU is expressed as: where β > 0.
Wavelets.They are sets of non-linear bases of vector spaces that oscillate and then vanish away from the origin [43].Typically, wavelets are used via the wavelet transform that maps a temporal signal to time-scale space, but wavelets can also be used as activation functions [32].
There are infinitely many wavelet functions, and just a few of them are mentioned here.The Haar wavelet is defined as: The Mexican Hat wavelet belongs to a family of continuous functions and is the second derivative of the Gaussian function −e − x 2 2 : It is noteworthy that all the derivatives of these Gaussian wavelets are also wavelets [44].
Wavelet bases are associated with multiresolution analysis, where vector spaces are spanned from dilation and traslation of a (mother) wavelet f ∈ L 2 (R).Let a ̸ = 0 and b ∈ R the scaling and translation of f .The continuous wavelet transform of a function g ∈ L 2 (R), W f g with respect to f is [45]: Given the admissibility condition |ω| dω < ∞, where f is the Fourier transform of f , the reconstruction formula for g(x) is [43,46]: which means that a function g ∈ L 2 (R) can be represented through its decomposition at different resolutions (scale or frequency analysis).
It is worth mentioning that for Q > 0, if g is defined as: i.e., a sigmoid function defined in the interval [−Q, Q], then the wavelet transform of g with respect to the Haar wavelet is [47]: where g (−1) (x) = So f tplus(x).
As already mentioned, wavelets can be used as activation functions and this paper addresses this approach.In fact, the main contribution of this work is to propose a morphing activation function inspired by the Haar wavelet expressed in terms of Heaviside functions.
Triangular.A triangular activation function [16] can be expressed as a piecewise linear and continuous function.For example: APTx.Activation function APTx [21] refers to "Alpha Plus Tanh Times" and has three parameters α, β, and γ: For α = β = 1 and γ = 1 2 , APTx has the same shape as Mish described in Equation ( 16), but with the difference that Equation (29) does not use Softplus, which could improve the computational efficiency.
SPLASH.It is the acronym of Simple Piecewise Linear and Adaptive with Symmetric Hinges.These types of functions are continuous, piecewise linear, with hinges placed at fixed locations where the slopes of the line segments are learned from the data [20].SPLASH functions are expressed as: Essentially, a SPLASH connects ReLU functions with different slopes and aims to approximate concave and convex activation functions.

Fractional Activation Functions
Considering the fractional derivatives of Section 2.1, it is possible to define fractional activation functions.
A first fractional activation function is based on Equation ( 1) that provides a generalized fractional exponential function of ν-order ∈ (0, 1].Thus, as described in [34], activation functions that involve Sigmoid functions can be generalized to fractional versions.For example, given the Sigmoid of Equation ( 6), the RL-type Fractional Sigmoid can be defined as: On the other hand, the RL-type fractional tanh function is: We mention once again that this approach of applying the RL-type fractional derivative is not the one followed in this paper to obtain the Morph() function, but a different one based on the Caputo fractional derivative, as will be shown later.
To improve accuracy, reduce overfitting, and create smaller and more efficient neural network models, activation functions with parameters can be adjusted from the data according to the learning process.In other words, the parameters become hyperparameters and gradient-based methods can be used to optimize them.This is the case of PRLeLU, which is described below.
At this point, it is convenient to introduce a fractional derivative for ReLU based on the definition of Caputo [15,34,35].
x ∈ R, the fractional derivative of ν-order for the ReLU function is There are two notable cases: 33) is reduced to ReLU of Equation ( 9).• (ii) If ν = 1, then Equation (33) becomes the Heaviside function given that for x > 0, This gradual change of shape (morphing) is illustrated in Figure 2, which shows ReLU ν x (x) for ν ∈ [0, 1].The ν-value is displayed on the y-axis.Since ν is real, there are infinitely many activation function shapes.
The parameters of activation functions, such as α of Equation (34), can be considered hyperparameters to calculate their optimal values in the learning process.In this way, several parametric activation functions can be obtained, and some examples are listed below: • LReLU(x, α) is a Parametric Leaky ReLU (PLReLU) depending on α according to Equation ( 10).• ELU(x, α) is the Parametric ELU(x, α) (PELU) that depends on α according to Equation (11).
In fractional activation functions, the ν-order derivative may also be considered a learning parameter.Consequently, more activation functions can be obtained: ) that depends on the optimization of α and ν, according to Equation (34).
, the fractional derivative of ν-order for the ReLU function considering dilation a and traslation b is: The same dilation and translation operations can be applied to other parametric fractional activation functions like LReLU ν x (x) of Equation (34).

Morph(): Adaptive Fractional Activation Function
A Haar wavelet of Equation ( 22) can be expressed in terms of Heaviside functions H(x) as follows [46]: This expression inspired us to propose a parameterized activation function (morphing) depending on ν ∈ (−∞, 1] given by: where the translation of the ReLU ν x terms are chosen to center the shape along the x-axis.In this paper, the nomenclature Morph v x (x) is the same as Morph(x, v), since in the first case, it is intended to emphasize the derivative order (not to be confused with an exponent), and secondly, because the change of shape depends on the parameter ν, the fractional order.Figure 3 shows Morph(x, ν) for −5 < ν < 1.When ν changes, the shape mimics a Haar wavelet, Triangular, Sigmoid, ReLU, and polynomial functions.
In the following sections, special cases of Morph(x, ν) are studied for some relevant integer values of ν, which allow us to obtain piecewise polynomial versions of several activation functions including Sigmoid, Softplus, Swish, Mish, GeLU, and ELU.   2 ).

Morphing from Morph(x, ν) to Triangular
When v = 0, the piecewise function Morph(x, ν = 0) has a triangular shape since: which corresponds to a triangular activation function: Figure 5 illustrates the plot of Morph(x, ν = 0), where the triangular shape is symmetric with respect to the y-axis., and by applying Equation (37), it yields to the piecewise function called PolySigmoid: The name PolySigmoid refers to a Sigmoid shape written as a piecewise polynomial.Figure 6 shows Morph(x, ν = −1) (left) and the piecewise approximation with two parabolas, the first for −1 ≤ x ≤ 0 and the second for 0 ≤ x ≤ 1 (right).PolySigmoid does not use exponential functions, but rather a quadratic polynomial, which may be more efficient in terms of computational complexity.A parametric version is obtained from Equation (41), replacing x with αx: If α → +∞, the shape changes from Sigmoid to Heaviside.If α = 1 4 , then PolySigmoid(α, x) approximates well enough the Sigmoid σ(x), as is shown in Figure 7.

Adapting Morph(x, ν) to PolySwish
Given the good similarity between PolySigmoid(α = 1 4 , x) and Sigmoid, and following the definition of Swish, another activation function can be proposed multiplying PolySigmoid by x.In this way, a polynomial version of Swish is: Of course, Equation ( 43) can be simplified, but it has been kept as such for clarity.Figure 8 compares PolySwish and Swish.For x > 0, it seems an excellent approximation, though not for x ≤ 0. However, it is relevant to consider that this is a polynomial-based version obtained through an integer order of Morph(x, ν).Later, the performance of this approximation will be explored experimentally.

• Softplus uses an exponential function (infinite series).
A better approximation between PolySoftplus and Softplus can be reached by introducing an α > 0 parameter to the former, so that: Figure 11 shows the case α = 4 where effectively, PolySo f tplus(α = 4, x) gives a better and more acceptable approximation to Softplus than Equation (45).Furthermore, it is possible to approximate ReLU considering Equation ( 42) if α → +∞, which is a Heaviside approximation.Thus:

Morphing from Morph(x, ν) to Piecewise Polynomial
For ν < −2, the function Morph(x, ν < −2) is piecewise polynomial.It is denoted as Poly(x, ν) = Morph(x, ν < −2) and then: Note that this polynomial has a lower order for 1 < x than for the intervals −1 < x ≤ 0 and 0 < x ≤ 1.For example, if ν = −3: This is quadratic polynomial for 1 < x.Effectively, in spite of the fourth degree of other intervals, if 1 < x, then Poly(x, ν = −3) is quadratic, and the same should apply for larger negative ν-values.Why?It comes from the Formulas (38) and (39) where, given the translations, the summation finally cancels two positive terms with a double factor for the negative monomial with the highest degree.This concept was illustrated in Figure 3 of Section 2.4, where essentially, Morph(x, ν) represents a smooth surface for ν < 0.
Following the same theme, Figure 12 shows the case ν = −3 corresponding to Equation (49), where Poly(x, ν = −3) reaches the highest polynomial degree in x ∈ (−1, 1), but for 1 < x, the degree is reduced to quadratic.The experiments were developed in PyTorch running on a GPU NVIDIA GeForce RTX 3070.The optimizer was SGD with learning rate lr = 0.01.The number of epochs is 30, and the metric for comparison purposes is the test accuracy.

•
The version 1 α PolySo f tplus(α, x) with α = 1 4 is a good approximation of Softplus; therefore, it is not surprising that they produce essentially the same results.

Experiment 2: PolySwish, Swish, and ReLU
In Experiment 2, PolySwish is compared with Swish and ReLU. Figure 14 shows that PolySwish is superior to Swish and ReLU.Swish is slightly better than ReLU.

Experiment 3: PolySigmoid, Sigmoid, and ReLU
In Experiment 3, the Sigmoid performance was too low, starting with an accuracy of 11.35% and reaching a maximum of 38.29%.For this reason, instead of the original sigmoid σ(x), σ(3x) was considered, and the performance was improved from 89.47% up to a maximum of 93.51%.
Figure 15 shows the accuracies for PolySigmoid, Sigmoid σ(x), Sigmoid σ(3x) and ReLU.It is possible to appreciate that PolySigmoid is superior to σ(x), and σ(3x).However, ReLU is better than PolySigmoid and therefore better than all activation functions in this experiment.

Experiment 5: ELU Approximation from PolySigmoid
In order to obtain a piecewise polynomial version of the ELU function, the following approximation is made.Given Equation (42), which approximates Sigmoid via Polysigmoid, and focusing on the interval −4 < x < 0, it follows that: , and solving for e x : With this approximation, it is possible to write a polynomial version for ELU named PolyELU, so: Figure 17 has the plots for α = 1 of PolyELU and ELU on the left side, whereas the corresponding accuracies are shown on the right side.Basically, they are overlapped and have a correlation of 0.999.This confirms that there is a good approximation between these two activation functions.Experiment 6 gathers some activation functions that achieve a low accuracy: Haar, Heaviside, and Sigmoid, and it is shown in Figure 18.Haar's accuracy is low, but higher than Heaviside and Sigmoid.Heaviside varies from 49.99% up to 58.22%, whereas Sigmoid is below 40%.It is striking how a Sigmoid function that approaches Heaviside as α tends to infinity can increase its accuracy.It has been experimentally calculated that as α increases up to 19.459, the accuracy improves.In fact, the accuracy of Sigmoid with high α-value outperformed that of Heaviside.
Obtaining the maximum accuracy for Sigmoid(19.459x), it was compared with Triangular and ReLU.The results are in Figure 19, where Sigmoid(19.459x) outperforms Triangular followed by ReLU.Sigmoid(3x) was also plotted just as a reference.This experiment leads us to consider the importance of the non-zero gradient problem, challenging for Haar and Heaviside, which is extensive to Sigmoid for high scaling factors.

Experiment 7: Mish and PolyMish
Given the definition of PolySoftplus(x) in Equation (45), and some advantages over Softplus described in Section 3.1, is natural to propose a first Mish approximation named PolyMish, as follows: Figure 20 shows the plots of Mish vs. PolyMish.Mish(x) follows a smooth curve for x < 0 and asymptotical behavior to zero for x < −1.2, whereas PolyMish vanishes faster than Mish and is exactly zero for x < −1 (see Section 3.7).In other words, this first version is not a good approximation.Moreover, the bottom side of Figure 20 illustrates how Mish outperforms PolyMish.The approximation to Mish can be improved by considering an α factor, as in Section 3.1, so that: The top side of Figure 21 shows the approximation of Mish using PolyMish(α = 4, x).The accuracy results are shown on the bottom side of Figure 21, where it is noticeable that this second version of PolyMish(α = 4, x) outperforms Mish.In this experiment, given PReLU(x, ν), the parameter ν is adapted during the learning (see Equation (33)).The initial value for ν is zero.The network architecture involves three PReLU(x, ν) functions, and each ν-parameter is adjusted with SGD as a hyperparameter in Pytorch.The results are shown in Figure 22.From the top side, it is evident that the PReLU version outperforms the network built only with ReLUs.The adaptive ν-values for the three PReLUs are shown at the bottom of Figure 22, where the fractional derivatives have negative values, i.e., the power of x is greater than 1.0.The importance of having a non-zero gradient for x ≤ 0 is a motivation to use a fractional derivative of |x| based on Equations ( 33) and ( 34), as follows: Beyond this, different fractional-order derivatives can be used, ν 0 for x ≤ 0 and ν 1 for x > 0. Thus, for α ∈ R: If ν 0 > 1, then Equation (56) produces a zero division if x = 0, but to avoid this situation, it is possible to add an ϵ > 0 [18].In this experiment, ϵ = 1e − 10.
All ν-parameters as well as α are initialized to zero.For Case 3, α is optimized in three activation functions AF1, AF2, and AF3.From Equation (30), only seven terms are considered for SPLASH, then it is written as: where b s = {1, 2, 2.5}, a 0 is initialized to one, and the rest to zero (see [20]).These initial conditions mean initializing with ReLU shape.SPLASH does not involve fractional derivatives, but it is used in this experiment as a reference for comparison purposes.
The neural network architecture has three activation functions, and for SPLASH, each unit needs seven a-parameters.So, the total number of hyperparameters is 21 (it certainly consumes enough GPU memory).
For the fractional derivative |x| ν of Equation ( 55), the total number of hyperparameters is 3, and the initialization was ν = 0, which means initializing with a piecewise linear shape.
In The accuracy results are show in Figure 24.SPLASH adjust all of the 7 × 3 = 21 parameters, but its performance is lower than the corresponding of |x| ν , which, by the way, only uses three parameters.Finally, in the last experiment, a minimal neural network is used to illustrate how the fractional derivative ν for Morph() is updated using Adam and a fractional SGD (FSGD) [19] with the fractional gradient set to 1.7, which was determined experimentally for MNIST [18,19].Note that the fractional gradient uses the fractional derivative of the activation function to update the learning parameters, but the approach is different (and complementary) to adjust the shape of the fractional derivative for Morph() as activation function.
The number of epochs was 100, and the accuracy results are shown in Figure 28.The parameter initialization is ν = 0.75.In Figure 29, the shapes for epochs 1, 50, and 100 are plotted for both Adam and FSGD.It is a demonstration that a fractional optimizer like SGD can be sufficiently competitive with other more sophisticated optimizers like Adam.In fact, FSGD outperforms Adam in this experiment.
The source code of all the experiments is available at http://ia.azc.uam.mxaccessed on 11 July 2024.

Discussion
One of the branches of artificial intelligence research is the architecture of neural networks, focused on the number of layers, types of modules, and their connections.However, another branch that has gained traction is proposing new activation functions that provide the non-linearity required by neural networks to efficiently map inputs to outputs of complex data.
Fractional calculus emerges as a mathematical tool to generalize to non-integer derivative orders, useful for building activation functions and, in this paper, a novel adaptive activation function Morph was proposed based on fractional derivatives.
The search for the best activation function has led to the proposal of adaptive functions that learn from data.The adjustment of hyperparameters allows us to control stability, overfitting, and to enhance the performance of neural networks, among other challenges in machine learning.
Although we are certainly not working on function approximation with Morph, but rather using it as an activation function, it is important to emphasize the approximation property of Morph, and consequently to remark on how it provides the nonlinearity needed in a neural network to efficiently map inputs to outputs with good generalization capacity.
The proposed Morph function approximates the Sigmoid function sufficiently to satisfy the conditions of the Universal Approximation Theorem [37].Thus, there is a theoretical justification for Morph to approximate functions in compact subspaces.
Also, Morph is able to mimic shapes such as the Haar wavelet, which in turn are basis functions of vector spaces, and it gives an idea of the approximation capacity of this proposed function [43].In fact, the inspiring idea for Morph relies on the Haar decomposition as a linear combination of translated Heaviside functions.
Indeed, two fundamental operations on functions are translation and dilation [43].In this way, for a ̸ = 0 and b ∈ R, the translated and dilated version of Morph(x, ν) is: A successful application of this was the approximation of Sigmoid via PolySigmoid, Swish via PolySwish in Section 2.7, and GeLU via PolyGeLU in Section 2.9.

Figure 1 .
Figure 1.Interpolation of two functions to illustrate a "fractional" operator.
section describes several experiments that support the conclusions.They include: • Experiments 1 to 7. Accuracy comparison between existing activation functions and polynomial versions obtained from the proposed morphing activation function.• Experiments 8 to 11. Adaptation of hyperparameters of activation functions (including Morph) during the training by using gradient descent algorithms with MNIST dataset.• Experiment 12. Adaptation of hyperparameters for Morph by using gradient descent algorithms with CIFAR10 dataset.

3. 1 .
Experiment 1: PolySoftplus, Softplus, and Relu A first experiment compares PolySo f tplus, PolySo f tplus(α, x), Softplus, and ReLU.The results are shown in the plots of Figure 13.Note that: • ReLU is superior to all the other activation functions.• PolySoftplus outperforms Softplus.It is a bit intuitive after reading Section 2.10, since

Figure 20 .
Figure 20.Plot for Mish and PolyMish (top) and the test accuracies for MNIST dataset (bottom).

3. 10 .
Experiment 10: Trainable Activation Functions with Fractional Order: Morph and |x| ν vs. SPLASH In this experiment, the adaptive ν-order of fractional functions is optimized with SGD in PyTorch.Three activation functions are compared: SPLASH, |x| ν , and Morph.
the case of Morph, dilation a and translation b are used, so Mor ph(x, ν, a, b) = Mor ph( x−b a , ν) and the initialization was ν = −2.0, a = 1, and b = 0.It corresponds to PolySoftplus of Equation (45) illustrated in Figure 10 (a shape similar to ReLU or Softplus, in Section 2.10).In this case, Mor ph(x, ν, a, b) needs to adapt 3 × 3 = 9 hyperparameters.
Special attention is given to Morph(x, ν, a, b), which achieves the best accuracy compared to |x| ν and SPLASH 7 .

Figure 25
shows the change of the fractional orders of three Morph() activation functions used in the network architecture: ν 0 , ν 1 , and ν 2 initialized to zero.Also, the parameters a and b of these three Morph() functions are optimized, and they are plotted in the same figure.

Figure 25 .
Figure 25.Adaptation of ν 0 , ν 1 , ν 2 corresponding to the fractional derivatives of Morph(x, ν, a, b) in a neural network with three adaptive units.Also, a 0 , a 1 , a 2 and b 0 , b 1 , b 2 are plotted; a ′ s are initialized to 1 and b ′ s to 0.

Figure 26
Figure 26 shows the shape of Morph(x, ν, a, b) when the parameters ν = −2, a = 1, b = 0 are updated to ν = −2.0932, a = 0.3014, b = −0.2269,and then to ν = −2.0959, a = 0.2829, b = −0.2295at epochs 0, 2, and 30, respectively.These parameters are optimized with the gradient descent algorithm SGD in PyTorch.Only the values of one of the three Morph functions are presented for reasons of space, and since they are similar to those of the other two units.

3. 11 .
Experiment 11: Comparison of Morph() with Other 20 Activation Functions Experiment 11 compares several activation functions: piecewise polynomial obtained as special cases of Morph() and other existing and well-known activation functions.Based on the accuracy results, they are shown in Figure 27 sorted from left to right, worst to best, and enumerated in Table 1.Highlighted in bold are cases where a polynomial version is better than its counterpart, for example, PolySoftplus is better than Softplus.Note that the highest accuracies are achieved by Morph with optimized parameters: Case 21.Morph(x, ν), initializes ν = −1, a = 1, b = 0 and ν is optimized during the training.Case 22. Morph(x, ν), initializes ν = −2, a = 1, b = 0 and ν is optimized during the training.Case 23.Morph(x, ν, a, b), initializes ν = −2, a = 1, b = 0 and ν, a, b are optimized during the training.

Figure 27 .
Figure 27.Boxplots-accuracies of activation functions.Sorted from worst to best, left to right.The best case corresponds to Morph() with optimized fractional derivative during 30 epochs.