Self-Organized Operational Neural Networks with Generative Neurons

Operational Neural Networks (ONNs) have recently been proposed to address the well-known limitations and drawbacks of conventional Convolutional Neural Networks (CNNs) such as network homogeneity with the sole linear neuron model. ONNs are heterogenous networks with a generalized neuron model that can encapsulate any set of non-linear operators to boost diversity and to learn highly complex and multi-modal functions or spaces with minimal network complexity and training data. However, Greedy Iterative Search (GIS) method, which is the search method used to find optimal operators in ONNs takes many training sessions to find a single operator set per layer. This is not only computationally demanding, but the network heterogeneity is also limited since the same set of operators will then be used for all neurons in each layer. Moreover, the performance of ONNs directly depends on the operator set library used, which introduces a certain risk of performance degradation especially when the optimal operator set required for a particular task is missing from the library. In order to address these issues and achieve an ultimate heterogeneity level to boost the network diversity along with computational efficiency, in this study we propose Self-organized ONNs (Self-ONNs) with generative neurons that have the ability to adapt (optimize) the nodal operator of each connection during the training process. Therefore, Self-ONNs can have an utmost heterogeneity level required by the learning problem at hand. Moreover, this ability voids the need of having a fixed operator set library and the prior operator search within the library in order to find the best possible set of operators. We further formulate the training method to back-propagate the error through the operational layers of Self-ONNs.


I. INTRODUCTION
Multi-Layer Perceptrons (MLPs), and their derivatives, Convolutional Neural Networks (CNNs) have a common drawback: they employ a homogenous network structure with identical "linear" neuron model. This naturally makes them only a crude model of the biological neurons or mammalian neural systems, which are heterogeneous and composed of a highly diverse neuron types with distinct biochemical and electrophysiological properties [13]- [18]. With such crude models, conventional homogenous networks can learn sufficiently well problems with a monotonous, relatively simple and linearly separable solution space but they fail to accomplish this whenever the solution space is highly nonlinear and complex [8]- [10], [32], [33]. Despite many attempts to address this deficiency by searching for good network architectures [4], [5] or by following extremely laborious search strategies [6]- [10], or hybrid network models [19]- [21], or new parameter update approaches [22], [23]; no attempts have been made to address the core problem, i.e., the network homogeneity with only linear neurons coming from decades-old McCulloch-Pitts model [11].
To address this drawback, a heterogenous and dense network model, Generalized Operational Perceptrons (GOPs) has recently been proposed [32]- [36]. GOPs aim to model biological neurons with distinct synaptic connections. GOPs have demonstrated a superior diversity, encountered in biological neural networks, which resulted in an elegant performance level on numerous challenging problems where conventional MLPs entirely failed [32]- [36] (e.g. Two-Spirals or N-bit parity problems). Following GOPs footsteps, a heterogenous and non-linear network model, called Operational Neural Network (ONN), has recently been proposed [37] as a superset of CNNs. ONNs, like their predecessor GOPs, boost the diversity to learn highly complex and multi-modal functions or spaces with minimal network complexity and training data. More specifically, the diverse set of neurochemical operations in biological neurons (the non-linear synaptic connections plus the integration process occurring in the soma of a biological neuron model) have been modelled by the corresponding "Nodal" (synaptic connection) and "Pool" (integration in soma) operators whilst the "Activation" operator has directly been adopted. A particular set of nodal, pool and activation operator forms an "operator set" and all potential operator sets are stored in an operator set library. Using the so-called Greedy Iterative Search (GIS) method, an optimal operator set per layer can iteratively be searched during several short Back-Propagation (BP) training sessions. The final ONN can then be configured by using the best operator sets found, each of which is assigned to all neurons of the corresponding hidden layers. 2 The results over challenging learning problems demonstrate that 1) with the right operator set, ONNs can perform the required linear or non-linear transformation in each layer/neuron, so as to maximize the learning performance, and, 2) ONNs not only outperforms CNNs significantly, they are even able to learn those problems where CNNs entirely fail. However, ONNs proposed in [37], too, exhibit certain drawbacks. First and the foremost is the limited heterogeneity due to the usage of a single operator set for all neurons in a hidden layer. This enforces the sole usage of single nodal operator for all kernel connections of each neuron to the neurons in the previous layer. A major limitation is that the learning performance of the ONN directly depends on the operators (particularly nodal operators) in the operator set library, which is fixed in advance. In other words, if the right operator set for a proper learning is missing, the learning performance will deteriorate. Obviously, it is not feasible to cover all possible nodal operators since they are infinitely many. Furthermore, many operators cannot even be formulated with standard non-linear functions, yet, they can be approximated. Finally, the GIS is a computationally demanding local search process which requires many BP runs. The best operator sets found may not be optimal and especially for deep networks that are trained over large-scale datasets, GIS results in a real bottleneck computational complexity.
In order to address these drawbacks and limitations, in this study we propose Self-organized ONNs (Self-ONNs) with generative neurons. Self-ONNs, as the name implies, have the ability to self-organize the network operators during training. Therefore, they neither need any operator set library in advance, nor require any prior search process to find the optimal nodal operator. In fact, the limitation of the usage of a single nodal operator for all kernel connections of each neuron will be addressed by the "generative neurons" where each neuron can create any combination of nodal operators, which may not necessarily be a well-defined function such as linear, sinusoids, hyperbolic, exponential or some other standard functions. It is true that the (weights) parameters of the kernel change the nodal operator output, e.g., for a "Sinusoid" nodal operator of a particular neuron, the kernel parameters are distinct frequencies. This allows the creation of "any" harmonic function; however, the final nodal operator function after training cannot have any other pattern or form besides a pure sine wave even though a "composite operator", e.g., the linear combination of harmonics, hyperbolic and polynomial, or an arbitrary nodal operator function would perhaps be a better choice for this neuron than pure sinusoids. This is in fact the case for biological neurons where the synaptic connections can exhibit any arbitrary form or pattern. In brief, a generative-neuron is a neuron with a composite nodal-operator that can be generated during training without any restrictions. As a result, with such generative neurons, a Self-ONN can self-organize its nodal operators during training and thus, it will have the nodal operator functions "optimized" by the training process to maximize the learning performance. For instance, in the sample illustration shown in Figure 1, the CNN and ONN neurons have static nodal operators (linear and harmonic, respectively) for their 3x3 kernels, while the generative-neuron can have any arbitrary nodal function, , (including possibly standard types such as linear and harmonic functions) for each kernel element of each connection. This is a great flexibility that permits the formation of any nodal operator function. Finally, the training method that back-propagates the error through the operational layers of Self-ONNs is formulated in order to generate the right nodal functions of its neurons. Over the same set of challenging problems in [37] with the same severe restrictions, we shall show that Self-ONNs can achieve a comparable and usually better performance levels than the parameter-equivalent ONNs with a superior computational efficiency. The performance gap compared against the equivalent CNNs further widens even for Self-ONNs with significantly fewer neurons and with a short training. The rest of the paper is organized as follows: Section II will briefly present the conventional ONNs while the BP training is summarized in Appendix A. Section III presents Self-ONNs and generative neurons in detail and formulates the forwardpropagation (FP) and back-propagation (BP) training. It further discusses major features of Self-ONNs on a toy problem. Section IV presents detailed comparative evaluations among Self-ONNs, ONNs and CNNs over four challenging problems. The computational complexity analysis of these networks for both FP and BP is also presented in this section. Finally, Section V concludes the paper and suggests topics for future research.
II. OPERATIONAL NEURAL NETWORKS Similar to MLPs, conventional CNNs make use of the classical "linear" neuron model; however, they further apply two restrictions: kernel-wise limited connections and weight sharing. These restrictions turn the linear weighted sum for MLPs to the convolution formula used in CNNs. This is illustrated in Figure  2 (left) where the three consecutive convolutional layers without the sub-sampling (pooling) layers are shown. ONNs borrows the essential idea of GOPs and thus extends the sole usage of linear convolutions in the convolutional neurons by the nodal and pool operators. This constitute the operational layers and neurons while the two fundamental restrictions, weight sharing and limited (kernel-wise) connectivity, are directly inherited from conventional CNNs. This is also illustrated in Figure 2 (right) where three operational layers and the k th neuron with 3x3 kernels belong to an ONN. As illustrated, the input map of the k th neuron at the current layer, , is obtained by pooling the final output maps, of the previous layer neurons operated with its corresponding kernels, , as follows: A close look to Eq. (1) reveals the fact that when the pool operator is "summation", Σ, and the nodal operator is "linear", Ψ , , , , , , for all neurons, then the resulting homogenous ONN will be identical to a CNN. Hence, ONNs are indeed a superset of CNNs as the GOPs are a superset of MLPs.
For Back-Propagation (BP) training of an ONN, the following four consecutive stages should be iteratively performed: 1) Computation of the delta error, Δ , at the output layer, 2) Inter-BP between two consecutive operational layers, 3) Intra-BP in an operational neuron, and 4) Computation of the weight (operator kernel) and bias sensitivities in order to update them at each BP iteration. Stage-3 also takes care of subsampling (pooling) operations whenever they are applied in the neuron. BP training is briefly formulated in Appendix A while further details can be obtained from [37].

Convolutional Layers of CNNs
Operational Layers of ONNs

III. SELF-ORGANIZED OPERATIONAL NEURAL NETWORKS
In this section, first the model of generative neurons which are the main difference between conventional ONNs and Self-ONNs is presented. Then we shall formulate the forward-and backpropagation for Self-ONNs and finally, for the sake of clarity we shall discuss its major characteristics and computational efficiency over a toy problem.

A. Generative Neurons
As discussed earlier, a generative-neuron is a neuron with a "composite nodal-operator" that is iteratively created during BP training without any restrictions. In this way, each generative-neuron in a Self-ONN can have the self-optimized nodal operators by the BP training for each kernel element and for each connection (to each previous layer neuron) to maximize the learning performance. In order to generate a composite nodal operator, a straightforward choice would be the weighted sum of standard functions. For example, a composite nodal function may have the following expression: where is a Q-dimensional array of parameters that is composed of weights (e.g. and in Eq. (2)) and internal parameters of the individual functions, (e.g., (frequency), (power factor) and (slope) in Eq. (2)). However, such a straightforward formation of the composite nodal functions would obviously have severe stability issues due to the different dynamic ranges of the individual non-linear functions composed. Moreover, it requires too many parameters to be tuned especially when the operator set library contains many individual nodal operator functions. It is, as well, equally redundant because one can form any arbitrary function by other conventional methods such as Taylor Polynomials or Fourier Series. The former is a better choice due to its lower computational complexity than the latter. The Taylor approximation of a function, f , near a point, , can be expressed as, where f , f and f are the first, second and third derivatives, respectively. Hence, one can form the composite nodal operator function using the ℎ order truncated Taylor approximation as follows: Ψ , y y a ⋯ y a where ! is the q th parameter of the Q th order polynomial. The training process optimizes the parameters to form (approximate) the best-fitting nodal operator for each kernel element of each individual inter-neuron connection. An immediate issue arises, this approximation is only valid near the point, y= . The farther the points are from , the coarser the approximation becomes. However, this does not affect Self-ONNs since the nodal operators operate over the neuron outputs of the previous layer, each of which is bounded based on the generative range of the activation operator function. If, for instance, the activation function is a sigmoid, then the outputs, y, operates within the range of [0, 1]. In this case, the nodal operator function can be approximated for 0.5 (the mid-point) and a sufficiently higher-degree polynomial can approximate any arbitrary function sufficiently well around the close vicinity of this point, i.e., in the range of [0, 1]. In this study, we are using the activation function tangent hyperbolic (tanh) that is bounded in the range of, [-1, 1]. In this case, naturally, 0 and the ℎ order Taylor approximation in Eq. (4) simplifies to the Maclaurin series as, Ψ , y ⋯ Finally, the bias coefficient, , can be omitted since the overall DC bias will anyway be compensated by each neuron's bias term, .

B. Forward Propagation in Self-ONNs
The forward propagation (FP) formula for Self-ONNs differs from FP for ONNs in Eq. (1) by the following two points: 1) Each nodal operator function with the single kernel element, Ψ , , r, t , will now be approximated by the composite nodal operator, Ψ , , r, t , as expressed by the Maclaurin series in Eq. (5), 2) The scalar kernel parameter, r, t , of the kernel of an ONN neuron, is replaced by a Q-dimensional array, r, t , and the Maclaurin series expression in Eq. (5) is the only composite nodal operator function for all neurons in the network. Thus, individual nodal operators, e.g., Ψ , can now be expressed simply as the composite nodal operator, . So, the composite nodal function for the kernel element, r, t , can be expressed as follows: , , r, t r, t, 1 , where the DC bias term, 1 r, t, 0 , is omitted due to the reasoning mentioned earlier. Therefore, a generative neuron of a Self-ONN has a 3D kernel matrix where the q th weight of the kernel element r, t is represented by 1 r, t, q . As illustrated in Figure 1, for each neuron in a Self-ONN, any nodal function can be generated (approximated) for each kernel element and for each kernel connection. This results in an enhanced flexibility and diversity even over an ONN neuron where only a standard nodal operator function has to be used for all kernels connected to previous layer neurons. Finally, the generative neurons of a Self-ONN can still have different pool and activation operators; however, in this study we keep the choices fixed to "summation" for pool and "tanh" for activation.

C. Back Propagation on Self-ONNs:
For Self-ONNs, the contributions of each pixel in the output map, , on the next layer input map, , , can now be expressed as in Eq. (7). Using the chain rule, the delta error of the output pixel, , , can therefore, be expressed as in Eq.

1.
Let ∇ Ρ , , , Eq. (9) is similar to the corresponding one for ONNs in Eq. (26) in the Appendix, except that there is no need to register a 4D matrix for ∇ since it can directly be computed by Eq. (10). Moreover, when the pool operator is sum, Ρ Σ, then ∇ Ρ , , , 1 and thus, ∇ Ρ , , , ∇ , , , which is expressed in Eq. (10). Once the Δ is computed, using the chain-rule, one can express, When there is a down-sampling by factors, ssx and ssy, then the back-propagated delta-error by Eq. (26) should be first upsampled to compute the delta-error of the neuron. Let zero order up-sampled map be: up , . Then Eq. (11) can be modified, as follows: where where the q th element of the array, , , contributes to all the pixels of , as expressed in Eq. (6). By using the chain rule of partial derivatives, one can express the weight sensitivities, , in Eq. (15). A close look to Eq. (6) reveals that, , , which then simplifies to Eq. (16) Note that in this equation, the first term, Δ , , is independent from the kernel indices, r and t. It will be element-wise multiplied by the other two latter terms, each with the same dimension M Kx 1 x( Ky 1 , and created by derivative functions of nodal and pool operators applied over the shifted pixels of , and the corresponding weight value, , .
If Ρ Σ, then Eq. (16) is somewhat similar to Eq. (33), the corresponding one for conventional ONNs, except that there is no need to 7 register a 4D matrix for ∇ , since it can directly be computed from the outputs of the neurons. Moreover, when the pool operator is the sum, then ∇ Ρ , , , 1 and Eq. (16) will simplify to Eq. (17) where 〈 〉 is the q th 2D sensitivity kernel, which contains the updates (SGD sensitivities) for the weights of the q th order outputs in Maclaurin polynomial. Finally, the bias sensitivity expressed in Eq. (18) is the same for ONNs and CNNs since the bias is the common additive term for all. Algorithm 1: Back-Propagation algorithm for Self-ONNs Input: Self-ONN, , ) Output: Self-ONN* = BP (Self-ONN, , ) 1) Initialize network parameters randomly (i.e., ~U(-a, a)) 2) UNTIL a stopping criterion is reached, ITERATE: a. For each mini-batch in the train dataset, DO: i. FP: Forward propagate from the input layer to the output layer to find q th order outputs, and the required derivatives and sensitivities for BP such as , ∇ Ψ , ∇ Ρ and ∇ Ψ of each neuron,k, at each layer, l. ii. BP: Using Eq. (23) compute delta error at the output layer and then using Eqs. (10) and (12), back-propagate the error back to the first hidden layer to compute delta errors of each neuron, k, Δ at each layer, l. iii. PP: Find the bias and weight sensitivities using Eqs. (17) and (18), respectively. iv. Update: Update the weights and biases with the (cumulation of) sensitivities found in previous step scaled with the learning factor, ε, as in Eq. (19): 3) Return Self-ONN* Let 〈 〉 be the q th 2D sub-kernel where q=1..Q and composed of kernel elements, r, t, q . During each BP iteration, , the kernel parameters (weights), 〈 〉 , and biases, , of each neuron in the Self-ONN are updated until a stopping criterion is met. Let, ε t , be the learning factor at iteration, t. One can express the update for the weight kernel and bias at each neuron, i, at layer, l as follows: As a result, the pseudo-code for BP is presented in Alg. 1.

E. Discussions
Recall that the main difference between ONNs and Self-ONNs is the presence of generative neurons with the composite nodal operator, which is a Q th order Maclaurin polynomial. As a result, each kernel element is a Q-dimensional array and therefore, the weight kernels, , are 3D matrices that are equivalent to an array of Q 2D matrices, 〈 〉, 1, . . , .
Naturally, the weight sensitivities, , are 3D matrices too.
In order to speed-up both FP and BP, the q th power of the neuron outputs, , , can be computed only once (during FP) and stored in individual 3D matrices to be used repeatedly during BP. This is a memory overhead of Self-ONNs compared to ONNs. On the other hand, Self-ONNs do not need the 4D matrices, ∇ Ψ , and ∇ Ψ , both of which can be computed directly. For visualization, a Self-ONN with a single hidden layer and a single neuron is trained by BP over the toy problem shown in Figure 3. Input and output are both 3x3 images and the sample Self-ONN has a single input, hidden and output neuron with 2x2 kernels. The toy problem is to learn (regress) to rotate 180 ⸰ the input image. The final 13 th order nodal operators generated during BP are shown in Figure 3, plotted for each kernel element. It is interesting to see optimized nodal operators resembling to a sinusoid, exponential and logarithm with certain variations. This simple Self-ONN network can achieve ~30 times less MSE (or ~14dB higher SNR) than the equivalent CNN trained under the same BP hyperparameters.

IV. EXPERIMENTAL RESULTS
The comparative evaluations are performed with the same experimental setup and over the same challenging problems in [37]: 1) Image Synthesis, 2) Denoising, 3) Face Segmentation, and 4) Image Transformation with the same training constraints: i) Low Resolution: 60x60 pixels, ii) Compact/Shallow Models: Inx16x32xOut (for CNN and ONN) and Inx6x10xOut for Self-ONNs, iii) Scarce Train Data: 10% of the dataset iv) Multiple Regressions per network, v) Shallow Training: 240 iterations. 8 For a fair evaluation, we have used a Self-ONN configuration, Inx6x10xOut with 7 in all layers. In this way all networks have approximately the same number of network parameters. Note that this equivalence results in Self-ONNs having three times less number of hidden neurons than CNNs and ONNs, i.e., 16 vs. 48. Moreover, as in [37] the first hidden layer applies subsampling by 2, and the second one applies upsampling by 2. Self-ONNs are trained using Stochastic Gradient Descent (SGD) without momentum but with a fixed learning parameter whereas adaptive learning rate was applied for CNNs and ONNs in [37]. Finally, three BP runs have also been performed for Self-ONNs and the Self-ONN model that achieved the minimum loss (MSE) during these runs is used for each problem.

A. Learning Performance Evaluations
For each problem, the results obtained by Self-ONNs are compared against the best results obtained by the CNN and ONN.
In order to evaluate the learning performance for the regression problems, image denoising, syntheses and transformation, we used the Signal-to-Noise Ratio (SNR) evaluation metric, which is defined as the ratio of the signal power to noise power, i.e., 10log / . The ground-truth image is the original signal and its difference to the actual output yields the "noise" image. For the (face) segmentation problem we used the conventional evaluation metrics such as classification error (CE) and F1-score. For Image Synthesis and Denoising, the benchmark datasets are partitioned into train (10%) and test (90%) for 10-fold cross validation. So, for each fold, all network types are trained 10 times by BP over the train partition and tested over the rest. The following sub-sections will now present the results and comparative evaluations of each problem by the proposed Self-ONNs, ONNs and CNNs.

1) Image Denoising
As in [37] gray-scale 1500 images from Pascal VOC database are down-sampled and used as the target outputs while the images corrupted by and Gaussian White Noise (GWN) are the input with SNR 0dB. Compared to earlier denoising works using deep CNNs [37]- [41], this task is far more challenging due to the severity of the noise level applied (0 dB) while all other studies the "noisy" images have SNR levels higher than 15dB. Moreover, the aforementioned restrictions enforce severe training constraints, thus making the problem even more challenging for any machine learning approach. Figure 4 shows SNR plots of the best denoising results of the three networks for 10 folds and over both partitions. In both train and test partitions, Self-ONNs achieve significantly higher performance as compared to CNNs and ONNs. This is despite the fact that it has three times less neurons. The average SNR levels of CNNs, ONNs and Self-ONNs denoising for the (train) and (test) partitions are: (5.67dB, 5.68dB and 7.05dB), and (5.61dB, 5.46dB and 6.15dB), respectively. Therefore, Self-ONNs can achieve higher than 0.5dB SNR level on the average on the test partition. Figure 5 presents the SNR vs. iteration plots of all networks for the 1 st fold. The convergence speed of Self-ONN can easily be distinguished here, i.e., in both train and test partitions, within only 11 iterations it can reach up the maximum SNR levels of both CNNs and ONNs. This basically shows the crucial role of the optimized nodal operators of its generative neurons. In other words, those "custom-made" nodal operators can quickly be "tuned" within few BP iterations to achieve a superior generalization ability of the network. In this problem Self-ONN has already achieved above 6dB SNR level on the test set in less than 50 iterations.

Figure 5: The SNR vs. iteration plots for the CNN (blue), the ONN (red) and the Self-ONN (black) trained in the 1st
fold. The red circle shows the maximum SNR level achieved by the competing networks.

Input
Target CNN ONN Self ONN Figure 6: Some random original (target) and noisy (images) and the corresponding outputs of the CNN, ONN and Self-ONN from the test partition.
For a visual evaluation, Figure 6 shows randomly selected original (target) and noisy (input) images and the corresponding outputs of CNNs, ONNs and Self-ONNs from the test partition. The superior denoising performance of Self-ONNs is clear when compared with both traditional networks.

2) Image Synthesis
Image synthesis is a typical regression problem where a single network learns to synthesize a set of images from individual noise (WGN) images. As recommended in [37] we have trained a Self-ONN to (learn to) synthesize 8 (target) images from 8 WGN (input) images, as illustrated in Figure 7. We repeat the experiment 10 times (folds), so 8x10=80 images are randomly selected from Pascal VOC dataset. The gray-scaled and downsampled original images are the target outputs while the WGN images are the input.

Input
Target Output Self-ONN (1x6x10x1)   Figure 8 shows the SNR plots of CNNs, ONNs and Self-ONNs among the 10 BP runs for each synthesis experiment (fold). In this problem, Self-ONNs surpassed ONNs only on two folds out of ten. The average SNR levels of CNNs, ONNs and Self-ONNs synthesis are 5.02dB, 9.91dB, and 8.73dB respectively. The superiority of ONNs over Self-ONNs is due to two reasons: 1) The conventional nodal operators (exponential and chirp for the 1 st and 2 nd hidden layers and convolution for the output layer) are near-optimal choices whereas their Maclaurin approximation in Self-ONNs is not in general improving, rather deteriorating the learning performance, 2) conventional ONNs have the advantage of having 3 times more learning units (neurons) than Self-ONNs. Under the equivalent configuration, 1x16x32x1, Self-ONNs still surpasses ONNs achieving an average SNR level of 10.27dB. Against CNNs, Self-ONNs demonstrate a superior performance with a significant average SNR gap over 3.5dB. Finally, for a visual comparative evaluation, Figure 9 shows a random set of 14 synthesis outputs of all networks with the target image. The performance gap is also clear here especially some of the CNN synthesis outputs have suffered from severe blurring and/or textural artefacts.

3) Face Segmentation
Deep CNNs have often been used in face and object segmentation tasks [43]- [52]. As in [37], we used the benchmark FDDB face detection dataset [53], which contains 1000 images with one or many human faces in each image. Figure 10 shows F1 plots of the best CNNs, ONNs and Self-ONNs at each fold over both partitions. ONN-3 is the ONN model that got the highest test F1 scores in [37]. The average F1 scores of CNN, ONN-3 and Self-ONN segmentation for the (train) and (test) partitions are: (58.58%, 79.86%, and 96.6%) and (56.74%, 59.61% and 62%), respectively. The first and the foremost interesting observation is that Self-ONNs can achieve significantly higher F1 level in train set despite the fact that both ONNs and CNNs have three times more learning units than Self-ONNs. In fact, such a train performance hints a certain amount of "over-fitting" which will be discussed next. Self-ONNs achieves the highest average F1 score on the test set too; however, the performance gap diminishes.  Figure 11 presents the loss (MSE) vs. iteration curves of all networks for the 1 st fold. As in the denoising problem, the Self-ONN shows a staggering convergence speed, i.e., on both train and test sets, Self-ONN can achieve the minimum loss level of both CNN and ONN only within 10 iterations whilst both competing networks achieve their minimum loss almost at the end of the training. Then it reaches the minimum loss (MSE = 0.324) at iteration 21 and thereafter, the loss gradually increases at the test set, which indicates an overfitting. This is not surprising considering the scarcity of the train data and the lesser number of learning units in SelfONNs. In practice, such an overfitting can be avoided with a standard "early-stopping" technique over a validation set, and this, in turn, allows a very brief BP training (e.g. < 50 iterations) to achieve an elegant learning performance on the test set.

4) Image Transformation
In this task, a set of images is transformed to another by a network. In all earlier image transformation applications of Deep CNNs [54], [55] the input and output images are strongly correlated, e.g., edge-to-image, gray-scale-to-color image, and day-to-night (or vice versa) photo translation, etc. In [37] this problem has become more challenging where each image is transformed to an entirely different image. Moreover, a single network is trained to (learn to) transform 4 (target) images from 4 input images, as illustrated in Figure 12 (left). In this fold, note that two pairs of distinct images are used as both input and output of each other; therefore, the capability of the networks to learn both "forward" and "backward" problems at the same time and for two image pairs is tested.

Input
Target ONN CNNx4   Figure 13 presents the best SNR levels for each image transformation fold for all networks. The average SNR levels achieved by CNNs, ONNs and Self-ONNs are 0.5dB, 9.5dB and 10dB, respectively. As this is the hardest learning problem among all the problems in this study, it is not surprising to observe the largest performance gap between CNN and both ONN models (higher than 9dB on the average). The performances of ONNs and Self-ONNs are within a narrow margin while ONNs surpassed Self-ONNs on 3 out of 10 transformations. A close look to the plots in Figure 13 reveals the fact that in fold 9, the ONN has significantly surpassed the Self-ONN (as in fold 10 in image synthesis problem). This happens when the nodal operator of each neuron fits very well, and thus its Q th order approximation cannot reach the same performance. Moreover, the Maclaurin approximation also costs Q-times more parameters and thus, the Self-ONN ends up with significantly less number of neurons, which can potentially deteriorate the learning performance. However, more often Self-ONNs can surpass ONNs when their nodal operators are not properly assigned, or more likely, no such "best-fitting" nodal operator is available in the operator set library for the problem at hand. Obviously, in this case the "custom-made" nodal operators by generative neurons can boost the learning performance that is visible in the majority of the experiments performed in this study.

B. Computational Complexity Analysis
In this section the computational complexity of the proposed Self-ONNs is analyzed with respect to the parameter-equivalent CNNs and ONNs. We shall begin with the complexity analysis of the forward propagation (FP) and then focus on BP. For the sake of simplicity, we shall ignore the up-and down-sampling and assume the same input map sizes among the layers. As assumed in this study, when the pool operator is "sum", Ρ Σ, in a FP in a Self-ONN, Eq. (1) can be expressed as follows:  (20) where is the (Maclaurin) composite nodal operator and , is Q-dimensional array for the kernel element , .
Putting the q th order 2D kernel, 〈 〉 (q=1..Q), which is composed of the kernel elements, r, t, , then Eq. (20) can be simplified as, Layer l k th neuron (20,20) l k y (22,22) i th neuron (22,22) 1 This special-case Self-ONN configuration is illustrated in Figure 14 where it actually resembles a multi-output and multi-kernel CNN. Once the power-outputs, , for q=1..Q, are computed for all hidden neurons in the network, Eq. (21) is simply independent 2D convolutions, which can be parallelized, and hence, will take the same time for a single convolution. Therefore, in a parallelized implementation, a Self-ONN and a CNN with the same configuration have approximately the same computational complexity for FP. In this study, we compute the total number of multiply-accumulate operations (MACs) for CNNs and Self-ONNs used in this study. The number of MACs for the layer of the network is calculated using the following formula: where is the output of the current layer, is the number of neurons in the previous layer, and are the kernel dimensions for the current layer, is the order of approximation and finally, is the number of neurons in the current layer. The last term can be omitted for the special case where bias is not used. Table 1 provides the comparisons of the number of trainable parameters and total number of MACs, of the networks used in this study. The number of neurons in the input and output is fixed to 1 for both networks. For computational complexity comparison in BP training, recall that when the pool operator is "sum", then ∇ Ρ 1. Recall further that the power outputs, , are already computed for each hidden neuron of the network during the prior FP for each BP iteration. So, once the 4D matrix, ∇ , is computed by Eq. (11), then the error back-propagation computation in Eq. (10) can be parallelized and will take the same time complexity with an equivalent CNN. The computation of , and the delta errors, Δ , are also common with the conventional BP in CNNs. This makes the identical computational complexity for bias sensitivities by Eq. (19). Finally, for weight sensitivities, note that Eq. (18) is simply Q independent convolutions of the delta error, Δ , and power output, y , all of which can also be parallelized to take the same time for a single convolution. As a result, there is no significant difference between the BP computational complexities of CNNs and Self-ONNs with the same configuration. The time for computing the power outputs (only once in each BP iteration) and the 4D matrices are the only overheads, which are insignificant. In this study, earlier analogy is also valid for BP, i.e., since the network configuration for Self-ONN has three times less neurons than ONNs and CNNs, BP for Self-ONNs will take significantly less time than the BP for CNNs. The gap further widens when compared to ONNs about 1.5 to 4.7 times in practice [37].

V. CONCLUSIONS
In this study, Self-Organized ONNs (Self-ONNs) are proposed with the generative neuron model, which allows customized (self-optimized) nodal operator functions -not only for each neuron but for each kernel connection to the previous layer neurons. This is an ultimate heterogeneity level that allows to create (self-) optimized nodal operators during BP training. This does not only void the need for prior operator search runs, but also optimizes the nodal operator of the output layer neuron(s), that are the most crucial neurons in the network in which the loss (fitness) is computed. Like ONNs, Self-ONNs are also a superset of CNNs, e.g., when the order of a Self-ONN for each layer is set to Q=1, a Self-ONN will become a CNN. Even when Q > 1, if a linear (convolutional) neuron is the optimal choice for a particular problem, the ongoing BP can still converge all higher order (q>1) weights to zero to turn the Self-ONN to a conventional CNN. Overall, the generative neurons have the ability to form customized nodal operators per kernel connection for the problem at hand. In this way, traditional "weight optimization" of conventional CNNs is now turned to be "operator optimization" process.
The results on the four challenging problems proposed in [37] show that Self-ONNs with the same number of parameters (but with much less number of neurons) can achieve a superior learning performance whilst the performance gap over CNNs widens further. Self-ONNs usually obtain comparable or better results than ONNs; however, some results have highlighted a crucial fact: when a conventional nodal operator of an ONN is the "right choice" for a particular problem, the parameterequivalent Self-ONN cannot surpass the performance with the Qorder Maclaurin approximation of the "near-optimal" nodal operators and of course, with less number of neurons. However, this seems to be the minority case over the problems tackled in this study. Above all, Self-ONNs, in a parallelized implementation, have a superior computational efficiency especially compared to ONNs.
This study has proposed a "baseline" version of Self-ONNs and further performance boost can be expected with the following improvements:  instead of fixing to some practical value, (e.g. Q=7 in this study) optimizing the order of Maclaurin approximation, Q, per layer and even per neuron,  adapting a better optimization scheme for training, e.g., SGD with momentum [56], AdaGrad [57], RMSProp [58], Adam [59] and its variants [60], all of which should be modified for Self-ONNs for proper functioning,  optimizing also the pool and activation operators during BP training,  and finally, performing non-localized kernel operations for each kernel connection of each neuron for the operation capability within a larger area without increasing the size of the kernels.
These will be the topics for our future research.