Backpropagation Neural Tree

We propose a novel algorithm called Backpropagation Neural Tree (BNeuralT), which is a stochastic computational dendritic tree. BNeuralT takes random repeated inputs through its leaves and imposes dendritic nonlinearities through its internal connections like a biological dendritic tree would do. Considering the dendritic-tree like plausible biological properties, BNeuralT is a single neuron neural tree model with its internal sub-trees resembling dendritic nonlinearities. BNeuralT algorithm produces an ad hoc neural tree which is trained using a stochastic gradient descent optimizer like gradient descent (GD), momentum GD, Nesterov accelerated GD, Adagrad, RMSprop, or Adam. BNeuralT training has two phases, each computed in a depth-first search manner: the forward pass computes neural tree's output in a post-order traversal, while the error backpropagation during the backward pass is performed recursively in a pre-order traversal. A BNeuralT model can be considered a minimal subset of a neural network (NN), meaning it is a"thinned"NN whose complexity is lower than an ordinary NN. Our algorithm produces high-performing and parsimonious models balancing the complexity with descriptive ability on a wide variety of machine learning problems: classification, regression, and pattern recognition.


Introduction
Data-driven learning is a hypothesis (trained model) search from a hypothesis-space that fits input data to its target output as good as possible (a low error on test data).A learning algorithm like neural networks (NNs) parameter optimization via backpropagation is the effort to find such a hypothesis (Rumelhart et al., 1986).We propose a new study of ad hoc neural trees generation and their optimization via our recursive backpropagation algorithm to find such a hypothesis.

Hence, we propose a new algorithm called Backpropagation Neural Tree (BNeuralT).
A tree of BNeuralT is like a biological dendritic tree (Travis et al., 2005;Mel, 2016) that processes repeated inputs connected to a single neuron (Beniaguev et al., 2020;Jones and Kording, 2021) through dendritic nonlinearities (London and Häusser, 2005).Structurally, BNeuralT model is a stochastic computational dendritic tree that takes random repeated inputs through its leaves and imposes dendritic nonlinearities through its internal nodes like a biological dendritic tree would do (Travis et al., 2005;Jones and Kording, 2021).Hence, considering the plausible dendritictree-like biological properties, BNeuralT is a single neuron neural tree model with its internal nodes resembling dendritic nonlinearities.
Structurally, BNeuralT, being a tree, is a minimal subset of a (highly sparse) NN whose complexity is comparatively low (Poirazi et al., 2003b).This means that a NN with a very high dropout [a network regularization technique (Srivastava et al., 2014)] prior to its training can be similar to BNeuralT, except BNeuralT has dedicated paths from input to output as opposed to sparse NN that has shared connections between nodes.Hence, we aim to gauge the performance of ad hoc neural trees trained using stochastic gradient descent (SGD) optimizers like gradient descent (GD), momentum gradient descent (MGD) (Qian, 1999), Nesterov accelerated gradient descent (NAG) (Bengio et al., 2013), adaptive gradient (Adagrad) (Dean et al., 2012), root-mean-square gradient propagation (RMSprop) (Tieleman and Hinton, 2012), and adaptive moment estimation (Adam) (Kingma and Ba, 2015).
Operationally, an expression-tree with its operator (node) being neural nodes (i.e., an operator is an activation function), edges being neural weights, and leaves being inputs make a neural tree architecture, where the tree's architecture itself can be optimized (Chen et al., 2005;Schmidt and Lipson, 2009).The tree's edges (parameters) optimization is straightforward using a gradientfree method (Rios and Sahinidis, 2013;Kennedy and Eberhart, 1995) where the tree is assumed a target function (Ojha et al., 2017).However, its gradient-based optimization is non-trivial, especially because the error-backpropagation through the tree data structure is recursive to traverse.Our proposed BNeuralT algorithm does a two-phase computation of a neural tree in a depth-first search manner: the forward pass computes neural tree's outputs in a post-order traversal, while the error backpropagation during the backward pass is performed recursively in a pre-order traversal.
We trained ad hoc neural trees in an online (example-by-example) and a mini-batch mode on a variety of learning problems: classification, regression, and pattern recognition.For classification and pattern recognition problems, BNeuralT has its root node's children (nodes at tree depth one) strictly dedicated to each target class, and the root node decides the winner class on receiving input data.BNeuralT dedicates its root as the output node for a regression problem.
We evaluated BNeuralT's convergence process on six SGD optimizers and analyzed BNeuralT's complexity against its convergence accuracy.Each training version was compared with a similar training version of a multi-layer perceptron (MLP) algorithm (i.e., an input-hidden-output NN architecture) and classification and regression algorithms such as decision tree (DT) (Breiman et al., 1984), random forest (RF) (Breiman, 2001), single and multi-objective versions of a heterogeneous flexible neural tree (HFNT S and HFNT M ) (Ojha et al., 2017), multi-output neural tree (MONT) (Ojha and Nicosia, 2020), Gaussian process (GP) (Rasmussen and Williams, 2006), naïve Bayes classifier (NBC) (Mitchell, 1997), and support vector machine (SVM) (Cortes and Vapnik, 1995;Chang and Lin, 2011;Fan et al., 2008).The results on all problems indicate the success of our BNeuralT algorithm that produces high-performing and parsimonious models balancing the complexity and descriptive ability with a minimal training hyperparameters setup.
Our contribution is an innovative Recursive Backpropagation Neural Tree algorithm that • takes inspiration from biological dendritic trees to solve a wide class of machine learning problems through a single neuron tree-like model performing dendritic nonlinearities through its internal nodes and resembling a highly sparse neural network.
• generates low complexity and high accuracy models.Therefore, we have designed a learning system capable of producing minimal and sustainable neural trees that have fewer parameters to produce more compact and, therefore, sustainable neural models able to reduce CPU time and, consequently, CO 2 emissions for machine learning applications.
• shows that the sigmoidal dendritic nonlinearity of any stochastic ad hoc neural tree structure can solve machine learning problems with high accuracy, and any such structure excels to genetically optimized neural tree structures, NNs, and other learning algorithms.This paper presents relevant related work in Sec. 2. BNeuralT model's architecture and properties are described in Sec. 3. Secs.4.1 and 4.2 outline the hyperparameter settings and experiment versions.The performance of BNeuralT on machine learning problems is summarized in Sec. 5 and discussed in Sec.6, followed by conclusions in Sec. 7. Source code of BNeuralT algorithm and pre-trained models are available at https://github.com/vojha-code/BNeuralT.

Related works
We review the works defining neural tree architectures and training processes.The early definition of neural trees appeared in (Sakar and Mammone, 1993;Sirat and Nadal, 1990), where the tree's "root-to-leaf " path is represented as a neural network (NN).Such a tree makes its decision through leaf nodes, and its internal nodes are NNs (or neural nodes).Jordan and Jacobs (1994) proposed a hierarchical mixture expert model that performs construction of a binary tree structure where the model hierarchically combines the outputs of expert networks (feed-forward NNs at the terminal) though getting networks (feed-forward NNs at non-terminal) and propagates computation from "leaf-to-root" and where each NN uses the whole input features set.
In contrast, our model is purely a single network (tree) structure representation, whereas a hierarchical mixture expert model is a hierarchical combination of several (preferably small) networks.Therefore, unlike hierarchical mixture expert model, our model is a subset of a NN where "leaf-to-root" has a specific information processing path.In fact, considering plausible inspiration from biological computational dendritic tree (Travis et al., 2005;Mel, 2016;Poirazi et al., 2003b), our model behaves as a single neuron model (Jones and Kording, 2021).
Our proposed BNeuralT algorithm generates an m-ary tree structure stochastically and assigns edge weights randomly.BNeuralT's each leaf node (terminal node) takes a single input variable from a set of all available variables (data features).Therefore, in a generated tree, some features could remain unused by the model leading to only select features responsible for the prediction.
Moreover, tree's each neural node (non-terminal node) takes a weighted summation of its child's output.Hence, a BNeuralT model potentially performs an input dimension reduction and propagates the computation from leaf to root.
A recent work of Tanno et al. (2019) demonstrates neural tree as an arrangement of convolution layers and linear classifier as a learning model resembling a decision tree-like classifier where the incoming inputs at the nodes are inferred through the so-called router, processed through tree edges (transformers), and classified through leaf (solver) nodes.In contrast, our model takes image pixels as its inputs.A leaf-to-root as a neural tree definition appeared in (Zhang et al., 1997;Chen et al., 2005), where the tree's leaf nodes are designated inputs, internal nodes are neural nodes, and edges are weights.Such types of neural trees have been subjected to structure optimization (Chen et al., 2005;Ojha et al., 2017) and parameter optimization via gradient-free optimization techniques like particle swarm optimization (Chen et al., 2007) and differential evolution (Ojha et al., 2017).Zhang et al. (1997) demonstrated that a neural tree could be evolved as a subset of an MLP.
Their effort was to evolve a neural tree using genetic programming and optimize parameters using a genetic algorithm.Lee et al. (2016) focused on implementing pooling layers within a convolutional NN as a tree structure.However, our approach is to generate and train ad hoc neural trees using our proposed recursive backpropagation algorithm.To the best of our knowledge and review, this is the first and novel attempt to generate and train ad hoc neural trees using our recursive error-backpropagation algorithm.Our motivation is to avoid any prior assumptions on network architecture and complicated hyperparameter settings.Srivastava et al. (2014) proposed dropout technique that suggests randomly dropping neurons from a large NN.This creates "thinned" NN instances during training and prevents a NN from overfitting.Our proposed BNeuralT randomly generates a tree architecture, which can be considered a sparse NN in a similar sense with rather a higher dropout.Also, the branching and pruning of the tree branches in BNeuralT are performed at the tree generation stage, where a branch is probabilistically pruned by generating a leaf node at a depth lower than terminals.

Problem statement
Let X ∈ R d be an instance-space and Y = {c 1 , . . ., c r } be a set of r labels such that a label y ∈ Y is assigned to an instance x ∈ X .Therefore, for a training set of N instance-label pairs S = (x i , y i ) N i=1 , we induce a classifier G(X , w) that reduces classification cost L Error (G) = 1 /N N i=1 (ŷ i = y i ), where ŷi is a predicted output class on an input instance labeled with the target class y i ∈ {c 1 , . . ., c r }.Additionally, when an instance x ∈ X is associated with a continuous variable y ∈ R rather than a set of discrete class labels, then G(X , w) is a predictor that for a training set of instance-output pairs S reduces a prediction cost like mean

Backpropagation neural tree algorithm
Backpropagation neural tree (BNeuralT) takes a tree-like architecture whose root node is a decision node, and leaf nodes are inputs.For classification learning problems, BNeuralT has strictly dedicated nodes at level-1 (child nodes of the root) of the tree to represent classes.BNeuralT, denoted as G is an m-ary rooted tree with its one node designated as the root node, and each node takes at least m ≥ 2 child nodes except for a leaf node that takes no child node.Hence, for a tree depth p, BNeuralT takes n ≤ [(m p+1 − 1)/(m − 1)] nodes (including the number of internal nodes |V | and the leaf nodes |T |).Thus, BNeuralT can be defined as a union of internal and leaf nodes V is an internal node and receives 2 ≤ j ≤ m inputs from its child nodes.The k-th leaf node t k ∈ T has no child, and it has a designated input Fig. 1 is an example of classification (left) and regression (right) trees.All internal nodes (shaded in gray) of the tree are neural nodes and may behave like the nodes of a NN.That is, a neural node computes a weighted summation z of inputs and squashes that using an activation function ϕ(z), e.g., sigmoid : ϕ(z) = 1 /(1 + e z ) or ReLU : ϕ(z) = max(0, z).We installed sigmoid or ReLU functions as BNeuralT's neural nodes, which can be any other activation function like tanh.The trainable parameters w are the edges and the bias weights of the nodes.The number of nodes n in a tree grows as per O(m p ).The number of edges (n − 1) is proportional to the growth of n, so is the number of tree's trainable parameters w.
x 1 Biologically plausible neural computation of BNeuralT.A typical NN uses McCulloch and Pitts (1943) neurons.Such a neuron operates on a weighted sum of inputs and processes the sum via a nonlinear threshold function.Such neural computation considers that the dendrites (synaptic inputs) of a neuron are summed at "soma," thereby exciting a neuron, i.e., providing it a firing strength (McCulloch and Pitts, 1943;Hodgkin and Huxley, 1952;Poirazi et al., 2003a).However, the biological behavior of dendrites shows that dendrites themselves impose nonlinearity on their synaptic inputs before summing at "soma" (London and Häusser, 2005;Hay et al., 2011).This dendritic nonlinearity is possibly a sigmoidal nonlinearity (Poirazi et al., 2003b).Additionally, the synaptic connections in a fully connected NN are symmetric, whereas biological dendritic connections are asymmetric (Mel, 2016;Travis et al., 2005;Farhoodi and Kording, 2018) [cf.Fig. 2 accuracy.Jones and Kording (2021) considered the biologically asymmetric morphology of "dendritic tree" and its repeated synaptic inputs to a neuron to show the computational capability of a single neuron for solving machine learning problems.Stochastic gradient descent (SGD) training.BNeuralT's trainable parameters w are iteratively optimized by a stochastic gradient descent (SGD) method (cf.Algorithm 1) that at an iteration j requires gradient ∇w j ← Gradient∇ w L(x j , G w j ) computation (cf.Algorithm 2) and weight update as per w j ← w j−1 + η∇w j .The weight update w j ← w j−1 + η∇w j in line number 7 of Algorithm 1 is a simple GD method, where other similar optimizers like MGD, NAG, Adagrad, RMSprop, or Adam can also be used.Table 1 details the expressions of weight updates for these optimizers.
Error-backpropagation in BNeuralT.Our proposed recursive error-backpropagation in BNeu-ralT algorithm has two computation phases: forward pass and backward pass (cf.Fig. 3).Both work in a depth-first search manner.Since a tree data structure is algorithmically recursive to traverse through, both forward pass and backward (error-backpropagation) pass take place in a recursive manner.The forward pass computation produces the output for a tree in a post-order traversal manner (cf.Fig. 3(left)).That is, each leaf node propagates its input through dendrite for an instance x j ∈ S do 6: w j ← w j−1 + η∆w j GD weights update on an input instance x j 8: end for 9: end for 10: end procedure

Algorithm Expression
MGD (Qian, 1999) RMSprop (Tieleman and Hinton, 2012) Adam (Kingma and Ba, 2015) (edge) to its parent node, and subsequently, each internal node, after computing received inputs from its child nodes, propagates activation to their respective parent node.Finally, the root node computes the tree's output.
The backward pass computes the gradient of the error with respect to edge weights.The backward pass computes gradient δ for each internal node and propagates it back to each edge depth-by-depth.Hence, the backward pass is a pre-order traversal of the tree (cf.Fig. 3(right)).
That is gradient δ computed at the root node flow backward to its child node until it reaches leaf nodes.x Algorithm 2 Backpropagation computation of neural tree G 1: procedure Gradient ∇wL(x, G) x and w are the inputs and trainable parameters 2: G δ ←Compute δ (y, G(x), N 0 ← G, G) Compute δ using Algorithm 3 for target y and prediction G(x) at tree's root N 0 3: N → ∇w jk = δ k h j w jk is the weight between node N j and its parent node N k 10: N → ∇w b j = N → δ j bias weight is set to current node's gradient δ 11: else if N → T ype → Leaf node then 12: end for 17: return ∇w 18: end procedure Algorithm 3 Computation of δ for each neural node y is a target, ŷ is a prediction, and N is the current (or entry) node of tree G 2: if N → T ype → Output node then compute gradient δ k for output node N in G 3: else compute gradient at an internal (hidden) neural node 5: h j = N → h j activation (value) of an internal (neural) node 6: w jk = N → w jk edge (weight) between current node N j and parent node N k 8: end for 13: end procedure 4 Experiments
These datasets are available at (Bache and Lichman, 2013;Keel, 2011).These problems are significantly different not only in terms of the number of classes and examples but also in terms of their attribute types and range.This differing nature of these problems poses significant variations in difficulty for one algorithm to excel on all problems (Wolpert, 1996).
Both classification and regression learning datasets were normalized using min-max normalization between 0 and 1.Each dataset was randomly shuffled and partitioned into training (80%) and test (20%) sets for each instance of the experiment.For a pattern recognition problem, we select the MNIST dataset (LeCun et al., 2020) BNeuralT hyperparameters.We repeated experiments 30 times (independently) for each classification and regression problem.In each run, we generated ad hoc BNeuralTs (stochastically generated tree structures) for each dataset with a maximum tree depth p = 5; max child per node m = 5, and branch pruning factor P [leaf p < p] ∈ {0.4,0.5} which is a probability of a leaf node being generated at a depth lower than the tree height p.A higher leaf generation probability (e.g., 0.5) at internal nodes means that tree height terminates earlier than its predefined depth, which means a tree will be generated with fewer parameters.A lower leaf generation probability (e.g., 0.4) means a deeper tree structure with more parameters.
Other algorithms hyperparameter.MLP architecture was a fixed three-layer architecture [inputs-hidden (100 nodes)-targets] for each dataset of classification and regression problems.
A SoftMax layer acted as the MLP classifier's output nodes, and an MLP regression had a sigmoid activation as its output node.The internal neural nodes in an MLP were sigmoid (or ReLU ) functions.Other algorithms HFNT S , HFNT M , MONT 3 , DT, RF, GP, NBC, SVM, and CARTs had their default setups as they are in their libraries (Pedregosa et al., 2011) or in the literature (Ojha et al., 2017;Ojha and Nicosia, 2020;Zharmagambetov et al., 2019).A detailed list of hyperparameters of all algorithms is provided in Supplementary Table A1.SGD hyperparameters.BNeuralT and MLP algorithms take optimizers like GD, MGD, NAG, Adagrad, RMSprop, or Adam.The training parameters were learning rate η = 0.1, momentum rate γ = 0.9, β 1 = 0.9, β 2 = 0.9, = 1e −8 , training mode was stochastic (online), and training epochs were 500.Since the gradient descent computation was stochastic, both BNeuralT and MLP do the same number of forward-pass (function) evaluations, i.e., number training examples × epochs.All six optimizers were used for training BNeuralT and MLP with an earlystopping restore-best strategy (or without an early-stopping for some trail experiments), whereas other algorithms take their own default optimizer (Pedregosa et al., 2011) was best with categorical cross-entropy loss (Bishop, 2006).The training of other algorithms had default setups recommended in their libraries (Pedregosa et al., 2011).For regression problems, all algorithms were trained by reducing L MSE (•).The test metric for classification problems for all algorithms was a miss-classification rate L Error (•) and for regression problems, it was a i=1 (yi − ȳ) 2 (Nash-Sutcliffe model efficiency coefficient) which gives a value between [−∞, 1], where ȳ is the mean of target y.

BNeuralT and MLP experiment versions
We experimented with multiple versions of BNeuralT and MLP settings to bring out the best of both.We tried sigmoid and ReLU as the internal activation functions.We tried BNeuralT's branch pruning factor P [leaf p < p] with 0.5 and 0.4.
The learning rates of the optimizers had two sets: (i) A flat learning rate η = 0.1 for all optimizers.(ii) The learning rate recommended in the Keras library for respective optimizers, i.e., RMSprop, Adam, and Adagrad had η = 0.001, and MGD, NAG, and GD had η = 0.01.We call the library's recommended η value a default learning rate.In addition, the SGD learning was tried "with" and "without" early-stopping (ES) strategies.For each algorithm (BNeuralT, MLP, HFNT S , HFNT M , MONT 3 , DT, RF GP, NBC, and SVM), each optimizer (GD, MGD, NAG, Adagrad, RMSprop, and Adam), and each variation of hyperparameter setting, there were 110 experiments (cf.Tables A2 and A3 in Supplementary).
We repeated each experiment for each dataset for 30 independent runs, and their average performance on test sets was evaluated.

Selection of the best performing setting
We selected the best performing setting based on the average test accuracy computed over 30 independent runs of BNeuralT, MLP, HFNT S , HFNT M , MONT 3 , DT, RF, GP, NBC, and SVM to report them in detail in this section.The best performing BNeuralT setting was the "ES training of BNeuralT having sigmoid nodes, 0.1 learning rate, and 0.4 leaf generation rate." The best MLP setting was "ES training of MLP having sigmoid nodes and default learning rate."We found that HFNT S , HFNT M , DT, RF, GP, NBC, and SVM worked best with their recommended setting.
We found that collectively on all classification and regression datasets, BNeuralT with sigmoid nodes, 0.1 learning rate, and 0.4 leaf generation rate trained using RMSprop performed the best among all experiment versions of all algorithms.This setting produced an average accuracy of 83.2% across all datasets with an average of 222 trainable parameters.This same setting also performed the best across all classification datasets among all algorithms, i.e., it produced an average accuracy of 89.1% with an average of 261 trainable parameters.In fact, the top six best results over classification datasets were from BNeuralT settings.GP algorithm came 7th with an average classification accuracy of 86.79%.MLP with sigmoid node and ES training using MGD optimizer with default learning performed 8th with an average accuracy of 86.78% with an average 1970 trainable parameters.MLP, however, performed slightly better on regression problems than the other algorithms.MLP with sigmoid node and ES training using NAG optimizer with default learning rate produced an average regression fit value of 0.775.Whereas BNeuralT with sigmoid nodes, 0.1 learning rate, and 0.4 leaf generation rate trained using RMSprop optimizer produced an average regression fit value of 0.727.It is important to note that this performance of BNeuralT comes with a much lower average trainable parameter.BNeuralT used only 152 trainable parameters compared to MLP that used 1041 parameters.This means BNeuralT's performance comes with an order magnitude less parameter than MLP on both classification and regression tasks.
Table 2 reports the details of each algorithm's best-performing settings.However, exhaustive lists of 110 experiments, from which we selected these best performing settings, are provided in Supplementary Tables A2 and A3.

BNeuralT models summary
BNeuralT classification models summary.Table 2 suggests that BNeuralT performance on both classification and regression problems is highly competitive with MLP and other algorithms.For example, the average performance of BNeuralT's RMSprop on all classification problems is 2.65% (average accuracy: 89.1%) higher than the nearest best performing non-BNeuralT algorithm.The best MLP model offered an average accuracy of 86.8%, and MLP with a 0.4 dropout rate using Adam produced an 85.9% accuracy.Other algorithms were as follows: HFNT S , 78.9%; HFNT M , 72.4%; MONT 3 , 83.1%; DT, 81.3%; RF, 86.4%; GP, 86.8%; NBC, 78.2%; and SVM, 84.1%.For this performance, BNeuralT uses only 13.25% (w = 261) trainable parameters than MLP's 1969 parameters.The structures of some select best performing BNeuralT classification models are shown in Fig. 5, where black edges indicate dendrites; and green, blue, red, and black nodes, respectively indicate inputs, dendritic nonlinearities, root, and class nodes.
The average tree size of HFNT M , MONT 3 , and HFNT S algorithms were 29 (72.4% accuracy), 36 (83.1% accuracy), and 92 (78.9% accuracy), respectively.Since tree construction and forward pass computation are similar for BNeuralT, HFNT, and MONT algorithms, there is a trade-off between the model's compactness and accuracy.In fact, this produces a set of trade-off solutions (between accuracy and complexity).Along with this set of Pareto solutions, one can choose which is the best candidate solution for the given machine learning problem under examination: more accurate but less sustainable or a little less accurate but more robust and sustainable.
The forward pass computation time on a single example (in multiple of 10 −6 seconds) of BNeu-ralT was 11.2 seconds, whereas MLP took 1288; DT,3.1;RF,485.4;GP,455.6;NBC,31.9;and SVM,16.1 seconds.DT was the fastest, and BNeuralT was the second-fastest.However, DT has a much lower accuracy (81.3%) than BNeuralT (89.1%).The time computation is difficult to compare as the algorithms were implemented in different programming languages (Pereira et al., 2017).BNeuralT was implemented in Java 11, and all other algorithms were implemented in Python 3.5.However, BNeuralT's performance on classification problems was clearly better among all algorithms.This is further evident from BNeuralT's collective average accuracy of all optimizers on all classification datasets was 86.1%.Whereas on all optimizers, MLP's average accuracy was 83.8%, tree algorithms had 80.4%, and other algorithms had 83.6% accuracy.
We selected BNeuralT's best optimizer RMSprop for the statistical significance test.This test was designed to examine whether the performance of BNeuralT's RMSprop is statistically significant than that of the other algorithms.Table 3 presents Kolmogorov-Smirnov (KS) test results on two samples to examine the null hypothesis that there is no difference between the performance distributions of BNeuralT's RMSprop and other algorithms.The results show that for most datasets and most algorithms, the classification results of BNeuralT's RMSprop show statistical significance over other algorithms' performance as the null hypothesis of no difference is rejected in most cases.This was the case, despite using a restrictive Bonferroni correction to adjust p-values.Wilcoxon signed-rank test and Independent T-test in Supplementary Table A4 and Table A5 also favor BNeuralT's RMSprop.
BNeuralT regression models summary.MLP's Adam performed best for regression problems.MLP's Adam produced an average regression fit of 0.772 without dropout and 0.754 with dropout on all datasets.BNeuralT's RMSprop offered an average regression fit of 0.727, which differs only by 5.8% with the best MLP result.This performance of BNeuralT comes with the use of only 14.6% trainable parameters than the parameters used by the MLP (w = 1014).(Note that an MLP dropout model during its test phase uses all weights since dropout only regularizes weights by averaging gradient over epochs during the training phase (Srivastava et al., 2014).)This suggests that BNeuralT is highly capable of learning data with very low complexity with a faster forward pass computation time.The structure of some select best performing BNeuralT regression models is shown in Fig. 6.
The average tree size of BNeuralT with P [leaf p < p] = 0.5 and RMSprop for regression problems was 64 (L r 2 = 0.675).The average tree size of HFNT S and HFNT M algorithms were 127 (L r 2 = 0.562) and 90 (L r 2 = 0.567) respectively.Here, BNeuralT was able to perform accurately compared to genetically optimized HFNT algorithms with less complex models.BNeuralT pattern recognition (MNIST) models summary.RMSprop optimizer was found robust and converging fastest for classification models (cf.Sec.5.3).Hence, we train BNeuralT on the MNIST character classification dataset (LeCun et al., 2020) using RMSprop.

The statistical tests in
We fed BNeuralT with pixels of MNIST character images since we do not use convolution in BNeuralT.We aimed at generating varied BNeuralT models with varied trainable parameters length by varying tree size.We hoped that a low complexity (few parameters) BNeuralT model would perform competitively with a few reported state-of-the-art.Therefore, we compare BNeu-ralT performance to gauge its robustness not only on learning small-scale problems reported in  In our few trials, BNeuralT does perform competitively with many state-of-the-art (cf.Table 4).
We compare BNeuralT with biologically plausible models of Jones and Kording (2021) that performed binary classification on MNIST's two classes (class 3 and class 5).This is, however, a trivial comparison as BNeuralT works on all classes and uses sigmoidal dendritic nonlinearities, whereas Jones and Kording (2021)'s models work on binary class and use Leaky ReLU as dendritic nonlinearities.They obtained an error rate of 7.8%, 3.65%, and 8.89% respectively with 1-tree (w = 2, 047), 32-tree (w = 65, 504), and A-32-tree (w = 65, 504) models.In contrast to Jones and Kording (2021)'s models, BNeuralT performs classification on all ten classes of MNIST pixels.Obviously, some classes are easier to learn than others (see Fig. 7), and training a binary classifier presents an entirely different difficulty level than a multi-class classification.
However, although a one-to-one comparison is not possible in such a scenario, it may be worth noting that BNeuralT obtained an error rate of 7.74% (w = 11, 987) and 6.08% (w = 23, 835) on all ten classes.Therefore, the sparse stochastic structure of BNeuralT (e.g., Fig. 8) stands competitive with the models of Jones and Kording (2021).
Moreover, BNeuralT models show a linear relation between trainable parameters and their accuracy (cf.Table 4).Hence, BNeuralT models with relatively higher parameters and exhaustive hyperparameter tuning are able to produce efficient results.Fig. 7 shows an example BNeuralT   .This model has 3,664 function nodes (blue nodes), 16,507 leaf nodes (green nodes), and ten class nodes (red nodes in the inner circle), and the root node (in black) in the center.This model has 6,738 edges (gray lines connecting nodes).These lines also represent neural weights.Each blue node also has its bias.Edge weights and bias together make 23,835 tree's trainable parameters.This model has a test accuracy of 94% (an error rate of 6.08%).

BNeuralT convergence analysis
We evaluated average asymptotic convergence profiles of all six SGD optimizers for optimizing BNeuralT on classification and regression problems (cf.Figs. 9, 10, and 11).For such an analysis, we recorded training and test accuracies of each training epoch.Since we ran algorithms for 30 independent instances, we analyzed the average trajectory of all 30 runs.In each run, an ad hoc BNeuralT architecture was generated, which could vary in tree size between a minimum "outputs × 2" nodes to a maximum (m p+1 − 1)/(m − 1) nodes.Hence, BNeuralT architecture and trainable parameters varied stochastically at each instance of the experiment.
Such high entropy network architectures pose difficulties for SGDs to perform well consistently.
BNeuralT classification models convergence.Adagrad showed the most interesting convergence profile as initially, it had worse convergence among all optimizers, and while approaching higher epochs, it started rapidly improving its convergence.Thus, over an asymptotic behavior, Adagrad converged to a similar accuracy to that of RMSprop's accuracy.The optimizers NAG and MGD behave equivalently.Adam and GD were found sensitive to BNeuralT architecture (and trainable parameters).
BNeuralT regression models convergence.BNeuralT and MLP settings convergence.In Fig. 11, we compare convergence of six optimizers for optimizing both BNeuralT and MLP on various settings.We show this comparison on "glass" and "miles per gallon" datasets as an example.(Supplementary shows convergence of all other datasets on various settings.)In Fig. 11, we observe that the learning rate 0.1 produces stable convergence for all optimizers.For learning rate 0.1, Adam does not converge as good as ).In fact, when using ReLU activation function for regression problems, BNeuralT suffered from exploding gradient issues when using optimizers like GD, MGD, NAG, and Adam during some instances of runs of some datasets.Adagrad, however, remained unaffected by exploding gradient issue.This is due to its decreasing convergence speed.BNeuralT's performance with ReLU, due to its high sparsity, was affected by exploding gradient effect more than the MLP, which showed more tolerance to exploding gradient effect due to its large number of parameters (cf.Supplementary Fig. A3).
Convergence of accuracy against trainable parameters.BNeuralT's tree size (proportional to trainable parameter) and test accuracy in Fig. 12 suggest that RMSprop compared to other optimizers can optimize ad hoc structure better.We observed that the accuracy of BNeuralT increases with increasing tree size.However, accuracy dropped for some outliers in the connected scatter plot in Fig. 12.This was because many points were within a specific range.For classification problems, except for RMSprop, NAG was another better optimizer.For regression problems, along with RMSprop, Adagrad was another better performing optimizer.
BNeuralT's RMSprop optimizer showed rather more stable performance for stochastically varying architectures compared to other optimizers.For the pattern recognition MNIST dataset, RMSprop optimizer was used, and it showed a linear increase in accuracy for increasing order of tree size.

Discussion
We designed and investigated a learning system called BNeuralT capable of solving three classes of machine learning problems: classification, regression, and pattern recognition.We assessed the capability of this neural tree algorithm as a single neuron model approximating computational dendritic tree-like behavior (cf.Figs. 2 and Fig. 4).This algorithm can also be considered a highly sparse NN trained using SGD optimizers.To train BNeuralT using SGDs, we designed a recursive backpropagation algorithm.Therefore, we broadly assessed three aspects of a learning system, i.e., its performance on (i) stochastically generated highly sparse models, (ii) sigmoid and ReLU functions and their dendritic interactions with internal nodes, and (iii) optimizers asymptotic convergence behavior.We had a diverse range of classification and regression problems and algorithms to compare BNeuralT's capabilities over these dimensions.The optimizers RMSprop, MGD, NAG, Adagrad, GD, and Adam are respectively indicated in blue, orange, green, red, purple, and brown colors, respectively, with symbols diamond, triangle, circle, downward triangle, and star.For a few cases, convergence is linear to tree size, however for a few, high accuracy is achieved with smaller trees.For MNIST, RMSprop has 10-epoch of stochastic online training and shows a linear relation with tree size.In all tasks, the convergence of RMSprop (blue diamonds) is the best, followed by NAG (green squares) and MGD (yellow triangles).Adagrad (red circles) shows competitive performance with RMSprop (blue diamonds) for regression problems.Classification datasets have green and blueish hues because NAG and RMSprop are top optimizers, and for regression datasets, plots appear to be red and blueish hue because of RMSprop and Adagrad are top optimizers.
Since BNeuralT resembles a highly sparse NN, its performance was assessed against MLP (and MLP with dropout rate similar to the probability of keeping nodes in BNeuralT) for their similar versions of SGD training.Six classification trees of BNeuralT, among all other algorithms and experiments, were top-performing models with a very low number of parameters.In fact, BNeuralT performed better against MLP dropout on classification problems, and it statistically had a similar performance on regression problems.This BNeuralT's performance against MLP's dropout regularization technique confirms that stochastic gradient descent training of any a priori arbitrarily "thinned" network has the potential to solve machine learning tasks with equivalent or better degree of accuracy than a fully connected symmetric and systematic NN architecture.
We used six different SGD optimizers for optimizing BNeuralT and MLP.Each optimizer be- ).This phenomenon of Adagrad may be confirmed since GD being the slowest converging SGD, was also found efficient when ReLU is used (cf.Fig. 11(c, f, and e)).Additionally, Adagrad converged better with a learning rate of 0.1 than 0.001 (e.g. Fig. 11(a-b)).This is because Adagrad was too slow at earlier epochs that prevented it from converging within a fixed number of training epochs.
BNeuralT is operationally similar to HFNT and MONT algorithms.The HFNT and MONT algorithms model structures were genetically optimized as opposed to BNeuralT structure.The better performance of BNeuralT compared to HFNT and MONT shows that the stochastic structure of BNeuralT has a high potential to solve machine learning problems (cf.Table 2).
However, this performance comparison also shows that BNeuralT models can be further compacted because both HFNT and MONT on classification problems had smaller average tree sizes than BNeuralT.This confirms that optimization of structure made HFNT and MONT more compact, although their accuracies were slightly compromised.On regression problems, however, BNeuralT performed better than HFNT both in terms of tree size and regression fit.
BNeuralT's performance compared to MLP's (with and without dropout) models and genetically optimized HFNT and MONT models confirms Occam's razor principle of parsimony for machine learning model selection that the simple models possess better generalization capability than the complex models (Blumer et al., 1987).Indeed, it is similar to the sparsity of the biological brain that a sparse network generalizes better or as good as a dense network (Friston, 2008;Herculano-Houzel et al., 2010;Hoefler et al., 2021).Moreover, it has been argued that a dense network is often overparameterized, and only a minute fraction of it is required for generalization (Denil et al., 2013).Our result is in a similar line because BNeuralT, with only an average of 222 parameters, which is only 13.5% of parameters than that of MLP's average 1638 parameters, is able to generalize machine learning problems better or with similar accuracy than MLP.
Additionally, the sparsity and compactness of BNeuralT models reduce memory usage and CO 2 footprint as they require less memory and computational resources than dense networks.
The decision tree algorithms DT and RF (ensemble of DTs) computationally have dedicated paths from the root to leaves (Breiman et al., 1984;Breiman, 2001).BNeuralT computationally also has dedicated information processing paths but from leaves to root.Although these algorithms differ in how nodes propagate information, a performance comparison suggests that BNeuralT has superior or competitive performances compared with DT and RF (cf.Table 2).
This performance is noticeable since RF is an ensemble algorithm that, using bootstrapping, combines 100 DTs to construct a predictor (Breiman, 2001).Hence, the better performance of a standalone randomly generated BNeuralT model shows its high capabilities.Especially when RF being an ensemble of many trees, is more complex than a small and compact BNeuralT tree.
Moreover, DTs are symbolic machine learning algorithms whose models offer inference ability as opposed to the black-box nature of NNs because of their ability to induce data using dedicated paths from the root to leaves.Likewise as shown in Figs. 5 and 6, BNeuralT has dedicated information processing paths from leaves to root, and such paths related to particular subsets of inputs may be analyzed, which potentially may offer inference ability to BNeuralT.
Thus, BNeuralT models are potentially inferable as opposed to NNs.However, this is a challenging task since BNeuralT's nodes combine inputs and perform a nonlinear or linear transformation.
We assessed BNeuralT performance against GP, NBC, and SVM.These three algorithms take Gaussian kernels.That is, these algorithms have a powerful approach towards prediction.GP and NBC algorithms are robust and powerful algorithms if input data follow a normal distribution.Similarly, SVM uses Gaussian kernels to project input to high dimensions, increasing the separability of data points to help to classify them (Cortes and Vapnik, 1995).Better performance of BNeuralT compared to these algorithms on classification and regression problems (cf. Table 2) suggests that BNeuralT offers an efficient alternative to these algorithms as BNeuralT does not make any assumption about data to generate a hypothesis (model) when fitting or classifying data.
The biologically plausible design of BNeuralT comes from its structural arrangement that takes random repeated input and has a computational dendritic tree-like organization with sigmoidal nonlinearities or ReLU linearity through its internal nodes (London and Häusser, 2005).The biologically plausible computational dendritic tree-like models 1-tree and 32-tree have a regular structural arrangement where repeated inputs are fed to a neuron systematically to form a tree structure (Jones and Kording, 2021).Whereas BNeuralT takes randomly generated inputs and takes a non-systematic stochastic approach to its tree construction (cf.Fig. 2).Moreover, BNeuralT works on multi-class classification; and 1-tree, 32-tree, and A-32-tree models work on binary classification (Jones and Kording, 2021).
BNeuralT's comparison with 1-tree, 32-tree, A-32-tree models, although limited, presents a noticeable performance.The error rate of BNeuralT on the MNIST dataset on all ten classes classification was 6.08% with 23835 parameters.The error rates of 1-tree, 32-tree, A-32-tree models on the binary classification of classes 3 and 5 of the MNIST dataset were reported as 7.8%, 3.65%, and 8.89%, respectively, and they had 2047, 65504, 65504 parameters, respectively.
This result confirms BNeuralT's potential to produce capable learning systems, especially when BNeuralT's structural randomness (cf.Fig. 2) is closer to the randomness (if any) of biological computational dendritic-tree (Travis et al., 2005).

Conclusions
We propose a new algorithm Backpropagation Neural Tree (BNeuralT).Our BNeuralT algorithm plausibly has a biological dendritic tree-like modeling capability.It has a single neuron-like model with sigmodal dendritic nonlinearities or rectified linear unit (ReLU) based dendritic linearity.
It uses random repeated inputs at the leaves of subtrees attached to a single neuron, which is the root of a tree.BNeuralT uses stochastic gradient descent (SGD) optimizers to optimize stochastically generated sparse tree structures that are potentially minimal subsets of neuron networks (NNs).We propose a recursive error backpropagation algorithm to apply SGDs to train trees that require pre-order and post-order traversal in a depth-first-search manner for their forward pass and backward pass computations.
The results showed that our stochastically generated biologically plausibly tree structure and recursive error backpropagation algorithm have the capacity to learn a wide variety of machine learning problems.Moreover, we show that any stochastically generated tree structures can learn machine learning problems with high accuracy, and structure optimization may only be required for making models more compact.However, there is a trade-off for compacting models as we found that making the models more compact means compromising on accuracy.Additionally, BNeuralT's strong performance compared to MLP's dropout regularization technique confirms that SGD training of any "a priori " arbitrarily "thinned network " (spares tree structures) has the potential to solve machine learning tasks with equivalent or better degree of accuracy.
The sigmoidal dendritic nonlinearities (sigmoid function used at tree's root and internal nodes) performed obviously better than a linear dendritic tree (sigmoid function used at tree's root and ReLU at internal nodes).However, the linear dendritic tree differed from the best performing nonlinear dendritic tree by only about 10% accuracy.Nevertheless, it was comparable with a few nonlinear dendritic tree models, especially with those trained with gradient descent (GD), momentum GD, and Adam.This shows that purely single node BNeuralT models might solve machine learning problems efficiently.
On MNIST (pixels) character classification dataset, BNeuralT, when loosely compared with 1-tree and 32-tree biologically plausible dendritic tree algorithms, was found competitive.Moreover, BNeuralT performed best among select tree-based classifiers for the classification of MNIST characters.On classification and regression problems, the overall performance of BNeuralT was better than some varied types of well-known algorithms: decision tree, random forest, Gaussian process, naïve Bayes classifier, and support vector machine.Such a performance of BNeuralT came from a minimal hyperparameter setup.Therefore, this work shows that our newly designed learning algorithm generates high-performing and parsimonious (therefore sustainable) models balancing the complexity with descriptive ability.In both classification and regression plots, a larger length of error bar shows higher stochasticity of an optimizer that indicates an optimizer's higher ability to skip local minima.Thus, it shows an optimizer's better convergence ability.

A.2.1 BNeuralT Convergence Trajectories
Figs. A1, A2, and A3 are BNeuralT models with their leaf generation rate at lower tree depth set to 0.5, and they were trained with an early-stopping strategy.However, they varied in the following ways: Fig. A1 has the sigmoid functions as its internal nodes.All six SGDs are trained with a learning rate of 0.1.Fig. A2 has the sigmoid functions as its internal nodes.In this setting, the optimizers RMSprop, Adam, and Adagrad had a learning rate of 0.001, and optimizer MGD, NAG, and GD had a learning rate of 0.01.Fig. A3 has the ReLU functions as its internal nodes.All six SGDs are trained with a learning of 0.1.

Figs. A1, A2, and A3 offer convergence profiles for both classification and regression problems for
BNeuralT setting with higher leaf generation rates, i.e., 0.5.For the higher leaf generation rate (smaller model size), BNeuralT with sigmoid node and 0.1 learning rate convergence is similar to the lower leaf generation rate (larger models).However, for the default learning rate (lower learning rate), BNeuralT convergence shows a different profile (cf.Fig. A2).For a lower learning rate, RMSprop (and Adam) is much slower at the beginning of training but converges very fast at the reminder of the training epochs.NAG and MGD show a monotonically increasing and stable learning profile.Interestingly, Adam (as evident from literature (Kingma and Ba, 2015)) performed the best with a learning rate of 0.001, and as shown in Fig. 9, this was not the case where the learning rate was 0.1.Adam has a similar profile to RMSprop, but RMSprop produced better accuracy than Adam.Adagrad, with a lower learning rate, had worse performance among all six SGDs.
BNeuralT's performance with ReLU due to its high sparsity and loss nonlinearity show a decline in the model's performance (cf.Fig. A3).In fact, it seems to suffer from exploding gradient issues for some optimizers like GD, MGD, NAG, and Adam.Adagrad, however, remains unaffected by this issue when the ReLU function was used.The downward curve of other algorithms, except Adagrad, indicates that, when ReLU was used, not all structures could be trained.This is because of exploding gradient issue.This is because of exploding gradient issue.Hence the average trajectories show a downward curve (trajectory) in performance after certain epochs.This phenomenon is indicated with a sudden upward curve in plots (b) for regression.(b) Adagrad in red is the fastest converging and most dominant optimizer in plots.
However, they varied as per early-stopping and learning rate usage strategy: Fig. A4 was trained with early-stopping and with a flat 0.1 learning rate for all optimizers.Fig. A5 was trained with early-stopping, but RMSprop, Adam, and Adagrad had a learning rate of 0.001, and MGD, NAG, and GD had a learning rate of 0.01.Fig. A6 was trained without early-stopping and with a flat 0.1 learning rate for all optimizers.Fig. A7  Considering the convergence profiles of different optimizers on MLP training for early stopping and learning rate 0.1, we observed the following: Adagrad with early stopping and higher initial learning rate offered better accuracy on both classification and regression problems.This is because Adagrad's small step at higher epochs allowed networks to find a better early stopping point than that of the other algorithms whose larger step size made networks converge to a premature early stopping point (cf.Figs.A4 and A6).
With an initial small learning rate of 0.001, Adagrad is too slow to converge within a predefined number of epochs (500).In this setting, Adam seems to have an appropriate small step size to converge to a proper early-stopping point.The performances of RMSprop, MGD, NAG, and GD are next to Adam's performance (cf.Figs.A5 and A7).

Fig. 1
Fig.1(a) is an example of a classification neural tree where root's each immediate child is a subtree dedicated to a class, and the root node only decides a winner class ŷ = argmax{c 1 , . . ., c r } for an instance-label pair (x, y).For regression learning problems, BNeuralT is a regression neural tree whose root node decides the tree's predicted output ŷ = ϕ( child i=1 w i v i + b v 0 ), where ϕ(•) is an activation function yielding a value in [0, 1], w i is a edge weight, v i is the activation of i-th child, and b v 0 is the root's bias (cf.Fig.1(b)).

Fig. 1 :
Fig. 1: Neural Trees: (a) A neural tree example of a three-class classification learning problem.The root node v 3 0 takes three immediate children: v1, v3, and v4, each respectively designated to a class c1, c2, and c3.The internal nodes (shaded in gray) are neural nodes and take an activation function ϕ(•) and leaf nodes are inputs.Each designated output class has its subtree.This tree takes its input from the set {x1, x2, . . ., x5}.The link w v j i between nodes are neural weights.(b) A neural tree example for a regression problem has one output node v0.

Fig. 2 :
Fig. 2: Biologically plausible neural computation using dendritic trees.Red circle represents a neuron (soma), black lines are dendrites, and the numbers indicate inputs.

Fig. 2
Fig. 2(b)  showsJones and Kording (2021)'s a single neuron computational model with repeated inputs as a binary tree structure.Unlike the work ofPoirazi et al. (2003b), BNeuralT has asymmetric dendritic connections to a "single" neuron(Beniaguev et al., 2020;Jones and Kording, 2021).Jones and Kording (2021)'s dendritic tree has a systematic and regular binary-tree-like structure and solves a binary classification problem.Whereas BNeuralT's neuron is like the neuron ofTravis et al. (2005) (cf.Fig.2(a)), and it has a stochastic m-ary rooted tree-like structure (cf.Fig.2(c)).Thus, through BNeuralT we investigate the ability of a single neuron with sigmoidal nonlinearity (and linear when using ReLU) in its dendritic connections on three machine-learning problems: multi-class classification, regression, and pattern recognition.

Fig. 3 :
Fig. 3: Forward pass (left) and backward pass (right) computation.The arrows show the direction of computation.

Fig. 4 :
Fig. 4: Left.Backpropagation Neural Tree.Output node v k yields output y using forward pass upon receiving inputs xi from leaf nodes.Each node is linked with an edge weight wij.The backward pass propagates error e = (y − ŷ) back to input nodes to compute weight change ∆w.Right.Backpropagation of error from an output node, k; to a hidden node, j; to an input node, i; and to bias inputs b k and bj.Dashed lines represent error backpropagation and computation δ and gradient ∇w (cf.Algorithm 2) to find weight change ∆w that help stochastic gradient descent (cf.Algorithm 1).
. While BNeuralT and MLP were trained in online mode (example-by-example training), other algorithms take only offline mode (epoch-by-epoch) training.For the pattern recognition problem (MNIST), we set a mini-batch training with a batch size of 128 examples.RMSprop was used as an optimizer, and BNeuralT was trained by varying learning rate η ∈ {0.1, 0.01} and the number of epochs ∈ {10, 25, 50, 70}.The results of other algorithms on MNIST were collected from literature to compare performances.Loss functions.The loss function for BNeuralT training for classification and pattern recognition problems was a miss-classification rate L Error (G).MLP training on classification problems These variations produced five BNeuralT settings: (i) ES training of BNeuralT having sigmoid nodes, 0.1 learning rate, and 0.5 leaf generation rate; (ii) ES training of BNeuralT having sigmoid nodes, default learning rate, and 0.5 leaf generation rate; (iii) ES training of BNeuralT having ReLU nodes, 0.1 learning rate, and 0.5 leaf generation rate; (iv) ES training of BNeuralT having sigmoid nodes, 0.1 learning rate, and 0.4 leaf generation rate; and (v) without ES training of BNeuralT having sigmoid nodes, 0.1 learning rate, and 0.5 leaf generation rate.Multiple MLP settings were tried.Out of which, some best performing settings were: (i) ES training of MLP having sigmoid nodes and 0.1 learning rate; (ii) ES training of MLP having sigmoid nodes and default learning rate; (iii) ES training of MLP having sigmoid nodes, 0.1 learning rate, and L2-norm regularization; (iv) ES training of MLP having sigmoid nodes, default learning rate, and L2-norm regularization; and (v) experiment setting same as (i) but without ES; and (vi) experiment setting same as (ii) but without ES.Other trials were using dropout with and without early stopping.

Fig. 5 :
Fig. 5: Classification Trees.(a) -(i) show test accuracy and tree size of select high performing trees of datasets.The black node in a tree is its root node, class output nodes are in red, function nodes are in blue, and leaf nodes are in green.The link connecting nodes are neural weights.

Fig. 6 :
Fig. 6: Regression Trees.(a) -(e) show best performing tree structure of the respective dataset, their accuracy and tree size shown in brackets.The red node in a tree is the root node (output node), function nodes are blue, and leaf nodes are green.The link connecting nodes are neural weights.

(Fig. 7 :
Fig. 7: BNeuralT-20K (23,835 trainable parameters w) model's RMSprop training and test error over 70 epochs on the MNIST dataset.Zoom-in for BNeuralT performance on receiver operating characteristic curve plots on training (inner top) and test (inner bottom) sets.

Fig. 8 :
Fig.8: BNeuralT-20K (pixels) MNIST model (tree structure).This model has 3,664 function nodes (blue nodes), 16,507 leaf nodes (green nodes), and ten class nodes (red nodes in the inner circle), and the root node (in black) in the center.This model has 6,738 edges (gray lines connecting nodes).These lines also represent neural weights.Each blue node also has its bias.Edge weights and bias together make 23,835 tree's trainable parameters.This model has a test accuracy of 94% (an error rate of 6.08%).
Fig. 9 shows convergence (training and test errors) profiles of ES training of BNeuralT having sigmoid nodes, 0.1 learning rate, and 0.4 leaf generation rate.With this BNeuralT's setting, we observe that BNeuralT's RMSprop converges the fastest among all SGDs.RMSprop also outperformed all other optimizers.NAG and MGD were asymptotically closer to RMSprop optimizer.Like RMSprop, NAG and MGD showed monotonically increasing training convergence.However, on the test sets, we observe that the models started overfitting.This motivated us to use early-stopping with restore best.
Fig. 10 shows convergence (training and test errors) profiles of ES training of BNeuralT having sigmoid nodes, 0.1 learning rate, and 0.4 leaf generation rate.The convergence profiles of optimizers on five regression problems suggest that RMSprop and Adagrad were better converging optimizers.Similar to its classification problems profile, RMSprop converged faster than other optimizers for regression problems.Adagrad showed slower convergence than RMSprop.However, unlike its performance on classification problems, Adagrad showed a more stable convergence profile for regression.Contrary to classification problems, overfitting occurred only occasionally for regression problems when comparing training and test convergence profiles.

Fig. 9 :Fig. 10 :
Fig.9: Early-stopping training of BNeuralT having sigmoid nodes, 0.1 learning rate, and 0.4 leaf generation rate.BNeuralT average convergence trajectory performance computed over 30 independent runs for six optimizers over nine classification problems.The x-axis, log 10 (epochs) has the range [0.0, 2.7] and is the training epochs 1 to 500.The y-axis, − log 10 (LError(G)) has the range [0, 10] and is the log scale of the training and test accuracies.An error of 0.01 (accuracy 99%) on the − log 10 (LError(G)) scale has a value of 2.0 and an accuracy of 90% has a value of 1.0.Thus, a higher value on the y-axis is better.Error bar is the standard deviation of − log 10 (LError(G)) and it indicates stochasticity of the convergence that helps an optimizer escape local minima better.Thus, a larger length is better.For each data, training and test convergence pair are plotted for 500 epochs (on the log scale, 2.7).RMSprop, MGD, NAG, Adagrad, GD, and Adam are respectively indicated in blue, orange, green, red, purple, and brown colors, respectively, with symbols diamond, triangle, circle, downward triangle, and star.

Fig. 11 :Fig. 12 :
Fig. 11: Comparison of convergence of optimizers for optimizing BNeuralT and MLP and optimizers convergence over varied learning rate settings and activation function usage.(a)-(f) Classification problem where y-axis is an error rate.(g)-(l) Regression problems where y-axis is an MSE.The x-axis, log 10 (epochs) has the range [0.0, 2.7] and is the training epochs 1 to 500.For MLP, early-stopping method show values for optimizes only upto the epochs where training stopped.Learning rate η value def ault indicates that RMSprop, Adam, and Adagrad has a value of 0.001 as their learning rate and GD, NAG, and MGD has a value of 0.01.
haved differently in terms of their asymptotic convergence depending on what problem they solve, how their learning rate behaved over the training epochs, and what activation function was used(cf.Figs. 9, 10, and 11).For example, with a 0.1 learning rate, RMSprop was the best among others for BNeuralT optimization over classification problems.For regression problems, both RMSprop and Adagrad performed well.Adagrad, however, was slow on classification problems.Since optimizers had to optimize the same architecture in an instance, it may be the continuous-variable output in the case of regression problems has helped Adagrad perform better than the discrete variable output in classification problems.The use of activation functions influenced the performances of SGD optimizers.The sigmoid function proved to be more efficient with RMSprop, NAG, and MGD.Whereas ReLU proved to be efficient with Adagrad.This may be related to Adagrad's slow convergence speed that avoided weights to explode too quickly compared to faster converging optimizers like RMSprop The performance of BNeuralT and MLP of classification problems is shown in Figs.A1(a), A2(a), A3(a), A4(a), A5(a), A6(a), and A7(a).The y-axis of each plot is "− log 10 (L Error (•))" that has the range [0, 10] and is the log scale of the training and test accuracies.An accuracy of 99% (an error of 0.01) on − log 10 (L Error (•)) scale has a value of 2.0, and an accuracy of 90% (an error 0.1) has a value of 1.0.Thus, a higher value on the y-axis is better.Error bar is the standard deviation of − log 10 (L Error (•)) value.The performances of regression problems are shown in Figs.A1(b), A2(b), A3(b), A4(b), A5(b), A6(b), and A7(b).The y-axis of each plot is log 10 (L MSE (•)), and it is the training and test sets MSE on the log scale.An MSE 0.01 on the log scale has a value of −2.Thus, a lower value is better.Error bar is the standard deviation of log 10 (L MSE (•)).
Fig. A1: BNeuralT (a) classification (b) regression models having sigmoid nodes, 0.5 leaf generation rate, 0.1 learning rate, and early-stopping strategy setting.(a) RMSprop in blue is the fastest converging optimizer as it appears at the top line in plots.Adagrad in red shows slow convergence in the beginning and only rapidly improves at higher epochs.Adam and RMSprop have similar trajectories at earlier epochs, but Adam stays at local minima.(b) Adagrad and RMSprop both appear as bottom lines in graphs showing better convergence than others.
Fig. A4: MLP (a) classification (b) regression with sigmoid nodes trained with 0.1 learning rate and earlystopping strategy.(a) Adagrad in red in this setting seems taking more iteration as red lines in test sets plots show more fluctuation and longer epochs and lines are on the upper side of plots.(b) NAG is green shows better performance as lines are lower in plots.
Fig. A5: MLP (a) classification (b) regression with sigmoid nodes trained with default learning rate and earlystopping strategy.(a) NAG and MGD, in green and orange in this setting, show better performance as on the test sets, and they are on the upper side of plots.Adam is competitively closer to NAG and MGD.(b) Adam, GD, and RMSprop in respective order show better performance as lines are lower in plots, and GD takes longer epochs to converge in early stopping.
Fig. A7: MLP (a) classification (b) regression with sigmoid nodes trained with default learning rate and without an early-stopping strategy.(a) Adam and NAG in brown and green in this setting show more fluctuation as lines are on the upper side of plots.(b) Adam, RMSprop, and NAG in brown, blue, and green show better performance as lines are lower in plots.

Table 1 :
Gradient descent versions to replace line number 7 in Algorithm 1. Symbols η, γ, β, β1, and β2 are constants (hyperparameter) of respective algorithms.Symbol vj and wj show previous momentum and weights, respectively, and ∇wL(xj, Gw j ) shows gradient of loss L over input xj and w of tree G.
x d } is an input attribute

δ k w jk gradient at an internal sigmoid node
Table 3 also suggest that BNeuralT's RMSprop performance distribution on regression problems compared with MLP's Adam is statistically insignificant only on the Friedman dataset.On all other regression datasets, BNeuralT's RMSprop performance is equally significant as other algorithms.

Table 2 :
Ojha and Nicosia (2020)rithms performance as per average (avg.)accuracy(1− LError(•)) and avg.regression fit (L r 2 (•)) on the test sets of 30 independent runs for nine classification and five regression learning problems.Both accuracy and regression fit take 1.0 as the best value.Trainable parameters (w) of BNeuralT and MLP are neural weights.The best accuracy among all algorithms is marked in bold.Both BNeuralT and MLP are trained on an online mode, whereas other algorithms take offline mode training.Average forward pass (single example prediction time) wall-clack time is τ × e −6 seconds (a lower value is better), where τ J and τ P respectively indicate time for Java 11 and Python 3.7.Symbol M0.4 indicates MLP with a dropout rate of 0.4, (Avg.)indicates the average performance of an optimizer over datasets, and [Avg.]indicates the average performance of optimizers on a dataset.The classifier results of models MONT3 and HFNT M are fromOjha and Nicosia (2020).RF † indicates random forest which is an ensemble model.

Table 3 :
Kolmogorov-Smirnov (KS)test on two samples: BNeuralT's RMSprop against all other algorithms for each data.The stat, pval, and post respectively indicate KS statistic, two-tailed p-value, and Bonferroni correction post-hoc adjusted p-value.The values are marked in bold where the null hypothesis that BNeuralT's RMSprop and other algorithms come from the same distribution is rejected as per Bonferroni correction.

Table 2
but on large-scale learning problems like MNIST.Table4summarizes BNeuralT models compared with the performances of tree-based state-of-the-art classification algorithms.

Table 4
(LeCun et al., 2020))s) results compared with classification trees.BNeuralT performs the best among the reported trees that work on MNIST (pixels) for character classification.However, convolution of images has been proven efficient for image classification problems.For example, CapsNet(Sabour et al., 2017), a state-of-the-art algorithm on MNIST (convolution), has an error rate of 0.25, but it uses 8 million parameters.Compared to that, a BNeuralT on MNIST (pixels) used 23, 835 trainable parameters for an error rate of 6.08, and another model used 241, 999 trainable parameters for an error rate of 5.19.Obviously, there is a trade-off between the model's parameters size and accuracy.The performances of a varied range of other algorithms on MNIST dataset are available at(LeCun et al., 2020).Our goal is to use as much compact model as we can for high accuracy.

Table 4 :
Test error rate % of ad hoc BNeuralT (G) models with a varied number of trainable parameters on the MNIST dataset.All models are trained for 70 epochs, and where denoted † is for 25 epochs.The decision tree models are reported in(Zharmagambetov  et al., 2019).
was trained without early-stopping, but the optimizers RMSprop, Adam, and Adagrad had a learning rate of 0.001 and optimizer MGD, NAG, and GD had a learning rate of 0.01.The convergence profiles of the optimizers for MLP were not consistent, with one optimizer outperforming all other optimizers across all datasets [cf.Figs.A4 (only upto ES epochs), A5 (only upto ES epochs), A6 (full epochs), and A7 (full epochs)].RMSprop for learning rate 0.1 performed better on four datasets.Adagrad performed relatively consistent among all optimizers.Adam performed worse in cases of regression problems with a learning rate of 0.1.Adam, however, does perform well with a default (0.001) learning rate.Adagrad shows poor converges with a slower learning rate as it does in the cases of BNeuralT.NAG and MGD, in the cases of both classification and regression problems, show a stable convergence profile.

Table A2 :
Summary of top (based on overall performance) 55 of 110 experiments on all data, on all algorithms, and all settings .The values are an average of 30 independent runs of each setting.Names of experiments start with B indicate BNeuralT and M indicate MLP.The next letter is S or R indicates sigmoid and ReLU activation function, ESy indicates early-stopping and ESn indicates no early-stopping, Rn or Ry indicates no regularization or elastic net regularization; L indicates default learning rate or absence of L indicates a learning rate of 0.1, Dr indicates dropout; p4 or p5 indicates leaf generation rate 0.4 or 0.5, and optimizers GD, MGD, NAG, Adagrad, RMSprop, and Adam are indicated with G, M, N, A, R, D, respectively.

Table A3 :
Summary of last (based on overall performance) 55 of 110 experiments on all data, on all algorithms, and all settings .The values are an average of 30 independent runs of each setting.Names of experiments start with B indicate BNeuralT and M indicate MLP.The next letter is S or R indicates sigmoid and ReLU activation function, ESy indicates early-stopping and ESn indicates no early-stopping, Rn or Ry indicates no regularization or elastic net regularization; L indicates default learning rate or absence of L indicates a learning rate of 0.1, Dr indicates dropout; p4 or p5 indicates leaf generation rate 0.4 or 0.5, and optimizers GD, MGD, NAG, Adagrad, RMSprop, and Adam are indicated with G, M, N, A, R, D, respectively.

Table A4 :
Wilcoxon signed-rank test on two samples: BNeuralT's RMSprop against all other algorithms for each data.The 'stat,' 'pval,' and 'post,' respectively indicate Wilcoxon signed-rank statistic, two-tailed p-value, and Bonferroni correction post hoc adjusted p-value.The values are marked in bold where the null hypothesis that BNeuralT's RMSprop and other algorithms have no difference is rejected.

Table A5 :
T-test on two independent samples: BNeuralT's RMSprop against all other algorithms for each data.The 'stat,''pval,' and 'post,' respectively indicate T-test statistic, two-tailed p-value, and Bonferroni correction post hoc adjusted p-value.The values are marked in bold where null hypothesis that BNeuralT's RMSprop and other algorithms have the same expected (average) is rejected.