Introduction

Accurate classification of massive samples, as a landmark accomplishment of deep learning, has achieved excellent performance that compares to or surpasses that of humans. However, its performance is based on the premise of a huge number of data with a balanced distribution, which makes it highly susceptible to the overfitting problem with small1,2,3,4 and imbalanced5,6,7 data. Some approaches try to mitigate the overfitting problem by data augmentation8,9,10. However, the supplemental data obtained by such methods are only the result of distribution prediction of the existing data, whose accuracy is also highly dependent on the amount of data. Therefore, data-scarce fields still rely on traditional neural networks or non-neural machine learning methods.

Among these traditional methods, the recurrent neural networks (RNN)11,12,13,14 have promising performances, and the synergetic neural network (SNN)11 therein natively supports data distribution adaptation, making it more suitable for classification tasks with small amounts of imbalanced data. RNN has the advantages of a small number of parameters, fast training speed, controllable working process, low data dependence, and high robustness15,16, which supports its wide application in small data classification tasks17. Although SNN is a relatively outdated RNN model, it still has some theoretical advantages over the latest models. All attractors of the SNN correspond to valid memories, saddle points and the local extreme are limited and can be easily escaped, ensuring the network’s unrivaled robustness18,19,20. In addition, its unique synergetic-based dynamics lead to zero error in convergence results and native imbalanced data adaptation16,21,22. The overfitting problem can be effectively reduced by repeatedly applying varied attention parameters to different classes of data during the recurrences of the network.

However, these advantages are masked by the network's frequent association errors, and the optimization space of the parameters is employed to overcome this problem, preventing the network's adaption capability from being fully revealed. Researchers apply SNN in tasks including image retrieval23, face recognition24, and semantic role labeling25,26, yet find that the increased task difficulty significantly elevates the frequency of network association errors. Hu first addresses the problem by adding more parameters for adjustment27. Other studies optimize parameters with immune clonal strategy28, fuzzy integral29, and improved particle swarm optimization30. Besides, applying the immune clonal strategy to enhance the orthogonality of memories also helps improve accuracy31. However, the ideology of these approaches is to use various learning methods to solve SNN’s working problem. From the dynamical system perspective, the association error stems from the initial state being placed in the wrong basin of attraction in its working procedure. Since the cause of the problem is not revealed, parameter tuning becomes the dominant route. Due to the lack of explicit objectives, optimization methods are often based on genetic algorithms. Such a research route substantially complicates SNN’s application process and occupies the optimization space of the parameters, so task-oriented parameter training is difficult to be introduced simultaneously. These problems lead related research to a standstill.

In this paper, we propose a modern synergetic neural network (MSNN) model to properly apply the advantages of SNN to practical problems. We first address the association error and release the parameter tuning space by defining and remodeling the state initialization method of SNN. Although SNN’s first study and some subsequent studies suggest that its initial state characterizes the similarity between samples and memories11,23,32,33,34, we prove that the initial state does not conform to the principles of a similarity metric. Therefore, we distill the network's method of calculating the initial state and remodel it as a definitive solution. Since the new solution isolates the parameter tuning process, the whole optimization space can be reserved for the task properties. We design an Error BackPropagation (EBP) based attention parameter training method that allows MSNN to be co-trained with other network layers for automatic data distribution adaptation. Experimental results on 75 imbalanced small UC Irvine Machine Learning (UCI) Datasets show that these improvements make MSNN outperform 187 neural and non-neural methods.

Contribution of this work

(1) Revealing the root of the SNN association error to be the wrong state initialization method. (2) Updating SNN’s working process to solve the association error and release the parameter tuning space. (3) Proposing an EBP-based training method to enable adaptation of the built-in attention parameters to the data distribution.

Related work

In general, classification methods for imbalanced data include data preprocessing, training target modification, and proposing targeting methods35. Since most classification networks are not designed with data imbalance, related research focuses on data preprocessing methods. Depending on the distributional characteristics of the data, preprocessing methods can be categorized into oversampling36,37, undersampling38,39, and hybrid methods of the two40,41. In recent years, with the increasing demand for data volume of classification networks and the proposal of pattern generation methods based on generative adversarial networks, oversampling of minority classes has gradually become a mainstream method42,43. However, our solution belongs to the category of proposing targeting methods, and our network natively supports imbalanced data for training, which is divergent from the above studies.

SNN overview

SNN’s working procedure

SNN is a 3-layer RNN, its network structure is shown in Fig. 1. Updating formula of SNN11 of its input, hidden, and output layer is

$${\varvec{\xi}}={V}^{+}{\varvec{x}}$$
(1)
$${{\varvec{\xi}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}=Syn\left({\varvec{\xi}}\right)=\gamma \left(\frac{b{{\varvec{\xi}}}^{3}+{\varvec{\lambda}}{\varvec{\xi}}}{(b+c){\Vert {\varvec{\xi}}\Vert }_{2}^{2}}+\left(\frac{1}{\gamma }-1\right){\varvec{\xi}}\right)$$
(2)
$${{\varvec{x}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}=V{{\varvec{\xi}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}$$
(3)

\({\varvec{x}}\) is the normalized query pattern. \(V=[{{\varvec{v}}}_{1},\dots ,{{\varvec{v}}}_{{\varvec{M}}}]\) is the matrix of normalized static prototypes representing memories. \({{\varvec{x}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}\) is the new input transmitting to Eq. (1). \({\varvec{\xi}}\) is the vector of order parameters. \({V}^{+}\) is \(V\)’s Moore–Penrose inverse44,45. \(Syn\) is the Synergetic activation function. \(\gamma\) is the learning rate. Network parameters include \({\varvec{\lambda}}\), \(b\), and \(c\). \({\varvec{\lambda}}\) is the attention parameter to the prototypes with default value 1. Higher attention brings greater chances of association. \(b\) and \(c\) control the convergence speed with default value 1. SNN requires that all prototypes are mutually independent, and that their total number is less than the dimension, such that the product of \({V}^{+}\) and \(V\) is the identity matrix. Substitute Eq. (3) into (1), and \({\varvec{\xi}}={V}^{+}{{\varvec{x}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}={{\varvec{\xi}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}\). Thus, the update formula can be interpreted as constructing a dynamic system of \({\varvec{\xi}}\). \({\varvec{\xi}}\) is the dynamical state, and its initial value is the initial state. The variation of \({\varvec{\xi}}\) is reflected to \({\varvec{x}}\) through \(V\).

Figure 1
figure 1

SNN structure. The activation function of the hidden layer is \(Syn\). The input and output layers are linear mappings with weights of the adjoint matrix \({V}^{+}\) and the prototype matrix \(V\), respectively.

SNN converges to three kinds of stationary points, including the target stable point, the saddle point, and the local maxima point. The convergences are shown in Fig. 2. Generally, SNN reaches the target stable point. The target stable point is reached when \({\varvec{\xi}}\) is the positive or negative one-hot encoding. The single nonzero order parameter is called the winner parameter. The network outputs \(\pm {\varvec{v}}\) at this point, which reflects the association from \({\varvec{x}}\) to \({\varvec{v}}\). The saddle point is reached when \({\varvec{\xi}}\) has more than one identical nonzero value, which stems from multiple identical extremes in the initial state. The local maximum point is reached when all elements of \({\varvec{\xi}}\) are 0. The division by 0 error in Eq. (2) blocks the network from working.

Figure 2
figure 2

The convergence to stationary points of SNN. The different colors of the curves are used to distinguish order parameters. (a,b) are the convergence to target stable points (the positive or negative one-hot), i.e., one order parameter converges to ± 1 while others converge to 0; (c) is the convergence to the saddle point (multiple identical values) that stems from multiple identical extreme values in the initial value of \({\varvec{\xi}}\); (d) is the convergence to the local maximum point (all values are zeros) that stems from the zero-vector initialization of \({\varvec{\xi}}\). Note that chart (d) is only for representation. The divide-by-0 error terminates the network’s further iterations.

SNN’s basin of attraction

In describing the convergence process to the target stabilization point, SNN proposes the “winner-takes-all” property, i.e., the order parameter with the biggest absolute initial value is the winner parameter, but lacking detailed proof. Therefore, we prove this property by showing that \(\left|{\xi }_{m}^{new}\right|\) is the largest when \(\left|{\xi }_{m}\right|\) is the largest. The detailed proof is shown in SI ***1A. From the perspective of dynamical systems, the “winner-takes-all” property can be interpreted as extreme-based basin partitioning. The basin of SNN’s attractor is the set of all initial states with the same sign and extreme value index as itself. The attractors, basins, and trajectories of random initial states of SNN in 2D and 3D are shown in Fig. 3. It can be seen that such a division allows the order parameter with the biggest absolute value retains its winner position throughout the convergence.

Figure 3
figure 3

SNN’s attractors, basins, and trajectories of random initial states in 2D and 3D space. One coordinate of the attractor is ± 1 and others are 0. The trajectories and the basin corresponding to the same attractor are marked with the same type of color. The basin contains all initial states with the same sign and maximum absolute value index as its attractor.

MSNN

The nonlinear dynamical system is sensitive to the initial state and should be carefully designed. However, the initialization method of SNN was proposed without in-depth analysis. Although the associative memory task requires the correct association to be the most similar memory to the input, we prove that the existing initialization method will designate the order parameter with smaller similarity as the winner parameter. Due to the “winner-takes-all” property of SNN’s convergence, the selected winner will converge to ± 1, so the network will output the less similar memory as the association result, leading to the association error. To address this problem, we redesign the state initialization method to correct the winner designation process. The new approach ensures the consistency of the winner selection and the association target, fundamentally solving the association error problem of SNN. In addition, the new initialization method provides the feasibility of EBP-based parameter learning.

SNN’s erroneous state initialization method

The working target of SNN is to converge to the most similar memory. The initial state controls the convergence, so the initialization method should be proposed under a similarity metric. However, the similarity between the sample \({\varvec{x}}\) and the memory \({\varvec{v}}\) cannot be characterized by the metric of SNN’s state initialization method

$$S\left({{\varvec{v}}}_{{\varvec{m}}},{\varvec{x}}\right)={{\varvec{v}}}_{{\varvec{m}}}^{+}\cdot {\varvec{x}}$$
(4)

Although there are at least 67 different metrics applied in various fields46, all similarity metrics shall satisfy the following three principles47:

  1. 1.

    Commonality related. The more commonality they share, the more similar they are.

  2. 2.

    Difference related. The more differences they have, the less similar they are.

  3. 3.

    The maximum is reached when identical.

However, \(S\) actually characterizes the scaled cosine distance of \({{\varvec{v}}}_{{\varvec{m}}}^{+}\) and \({\varvec{x}}\), conforming to none of the above principles. For Principle (3), \(S=1\) when \({\varvec{x}}={{\varvec{v}}}_{{\varvec{m}}}^{+}\), while \(S={\Vert {{\varvec{v}}}_{{\varvec{m}}}^{+}\Vert }_{2}\) when \({\varvec{x}}={{\varvec{v}}}_{{\varvec{m}}}^{+}/{\Vert {{\varvec{v}}}_{{\varvec{m}}}^{+}\Vert }_{2}\) (i.e., \({\varvec{x}}\) is the normalized adjoint vector). From "Related work", \({V}^{+}V\) is the identity matrix, so

$${{\varvec{v}}}_{{\varvec{m}}}^{+}\cdot {{\varvec{v}}}_{{\varvec{n}}}=\left\{\begin{array}{l}1,m=n\\ 0,otherwise\end{array}\right.$$
(5)

which means that \({{\varvec{v}}}_{{\varvec{m}}}^{+}\) is perpendicular to the hyperplane of all prototypes except \({{\varvec{v}}}_{{\varvec{m}}}\). Since the inner product of \({{\varvec{v}}}_{{\varvec{m}}}^{+}\) and \({{\varvec{v}}}_{{\varvec{m}}}\) is 1, the angle between \({{\varvec{v}}}_{{\varvec{i}}}^{+}\) and \({{\varvec{v}}}_{{\varvec{m}}}\) takes values in the range [0,0.5π). SNN requires \({{\varvec{v}}}_{{\varvec{m}}}^{+}\) to be normalized, so \({\Vert {{\varvec{v}}}_{{\varvec{m}}}^{+}\Vert }_{2}\ge 1\). \(S\) may achieve a bigger value when it is not equal to \({{\varvec{v}}}_{{\varvec{m}}}\), so \(S\) does not satisfy Principle (3). For Principle (1) and (2), as \({\varvec{x}}\) gradually approaches \({{\varvec{v}}}_{{\varvec{m}}}^{+}\) from \({{\varvec{v}}}_{{\varvec{i}}}\), its commonality with \({{\varvec{v}}}_{{\varvec{m}}}\) decreases and the difference increases, but \(S\) increases other than decreases. Therefore, \(S\) does not satisfy Principles (1) and (2).

The conflict between \(S\) and the similarity metric causes the association error. From the previous section, the order parameter with the largest absolute value in the initial state is the winner parameter. SNN will pick the wrong winner when the largest order parameter relates to a less similar \({\varvec{v}}\) by \(S\), which leads to an association error.

MSNN’s remodeling of the state initialization method

The association error originates from the wrong initial state, so the MSNN needs to redesign the initialization method. Since SNN’s basin of attraction focuses on the parameter’s absolute value, simply using the similarity measure as the state initialization method of SNN may allow the smallest negative order parameter to be the winner, making the network associates the least similar memory. To avoid this problem, we propose the new initialization method as

$${\xi }_{m}=ReLU\left(S\left({\varvec{x}},{{\varvec{v}}}_{{\varvec{m}}}\right)\right)$$
(6)

\(S\) is the similarity measure between the query and the memory. \(ReLU\) sets the negative value to zero, eliminating the possibility of the negative order parameter becoming the winner. In summary, the working process of MSNN is

$${\varvec{\xi}}=\left\{\begin{array}{c}ReLU\left({\left(S\left({\varvec{x}},{{\varvec{v}}}_{1}\right),\dots ,S\left({\varvec{x}},{{\varvec{v}}}_{{\varvec{M}}}\right)\right)}^{\mathrm{T}}\right),\xi \ not \ initialized\\ {V}^{+}x, otherwise\end{array}\right.$$
(7)
$${{\varvec{\xi}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}=Syn\left({\varvec{\xi}}\right)$$
(8)
$${{\varvec{x}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}=V{{\varvec{\xi}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}$$
(9)

MSNN’s network structure is shown in Fig. 4.

Figure 4
figure 4

The working process of the MSNN after the correction of the state initialization method. The network input \({\varvec{x}}\) is mapped to \({\varvec{\xi}}\left(0\right)\) through the corrected state initialization, \({\varvec{\xi}}\left(0\right)\) is activated by the ReLU function and input to the hidden layer by the initial order parameter feedforward layer, and then the network starts to work in iterations.

The new initialization method ensures the correct association while improving the running speed. This method only allows the positive value to be the initial value of the order parameter, so the most similar memory must correspond to the largest order parameter. From the “winner-takes-all” property, the largest order parameter becomes the winner, and the most similar memory becomes the association result. The new initialization method sparsifies \({\varvec{\xi}}\) by setting negative order parameters to zero, thus speeding up the hardware computation.

MSNN’s attention parameter self-learning

SNN’s genetic algorithm-based parameter learning is hard to be co-trained with other modern network layers, so we design an EBP-based learning method. The new learning method adjusts the attention parameter \({\varvec{\lambda}}\) to assign greater attention to classes with smaller sizes for imbalanced data self-adaptation. Before applying EBP, \(Syn\) repeatedly imposes a polynomial function onto the input, which may lead to the gradient exploding or vanishing. The gradient problem is so severe that conventional means like gradient clipping can barely circumvent the non-convergence. To solve this problem, we first normalize \({\varvec{\xi}}\) and divide \(Syn\) into two terms,

$$Syn\left({\varvec{\xi}}\right)=\frac{\gamma{\varvec{\lambda}}{\varvec{\xi}}}{b+c}+\left(\frac{\gamma b{{\varvec{\xi}}}^{3}}{b+c}+\left(1-\gamma \right){\varvec{\xi}}\right)$$
(10)

EBP is performed normally for the former term, and the latter term uses the gradient bypass technique48,49. This technique passes the gradient of certain network layer outputs directly to the input during backpropagation, which is used to circumvent the inappropriate activation functions causing the gradient exploding or vanishing, even the gradient intransmissible caused by discontinuity.

The parameter learning requirement can be satisfied by directly acting EBP onto \({\varvec{\lambda}}\). Let the error of \({\xi }_{i}^{new}\) be \({\delta }_{i}\). \({\xi }_{i}\ge 0\),

$$\frac{\partial {\delta }_{i}}{\partial {\lambda }_{i}}=\frac{\gamma {\xi }_{i}}{b+c}\ge 0$$
(11)

so the adjustment \(\Delta {\lambda }_{i}\) has a different sign than \({\delta }_{i}\). \({\delta }_{i}>0\) means \({\xi }_{i}^{new}\) is too large, and \(\Delta {\lambda }_{i}\le 0\) means the network will not increase its attention to \({\xi }_{i}\), giving it a higher chance to converge to 0. \({\delta }_{i}<0\) means \({\xi }_{i}^{new}\) is too small, and \(\Delta {\lambda }_{i}\ge 0\) leads \({\xi }_{i}\) to a higher chance of converging to 1. Therefore, EBP satisfies the parameter learning requirement of \({\varvec{\lambda}}\).

Experiments

Dataset and network configuration

We test MSNN on the small datasets of the UCI, a collection of 121 datasets as pattern classification tasks to benchmark both neural network and non-neural network machine learning algorithms. These datasets are divided into 75 small and 46 large datasets by the threshold of 1000 samples50. All of these datasets are imbalanced after the train-test set division. We compare our network against 187 neural and non-neural machine learning algorithms. Their configurations and performances are detailed in literatures16,17,50. See SI 1B.2 for dataset configuration details.

As for the network architecture, we use the embedding layers, which is/are {0, 1, 7} fully connected layer(s) with ReLU activation functions and {32, 128, 1024} hidden units per embedding layer. These embedding layers are followed by SNN with iteration {0} to {9} and a mapping to the output vector with the dimension number of classes. The prototype matrix is obtained by intra-class K-means clustering, and the adjoint matrix is the M-P inverse of the prototype matrix. The network structure used for the experiments is shown in Fig. 5. On each dataset, we use EBP to train SNN’s hyperparameter \({\varvec{\lambda}}\) and perform a grid search to determine the best hyperparameter setting for the embedding layers, the memory number, and SNN’s iteration number. The hyperparameter search space of the grid search is listed in Table 1. All models are trained for 100 epochs with a mini-batch size of 4 samples using the softmax cross-entropy loss and the AdamW optimizer51. After each epoch, the model accuracy is computed on a separate validation set. Using the gradient direct transmission technique48,49, the gradient of MSNN’s output layer in the error backpropagation stage is passed directly to the state initialization layer to circumvent bypassing the polynomial-shaped activation function of the SNN causing gradient exploding or vanishing. With early stopping, the model with the best validation set accuracy averaged over 16 consecutive epochs is selected as the final model. This final model is then evaluated against the test set to determine the accuracy.

Figure 5
figure 5

Network structure for experiments. The network input \({\varvec{x}}\) is preprocessed through the embedding layer and subsequently transformed into the initial state of the SNN through the initialization layer, i.e., the initial value of the ordinal parameter \({{\varvec{\xi}}}_{0}\). \({{\varvec{\xi}}}_{0}\) is passed into the hidden layer of the SNN, and the update of the ordinal parameter and the association of memory are realized with the cycle of the network. The prototype pattern obtained by association is passed into the classifier by the output layer of the SNN to obtain the label y. The parameters are trained using error backpropagation with the loss function \(CrossEntropy\left({\varvec{y}},{\varvec{l}}{\varvec{a}}{\varvec{b}}{\varvec{e}}{\varvec{l}}\right)\), where \({\varvec{l}}{\varvec{a}}{\varvec{b}}{\varvec{e}}{\varvec{l}}\) is the true label of the data. Using the gradient direct transmission technique, the gradient \({\nabla }_{\varvec{\xi^{new}}}\ L\) of \({{\varvec{\xi}}}^{{\varvec{n}}{\varvec{e}}{\varvec{w}}}\) in the error backpropagation stage (red line in the figure) is passed directly to \({{\varvec{\xi}}}_{0}\) bypassing the polynomial-shaped activation function of the SNN to circumvent gradient explosion or vanish.

Table 1 Hyperparameter search space for grid search on small UCI datasets.

Classification performances validation

The Friedman rankings of these methods among datasets are presented in Table 2. MSNN outperforms all other methods on small datasets, setting a new state-of-the-art for 12 datasets (balance-scale, breast-cancer, congressional-voting, heart-cleveland, ionosphere, low-res-spect, monks-2, monks-3, planning, post-operative, soybean, and spect). See SI 1B ***for more details.

Table 2 Friedman ranking and average accuracy (%) for each classifier, ordered by increasing Friedman ranking.

Imbalanced data adaptation performance

We analyze the performance of MSNN for datasets with different levels of imbalance in terms of the percentage of majority class %Maj17. %Maj reflects the level of imbalance in the dataset, the higher the %Maj, the higher the imbalance. The classifier is prone to focus on the majority class when applied to imbalanced data, i.e., labeling all samples as the major class, which brings the overfitting problem. The more severe the overfitting problem is, the closer the accuracy of the classifier will be to %Maj. Thus, the accuracy over %Maj, denoted by \(\sigma\), can reflect the extent of minor class samples being correctly classified. See SI 1B ***for the %Maj of each UCI dataset. We rank each dataset in ascending order of %Maj and mark the accuracies of the top three Friedman ranking methods in Fig. 6a. We merge the adjacent datasets in groups of five and calculate their average \(\sigma\) for better visualization. The results are shown in Fig. 6b. MSNN outperforms other methods in most cases, and the average \(\sigma\) improvement is most obvious in groups 2–12 (except group 10) with the %Maj interval of (30.93, 67.83), which indicates that MSNN is useful for both mild and moderate imbalance datasets have good adaptive performance. In groups 13–15 with %Maj greater than 73.53, the average \(\sigma\) of MSNN has a decrease compared to other methods, which suggests that the linear classifier and the standard associative memory network have a more stable performance for heavily imbalanced datasets.

Figure 6
figure 6

(a) Majority class percentage %Maj of small UCI datasets and the accuracy of Friedman ranking top three methods. Datasets are ranked in ascending order of %Maj. (b) The average accuracy over %Maj (average \(\sigma\)) of the three methods with the adjacent datasets merged in the group of five. MSNN has the highest average \(\sigma\) in the first 12 groups except group 10, with a decrease in groups 13–15, indicating its better performance for mild and moderate imbalance datasets.

Order parameter initialization validation

We verify the effectiveness of MSNN’s order parameter initialization method for correcting association errors by comparing the accuracy to SNN. We use the balanced parameter configuration (all parameters default to be 1), so the association target is the most similar memory to the query. The performance of the SNN and MSNN is shown in Fig. 7. MSNN achieves 100% accuracy for all datasets, while SNN achieves 100% accuracy in only 5 datasets (acute-inflammation, acute-nephritis, horse-colic, monks-3, and trains). Its average accuracy is 66.47%.

Figure 7
figure 7

The associative accuracy of SNN and MSNN. The datasets are ordered by name. MSNN reaches 100% accuracy in all datasets, while SNN’s performance fluctuates significantly and reaches 100% accuracy in only 5 datasets.

Attention parameter learning performance

MSNN mitigates overfitting by attention parameter \({\varvec{\lambda}}\) self-learning to provide greater attention to classes with small sample sizes. Ideally, the elements of \({\varvec{\lambda}}\) should negatively correlate with the number of imbalanced samples. Due to the diversity of data sources, MSNN cannot guarantee fitting effectiveness on all datasets, and the attention parameter’s learning performance can hardly be reflected in the underperformed datasets. In addition, the correlation between \({\varvec{\lambda}}\) and the number of imbalanced samples is challenging to model in a parameter configuration with multiple attention parameters corresponding to one class. Since the objective is to verify the ideal cases rather than all cases, we dropped results from 28 datasets that did not meet the criteria, including (1) positive effect of \({\varvec{\lambda}}\) self-learning on performance. (2) The number of \({\varvec{\lambda}}\) is equal to the class number. We use Spearman correlation analysis to verify the correlation between \({\varvec{\lambda}}\) and the sample sizes. Spearman analysis requires at least four samples, yet a significant proportion of 2 or 3 class tasks are in the UCI dataset. Therefore, we apply 1-norm to \({\varvec{\lambda}}\) from different datasets and integrate them. The integrated \({\varvec{\lambda}}\) contains 192 samples with a correlation coefficient of about − 0.170, corresponding to a p value of about 0.019. Thus, \({\varvec{\lambda}}\) has a significantly negative correlation with the sample sizes, indicating that EBP is applicable to the learning of \({\varvec{\lambda}}\).

Conclusion

In this paper, we propose the MSNN model to further improve RNN’s classification performance on imbalanced small data. MSNN first addresses SNN’s misattributing association errors to underoptimized parameters in existing studies by modifying the state initialization method in its working process, releasing the whole parameter optimization space to task requirements. Then, MSNN adjusts SNN’s built-in attention parameter through an EBP and error bypass-based learning method for network self-adaptation of imbalanced data during network layers joint training. Experimental results on 75 small UCI datasets show that MSNN retains error-free associations on all datasets, and the attention parameters spontaneously establish a strong correlation with the imbalanced sample size. These improvements make MSNN outperforms 187 methods and achieves a new state-of-the-art.

Our study allows the theoretical advantages of the Synergetics to be successfully applied in artificial neural networks, and we plan to further extend these advantages to other areas in future work, including optimization methods for attention mechanisms and self-learning methods of representative prototypes.