1 Introduction

Classification tasks are widely used in image classification, personal credit evaluation, depiction of user portrait, and so on. These scenarios rely on streaming data to ensure lower latency. Under the premise of big data, it is necessary to use efficient incremental learning algorithms to process streaming data in these application scenarios. Clearly, these application scenarios require the highest possible accuracy. For these reasons, improving the accuracy of classification algorithms is an important problem [10] that must be addressed.

At present, there are two methods to solve this problem, namely batch learning [8] and Online learning [19]. In recent years, the emergence of a large amount of data has exceeded the size of memory, and online learning algorithms have attracted attention in the field of machine learning. Many online incremental learning [18] algorithms have been proposed, such as the Perceptron algorithm [14], the Passive Aggressive algorithm (PA) [1, 9], the Online Gradient Descent algorithm (OGD) [3], the Stochastic Gradient Descent algorithm (SGD) [2], the Truncated Gradient algorithm (TG) [3], the Weight Adaptive Regularization algorithm [5, 13], and the Confidence-Weighted algorithm (CW) [4, 6, 7].

However, some of these, such as the Perceptron and the OGD algorithms, effectively add noise to the model during each update, resulting in the inability to converge, or low and unstable convergence efficiency. Other algorithms cannot produce a sparse solution while maintaining convergence speed. These problems greatly increase the complexity of prediction and affect the classification accuracy of the algorithm.

In order to deal with the existing problems, this paper proposes an integrated classification model for incremental learning. This method consists of two parts: a pre-trained (Pt) model and a novel Truncated Gradient Confidence-weighted online classification model (TGCW). The pre-trained model is trained from the existing data samples in the neural network, and uses the transfer learning theory to use the pre-trained model to extract the feature vectors of all data to eliminate the noise from the data itself. Then, the feature vectors extracted by the pre-trained model are sent to the novel online classification model for incremental learning – this enables superior classification results. Compared with the baseline algorithm, our method combines the existing deep learning and incremental learning methods, making better use of the advantages of neural networks in data feature extraction and the continuous learning ability facilitated by incremental learning. Through the combination of these two methods the performance of the original algorithm is widely improved. Experiments have shown, conclusively, that our method has higher classification accuracy and faster convergence speed.

The remainder of this paper is organized as follows. Section 2 reviews other online learning algorithms related to our work. Section 3 proposes the Integrated classification learning framework, describing in detail the required steps. Section 4 conducts extensive experiments on our proposed algorithm and other state-of-the-art algorithms. Section 5 concludes this work.

2 Related work

With the development of machine learning theory, many algorithms to deal with classification problems have been developed. Most of these algorithms are based on batch learning, which relies on the assumption that all data samples can be obtained before training. The disadvantage of this kind of algorithm is that the learning stops after training on the existing data samples. For example, a deep neural network (DNN) [15, 25] uses a multilayer artificial neural network between input and output, and finds a mathematical operation that can minimize the loss function by adjusting the weights of the network. A DNN has the ability to capture complex nonlinear behavior, however it is easy to over fit and it typically requires a significant amount of training data. Also, DNNs require the training dataset to contain sufficient information that they can be applied to as-before-unseen data, post model training, while still obtain reasonable results.

In order to cope with the continuous growth of massive data, various online incremental learning algorithms have been proposed, such as active learning [26]. Online learning is a continuous training process in which input values are fed into the model in each round of training, and the model outputs prediction results based on the current parameters [16]. If the predicted classification result agrees with the input classification, the model will continue to be used for the next round of input values. On the other hand, if the input classification differs from that predicted by the model, an update will ensue in an attempt to make improved predictions for future data. Incremental learning means that a model can acquire new knowledge from new samples, while at the same time retain most of the previously learned knowledge – similar to the human learning process: people are exposed to new information every day, in a step by step fashion, and will ideally extract the key learnings, combining them with their prior understanding, thus improving their overall knowledge.

The Perceptron is a linear classification algorithm known as a threshold function, which is shown below. The basic idea of the Perceptron algorithm is to find a hyper plane ωTx + b in the sample space to divide the dataset into two categories. Therefore, a suitable function used to determine the class labels may be formulated as:

$$ {y}_t^{\hat{\mkern6mu} }=\mathit{\operatorname{sign}}\left({\boldsymbol{\omega}}^T\boldsymbol{x}+b\right)=\left\{\begin{array}{c}+1,{\boldsymbol{\omega}}^T\boldsymbol{x}+b\ge 0\\ {}-1,{\boldsymbol{\omega}}^T\boldsymbol{x}+b<0\end{array}\right. $$
(1)

where ω is a column vector of weight parameters and b is the bias. We can fix b and update the parameter ω, then the weight adjustment of the Perceptron learning algorithm proceeds as follows:

$$ {\displaystyle \begin{array}{c}\varDelta \boldsymbol{\omega} =\eta \left({y}^t-{y}_t^{\hat{\mkern6mu}}\right)\boldsymbol{x}\\ {}\boldsymbol{\omega} '=\boldsymbol{\omega} +\varDelta \boldsymbol{\omega} \end{array}} $$
(2)

where η represents the learning rate, usually between 0 and 1. This is the mechanism by which the Perceptron algorithm updates the model based on the prediction error.

Deviation causes the decision boundary to deviate from the origin and does not depend on any input value. If the training set is linearly separable, the perceptron convergence can be guaranteed. In addition, there is an upper limit for the number of times the perceptron adjusts its weight during training.

In addition to this, a wide range of algorithms based on convex optimization have been proposed, including the PA and OGD algorithms. The core of PA algorithm is based on the samples of support vector machine, which requires the classifier to adjust the distance between the classifier and the real classifier to be as small as possible when encountering the incorrect samples, while at the same time to ensuring the previous samples are correct. Online gradient descent algorithm is the application of offline gradient descent algorithm to an online learning situation. This algorithm is the basis of many online learning algorithms. In each round of training, the algorithm will update the direction of the loss function according to the existing model and parameters.

In 2008, the Confidence-Weighted algorithm (CW) [6, 20] based on the weight probability distribution hypothesis shows a positive results in natural language processing. The CW algorithm imposes a the Gaussian distribution on the parameter vector while updating each new training instance, so that the probability of correct classification of the instance under the updated distribution meets the specified confidence. Subsequent to the CW algorithm, in order to generate a sparse solution, a Truncated Gradient Algorithm (TG) [12] was proposed on the basis of convex optimization. Its core idea is to control the size of the coefficient via an imposed threshold.

In summary, batch learning algorithms cannot solve the problem of incremental data learning, and existing online learning algorithms are faced with problems involving sample noise and low classification accuracy. Ultimately, previous researchers have not addressed these issues in a comprehensive manner. In this paper, therefore, we propose several methods in order to solve these important problems.

3 Proposed methods

The online learning algorithm many be applied in a wide range fields, however, the error rate of the online learning algorithm is relatively high under the classification problem due to its optimization strategy. When the parameters are updated during the training process, problems such as low convergence speed and high noise often occur [3]. In order to solve the existing problems, this paper proposes an integrated classifier model for incremental learning, as shown in Fig. 1. After the data samples are input, the neural network is trained to obtain a pre-trained model, then the obtained pre-trained model is used to process the subsequent input data to obtain feature vectors. Next, the feature vectors are input into the TGCW classifier to obtain the final classification result. The integrated classification model consists of a pre-trained model and a TGCW classification model. The pre-trained model includes a DNN training module as well as a CNN training module [21]. The DNN training module and the CNN training module are two completely independent models. These two sub-modules are used to process vector data and matrix data, respectively. The type of data samples determines which module is selected for training. If the given data samples are vectors, the vector samples are input into the DNN module for training. If the given data sample is an image, the matrix is input into the CNN module for training. After the pre-trained model is obtained, the remaining data is input into the pre-trained model. It generates a set of feature vectors, and the classification model uses these feature vectors to generate prediction labels. If the predicted label is different from the real label, the classification model parameters will be updated, otherwise, the model parameters will remain unchanged.

Fig. 1
figure 1

Integrated classifier architecture

It should be pointed out that automatic data type detection is not implemented in the input layer of our model. In our later experiment section, the datasets are manually classified and entered into the model, separately. However, such an automatic detection feature could be easily realized. Here we provide an operational idea: a certain amount of data can be randomly selected from the dataset to be processed, and input into the vector pre-training module and matrix pre-training module, respectively. Then, using cross validation, a comparison of the performance of the two models may be carried out, with the model with superior performance then chosen for processing the complete dataset. This method is feasible because the better performance shows whether the vector or matrix methods can better describe the internal structure of the dataset. Cross validation can also be replaced by other validation methods that are more suitable for the current data.

3.1 Vector pre-trained model

For vector data, the pre-trained model uses a DNN without a convolution layer for training. The network uses three fully connected layers, with 64 nodes in the first two fully connected layers,as shown in Fig. 2. When the data is in vector format, a deep neural network without convolution layer is used for pre-training. In the pre-training step, three fully-connected layers are used to learn the characteristics of data, and the parameters of neurons in each layer are updated through back propagation to retain the information of vector quantity. The activation function uses the ReLU function, and the Softmax function is used as the activation function in the third fully-connected layer. At the same time, the Dropout layer is used between two adjacent fully-connected layers, and each time the entire network is trained, the Dropout layer will be discarded with a certain probability for each node.

Fig. 2
figure 2

Vector pre-trained model

3.2 Matrix pre-trained model

For matrix data, the pre-trained model uses CNN for training [17], as shown in Fig. 3. There are five convolution layers. Each convolution layer uses a 3 × 3 convolution kernel and 32 nodes. At the same time, the batch normalization layer is used to normalize the data after each convolution to speed up the search for the optimal solution of gradient descent. After the first, third and fifth convolution layers, a MaxPooling layer and the Dropout layer are added. The pool size of the MaxPooling layer is 2 × 2, and the step size is 2. Two fully connected layers are used after the fully connected layer. The first fully-connected layer has 1024 nodes. The activation function uses the ReLU function. After the fully-connected layer, the Batch Normalization layer is used to normalize the data. Then the Dropout layer is used to avoid overfitting. When inputting image data, convolution and maximum pool are used to extract and represent image data in the deep pre-training part containing convolution neural network. After convolution for many times, the obtained feature information is output to all the connection layers of 1024 nodes in a layer to obtain the image feature vector.

Fig. 3
figure 3

Matrix pre-trained model

3.3 TGCW classification model

In this section, we propose a new online learning algorithm suitable for binary classification of streamed data, named TGCW, which aims to further improve the prediction accuracy and feature selection capability of the model. The CW algorithm itself has some disadvantages. (1) Its updating strategy is very aggressive. When there is noise in the streaming data, the CW algorithm will greatly modify the parameters, resulting in a decrease in accuracy. When used for binary classification, it will not obtain satisfactory performance. (2) The sparsity of the model parameters is poor, which will lead to low interpretability and performance degradation, at least to some extent.

To overcome these disadvantages, we have considered two adjustments: (1) similar to the PA algorithm and the SCW learning process, we introduce a parameter C to control the aggressiveness versus passiveness [11] of the learning algorithm; (2) streaming data makes feature selection difficult, so we introduce the TG algorithm to truncate coefficients to be smaller than the threshold θ to reduces the dimension of streaming data.

Below we will elaborate our method. We assume that the coefficient follows a Gaussian distribution with the mean vector μi and the covariance matrix ∑i, as per the CW method, and allow an immediate update of the variables μt + l and Σt + l. To deal with the incremental streaming data, we apply the cost-sensitive strategy, i.e.,

$$ \left({\mu}_{t+1},{\varSigma}_{t+1}\right)=\underset{\mu, \varSigma }{\mathrm{argmin}}{D}_{KL}\left(\mathrm{N}\left(\mu, \varSigma \right)\Big\Vert \mathrm{N}\left({\mu}_t,{\varSigma}_t\right)\right)+{C}_{cs}{l}^{\varphi}\left(\mathrm{N}\left(\mu, \varSigma \right);\left({x}_t,{y}_t\right)\right) $$
(3)

Where, Ccs is the cost assigned to the suffered loss. The quantity ϕ is the loss that measures the probability of the misclassification based on the calculated Gaussian distribution, where

$$ {D}_{KL}\left(\mathrm{N}\left(\mu, \varSigma \right);\left({x}_t,{y}_t\right)\right)=\frac{1}{2}\log \left(\frac{\det {\varSigma}_t}{\det \varSigma}\right)+\frac{1}{2} Tr\left({\varSigma}_t^{-1}\varSigma \right)+\frac{1}{2}{\left({\mu}_t-\mu \right)}^T{\varSigma}_t^{-1}\left({\mu}_t-\mu \right)-\frac{d}{2} $$
(4)

is the KL divergence, measuring the difference between the two distributions.

Online learning requires the sequential arrival of the streaming data that cannot be reused. This constraint makes it difficult to perform feature selection. Truncated gradient (TG) is proposed to select the discriminate features. After the cost-sensitive CW update at each iteration, we apply TG to select the most active features. Specifically, we define:

$$ w= TG(w)=T\left(w-\gamma \nabla l\left(w;\left({x}_t,{y}_t\right)\right),\gamma g,\theta \right) $$
(5)
$$ T\left({w}_j,a,\theta \right)=\Big\{{\displaystyle \begin{array}{c}\max \left(0,{w}_j-a\right),\kern0.36em if\;{w}_j\in \left(0,\theta \right)\\ {}\min \left(0,{w}_j+a\right),\kern0.36em if\;{w}_j\in \left(-\theta, 0\right)\\ {}{w}_j,\kern2.28em otherwise.\end{array}} $$
(6)

Where,

$$ l\left(w;\left({x}_t,{y}_t\right)\right)=\max \left(0,1-{y}_t\left({w}^T{x}_t\right)\right) $$
(7)

γ is the learning rate, g is the gravity parameter, and θ is the threshold. xt and yt represent the sample value and label of the current iteration round, respectively. This TG operation can be performed in every K online steps.

3.4 Robustness analysis

Because of the feature extraction in the pre-trained part, the performance of the whole model is greatly improved. In addition, we also find that the robustness of the model has been significantly improved at the same time for the following reasons. (1) The convolution layer and the pooling layer in the CNN can filter at least some of the noise. (2) CNN Local Connectivity and Shared Weights strategy can also weaken the influence of the noise. If the classification accuracy is further improved, a convolution attention network can be used for preprocessing [22]. (3) For the DNN network for vector samples, the network structure with dense connection of DNN can improve the robustness of the whole model, which is more reliable than the model without pre-trained layer. (4) DNN network can extract features from datasets, and then put the feature vector into the TGCW model, which is conducive to weakening the direct impact of noise on model parameters. Based on the above four points, we believe that the pre-trained model has significant robustness, which is also confirmed by subsequent experiments. Details of our experiment are given in Section 4.

4 Experiments and analysis

4.1 Datasets

The main task of online classification is to minimize cumulative errors to obtain higher accuracy. Therefore, in order to test the performance of our proposed integrated classification model, we carefully selected five datasets from different fields from the UC Irvine Machine Learning Repository and the KEEL dataset repository. Glass1, Glass2, Glass5, Ecoli3 and Haberman are all well-known classically used test datasets. The Glass datasets contain attributes about several glass types that can be used in criminological investigations.

The Ecoli data contains protein localization sites and the Haberman dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer. In addition, the masked_faces dataset was collected by Yin Xiangzhi and others. It contains 2858 pictures of faces wearing masks and 1221 pictures of faces not wearing masks. The faces in the dataset have different directions and occlusion degrees. The Dogs vs. Cats dataset comes from the Kaggle Dogs vs. Cats competition dataset, which includes 12,500 pictures of cats and 12,500 pictures of dogs. The color, age, and position of the cats and dogs in the picture are different. The Animals10 dataset comes from the Kaggle Animals-10 competition dataset. This dataset contains images of 10 kinds of animals, including dog, cat, horse, spider, butterfly, chicken, sheep, cow, squirrel, elephant. We choose horses and elephants from the animals10 dataset as the dataset used in the experiment. This part of the dataset contains 2623 images of horses and 1436 images of elephants. The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. We selected 4132 images of “0” and 4684 images of “1” from the MNIST dataset for experiments. The CIFAR-10 dataset contains 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. We choose 5000 images of airplanes and 5000 images of trucks for experiment. These datasets are used to test the accuracy of integrated classification models in image classification applications. In addition, under the current Covid-19 pandemic situation, the algorithm has certain practical value in the detection of mask wearing. Table 1 shows the specifications of the 10 datasets (Fig. 4).

Table 1 Dataset summary
Fig. 4
figure 4

Masked_faces dataset and Dogs vs. Cats dataset

4.2 Performance for online incremental classifier

Accuracy is the most important metric for evaluating the model performance in the classification problem. To show the efficiency of our online learning algorithm, we compare our method with five kinds of state-of-the-art online learning algorithms, including Perceptron, OGD, PA, CW and IELLIP [23, 24].

Perceptron is a classifier that uses hyper planes for binary classification, and uses new data instances, predictions, comparisons, and updates each time to adjust the location of hyper planes. The learning process of PA is basically the same as that of Perceptron. When the weight is revised, a parameter Tt is added. When the prediction is correct, there is no need to adjust the weight. When the prediction is wrong, the weight is actively adjusted. Each learning parameter of the CW method has a degree of trust. Parameters with a small degree of trust should be trained, so more frequent opportunities for modification occur. The degree of confidence is expressed by the Gaussian distribution of the parameter vector. OGD uses the obtained data to perform a gradient descent every time, and updates the parameters according to the result of the gradient descent. The key idea of IELLIP is to approximate by an ellipsoid the classification hypotheses. In addition to the classical ellipsoid method, an improved version for online learning is also presented.

We carried out the experiments by first obtaining the optimal parameters for other algorithms for each method on each dataset, then applied each algorithm 10 times using these optimal parameters on each dataset, each time with a randomly permuted sequence. All results are reported by averaging over these 10 runs. There are three performance metrics which are used to evaluate the performance of online learning algorithm, (1) online cumulative mistake rate, (2) number of updates (which would be closely related to the potential number of support vectors in the kernel extension), and (3) the cost of running time.

Table 2 lists the results of our empirical evaluation of cumulative performance of the proposed TGCW and other algorithms in five classes of datasets, where we show the six kinds of online learning algorithms. The bold elements indicate the best performance in different datasets. We can draw several observations as follows. First, in terms of the overall mistake rate, our proposed TGCW outperforms other online learning algorithms in all five datasets. Specifically, compared to the strongest baselines, TGCW decreases the mistake rate by 9.1%, 26.0%, 2.7%, 40.6% and 20.3% in the dataset of Glass1, Haberman, Ecoli3, Glass2 and Glass5 respectively. Second, by examining the number of updates, we found that TGCW online leaning algorithm outperforms other methods. When the number of samples is large, the algorithm has a smaller update rate characteristic. Our proposed TGCW algorithm improves on the CW algorithm, so our method inherits the excellent feature. Moreover, among all the compared algorithms, TGCW often achieves the best or close to the best performance in terms of accuracy and number of updates. Finally, Fig. 5 shows us the online results of six kinds of algorithms with five varied numbers of datasets in the online learning process. The results again validate the advantages of TGCW in both efficacy and efficiency among all the state-of-the-art algorithms.

Table 2 Evaluation of cumulative performance of the proposed TGCW and other algorithms
Fig. 5
figure 5figure 5

Evaluation of cumulative performance of the proposed TGCW and other algorithms. a Glass1 b Haberman c Ecoli3 d Glass2 e Glass5

4.3 Performance for integrated classification model

In order to verify the performance of the integrated classifier, we designed the experimental steps and flow as shown in Fig. 6. Then, randomly selecting 10% of the dataset as data samples into the neural network for training to obtain a pre-trained model, the pre-trained model is used to process the remaining data to extract the feature vectors of the data. Next, the feature vector is input into the TGCW online classifier for online incremental learning, with 20 training rounds performed in the classifier, and the average value of thee results from these 20 training rounds is taken as the final result. In addition, the remaining data is entered into the original TGCW classifier for online learning to obtain the classification result, again taking the average of 20 experimental results, then comparing the performance of the integrated classifier and the original classifier.

Fig. 6
figure 6

Experiment steps and process

The experimental results, as shown in the blue part of Fig. 7, show that the performance of the integrated classifier is greatly improved compared to the original classifier. The way that the integrated classifier uses the data features extracted by the pre-trained model has a greater effect in reducing data noise and speeding up the parameter update process. On ten datasets, the error rate of classification decreased significantly. At the same time, it can be found that the integrated classifier has a better improvement on the Dogs vs. Cats dataset, which shows that the pre-trained model will have a superior improvement effect in the face of relatively complex data. By comparing the performance of the ten datasets, we can find that the performance of the integrated classifier can be greatly improved on the basis of the original classifier.

Fig. 7
figure 7

Performance comparison between the Pt-TGCW and the TGCW classifier on different datasets when we randomly added Gaussian noise in the original dataset. a Masked_faces b Dogs vs Cats c MNIST d Animals10 e Cifar-10 f Ecoli3 g Glass1 h Haberman i Glass2 j Glass5

Then, we randomly added Gaussian noise with a mean of 0, 0.3 or 0.5 and a variance of 0.1, 0.3 or 0.5 in 10% of the data in the original dataset. Then, the TGCW classifiers and integrated classifiers are used for classification. The result is shown as the red part of Fig. 7, and the experimental results show that the integrated classifier still has good classification performance after adding noise, which shows that the integrated classifier has advantages over the traditional classifier in dealing with noisy data. The change of the model before and after adding noise and the comparison between the models can also be seen clearly from Fig. 7.

5 Conclusion

In this paper, a new, integrated classifier for incremental learning is proposed, which uses pre-trained deep neural networks to extract the characteristic vectors of the data, and then uses the TGCW classifier for incremental learning. The experimental results show that the integrated classifier has great advantages in incremental learning. The results of running on ten different types of datasets indicate that the integrated classifier has a significant improvement in performance compared to traditional classifiers. At the same time, the operation results on the noise-added dataset show that after the introduction of noise the average increase of the mistake rate of the integrated classifier is about 1.5%, which is much lower than that of the original classifier (approx. 6%). Hence, the integrated classifier has advantages in processing noisy data, and it also shows that our integrated classifier can have superior robustness and generalization while maintaining the convergence speed. The dense connection structure and the convolution and pooling layer of neural network improves the robustness of the model. Our integrated classifier still has many shortcomings. At present, the ability of the integrated classifier to process complex image data is still poor. The accuracy of some image data is less than 50% and the training overhead is relatively large. Our integrated classifier is still limited to binary classification problems, which restricts its application to many practical situations. Future work will focus on extending our integrated classifier to multi-classification problems. In addition, we will also look for improved pre-trained models or use more classifiers for integrated learning to improve the classification accuracy of complex data.