Support Vector Machines in Big Data Classi�cation: A Systematic Literature Review

11 Classification is one of the most important and widely used issues in machine learning, the purpose 12 of which is to create a rule for grouping data to sets of pre-existing categories is based on a set of 13 training sets. Employed successfully in many scientific and engineering areas, the Support Vector 14 Machine (SVM) is among the most promising methods of classification in machine learning. With 15 the advent of big data, many of the machine learning methods have been challenged by big data 16 characteristics. The standard SVM has been proposed for batch learning in which all data are 17 available at the same time. The SVM has a high time complexity, i.e., increasing the number of 18 training samples will intensify the need for computational resources and memory. Hence, many 19 attempts have been made at SVM compatibility with online learning conditions and use of large-20 scale data. This paper focuses on the analysis, identification, and classification of existing methods 21 for SVM compatibility with online conditions and large-scale data. These methods might be 22 employed to classify big data and propose research areas for future studies. Considering its 23 advantages, the SVM can be among the first options for compatibility with big data and 24 classification of big data. For this purpose, appropriate techniques should be developed for data 25 preprocessing in order to covert data into an appropriate form for learning. The existing 26 frameworks should also be employed for parallel and distributed processes so that SVMs can be 27 made scalable and properly online to be able to handle big data.


Introduction
The rapid growth of digital data production and the rapid development of scientific computing, networking, data storage, and data collectors have enabled us to generate large data sets known as big data.The concept of big data is usually characterized by three characteristics: volume, velocity and variety.Big data is expanding rapidly in all areas of science and engineering.Beyond the characteristics of this type of data, the aspects of analyzing this data and extracting new insights are of special importance [1]- [3].
SVMs have notable advantages such as high generalizability, simple presentation through only a few parameters, and a strong theoretical foundation [4].Nevertheless, the standard SVM algorithm has been proposed for batch learning and is not suitable for online learning.Unlike batch methods that make all training samples available at once, online learning is a classic learning scenario in which training is done via providing one sample at a time [5].An important advantage of online algorithms is that they allow for additional training whenever data are available without restarting the training process [5].The time complexity of an SVM ranges between Ο( 2 ) and Ο( 3 ) [6].
Increasing the number of training samples will intensify the need for computational resources and memory.Given the properties of big data generated rapidly in large amounts, it is necessary to develop certain algorithms that can hander these data properties.This paper reviews the previous studies of SVM compatibility with online learning and SVM scalability, which can potentially be used for big data classification.To the best of our knowledge, no systematic literature review (SLR) has been conducted so far.In fact, an SLR identifies, classifies and synthesizes a comparative overview of state-of-the research and transfers knowledge in the research community [7], [8].This study aimed to systematically identify and classify the existing methods for SVM compatibility with big data properties (i.e., massive amount and rapid generation) in order to present future research areas.The study addresses the following questions: This paper is an SLR of the latest developments in SVMs for online learning and large-scale data.
For this purpose, the existing methods are analyzed, identified, classified, and reviewed.The shortcomings and problems of these methods are then identified to describe a research environment for future works.This SLR provides researchers with a description of machine learning(Table 1 contains some of the most widely used concepts in machine learning) and data analysis.The rest of this paper consists of different sections.The research methodology is discussed in Section 2, whereas the SVM is reviewed in Section 3.Moreover, Section 4 addresses the research questions, and Section 5 draws a conclusion.

Research methodology
This study comprises a three-phase process that includes planning the review, conducting the review, and documenting the review.The process was based on the SLR instructions [7], [8].
Figure 1 presents the phases in detail.however, no SLR of SVM compatibility with big data has been found.

Research questions
This subsection presents research questions and their motivations (Table 2), whereas the scopes of research goals are defined through PICOC criteria (population, intervention, comparison, outcomes, and context) according to Table 3.

Conducting Review
In this phase, studies are extracted in accordance with inclusion/exclusion criteria, and the resultant information is synthesized.

Selection of studies
The primary studies were selected by searching digital databases and using the snowball methodology, which act as supplementary techniques.The digital databases were ScienceDirect, ACM, Springer, and IEEE, and the search words were selected based on the instructions proposed by [8] and research motivations.The search words are presented below.Based on the needs for every digital database, different data entry styles were used.Several search attempts were made in the digital databases to reach the best compromise between recall and precision.The search was conducted on 11/05/2020.In addition to searching the digital databases, the snowball methodology was used through forward and backward methods in accordance with [9], [10] so that it could act as a supplementary to the search process.For the snowball process, 11 primary studies(table 4) meeting the inclusion/exclusion criteria were used as the primary groups selected through the strategy for searching digital databases.The selection process includes two tasks, the first of which is to completely define inclusion/exclusion criteria, whereas the second one is to make real use of these criteria for selecting the primary studies [7].Table 5 presents the inclusion/exclusion criteria: Quality scientific papers that have been reviewed and contain significant content on the compatibility of the support vector machine with large-scale data and online learning.

Support Vector Machine
The classification problem includes a rule for grouping data into a set of predetermined categories based on the training set.Vapnik introduced the SVM as a kernel-based machine learning model for classification and regression problems.In fact, the SVM is among the most popular and promising classification algorithms [11], [12] based on the VC dimension and developed through the statistical learning theory and the structural risk minimization principle [4], [11].The success of an SVM lies in its good generalizability and convergence [13].In addition to SVMs, there are many other good classification techniques such as the KNN algorithm [14], [15], Bayesian networks [16], [17], artificial neural networks [18], [19], and decision trees [20].The KNN algorithm is very simple to implement but is slow to handle big data and is very sensitive to irrelevant parameters [14], [15].The decision tree is a classifier that has widely been used.It is faster than other techniques in the training phase but is inflexible in parameter modeling [21], [22].
Neural networks have extensively been used in many applications; however, many factors such as learning algorithms, number of neurons in every layer, number of layers, and data representation should be considered in the development of neural networks [23], [24].
Using a training dataset, the SVM creates an optimal hyperplane which separates two classes with the maximum margin.This process is based on the structural risk minimization principle that reduces the model generalization error by minimizing the mean squared error in the training set.This is the same philosophy used often by empirical methods for risk minimization [25].The maximum margin between two classes is equal to the minimum VC dimension [4], [26].If this hyperplane is very appropriate for training data, the model starts learning training data instead of learning generalization, something which reduces the generalizability of the classifier.The SVM aims mainly to separate classes in a training set through a hyperplane that has the maximum margin between classes.In other words, the SVM allows for the maximization of generalizability.The SVM solution is obtained by minimizing the following objective function. )1( Where ϵℝ n is an ordinary vector of the plane, and bϵℝ shows the offset.Moreover, ξ i ≥ 0 is the slack variable that measures the classification error, whereas Cϵℝ is an error penalty coefficient.
Finally, p is usually either 1 or 2. This problem is defined by introducing both coefficients α i and μ i in a Lagrangian formulation.The following equation is then minimized if μ i , α i ≥ 0: Under the Karush-Kuhn-Tucker (KKT) conditions, the following equation will then be obtained: . Therefore, the estimation function can be defined as below: Usually,  i is mapped onto a higher limited-dimension space (a feature space) through a nonlinear mapping Φ() to improve the SVM distinguishability power.The SVM core is called the kernel function defined as K(a, b) = Φ(a).Φ(b), in which the boldfaced parameters show either a matrix or a vector, and the kernel dimensions refer to the feature space dimensions.The famous kernels are polynomial (with limited dimensions) and Gaussian (with limited dimensions).Equation 4is rewritten as below: After L P is minimized, some of the α i parameters (in fact, most of them in practical applications) are equal to zero.The nonzero ones are called the support vectors (SVs) on which the SVM solution depends [27].

RQ1) What is the motivation for SVM compatibility with big data properties?
Despite many advantages of SVMs, they have special weaknesses including high complexity, parameter selection, difficulty of online and multiclass classification, and efficiency in unbalanced datasets [25], [28]- [32].The advent of big data has pressured many of the machine learning methods that have also been challenged by big data properties [33].The most important challenge is to encounter the massive amount and high generation speed of data, which cannot be handled by many learning algorithms.The main flaw of an SVM might be its high computational cost in large-scale data classification.Since training an SVM needs a quadratic programming problem to be solved, the training process becomes slow and requires a massive amount of memory.The traditional SVM acts as a closed system.In other words, the SVM parameters are frozen after the training process ends.This strategy makes the SVM incompatible with online learning environments [34].In some of the real-world problems, data might be made available sequentially but not completely all at once.Online learning is an important machine learning problem with theoretical characteristics and an interesting model.The advent of big data and its recent applications (e.g., bioinformatics and personal medicine), online learning has drawn a great deal of attention.Although online learning has been successful in many applications, it is necessary to improve such limitations as the low efficiency of online learning, high-dimensional disaster problems, instability of test accuracy in online learning, and effects of noise data [35].Given its considerable advantages in conventional learning, the SVM is expected to be used as the main option for the process of online learning in order to achieve the best performance [36].

 Reduce samples
This approach aims to select a subset of training samples to scale down the training data before the process of training the SVM [37].The training dataset size is decrease in this method by eliminating the samples playing no roles in the definition of the separating hyperplane.Simple random sampling (SRS) is among the first techniques for scaling down the training set by selecting a subset of training samples to train the SVM.The Reduced SVM algorithm was proposed in [37] to use SRS and employ a subset containing 1-10% of training samples.In fact, SRS has a low computational cost and is considered an appropriate selection scheme in terms of several statistical criteria; however, the standard offset of classification accuracy is often large in this technique [38].
SRS was employed in [38]- [40] to select a subset of training samples.Sampling was performed in [41]- [43] by determining the selection probability for every sample to train the SVM with the selected samples.The probabilities were then updated, and the probability of false classification increased.This process was iterated several times.The Systematic Sampling Reduced SVM algorithm was proposed in [44], [45] to select informative data points to create the reduced set.
Unlike the reduced SVM [37] which employs the random selection scheme, the process starts with a small set first in this method.Some of the falsely classified points are then added iteratively to the reduced set based on the linear classifier.This process continues until the validation set accuracy is large enough.
Other methods of dataset reduction use the distance between samples and the optimal hyperplane through such metrics as Euclidean distance [46], Hausdorff distance [47], [48], and Mahalanobis distance [49].These methods try to select the samples that are closer to the opposite class and probably have higher chances of being selected as the support vector.
Another class includes data size reduction techniques which use active learning [50]- [52].A set of training samples was employed in [53] to increase the classifier scalability.The samples were selected heuristic.
Other techniques include clustering-based sample reduction.In [54], the core vector machine (CVM) was proposed to select a kernel set by solving a minimum enclosing ball (MEB) problem regarded as a simple version of MEB.Nevertheless, both CVM and MEB provide the appropriate solutions when there is a small number of support vectors in terms of the training set; otherwise, training will be very time-consuming.In [55], an extension of CVM [54] was proposed by keeping the ball radius constant; therefore, it was not necessary to minimize it.The cluster base SVM (CB-SVM) was proposed in [56] to manage massive datasets.In fact, the CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire dataset only once to propose an SVM with high-quality samples having statistical summaries of data.The cluster-SVM algorithm was introduced in [57] by dividing the clustering data into several clusters first and then using the cluster representatives for training the SVM in order to nearly identify SVs and non-SVs.The support cluster machine (CSM) algorithm was proposed in [58] by adopting a compatible kernelprobability product kernel, which can manage similarity not only between clusters in the training phase but also between a cluster and a vector in the test phase.The clustering reduced SVM algorithm was introduced in [59] by developing the SVM model through the RBF.In this method, the clustering algorithm is employed to create the cluster centroids of every class in order to create the reduced SVM working set.Like most of the clustering-based method, the paper reviewed by [60] tried to find the boundary points between two classes which would be the most qualified samples for training.In this method, the dynamically growing self-organization tree algorithm was used for clustering.The MEB was used in [61] to propose a classification method for dividing the training data through the proposed method to use the cluster centroids for classification.After that, classification was performed by using either the clusters whose centroids were SVs or other different classes.Most of the data were eliminate in the second step.A combination of SRVM and K-mode clustering KMO-SVM was proposed in [62] to classify big datasets.The C 2 LSVM algorithm was proposed in [63] by developing the local SVM algorithm based on cooperative clustering.Other algorithms can be found in [64], [65].
Some techniques use the neighborhood characteristics of SVs to reduce the training dataset size.
For instance, the neighborhood entropy was employed in [66]; however, only the models existing around the decision boundary were used in [67].The same procedure for fuzzy C-mean clustering was employed in [68] to select samples on class distribution.The clustering-based SVM training was used in [69] to eliminate the data that were farther from SVs.The data points existing in the inner layer of a cluster were considered non-SV points and were then eliminated; however, the data points scattered in the external layer were regarded as the SV points and were then retained.
In this method, the Fisher ratio and cluster data point distribution were employed to determine the boundary between the cluster data points and the data points scattered in a cluster.
Basic functions were used in [70] to develop a classifier.The independent linear vectors were used in [31] for SVM training; moreover, this method was extended in [71] to use the twin SVM instead of the SVM.
In [72], an algorithm was proposed to identify and remove the unnecessary SVs that would have no effects on the solution.A method was proposed in [73] to select a subset of vectors that would operate directly by creating a vocabulary of vectors; however, this formulation was not convex.The decomposition methods are based on the fact that the training time can be reduced if the active limitations of the quadratic programming (QP) problem are considered [66].A similar idea of active methods is also employed for optimization in decomposition techniques; however, two sets (a working set and a set of constant variables) are used in the active set method, and optimization is performed only on the working set.In the SVM problem, the working set usually consists of the items which violate the KKT conditions.An advantage of the decomposition method is that it needs memory in proportion to the number of linear training sets [25].These methods are sensitive to the selection of the working set because only a fraction of variables is considered in every iteration.If these elements are not selected accurately, the process will be time-consuming [29], [74].The convergence of these methods was also proven [75].Chunking is among the earliest decomposition methods [76].It obtains the maximum margin from a number of samples and then creates a new chunk with the SVs of the previous solution and some of the new samples.The sequential minimum optimization (SMO) was proposed in [77] to convert the main QP problem into a series of smaller QP problems, each of which would optimize only a subset of size 2.This algorithm is faster than the chunking algorithm.Using the boost SMO algorithm [78], the Platt's SMO algorithm can become faster.The LIBSVM [79] is an SMO-based algorithm with advanced improvements in the working set selection mechanism through the second-order information method previously proposed in [80].The SVM Light is another advanced method of decomposition proposed in [81].A parallel optimization phase was introduced in [82] by using the diagonal matrix block estimate the original kernel matrix in order to divide the SVM classification into hundreds of sub-problems.A superior recursive and computational mechanism was proposed in [83]  proposed in [82] to estimate the SVM kernel matrix through a block of diagonal matrices to convert the original optimization problem into hundreds of sub-problems, which can easily be solved through parallelization methods.The training set was divided into m random subsets which were then trained separately in [89].The data were divided into several subsets on which the optimization operations were performed in [6].The results of each subset were then combined and filtered in a cascade of SVMs.This method can be distributed on multiple processes with the minimum communication overhead because the matrices are small and require low memory.A distributed incremental algorithm was proposed in [90] for nonlinear kernels.In this method, LS-SVM [91] was employed to develop the algorithm.Moreover, a distributed algorithm (as opposed to parallelization) of SVM was proposed in [92].It was assumed that the training data followed the same distribution and were locally stored in different spots that could be processed.For this purpose, two approaches were employed.The first approach benefited from a distributed naïve chunking technique in which SVs were exchanged, whereas the second approach used a distributed semi-parametric SVM to reduce the interactions between machines for privacy retention.In [93], the distributed parallel SVM (DPSVM) algorithm was proposed in a configurable network environment for distributed data mining.The main idea is that SVs should be exchanged in a strongly connected network so that several servers can work simultaneously with distributed datasets at a limited communication cost but a high speed.A parallel SVM algorithm was proposed in [94].The authors claimed that it increased the SVM speed unprecedentedly.This algorithm decreased the SVM time complexity to O(np/m) in which p indicates the dimensions of a reduced matrix after factorization p is much smaller than n, whereas m refers to the number of machines used.The distributed parallel SVM algorithm was proposed in [90] to distribute data among different nodes which were prevented from communicating with the central processing unit (due to communication complexity, scalability or privacy).The MapReduce SMO algorithm was proposed in [95].The MapReduce SMO algorithm divides the training suite into m random subset of same size, and each partition is assigned to a Map task.Each Map function is trained on a partition and its output is a partial weight vector and a value of b.Finally, the algorithm calculates the total weight vector and mean of b.The resource-aware SMO (RASMO) algorithm was introduced in [96] to optimize the SVM in a parallel framework based on the MapReduce model by dividing the training set into smaller partitions and using a cluster.In this algorithm, the load balancing scheme is based on the genetic algorithm designed for the algorithm performance optimization in heterogeneous environments.The paper reviewed by [97] adopted a similar procedure to [96].The parallel algorithm to SVM was proposed in [98] to divide the dataset into K clusters by using the K-means algorithm and then develop the nonlinear SVM model on local data.After that, it labels the decision trees of SVMs to the terminal nodes.A combination of SVM and decision tree was proposed in the semi-supervised SVM (S 3 VM) in [99] for labeled and unlabeled data in a network of interconnected agents, in which data were distributed on machines.
In this method, communications are only limited to the neighboring agents, and there is no coordination reference.The distributed gradient descent algorithm and the NEXT framework were employed to find a solution based on sequential convexity of the main problem.A distributed online semi-supervised algorithm was proposed in [100] to use a series of anchor points adaptively through an online strategy.This method benefits from a random sparse mapping to estimate the map of kernel features.This mapping can estimate the model parameters without transferring the original data jointly between neighbors (to respond to privacy protection concerns).This algorithm is efficient in cases where data have limited labels.Other parallel implementations of SVM can be seen in [101]- [104].
 Improving Solvers Some of the SVM methods aim to accelerate the training process by improving the solver.A few of these techniques improve the SVM training time by losing a level of accuracy [74].These methods make some changes to the original QP formula.An instance of these methods is the least square SVM (LS-SVM) [91].This classifier transforms the objective function into a series of linear equations by changing the initial formulation.Introduced in [105], the proximal SVM also acts like the LS-SVM.
In [59], a method was proposed to reduce the number of necessary SVs.The reduction process repeatedly selects and combines the two nearest SVs belonging to the same class.In [106], by adding a new constraint to SVM, the classification sparsity is controlled and a method for solving the formulated problem is proposed.Basically, the proposed approach finds a subspace that can be covered by a small number of vectors.This subspace separates different classes of data linearly.
In [107], a new method was developed to solve the linear SVM through the L2 loss function using the Modified Finite Newton Method, which suits large-scale data mining tasks such as text classification.In [89], a cutting-plane algorithm was proposed to train a linear SVM.The proposed algorithm can perform classification with O(SN), in which S indicates the number of nonzero features.According to the authors, this method operates many times faster than the SVM Light in big data.In [108], a new method called the dual coordinate descent was proposed for the linear SVM with L1 and L2 loss functions.The authors claimed that the proposed method was faster than other solvers such as Pegasos, TRON, SVM Pref , and the primal coordinate descent implementation.
In [72], an algorithm was proposed to identify and eliminate unnecessary SVs without changing the solution.In [109], a local linear SVM was proposed with smooth bound and bounded curvature to obtain the classifier solution through local coding.In addition, the stochastic gradient descent can be adopted to optimize the model online with the same model convergence guarantee.In [110], the H-SVM algorithm was proposed based on an oblique decision tree, in which the nodes are split based on a linear SVM to reduce both the training and test time (it is necessary for cases purposes such as fraud detection and intrusion detection to reduce the test time).This method can also be parallelized.In [61], the PWL-SVM algorithm was proposed to implement a piecewise SVM method through the piecewise feature mapping.In [111], a method was developed to propose the piecewise-linear structure in a multiclass scenario based on the latent SVM formulation.In [112], a new version of the Frank-Wolf(FW) algorithm was proposed to accelerate the basic FW procedure convergence, in which the formulation focuses on the concavity maximization.Having a hierarchical structure with a linear SVM composite in every node, the HMLSVM(Hierarchical Mixing Linear Support Vector Machines) algorithm was proposed in [113].
The geometric methods of SVM are based on the calculation of the optimal separating hyperplane that finds the nearest points to the convex hull [114].In addition to heuristic methods such as [115], [116], alpha seeding was proposed in [115], [116] to estimate the initial value of   to start the QP problem.In [117], the decision tree was employed to propose an SVM decision boundary approximation method.In addition, the SVM was integrated with the decision tree to develop novel methods in [118], [119].Laying the foundation for online learning, the incremental learning methods can also lead to learning scalability.Since the incremental learning techniques are employed to apply online incremental learning to the model, it is reviewed in the next subsection.

RQ3) What are the existing SVM-based methods and techniques that support online learning?
Many attempts have so far been proposed to apply online learning and incremental learning to the SVM.They can be categorized as three groups [71], i.e. 1) Unbounded: these methods make no attempt to reduce the solution, and the solution size expands by increasing the number of input samples.2) Amender: these methods try to prevent the problem solution growth by decreasing the current problem dimensions.3) preventive: these methods seek to prevent the new sample from being added to the problem solution if the new sample has no effect on the solution. Unbounded methods for online and incremental learning In [120]- [122], a method was proposed for the SVM compatibility with incremental learning by performing the learning process through data batches based on the batch SVM.In [123], online learning was considered with kernel reproduction in the Hilbert space.The classic SGD was also used in a feature space along with certain techniques to develop a simple but efficient algorithm for classification and regression problems.In [124], a similar algorithm was proposed by using a novel implicit updating technique.An incremental algorithm was also proposed in [5] where the authors claimed that it would accelerate the incremental SVM by 5 to 20 rates.The acceleration was based on the numerical storage operations.In [90], a distributed parallel incremental algorithm was developed based on the LS-SVM for big data classification.In [125], the Lagrangian SVM (LSVM) algorithm was introduced as an improved version of the linear SVM algorithm.The solution was obtained from an iterative scheme with linear convergence by using the previous calculations.The LSVM aims to look for a hyperplane in an (n+1)-dimensional space instead of an n-dimensional space.The SMO was employed in [126] to propose the LSVM algorithm, which is closer to the explicit solution yielded by the SVM.It also provides online learning.In this method, the working set is replaced in the SMO phases to improve speed and accuracy significantly.Moreover, a second-order greedy working set selection strategy is put in every step to increase progress.In [127], an online incremental SVM algorithm was proposed for large-scale data by using two LPs(learning prototypes) and LSVs(learning Support Vectors).The LPs learn the prototype, whereas the LSVs integrate LPs with previous SVs to create a new SVM.An online semi-supervised algorithm was proposed in [100] by using a series of anchor points selected adaptively through an online strategy.This algorithm is efficient in cases with a limited number of labeled data.Based on LSVM, an online robust algorithm was selected in [128] to make a simple change to the kernel matrix.This algorithm is useful when the new data points might be contaminated (i.e., the label is changed by an attacker).Since the online learning model is more sensitive to contamination; therefore, it is affected to a greater extent if it is exposed to a contaminated sample.Therefore, the algorithm is made robust to deal with such cases.

 Amender Methods
In [129], a recursive online algorithm was proposed by presenting one sample every time.The learning method is reversible; therefore, unlearning is also possible in this method.The multipleversion of this method was also proposed in [130] to reduce the training time, learn, or unlearn multiple samples.In [131], an incremental SVM was proposed to learn and unlearn a single sample or multiple samples and adapt the current SVM by changing the configurations and kernel parameters.In [132], an incremental algorithm called the I-SVM was proposed to accelerate the training process and reduce the need for storage by throwing away some historical samples.In [133], an online SVM was proposed by reducing the number of previous samples used for prediction to minimize the need for memory.Previous algorithm was expanded in [134], It is also free of the presence of noisy samples and can also be used for batch cases.In [135], the LASVM algorithm was introduced to accelerate the training process by selecting active samples.This method benefits from the pairwise optimization principle for online learning.It also defines two direction operations called the PROCESS and the REPROCESS.The process tries to enter the new sample into the current set of SVs, whereas the reprocess aims to reduce the number of SVs.In online iterations, the algorithm switches between the single executions of these two operations.In [136], an incremental algorithm was proposed for the SVM by dividing the dataset into several segments, each of which was employed to select training samples through the K-means algorithm.
After that, every sample was given a weight in the active query based on the sample coefficient and distance from the hyperplane.A criterion is developed to eliminate non-informative training samples incrementally.In [137], a TS-type fuzzy classifier called the ISVM-FC was proposed.
This classifier can be employed when data are available sequentially.First, there is no fuzzy rule for learning the structure through the ISVM-FC.The rules are generated with respect to the distribution of training data.The ISVM is employed to regulate the parameters of rules to improve the classifier generalizability.The incremental learning can also be utilized to exclude the previous training data based on their distance from the hyperplane in order to improve the classifier efficiency.In [138], a single-class incremental classifier was proposed.In fact, the single-class classification is among the most challenging areas of machine learning in specific areas such as medical analysis.The proposed algorithm is called the ICOSVM(incremental Covariance-guided One-Class Support Vector Machine) which improves the classifier efficiency by relying on the low-variance direction.After the new sample arrives, previous SVs are controlled in this method, which can be implemented on large-scale stream data.

 Preventive Methods
In [139], the OSVC(online support vector classifier) algorithm was proposed for sequential data.
In this method, the SVM is first trained with the existing data to determine the decision boundary.
If the new sample is classified correctly when it arrives, it means that it is not an SV; therefore, it will be ignored.If the new sample violates the KKT conditions, it is considered an SV; thus, the decision boundary is updated.In [140], the concept of span of SV was proposed by Vapnik to develop a classifier.In addition to meeting spatial and temporal constraints, it yields acceptable performance.The idea of using the span is that it can directly affect the bounded generalization error.In this method, there is a constraint on the number of SVs.Every sample that is classified wrongly should be verified for inclusion.After that, a previous SV should be excluded.The memory is free of SVs at first.In the SV set, the points are replaced if they have the maximum Sspan reduction.In [31], an online algorithm called OISVM(On-line Independent Support Vector Machines) was proposed.It converges nearly on the ideal solution to the SVM.Like the kernelbased algorithms, OISVM generates a hypothesis through the samples which have been observed so far.These samples are called the base.The new samples can be placed in the base only if they are linearly independent of the current base set in the feature space.The method introduced in [71] is the extended version of OISVM that uses the twin SVM for training and improves the algorithm time complexity.In most of the online SVMs, the classifier is retrained with previous SVs and the newly arrived samples.The maintained system usually loses its performance due to the loss of much information on data.To solve this problem, an online algorithm was proposed in [34].The proposed algorithm includes a representative prototype area (RPA) that represents the previous historical data.In the RPA, every class is retained by an online incremental feature mapping that includes a sample set of stream data automatically.In [35], an accelerative model was proposed for online SVM learning based on the window technology of KKT conditions.It is not an independent online algorithm but can be considered a facilitator for the other online algorithms.This algorithm develops a working set of SVMs with a fixed window including the samples that violate KKT conditions.As a result, not only does the model create the training samples of the same size, but it is also guaranteed that the samples are useful for updating the hyperplane and can mitigate the noise effect by adjusting the KKT window and penalty coefficients.According to the literature review, the common performance evaluation criteria include accuracy, CPU time, and the number of support vectors described as below.

Classification Accuracy:
The performance calculation criteria are employed to evaluate the classification algorithm performance (Table 8).

Number of Support Vectors:
This criterion is the mean number of support vectors employed to develop the hyperplane.

CPU Time:
This criterion is the mean time required to create the hyperplane.

RQ5) What are the shortcomings and problems of the existing methods and techniques?
What are the future research areas?
Big data are characterized mainly by volume, velocity, and variety.A characteristic of big data is the massive size of data produced heterogeneously in high dimensions.Information collectors use specific schemes and structures to record data, whereas different applications lead to the presentation of different data.In such conditions, heterogeneous features and data of different dimensions will result in different representations.This problem is a serious challenge to the collection and combination of data from different sources [141], [142].The next problem is the presence of independent sources with decentralized distributions and control, something which is a major challenge to the applications of big data.Every source is independently able to generate and collect information without relying on a central control framework.In addition, the complexity of data relationships will intensify if big data expand.Therefore, it is necessary to consider the complicated relationships of data (nonlinear and many-to-many) along with the ongoing changes in order to extract useful models from big datasets [141].Machine learning provides great potential and constitutes a necessary component for big data analysis [143].The advent of big data has created many opportunities for machine learning.At the same time, big data features have challenged machine learning.The challenges posed to machine learning by big data include high dimensions, model scalability, distributed computation, data stream, compatibility, etc.
In such conditions, classification algorithms must change effectively to encounter and classify big data.The first problem is the variety of data.Most of the classification algorithms can work with a specific type of data, whereas big data might include structured, semi-structured, and unstructured data.Hence, data should be transformed into an appropriate form for learning in an initial preprocessing step.For this purpose, it is necessary to perform the right preprocessing operations on data with respect to the input data.These operations might include feature selection, sample reduction, noise elimination, and data annotation.Therefore, the existing methods of data preprocessing should be evaluated with respect to their applications and then adopted to meet the need for desirable changes.For instance, the annotation of unstructured data can be employed in the classification process.
The next problem includes massive size of data and high-speed data generation.In other words, data are continuously generated at a high rate.Therefore, the proposed learning algorithms should be both scalable and compatible with online learning conditions.Processing such massive data would require the use of parallel or distributed computation, for which MapReduce [144] and Spark [145] frameworks can be used.Google MapReduce provides an efficient and effective framework for parallel processing of big data.Moreover, the Google file system (GFS) [146] can be used along with this framework to provide distributed, reliable, and efficient data storage for big datasets.Furthermore, MapReduce can be employed to provide an abstraction of problems such as distributed programming and support effective and efficient parallelization.MapReduce is implemented by considering certain problems such as load balancing, network throughput, and error tolerance.The Apache Hadoop [147] project is among the most popular and widely used open-source implementations of MapReduce programmed by Java.The MapReduce model includes two separate steps called Map and Reduce.In fact, Map is an initial conversion step in which the input records are processed in parallel manner, whereas Reduce is a summarization step in which all of the related records are processed together [97].Spark is a cluster computation technology implemented for rapid computation.Based on MapReduce, Spark develops this model to support all different types of computations.Spark is mainly characterized by the in-memory cluster processing which accelerates the processing speed.
In fact, Spark supports a wide range of applications including batch applications, iterative algorithms, interactively query, and data stream.Moreover, Spark mitigates the difficulty of managing different maintenance tools.MLLIB [148] is one of the components of Spark, which is an open source distributed library for machine learning.This library provides functions for a wide range of learning applications and includes basic statistical primitives, optimization and linear algebra.MLLIB, along with Spark, supports several languages including Java, Scala, Python, and R, and provides high-level APIs.
An important feature of Spark is the reuse of a working set of data among several parallel operations, something which is also necessary in many machine learning algorithms and also interactive tools of data analysis.In addition to supporting such applications, Spark maintains scalability and error tolerance of MapReduce.For this purpose, Spark introduces an abstraction called the resilient distributed datasets (RDD), which is a read-only set of objects partitioned among a set of machines and can be recreated if the partitions are destroyed.In machine learning applications, Spark can operate 10 to 100 times faster than Hadoop.
The next problem is the large number of samples, all of which might not be necessary for the learning process.It might also be impossible to process all samples.Hence, dataset reduction methods such as selection of samples and preventive online learning methods can be useful.Given the high speed of data generation, using unlimited and corrective online learning methods will result in high costs of time and space; therefore, it appears to be more appropriate to use preventive online learning methods.

Conclusion
This study is an SLR of support vector machines in the era of big data.It includes three phases of planning, implementation, and documentation.In fact, an SLR identifies, classifies and synthesizes a comparative overview of state-of-the research and transfers knowledge in the research community.This study aimed to systematically identify and classify the existing methods of SVM by considering two features of big data, i.e., massive size and high speed of generation, and presenting the future research areas.
Considering its advantages, the SVM can be among the first options for compatibility with big data and classification of big data.For this purpose, appropriate techniques should be developed for data preprocessing in order to covert data into an appropriate form for learning.The existing frameworks should also be employed for parallel and distributed processes so that SVMs can be made scalable and properly online to be able to handle big data.Therefore, data reduction methods such as sample selection as well as preventive online learning methods can be useful.Due to the high speed of data production, the use of unlimited online learning methods and modifiers will have a high time and space cost, so it seems that the use of preventive online learning methods is more appropriate for this purpose.

Declaration
Funding: No fund is granted to this research.
Conflicts of interest/Competing interests: No conflict of interests.

RQ1)
What is the motivation for SVM compatibility with big data properties?RQ2) What are the existing SVM-based methods and techniques that support online learning?RQ3) What are the existing SVM-based methods and techniques that support large-scale data?RQ4) What are the necessary parameters for comparing big data classifications in volume and time complexity?RQ5) What are the shortcomings and problems of the existing methods and techniques?What are the future research areas?

Figure 2 :
Figure 2: Search summary and selection process of preliminary studies

Figure 4 :
Figure 4: Number of papers submitted for SVM scalability as the adaptive partitioning technique which would operate based on the piece-wise linear decision function.This method uses a non-Gaussian criterion for extracting the subclass of data and also a new formulation of the optimization problem obtained from the subclass information.In[84] the Cluster SVM (CSVM) algorithm is introduced which controls the data in the form of division and conquer.This algorithm groups the data into several clusters, in each of which a linear SVM is trained.A local classification was proposed in[85] to achieve efficiency based on coding the boundary anchor points.The LLBAP (Locally Linear SVMs model based on Boundary Anchor Points) divides the linear inseparable data into nearly dividable segments by scanning the boundary points and local coding.A linear SVM is then used in each segment of data to solve the problem.The subclass reduced SVM (SRS-SVM) algorithm was proposed in[86] to select the subclass structure of data to effectively estimate the set of candidate SVs.Since SVs account partially for the training set cardinality, the training set reduces with no change in the decision boundary.This method depends on the input data domain knowledge, i.e., the number of subclasses.The hierarchical SRS-SVM algorithm was proposed to decrease dependency on the domain knowledge and also sensitivity to the subclass parameter.It is a hierarchical and improved model of SRS-SVM.Since both methods divide the original optimization problem into several optimization subproblems, both can be parallelized.The SVMTroch algorithm[87] used the working set and shrinking to improve the training time in regression problems.Basically considered a stochastic sub-gradient descent optimization algorithm, the Pegasos algorithm[88] employed the decomposition technique to reduce the training time. Parallelization Some other techniques use parallelization or distributed computation to handle large-scale data.It is difficult to perform the parallel and distributed implementation of a QP problem, for there is great dependency between data [30].Most of the parallel methods of SVM training divide the training set into independent subsets for training the SVM in different processors.A method was

Figure 5 :
Figure 5: Type of papers submitted for SVM scalability

Figure 6 :
Figure 6: Apply online learning on SVM

Figure 7 :
Figure 7: Number of articles submitted for online and incremental learning

Figure 8 :
Figure 8: Type of articles for incremental and online learning

Figure 9 :
Figure 9: Type of articles RQ4) What are the necessary parameters for comparing data classifications in volume and time complexity?

Table 1 :
Special terms in machine learning

Table 2 :
Research questions

Table 3 :
Scope and Goals of the SLR Criteria (PICOC)

Table 6 :
Number of methods for scaling

Table 7 :
Number of articles for incremental and online learning