1 Introduction

The rapid improvement in industrial operations and technology, the internet of things, smart gadgets, and social media, digital data has grown in volume and complexity at a rapid rate [1,2,3,4]. Big data is a collection of records with a large volume and a rate of exponential growth over time [5]. For example, the New York Stock Exchange market creates over 1TB of new data per day, social media creates more than 500TB, and a Jet engine creates around 10TB of data in only 30 minutes, according to the report. The International Data Corporation keeps track of how much data is generated throughout the world. The data is rising exponentially during the year, as seen in Fig. 1 [6]. If current trends continue, data science agencies project that by 2030, there will be around 78 Zeta bytes of digital data [7].

Fig. 1
figure 1

World wide growth of data

Due to the quantity and nature of big data, traditional approaches are no longer able to cope. In addition, conventional techniques for data analysis cannot handle vast amounts of data in real-time. As the amount of dynamic data being processed grows every day, researchers are attempting to create new algorithms that exceed traditional approaches totally. The objective is to create a better computing system to rapidly and economically process and analyze large-scale data. These architectures need sophisticated computing algorithms capable of concurrently processing and monitoring data. Apache Spark is one of the most robust frameworks for managing and successfully handling the real-time streaming of large data. The framework addresses the problems of reliability, speed, and amount of data in big data analytic to delivers exceptional quality, and resource pooling power [8] as shown in Figs. 2 and 3.

Apache Spark is a clustering computation framework that supports Java, R, Scala, and Python [9]. It is built on top of the Hadoop platform and operates in the background to facilitate the delivery of trustworthy, consistent, and high-speed computing on huge data [10,11,12]. PySpark ML-lib is developed on top of Spark to take advantage of Spark’s advantages, such as its in-memory columnar data format, enabling data transmission between Java Virtual Machine (JVM) and Python methods at a low cost. Python users who work with Pandas and NumPy data will benefit from it. PySpark will be more beneficial in conjunction with other python libraries with only minor setup adjustments. In addition, the machine learning functions and syntax are comparable to those of the Python script. This improves the advancement of in-memory programming approaches and top-level data processing frameworks [13, 14]. PySpart frameworks have demonstrated successfully in stochastic modeling [15], healthcare informative [16], text mining [17], genomics data analysis [18, 19], etc.

Sci-kit-learn also offers an easy-to-use and reliable Python learning software for large data analysis. It supports Python computational and scientific libraries such as NumPy, python, and SciPy, and supports such techniques as vector support, random forests, and k-neighbors [20]. The Rapid Miner also provides advanced data processing, machine learning, in-depth learning, text mining, and predictive analytical application. It includes all facets of the processes of machine learning, including data processing, visualization of outcomes, model evaluation, and optimization [21]. It is used by business, industrial and science, education.

This study investigates the impact of data volume on computational capability. Based on the findings, the study identifies potential big data machine learning-oriented prototype layouts and their benefits. This paper aims to show the PySpark ML-lib and its capabilities of applying machine learning on big data from the computational aspect. We highlight the advantages of such a highly scalable machine learning architecture by rigorous testing and comparison to other frameworks such as SK-learn and Rapid Miner.

We highlight the advantages of such a highly scalable machine learning architecture by rigorous testing and comparison to other frameworks such as SK-learn and Rapid Miner. The following is a quick summary of our major contributions:

  • Although machine learning and its science and business applications have been studied over the last two decades. Research has been limited on big data machine learning. We are evaluating Apache Spark ML-lib big data learning analysis of millions of data records. The main part of the paper is to provide the machine learning community with a fascinating yet complex topic that allows the big data community to move on a new and exciting path to the rapidly expanding application machine learning industry.

  • We perform several real-world big data trials to examine a number of qualitative and quantitative facets of PySpark ML-lib. A comparative analysis is also carried out with the Rapid Miner API, MATLAB and the Library of SK-learn, which has been widely used in science. We evaluate varied traditional models for the study of Big Data Machine learning methods, classification, and regression, including Logistic/linear, Gradient boost tree, Random Forest, and Decision Tree. Furthermore, We evaluated unsupervised machine learning approaches like Kmeans and Gaussian Mixer Model (GMM).

The rest of the article is as follows: The literature review in ML-lib and Apache Spark is presented in following Sect. 2. Section 3 starts with an overview of big data technologies and a detailed explanation of Apache Spark, Rapid-Miner, and SK-learn as powerful big data platforms. Sections 4 deals with the methods used in the study, broadly classified as supervised and unsupervised learning. Section 5 presents an explanation of data sets and the data pre-processing involved in preparing them and the analysis results. At last, future work and conclusion are presented in Sect. 6.

2 Related Works

Godson Michael [22] explains the Internet of Things (IoT) in which packets are only transmitted over the network for correspondence if both the sender and the recipient are connected to the internet. To accomplish these goals, they proposed architecture that employs open-source technology such as Apache Kafka for real and virtual message processing, and Apache Spark for streaming, processing, and structuring real-time and historical data. The stored data is presented in a more appealing and understandable way using the Dashing framework. In [23], authors focused on Apache Spark’s benefits compared to the Hadoop Map Reduction and the use of time-series analyses in real-time analysis. It takes plenty of time and storage for this method of reading and writing data on the hard drive and HDFS. They used Apache Spark, which is an open-source project for faster queries, IT, and Real-time performance processing. The focus of this paper is also to study the time series Hadoop and Spark, which calculates and analyzes data in real-time. In [24], the author’s survey aims to provide a comprehensive overview of numerous optimization strategies for improving Spark’s generality and efficiency. They present the Spark programming paradigm and computation method, address the benefits and drawbacks of Spark, and do a literature review and classification of different solving strategies. Furthermore, they incorporate numerous data storage and analysis methods, machine learning algorithms, and Spark-supported applications in the article. V Srinivas [25] explored Spark as a general data analytic tool for distributing clusters based on Hadoop. In addition to reducing maps, it provides memory calculations to increase storage efficiency and handling. Structured data can be processed on Hive and streamed from HDFS, Flume, Twitters, and Kafka. It runs over an existing Hadoop cluster and is accessible through the Hadoop database (HDFS). They also note the modular nature of the API (based on distributed object sets forwarding) making it easy to customize. Jobs may be grouped and reviewed locally into portable repositories. Mehdi.A [26] analyzed in order to improve open-source Big Data Spark machine learning They even make a distinction between Spark and Weka in this article. In terms of efficiency, speed, and data handling capacity, Spark clearly outperforms Weka. Weka has a user-friendly GUI as well as a large number of built-in algorithms. A. Shoro [27] explored Apache Spark software for big data analysis and its capabilities for Twitter’s diverse giant records streaming properties. they considered Apace spark to be very successful after using data sets for Twitter sentiment analysis. They also figure out that Spark is difficult to use and that data visualizations should be improved. S. Salloum, 2016 [28] performed evaluated various classification data mining strategies based on data from the University of California Machines Learning Repository (UCI). Several data mining strategies were examined and time complexity have been computed for each classification.

3 Big Data Technologies

In recent years, Increasingly important has been creating Big Data apps. In actual fact, an increasing number of companies in many sectors depend on huge data collection. The Big Data is less cost-efficient, on the contrary, in traditional data methods and networks. They have a bad response time and lack of scalability, performance and accuracy. A great deal was undertaken to solve the complex problems of big data. This has led to the emergence of various delivery networks and innovations. Hadoop, Spark, NO-SQL, Sklearn and Weka libraries, Hive, Cloud, and Rapid Miner technologies are gaining popularity. These technologies are computer software tools for extracting, managing, and analyzing data from a massively complex and large data collection that traditional management tools would never be able to handle [29,30,31,32,33,34]. However, in such a setting, selecting among a variety of technologies may be time consuming and difficult. Many factors should be taken into account, including technology compatibility, deployment complexity, cost, efficiency, performance, dependability, support, and security issues [35]. We go through the technologies that are currently being employed in the research study in more detail in the subheading.

3.1 PySpark ML-LIB

Apache Spark is a scalable, fast and storage large data processing platform built at University of California, allowing developers of distributed applications to use Java, Python, Scala and R languages for programming. Apache Spark Streaming, Apache Spark SQL, Apache Spark GraphX and Apache Spark ML-lib are the four most used libraries [19, 26, 36].In an environment with highly tolerant defects and a lot analysis, stream processing takes place during Apache Spark Streaming, the Spark simple planning module. Apache Spark SQL makes relational queries in mine various database structures with the query-frames data statement principle [37]. Apache GraphX is a graph processing tool for Apache Spark. It offers programming models spread to treat two basic data structures: graphs and arrays. Apache Sparc ML-lib is one of 55 machine-learning algorithms that provide information and parallel processes to a vast range of data analyses. The Library uses different algorithms for machine learning such as sorting, classification, regression and reduction [38].

Many engineering and machine-learning scientists have embarked on advance spark ML-lib to assist the big data analysis ecosystem. We look at some of the latest Apache Spark ML-lib applications advancements in this area. [39] CVST, a modular open source framework was proposed. The framework is intended for smart transport applications production. The data analysis part of ML-lib is used to process and feed data to the front end [40]. Built an infrastructure of institutional information systems to provide resources for evaluating student data patterns. The recommended framework forecasts and recommends courses using Apache Spark ML-lib for the next half year [41].The bigNN, which can handle extremely high biomedical penalty rate categorization, is another fascinating big Data Analytic component that was produced by Apache Spark [42].

3.2 Rapid Miner

The first Rapid Miner, known in 2001 as YALE, was developed by Ralf Klinkenberg, Ingo Mierswa, and Simon Fischer. All the applications which can be carried out in Rapid Miner are statistical modelling, pre-processing data, market analysis, optimising and prediction analysis. It has eight operators and the servers that follow. (a) The process control operator provides a range of parameters, such as logging and random number generator initialization settings of a worldwide significance to the process.(a) The process control operator offers a collection of settings that are of global importance to the process, such as logging and random number generator startup settings. (b) Utility operator is used to insert a process into another process. (c) To access the repositories, utilize the repository access operator. (d) CSV files are read using the Import operator. (e) Almost any type of Input Output Object may be written to a file using Export operator. (f) An example set’s characteristics can be renamed using the data transformation operator. (g) For prediction, the modeling operator is utilized. (h) The evaluation operator separates the data set into three categories: training, testing, and estimating performance. Rapid Miner Studio is a code-free platform for designing complex analytic procedures that speed up Hadoop cluster computations [43,44,45]. We list the most recent advancement in the big data community using Rapid miner [46]. gave a review of existing data mining algorithms for predicting diabetes disease. Rapid Miner is mostly used to diagnose diabetes, according to their analysis [45]. used Rapid Miner data mining technology will be used to classify massive data and analyze people’ attitudes and opinions on the Facebook social networking site [47]. tested decision trees, regression, SVM, k-NN and Naive Bayes on breast cancer detection using Rapid Miner.

3.3 Sci-Kit Learn

Sci-Kit learning is a Python program that incorporates a wide variety of cutting-edge learning methods with medium-sized challenges, unattended and monitored. This kit concentrates on educating non-specialists about deep learning using a high-level vocabulary for general purposes. Priority is given to the following: ease of use, consistency, documentation and API reliability.Sci-kit learn works well in big data and support a variety of algorithms [48]. It has few limitations and is licensed under the BSD so it can be used both in the academic and business fields [49,50,51]. This study in [52] examines hospital health-care systems using machine learning and presents a thorough management strategy and [53] used SK-learn to create comprehensive graphs of the Covid-19 data and forecasted machine learning models to categorize gene sequences so that mutations could be detected fast. The broader study has conducted using big data sets to do textural analysis of hundreds of thousands of legislative speech actions [54].

3.4 MATLAB

MatLab is a numerical computations programming language that is mostly used by engineers and data analysts.It is a unified, strong platform for dealing with Big data [55, 56].We may choose from a number of toolboxes to expand on the fundamental capabilities that come with it. Matlab is available for use on Unix, Macintosh, and Windows systems, as well as personal computers for students. The fact that MatLab makes data visualization so simple for the user is its most significant benefit. MatLab enables users to obtain a more natural read on their data instead of depending on some fundamental understanding of coding or computer science principles. Users may retain strict control over the way their data is displayed by using the easy-to-read style in which data is provided both before and after analysis.MatLab has a significant number of dedicated users, including several institutions and a few businesses with the financial means to purchase a license [57]. Despite its widespread usage in colleges, MatLab is simple to learn for novices who are just getting started with programming languages since the package contains everything you’ll need when we buy it. When using Python, we must install additional packages. Simulink, a fundamental element of the MatLab package for which there is currently no suitable equivalent in other programming languages, is one component of MatLab [58]

In the next subsection, We will go through the various machine learning algorithms, as well as datasets and data pre-processing methodologies.

Fig. 2
figure 2

Big data processing pipeline

4 Methods

Supervised learning is a form of machine learning that trains models using label data. Models finds mapping function for mapping the input variable (X) to the output variable in supervised learning (Y). Supervised instruction requires monitoring to train an example which is equivalent in the presence of a teacher, when a pupil learns things. Two kinds of problems can be taught: classification and regression. In fact, supervised learning could be used to evaluate risk, classify images, detect fraud, spam etc. Two types of problems: (a) classification and (b) regression are subject to supervised learning. Random Forest,Decision Trees,Logistic Regression,Support vector Machines are examples of classification while Regression Trees, Non-Linear Regression, Bayesian Linear Regression, Polynomial Regression are famous examples of classification algorithms.

Unsupervised learning is another approach in which patterns derived from unlabeled input. Objective of Unsupervised learning is to discover the structure and patterns of the input data. Unsupervised learning requires no supervision. Rather, it searches for data models on its own. Two types of problems may include unattended learning: clustering and Association. Unsupervised training cannot be extended to regression or classification problems directly, since we have input data but no equivalent outcome data.The objective is to identify a fundamental data set layout the data in a compact format based on similarities.Some famous algorithms are K-means clustering, Hierarchical clustering,Neural Networks, Apriori algorithm,Principle Component Analysis, KNN (k-nearest neighbors) [59].

Fig. 3
figure 3

Supervised learning versus unsupervised learning

4.1 Supervised Learning

Logistic Classifier: It’s a way of forecasting the results of In conventional logistic regression, a binomial or multivariate classification algorithm is used. As the logic regression passes logit function, a linear combination of our input characteristics using gradient descent is minimized. Although logistic regression was introduced in the 1950s as a traditional statistical classification method [60,61,62].

Consider the model p=P(Y=1), which has two predictors, \({x_{1}}\) and \({x_{2}}\), and one binary response variable, Y. We assume that the predictor variables and the log-odds of the event Y=1 have a linear relationship. This linear relationship may be expressed mathematically as follows:

$$\begin{aligned} l =\log _{e}{\frac{p}{1-p}}=C_{0}+C_{1}x_{1}+C_{2}x_{2} \end{aligned}$$
(1)

where \({C_{i}}\) are parameters/coefficients of the model.

Decision Tree They are one of the analytical, processing and studying mathematical simulation approaches. To draw conclusions on the target value from expectations of an object, it uses a decision tree. Classification trees are tree structures in which a different value set may be taken from the target variable; the leaves represent classes, and branches represent functions which correspond to certain labels. Given their understanding and ease, decision trees are one of the most common master algorithms for learning [63, 64].

The majority of decision tree algorithms work top-down, selecting a variable at each step that optimally separates the collection of objects. Different algorithms employ various metrics to determine what is ”best.” These are used to assess the homogeneity of the target variable across subgroups. These metrics are applied to each potential subgroup, and the resulting values are averaged to offer a measure of the split’s quality.

Random Forest Breiman developed the Random Forest ensemble learning algorithm. An ensemble learner approach creates a large number of independent learners and then combines their performance. Random Forest employs a variation of the bagging method. Each classifier in Bagging is designed independently using a bootstrap sample of the input data. A judgment is taken at a node split in a standard decision tree classifier based on all function attributes. The best parameter at each node in a decision tree in Random Forest, on the other hand, is made up of a randomly chosen number of features. This random collection of features allows Random Forest to scale well not only when there are several features per feature vector, but also when there are few features per feature vector [65,66,67]. Train a classification or regression tree \({R_{B}}\) on given training set X with responses Y. The number of trees, B is a hyper-parameter. After training, summing the predictions from all the various regression trees on \({x^{'}}\) may be used to make predictions for unknown samples \({x^{'}}\):

$$\begin{aligned} {{\hat{R}}}={\frac{1}{B}}\sum _{b=1}^{B}R_{b}(x') \end{aligned}$$
(2)

Gradient Boosted Tree Gradient Boosted Trees have a lot in common with Random Forests. They take advantage of decision trees series, then make the forecast based on the weighted scoring from trees. The main distinction with Gradient Boosted Trees is that the first tree is used to project, and then an additional tree is inserted after it has been validated to mitigate error, i.e. the first tree’s failure. One tree at a time is introduced, each one mitigating the loss of the one before it until a stable model is created. Because of the dependency of earlier trees, trees are introduced sequentially rather than in parallel.Due to the dependency on previous tree calculations trees are added sequentially and not in parallel. This reduces distributed calculation’s output advantage. Gradient boosted tree regressors and classifiers were tested on limited data to measure computational time and precision for our objectives [68].

Gradient boosting fits a decision tree \({D_{n}(x)}\) at the nth step. Let \({L_{n}}\) be the number of leaves. For input x, the outcome of \({D_{n}(x)}\) may be represented as the sum:

$$\begin{aligned} {D_{n}(x)=\sum _{L=1}^{L_{m}}P_{Ln} {1}_{R_{Ln}}(x)} \end{aligned}$$
(3)

where \(P_{Ln}\) is the predicted value in the region \({R_{Ln}}\).

Linear Regression A linear technique to modelling the connection between a response and one or more explanatory factors is called linear regression. Simple linear regression is used when there is only one explanatory variable, whereas multiple linear regression is used when there are more. Let the response variable is y and the independent features are \({x_0, x_1,...,x_n}\). The model accept the feature vector in the form as

$$\begin{aligned} y_i = w_0 + w_1 x_{i1} + ... + w_n x_{in} + \in _i \end{aligned}$$
(4)

where \(w_0, ...,w_n\) is weight matrix that linearly multiple with the features. \(\in\) is the random variable that indicate the noise to the linear equation. To evaluate regression, there are several proposed metrics in the literature and one of these metrics is Root Mean Squared Error (RMSE).


Root Mean Squared Error measures how near a regression line is to a set of points. The Root Mean Square Error (RMSE) is the standard deviation of the prediction errors (residuals). The prediction error is a measure of how distant the data points are from the regression line; i.e., how spread out the data points are from the regression line. In other words, lesser value of RMSE shows how tightly the data is clustered around the line of best fit. In climatology, forecasting, and regression analysis, root mean square error is widely used to check experimental results.

$$\begin{aligned} RMSE = \sqrt{\frac{1}{n} \sum _{i=1}^n y_{residual}^2} \end{aligned}$$
(5)

4.2 Unsupervised Learning

K-means clustering The general principle of clustering is to identify in the data some underlying structure, also called classes of related objects. K For clustering, a similarity metric in the form of Euclidean distance is used, regardless of the platform. The basic concept of K Means is also referred to as divisive or partial clustering to start from each data point by dividing a larger cluster into smaller groups dependent on user feedback K (or the number of clusters). A core called the centroid is in every cluster. In each cluster the total number of centros is always the same as K. The algorithm searches for and assigns data points to the nearest cluster.As all data points in each centroid (which represents each cluster here) are allocated, the central values are recalculated and the procedure repeats before the clusters meet a convergence criterion. For each cluster, centroids are a New Mean and a convergence criterion is a consistency or a compact cluster measurement. Let the data be \({x_{0},...,x_{m}}\), the Kmeans clustering algorithm will divide the m observations to k \((\le m)\) sets. Therefore, the aim is:

$$\begin{aligned} {arg \min }_{s} {\sum _{i=1}^{k}}{\sum _{{x}\in S_{i}}} {||{x}-{\mu }_{i}||^{2}} ={arg \min }_{s} {\sum _{i=1}^{k}} {|S_{i}|} {Var} {S_{i}} \end{aligned}$$
(6)

where \(\mu _i\) is the mean and \(S_i\) is the \(i^{th}\) set in k.

Gaussian mixture model These are probabilistic models for a general population representing usually distributed sub-populations. Mixture models usually don’t need to know which sub-population a data point belongs to such that the model will automatically learn the sub-population. This is a method of unchecked learning because the task of sub-popularity is unknown. GMMs were also widely used in the object tracking of certain items, which predicted the number of mixture compounds and their means at each frame in a video series. GMMs were used for the extraction of features from speech data. The gaussian model weights defined as \(\phi _k\) with \(\sum _{i=1}^K \phi _i =1\) as a constraint, such that the total probability distribution is normalized to 1. The total probability distribution given as:

$$\begin{aligned} P(x) = {\sum _{i=1}^K \phi _i N(x | \mu _i, \sigma _i)} \end{aligned}$$
(7)

where \(\sigma\) is the variance and N can be written as:

$$\begin{aligned} N(x | \mu _i, \sigma _i) = \frac{1}{\sigma _i \sqrt{2 \pi }} exp (-\frac{(x-\mu _i)^2}{2 \sigma _i^2}) \end{aligned}$$
(8)

Silhouette Score The term ”silhouette” refers to a method of interpreting and assessing consistency within data clusters. The process includes a brief graphic representation of each object’s classification. The silhouette is between 1 and +1, with a high value indicating that the item fits well with its own cluster but not with the clusters around it. If the majority of the objects have a high value, the cluster architecture is advantageous. There may be too many or too few clusters in the setup if some points have low or negative values. The Silhouette score is calculated as follows:

$$\begin{aligned} S_i = \frac{q(i)-p(i)}{\max (p(i),q(i))} \end{aligned}$$
(9)

For ith data point, \(S_i\) is the Silhouette coefficient. where \(p(i)\) represents the average distance between i and all other data points in the cluster to which i belongs, and \(q(i)\) represents the lowest average distance between i and all clusters to which i does not belong.

Table 1 Data set attribute

5 Results and Discussion

5.1 System Environment

We performed our experiment on Windows 10, 64 bit Operating System. The hardware configuration is Intel CPU core i7-9700K with 48GB DDR4 memory.

Fig. 4
figure 4

Classification algorithms Comparison of SK-learn, Rapid miner,MATLAB and Apache Spark ML-lib with respect to ROC parameter

5.2 Datasets

Four various broad data sets, namely Hepmass, SUSY, Higgs, and Bank marketing have been used in our analysis. The raw data sets have downloaded from UCI Machine Learning Repository. The ”HEPMASS” data includes powerful physics experiments to look for exotic particles and a bi-classification task. The ”SUSY” data are referencing to distinguish between a signal process that creates super-symmetric particles and a background process that do not match with the phenomenon. The ”HIGGS” data are collection of signal samples that used to determine whether a particular signal emits Higgs Bosons or does not. The Bank data set related to the customer in a marketing campaigns whether a person will subscribe to a bank service or not [69]. Table 1 shows the detailed characteristics of each data set. The total number of classes in each data set is two, have real values.

5.3 Data Prepossessing

We have used four sets of data for our experiments and have split them in 7:3 ration, training and testing, respectively. Except for banking marketing, others are numeric data type. Therefore, We used the technique One hot encoding in PySpark because many machine learning algorithms do not operate on string and other data types. In addition, the classes were imbalanced, we applied oversampling the bank data to avoid the problem of over-fitting [70]. The raw data sets were in text format, we converted them to CVS format. The categorical encoding method has used in SK-learn library to convert the label to a numerical vector. For Rapid miner and MATLAB use auto preparation data modeling to handle the data. In case of PySpark, a dense vector has created from the data frame. The dense vector has two columns, namely, label and features. The label columns contains the original class in the data frame, whereas, the features compose of all the detests features in a single length vector [71].

5.4 Results

Fig. 5
figure 5

Classification algorithms Comparison of SK-learn, Rapid miner, MATLAB and Apache Spark ML-lib with respect to Accuracy

We have concentrated on Logistic, Decision Tree, Gradient-boosted Tree, and Random Forest, the supervised classification (Fig. 4) and Regression approaches (Fig. 6), as well as the unsupervised methods, such as K-means and Gaussian Mixture Model. All machine learning classification and clustering algorithms are applied in three different environments: PySpark Ml lib, Rapid miner, MATLAB and Sci-Kit learn. The results of the classification methods comparison are shown in Fig. 4. The graph depicts the ROC in each data set as it is applied to classification algorithms in various environments. The accuracy in each set of data in different environmental settings is depicted in Fig. 5.

Fig. 6
figure 6

Regression algorithms Comparison of SK-learn, Rapid miner, MATLAB and Apache Spark ML-lib with respect to RMSE

Fig. 7
figure 7

Regression algorithms time Comparison of SK-learn, Rapid miner, MATLAB and Apache Spark ML-lib

In the case of PySpark, the Gradient-boosted tree classifier has a maximum ROC value of 0.91, and the highest accuracy of about 92 percent for the bank data set. Rapid Miner also suggests a Gradient-boost tree to achieve maximum accuracy in a small data set. In contrast, the random forest classifier was more accurate, about 90 percent, than other classifiers in the SK-learn environment. In constrast, MATLAB perform better than sklearn and rapidminer in all the cases. Whereas, it has similar performance with pyspark. One should consider using PySpark to achieve the highest accuracy.

Considering the medium data sets, PySpark has achieved the highest accuracy of SUSY data set, about 80 percent with Gradient-boost (ROC = 0.829), and Hepmass data set about 89 percent with Random Forest method (ROc = 0.832). Rapid Miner reports that Gradient-boosted tree is better with both data sets, having accuracy, 64 percent (SUSY) and 86 percent (Hepmass), and ROC, 0.819 and 0.944. Similarly, Sklearn also suggests a Gradient-boosted Tree. However, the ROC value is lesser than PySpark and Rapid Miner. The second option would be MATLAB to get the adequate accuracies in comparison to other ML frameworks. The highest ROC value on MATLAB is 0.916 for the gradient-based tree for big data. It achieved 0.929 ROC value for the small dataset.

Fig. 8
figure 8

Running time comparison of K-Means and Gaussian clustering over the SUSY and Hepmass data sets

Fig. 9
figure 9

Silhouette Score comparison of K-Means and Gaussian clustering over the SUSY and Hepmass data sets

The random forest on the PySpark platform achieved maximum accuracy of 89.4 percent with a ROC of approximately 0.70 (Maximum) for the largest data set, HIGGS, in our study. Whereas, Gradient-boosted tree on Rapid Miner and Sklearn gives 62.4 percent and 71.4 percent accuracy with 0.68 and 0.71 ROC values, respectively. However, MATLAB perform better than sklearn and rapidminer with highest accuracy of 90.78 percent for bank (small) dataset, maximum average 84.66 percent for big datasets (Higgs, Susy, and Hepmass).

Moreover, we used ML regression approaches to investigate the impact of large data on the platforms. To determine the strength of various platforms, we used the Susy and Higgs data sets only. Linear regression and random forest achieve 0.20 RMSE for Susy data, whereas MATLAB outperforms the SK learn package with a minimum error of 0.26 as shown in Fig. 6. In the context of linear regression, PySpark and MATLAB is better than Rapid Miner with error difference of 0.05 for huge data, such as Higgs. PySpark and Rapid Miner have reported least error of 0.175 and 0.18, respectively. On Sklearn,, MATLAB RapidMiner, and PySpark, Fig. 7 displays the time spent by each method. PySpark has outperformed other platforms in all of the techniques.

In addition to classification and regression models, we have evaluated the unsupervised machine learning methods such as K-means and Gaussian Mixer Model on SUSY and Hepmass data set to check the robustness of PySpark against Sklearn and MATLAB. We analyse the clustering approaches in terms of time complexity and Silhouetter Score. Fig. 8 shows a comparison in the time require to cluster the SUSY and Hepmass data sets. In both data sets, the Gaussian Mixer Model (GMM) takes less time than the K-means. PySpark takes 38.5 percent less time than the Sklearn and MATLAB on K-means methods, and 27 per cent less on GMM. The graphs depicts that, in unsupervised machine learning, the PySpark ML-lib outperforms the SK learn and MATLAB package. With the SUSY and Hepmass data sets, Fig. 9 shows the Silhouetter Score between Kmeans and GMM. PySpark surpassed Sklearn and MATLAB with a silhouetter score of 86.57 percent using GMM. Sklearn performs marginally better than PySpark for Hepmass data.

Findings of our research indicate Apache Spark ML-lib is a powerful platform for big data analytic. Rapid miner, MATLAB and Sklearn, on the other hand, are slower than Apache Spark ML-lib when dealing with large amounts of data. This analogy might not be fair due to the various file systems and configurations used by Apache Spark ML-lib, rapid miner, and Sci-kit learn library.

However, our aim is to demonstrate how Spark performs with massive data sets. Rapid miner, MATLAB and Sklearn are used as a valid baseline that is widely accepted in the research community. Rapid miner, MATLAB and Sklearn have a number of advantages over Spark, including: (1) Users have access to a large number of documentation and resources; (2) It is very simple and convenient to use for non-expert users. (3) Rapid miner, MATLAB provides an excellent user interface and GPU computing support. Sklearn, MATLAB and rapid miner also have a large selection of machine learning techniques.

Figure 5 shows that the Apache Spark ML-lib is able, as predicted, to be faster than the Sci-Kit Learn library used and that the test has been carried out during the operational period, which corresponds to either the grading algorithms or the clustering method showed that Apache Pyspark ML-lib is faster.

6 Future Work and Conclusion

In order to turn this wide range of ordered and even non-structured data into highly qualified information and facts, data processing has to evolve on a growing number of occasions. Digital data is being collected. This paper analyzes and contrasts the latest technologies in big data. The aim is to help consumers pick and apply the right mix of big data technologies for their particular application. It not only presents a high-level overview of various Big Data technologies, but it also compares them using many large data sets and well-known machine learning methods and tools such as Rapid miner and Sci-Kit learn ML library. It classifies and evaluates the key features, advantages, drawbacks, and applications of various technologies.

Using a number of larger data sets, we want to incorporate more realistic tests for most of the Apache Spark ML-lib components in the future and also,We are aiming for an experimental evaluation of ML-lib Apache Spark using a series of broad data sets of diverse characteristics on a selection of programming languages, tools (e.g. Python, Scala, Weka and R), clusters or on various hardware and device setups.Also, we will calculate the hardware resources (Memory,CPU,GPU etc) used by the different platforms during the experimental evaluation.