A R EVIEW ON THE S IGNIFICANCE OF M ACHINE L EARNING FOR D ATA A NALYSIS IN B IG D ATA

Big data revolution is changing the lifestyle in terms of working and thinking environments through facilitating improvement in vision finding and decision-making. But, big data science's technical dilemma is that there is no knowledge that can administer and analyze large amounts of actively increasing data and pull out valuable information. As data around the world grows rapidly and its distribution with real-time processing continues, traditional tools for automated machine learning have become inadequate. However, conventional machine learning (ML) approaches have been extended to meet the needs of other applications, but with increased information or large data knowledge bases, there are significant challenges for ML algorithms for big data analysis. This paper aims to facilitate understanding the importance of ML in the analysis of large data. It contributes to understanding the implications and challenges in big data computational complexity, classification imperfection and data heterogeneity. It discusses the capability to mine value from large-scale data for decision-making and predictive analysis through data transformation and knowledge extraction. It will suggest the impact of big data on real-time data analysis and discuss the extent to which machine learning can be used to analyze large data through machine learning in big data analysis. It will also suggest the meaning and opportunity from the point of view of encouraging feature research development in the field of ML using big data.


INTRODUCTION
In today's information world, the volume of data is bursting at an extraordinary velocity with advances in "web technologies," "social media," "mobile devices" and "sensors".Due to the multiplicity of Big Data (BD), we had to rethink the implementation of automated learning algorithms in addition to data processing framework.Choosing the right tool for an individual working situation is mostly difficult, because different types of solutions may be needed while increasing the complexity of the data itself, along with that the requirements of an automated project learning may be different.BD has tremendous potential for commercial significance in a diversity of areas, such as "healthcare," "transportation," "e-business," "power supervision" and "economic services" [1]- [3].But, when faced with this huge amount of data, the traditional approach suffers to perform data analysis.Research performed by "ABI (Advance Business Intelligence) Research" [4] approximates that over 30 billion interconnected devices will be there for information need.These real systems can generate enormous quantity of data from numerous resources, making it complex and difficult to perform data management, processing and analysis.It is a difficult problem for several industries and organizations to incorporate today's "healthcare companies," "IT departments," "government agencies" and "research institutions".To solve such kind of problem, a separate area was created for BD science and new trends are needed for research and education efforts [5] for rapid and successful development.
BD analysis utilization and performance of Machine Learning (ML) depend on the algorithms as well as on the setting of the applied dataset that requires a lot of time-consuming operations.In fact, some systems cannot guarantee good performance without adjusting the module.BD solutions are of high performance in a short time by providing new scientific innovations that can be integrated with ML systems for decision making.In various studies, ML is believed to be an influential tool for handling BD.As presented in [3], it is similar to the relationship between BD and the ML association among the sources and individual learning.From this perspective, individuals are able to learn from the sources to deal with innovative problems.Similarly, they are able to solve new problems through learning from BD.More information on BD processing using ML can be found in [6]- [7].
Most of the past research works described in [8]- [9] suggest that it is difficult to perform classification of BD, as it is distributed among diverse categories of data and extracting constructive knowledge from large and composite datasets is not an easy task.BD classification demands a technique that is able to manage setbacks reasoned because of the BD attributes of "volume," "velocity" and "variety" [5].It also needs a few calculation models and procedures to efficiently categorize data utilizing suitable ML algorithms, as discussed in various proposals [5]- [9].
Current technology development includes the latest distributed file systems and ML approaches.One such technique is "Hadoop" [10], which facilitates ML deployment utilizing exterior libraries, such as the "scikit learning library", to handle BD.Many of the ML techniques in the library mostly rely on classification algorithms which might not be appropriate for BD processing.Nevertheless, several techniques, such as "decision tree learning" and "deep learning," are appropriate for BD classification and can help develop better-supervised learning skills over the coming development periods.
The rest of the paper is organized in the following sections.Section 2 discusses big data implications.Section 3 presents data transformation and knowledge extraction.Section 4 discusses machine learning in big data analysis.Section 5 shows the importance of ML's advantage in big data.Finally, Section 6 presents the conclusion of the paper.

BIG DATA IMPLICATIONS
The concept of BD is initially defined as high "volume," "velocity" and "variety," but later "veracity" [11] and "value" [12]- [13] have been added.The definition needs novel processing models to facilitate visibility detection, advanced decision-making and data processing.However, "value" is characterized as the needed results to handle BD [14] and not as one of the specified BD properties.The potential of BD is highlighted by definition; however, its achievement depends on the improvement of traditional approaches or the development of new methods capable of handling this data.

Challenges
The method of supervising and utilizing a large volume of data for proposing algorithms for active and proficient methods of large data can create distinctive challenges.The challenges and modern techniques currently included in BD analysis were reviewed by Chen and Zhang [15].Jin et al. [16] addressed the importance and opportunities of the BD concept.They also presented the challenges encountered in terms of data, order and computational complexity and suggested possible solutions to these challenges.

Computational Complexity
One of the major challenges faced in BD computation complexity is due to a straightforward increase in data volume.As a result, when it develops into a large size, the utilization of trivial systems is expensive and even the current ML algorithms also show a significant time complexity based on various data sample features.In case of utilizing ML algorithms like "support vector machine (SVM)", complexity is faced during the training phase of "O(m 3 )" time and "O(m 2 )" in the space of complexity [17], where m is the iterations needed for the training samples.Thus, the impact of m will significantly influence the time and memory requirement for training BD, rendering the process impractical.
Causes of challenges are mostly classified as: "classification," "scalability" and "analysis" based on the task to perform.In terms of technological challenges, these are classified as: "computation," "communication" and "storage".Also, with increased data size, the performance of algorithms becomes additionally reliant on the structure used to store and transfer data.As a result, the data size does not only affect performance, but it also leads to the need to revise the general architecture used to implement and develop these algorithms.Thus, with all these algorithms, as the data size increases, the time required to perform the calculations can increase dramatically and the algorithm can become unusable for very large datasets.

Classification Imperfection
The classification process implements methods to collect input data, understand data, transform data and understand the BD environment based on hardware necessities and acceptance criteria.Ultimately, the success of BD classification requires an understanding of modeling and algorithms.However, certain parameters affecting the classification of BD cause problems in the development of learning and classification imperfection models.
Classification imperfections are not limited to BD and have been the subject of research for more than 10 years [18].According to experiments performed by Japkowicz and Stephen [18], the difficulty of the problem of imbalance depends on the complexity of the task, the degree of inconsistency in the classes and the total data size of the training.They recommended that the class is likely to be represented by a reasonable number of samples in a large dataset.However, an evaluation of the actual BD set is required to confirm these observations.In such a case, the complexity of BD operations is expected to increase, which can have serious consequences due to class discrepancies.
The larger the dataset, the more often it is broken, assuming that the data is evenly distributed among all classes [19].This causes that the classification is incomplete.The performance of the ML algorithm will adversely be influenced when the dataset contains data for a class that has a variety of possible occurrences.This problem is particularly noticeable when various classes are characterized based on several samples and few are represented as extremely small numbers.As a result, in the BD context, the probability of class imbalance is high due to the size of the data.Also, because of complexity of data, the potential impact of class imbalance on the ML approach is significant.

Data Heterogeneity
BD analysis involves incorporating various data from multiple sources.Such data can vary depending on the data type, format, model and meaning.In practice, most real data analysis problems are caused by heterogeneous data [1], [20], different in type, structure and distribution due to the massive quantity of data composed from various sources with no class label information.For instance, in an emotion exploration activity, the data can be included as "text," "images" and "videos" collected from different social media sources.To extract knowledge from such large and unlabeled data, advanced autonomous learning technologies must have various models which are able to perform efficient integration and learning with minimum time and process complexity.
In statistics, heterogeneity defines the differences between statistical features in different datasets.These problems exist with BD as well as in small datasets, but the datasets usually contain parts from several sources.This statistical heterogeneity splits the familiar ML hypothesis that statistical features are related in an entire dataset.
In real-time applications, learning from heterogeneous sources is associated with significant challenges due to data dimensionality, multipart relationships, several structures having various objects and diverse distribution.In most cases, label learning through supervision for heterogeneous data is not presented or is time-consuming.In this case, the guidelines for heterogeneous information integration are missing and most learning methods fail to perform accurately.So, identifying an unsupervised function that will be beneficial to the overall analysis is still an important and crucial research problem.

DATA TRANSFORMATION AND KNOWLEDGE EXTRACTION
ML often requires data pre-processing and cleaning steps to configure the data for a particular model.However, in the case of data from different sources, the formats of the data may be different.In the context of data analysis, "data," "information" and "knowledge" are three foremost observations to be exploited.It is possible to perceive data analysis, which is able to transform and integrate data into information and can be used for visualization or decision making, as shown in Figure 1.
In an effort to optimize BD for data extraction and transformation, it is necessary to try to modify the data to become analysable by ML.This amendment process is in the pre-processing phase of the data.It also undertakes the challenges to remove dirty and noisy data through the cleaning process.In this area, there is no significant development in respect to BD and it has been an active research focus in various domains.
The three essential aspects of data influencing ML are: large quantity, dimension and various samples.Hence, two-perceptive data for learning with BD is handled by limiting the dimension and selecting the instance.Reducing dimensionality aims to set a high-resolution space on a smaller area of dimensions without much information loss.Dimension reduction mainly solves the problem of dimension curse and enhancement in processing.The selection of instance refers to methods that select subsets, similar to the entire dataset.It is intended to reduce the dimension with large-scale datasets through data reduction and more specifically to select the required instances.The subgroup is then utilized to create conclusions regarding the entire dataset.The selection methods for various events include "random selection," "selection on genetic algorithms," "progressive sampling using domain knowledge" and "cluster sampling" [21], [46].Data integration and management are critical issues in BD distribution.These are among the original activities utilized to advance the quality of distributed data in independent data sources.A traditional data collection system is a system that integrates the limited resources and usually has complex and time-consuming functions.As discussed in [22], data integration systems need to address uncertainties about semantic assignments between data sources and the intermediate schema in order to effectively index the keywords of the data access queries.This means that appointments are detected by understanding the meaning behind the tagging features of the elements of the schema elements, but many challenges are faced to understand the features that are reliable [23] and this is associated with BD integration [24].

Author Data Transformation
Data transformation converts data or information from one structure into another, generally from the structure of a source system to the necessitated structure of a required target data structure.Mostly, the standard procedures are engaged in converting text data files.However, during data conversion, for a while, a program is converted from one workstation execution program into another, so that the program can be executed on a diverse environment.The common motivation for this data relocation is the introduction of the latest system that is fundamentally dissimilar from the earlier systems.
In practice, data transformation engages the utilization of an exceptional program that reads the original base language of the data, determines the language in which it must be converted into a new program or the data that the system can utilize and then continues with data transformation.

Data Transformation engages two basic stages:
 Data mapping: Assignment of components to capture all transformations that occur at the source base or from system to destination.This is organized for more complex systems when there are multifaceted transformations, such as multiple individuals or multiple regulations for transformation.
 Code generation: Creating the original transformation program.As a result, the specification of the data map is utilized to produce carry-out programs for running on systems.

Data Analysis
Data analysis and data mining from a division of "Business Intelligence (BI)" that includes "data warehousing," "database management systems" and "online analytical processing (OLAP)".Data mining is a specific data analytic strategy that targets predictive statistical modeling and information discovery, relatively entire expressive reasons, while BI covers data analysis aiming primarily on BI.
Deeper data analysis is able to reveal many of the most important features of data, which helps predict future data features.This allows to explore the development of patterns from a set of data to a BD set.Statistical and engineering features are key analytical bases that assist us to recognize the development of patterns.One area focused on BD classification is the development of the technology sector, where the fundamental elements of analysis should be clearly understood.Some numerical evaluations contributing to these goals are: "counting," "mean," "variance," "covariance" and "correlation" [25].
The methodologies to process data for data analysis need to follow these steps:  Data requirements: Data is required as the input of analytics, which is particularly dependent on the needs of the analytics or clients' usage.The common individual on which data accumulated is identified as the testing unit.In particular, a demographic variable (e.g., height and weight) is obtained.Data can be statistical or definite.
 Data collection: Data is accumulated from various sources corresponding to data analysts at an organization.Data can also be gathered through sensors in the surrounding, such as "traffic cameras," "satellites," "recording devices," …etc.It can also be gathered through "interviews," "downloads from web sources" or "interpreting documents".
 Data processing: The data primarily acquired must be processed or organized for analysis.For example, this might include data placed in tables and columns, such as spreadsheets or statistical software in tables and columns for further analysis.
 Data cleaning: This will process and classify data which might be imperfect, duplicate or enclosing errors.Data is accessed and stored showing the need for data cleansing from problems.Cleaning data is the process of avoiding and correcting such inaccuracies.In general, it consists of "record matching," "recognizing data incompleteness," "eminence of existing data," "transcription" and "column segmentation".
Moreover, as we have already noted in the context of BD, the challenges of data classifying and cleaning are becoming more common and more difficult.Therefore, it is difficult to identify such problems and separate them to represent a complete group.In the case of large inconsistency between data rows, the process of data selection is not able to guarantee accurate class selection solutions.

MACHINE LEARNING IN BIG DATA ANALYSIS
ML is a division of artificial intelligence, which consists of two phases: "training" and "testing" [3].The primary phase proposes a learning mechanism based on some of the known characteristics of datasets.
The second stage aims to make predictions of unidentified characteristics through the knowledge gained in the primary phase.
In this view, "training" and "testing" are also called "learning" and "prediction".In fact, the task of ML is to use a learning algorithm to build a model that is also applied to make predictions.Therefore, this activity is generally called predictive modeling.The phases of ML from data acquisition to constructing a predictive model are shown in Figure 2.
In recent literature, several researchers illustrated ML challenges with BD [26]- [28], while others examined them in terms of a particular methodology [26].According to [28], ML algorithms are able to develop in numerous kinds of learning, such as "Decision Tree Learning," "Rule Learning," "Instancebased Learning," "Bayesian Learning," "Perception Learning" and "Collective Learning".All these learning algorithms reflect the nature of the promotion.
In ML, there are several algorithms for constructing the model, where the word algorithm points to the learning algorithm.In this scenario, the model is treated as information modified from training data.The testing phase aims to transform information into knowledge.The learning algorithm utilizes a given set of data to learn, validate and test the model.It discovers the best value for the parameter to validate and evaluate the enhancement.

Supervised and Un-supervised Learning
Supervised Learning (SL) proposes the methods of studying with the trainer, since in all the cases, the training clusters are categorized to predict the outcomes accurately.In other words, the proposed learning is usually inspired by learners' learning under the control of supervised trainers.In doing so, the purpose of this kind of learning is to build a model by learning through accurate data and making other predictions and unrelated cases in terms of the expected attribute value.Therefore, SL can be part of the "classification" and "regression" functions for final prediction and statistical prediction, respectively [47]- [48].
In SL, classes are known and class boundaries are well defined in a given set of learning data and learning is carried out using these classes.Classification problems can be solved precisely depending on the knowledge transformations revealed above.A flowchart of supervised ML approach is shown in Figure 3. Let's assume a dataset is specified and its data domain is D is R c , which implies that the occurrences in the dataset are based on the c properties and create a "c-dimensional vector space".If it is supposed that there are "n classes," the function of knowledge can be given using Equation (1).
In Equation ( 1), the series from "{0,1,2, . . .,n}" includes the groups of knowledge which allocate the distinct values of labels "0,1,2, . . .,n" to dissimilar classes.This mathematical purpose assists to describe the classification criteria that are appropriate for data classification.A number of classification procedures have been recommended in the ML document and a few of the well recognized methods are "SVMs" [29], [52], "decision trees" [30], "random forests" [31] and "in-depth learning" [32].
Un-Supervised Learning (USL), on the other hand, means learning without learning.This is because the learning results are not clear.In other words, learning without supervision is naturally inspired by learning.In fact, the purpose of this type of learning is to discover previously unknown dataset patterns through association and cluster insertion.The first aims to identify the relationship between the objects and attributes and the second aims to cluster the items based on their similarity.
In USL, suppose that class boundaries are unknown; so, the class labels themselves have been learned as well and classes are defined accordingly.Thus, the class boundaries are statistical and not clearly described; known as "clustering".In the clustering problem [33], it is assumed that the dataset can be created, but not categorized.As a result, it can only generate approximate rules to help categorize new data that does not contain labels.Clustering forms a guideline that facilitates labelling the selected data points and assigning labels to the new data points.As an outcome, the data can simply be collected without being classified.Therefore, clustering problems are expressed using estimation rules [49].
Clustering difficulties can also be mathematically solved based on the knowledge of data transformation, as discussed previously.Let's suppose a domain "D" with the set of data records, having c depending features, can be represented as "R c " and forms feature vectors with a c-dimension space.To construct a cluster for the k classes, a knowledge-based function can be derived as given in Equation (2).
The series of knowledge set is illustrated for k labels as "{0,1,2, . . ., k}," each label having different features.Based on these most associated features of labels, a suitable class is assigned to have accurate clusters.Few clustering algorithms in ML generally used are "k-Means clustering," "Gaussian mixture clustering" and "hierarchical clustering" [34].

Big Data Analysis
Business Intelligence is an application that can benefit from BD techniques.BD analyses also have systematic consequences in today's uses; hence, it is suitable to recognize them utilizing the features of the classes, the characteristics of the parameters and the characteristics of the observations; three important ideas of BD.A full understanding of the features of the classes, the characteristics of the parameters and the properties of the observations can support in addressing these problems.
Assuncao et al. [35] reviewed the development methodology and environment for performing BD analysis on the cloud platform.They categorized the BD analytics solutions "based on past customer activity -description models, "based on available data-forecast models" and "prescriptive models for supporting decision-making processes".
Personalization of acceptance and non-cooperative attempts can lead to difficulties in the BD area.Every acceptance will contribute to the BD and influence the uniqueness of the other orthogonal acceptances, thus determining acceptance problems using a three-dimensional space.This recommends that the classification of categories with BD development is very complex and unpredictable.Thus, an increase in the class forms depends on the scheme, irrespective of user knowledge and experience.Thus, BD classification becomes unpredictable and it is difficult to apply ML models and algorithms efficiently.
Similarly, the acceptance of the features contributes to BD complexity.It builds a classification utilizing the patterns to reduce complexity with growing data dimensions.These are considered as main factors that solve the scalability problem of the BD paradigm and its confirmation contributes to the complications in the data management, processing and analysis.Its expansion will increase data size and make processing difficult with current technologies in the near feature.

ML Modeling and Algorithms Approaches in Big Data
ML has different learning paradigms; however, not all these types of research are appropriate.Modeling and algorithms are defined based on the characteristics of "domain distribution," "batch learning" and "online learning" depending on the availability of data-level labeling and supervision and USL.The two foremost elements that help accomplish ML goals are aimed through learning models and learning algorithms utilizing different pattern recognition tools.Some of the tools utilized in BD for data processing are described in Table 1.Depending on the characteristics of the divisions of the field, "regression," "classification" and "clustering" might determine the modeling features of ML by supervised and unsupervised algorithms of ML [36]- [37].Domain segmentation can also be essential in determining the learning algorithms.Suppose that the field is categorized and group labels are introduced, so that a classification model can be set up and the acquisition of optimal parameters can be monitored.It is therefore referred to as SL and classifications are defined under the SL model.If the field is separable and the class labels are not assigned, it is referred to as USL and then assigned to the USL format.

Supervised Learning Models
Models of SL provide parameters to move the data field for a response group, thus helping to take the knowledge from the data.These learning models are generally combined into predictive models and classification models.The "regression model" is a predictive variable that is appropriate for systems that generate continuous reactions.There are various regression models, including "standard regression," "ridge regression," "lasso regression" and "elastic-net regression" [38].In this model, the factor creates an important function in reducing the error to the incline factor and the normalization factor.
The classification model is suitable for scenarios where individual results are created.There are many classification representations that can be grouped under "mathematically intensive," "hierarchical models" and "hierarchical models".Hierarchical models assist to classify separated group points associated with base classes utilizing a tree-like structure [50].This model is well suited for modern requirements, including BD and distributed ML.It adopts together regression analysis and classification approach using trees that can be constructed with a series of decisions, called decision trees.

ML Supervised Learning Algorithms
SL algorithms assist in model training effectively to provide high-grade accuracy.In general, SL algorithms support the use of large datasets to retrieve optimal values for model parameters without over-installing the model.Therefore, it is important to carefully design the learning algorithm using a systematic approach.The ML field proposes three phases of designing an SL algorithm as, "training phase," "verification phase" and "testing phase".
Training algorithms mainly help adjust and optimize model parameters using categorized datasets.The training algorithm needs "quantitative measures" to effectively train the learning model by means of the distinctive marked dataset.In general, it includes several sub-processes, such as extracting the data field and creating the associated group, standardization and modeling.Model testing is a procedure to evaluate the enhancement of a model that has been trained with a training algorithm.Few such algorithms based on training are described below.
 Support Vector Machine: This method helps in resolving one of the BD classification issues in a classic ML technology.Specifically, it can help in multiple domain applications in the BD environment, but it is complex during computation.It is utilized in BD frameworks, like "RHadoop," based on SVM implementation with R-programming for analyzing distributed file systems.Even for "MapReduce" in Hadoop framework SVM [53], associated algorithms are deployed to improvise the functions.
 Decision Tree Learning: Decision trees use rule-based approaches to divide domains into several linear spaces and predict reactions.If the predicted reaction is repetitive, the decision tree is a "regression tree" and if the predicted reaction is individual, the tree is a "classification tree".In fact, "decision tree-based learning" management is described as a "rule-based binary tree" creation procedure, but it is easy to recognize if it is interpreted as a hierarchical field partitioning system.The data area is recursively partitioned into two sub-domains to obtain more information gain than in the partitioned node approach.Decision trees are able to be "trained," "verified" and "assessed" exploiting SL algorithms, so it is clear that they form an SL model and satisfy this definition.
 Random Forest Learning: This learning method utilizes the decision tree modeling approach [3].This technique utilizes a decision tree model for parameterization, which includes "sampling techniques," "sub-space techniques" and "ensemble techniques" to optimize modeling, which is generally called bootstrap modelling and is substituted with a "random sampling method".Based on this, it supports to construct and choose a decision tree for random forest configuration.This decision tree can be either in the form of a "classification tree" or a "regression tree".Hence, it can be mutually useful for classification and regression issues.
 Deep Learning Models: Deep learning models in ML try to understand the relations embedded in learning representation.This is mostly expressed by the frequently used term "learning by features" [40].This kind of algorithm takes its name from the reality that it utilizes data representation rather than precise data functions to execute jobs.It transforms data into abstract illustrations that facilitate learning.In a deep-learning structure, these presentations will later be used to perform ML tasks.Since the functions are discovered directly from the data, the parameters do not need to be configured.In the BD context, the ability to avoid technical features is an immense benefit because of the challenges correlated with this process.
Deep-learning algorithms able to confine different stages of abstraction.This type of learning is, therefore, the best clarification to the "image classification" and "recognition problem"."Boltzmann machines" [41] are related with the exception that they use a random rather than an inevitable process.Another example of these algorithms is "deep-belief networks" [42].Because of the illustrated features, deep learning appears to be well suitable to handle several predefined challenges, such as "geometry features," "data heterogeneity," "nonlinearity" and "noisy data".However, these algorithms are not designed primarily for varying and volume data learning [43] and therefore prone to the data speed problem.While they are well-suited to handling large amounts of data with complex problems, they are not computationally efficient [45].
Najafabadi et al. [26] focused on deep learning, but pointed out the common disadvantages of ML with BD: "unstructured data formats," "fast data streaming," "multi-source data entry," "noisy" and "bad data," "high dimensions," "scalability of algorithms," "unbalanced input data" and "limited labeled data".Similarly, Sukumar [27] recognized three main prerequisites: "designing flexible" and "highly scalable structures," "understanding the properties statistical data" before applying algorithms and ultimately developing the capacity to work with large datasets.In Najafabadi et al. [26] and Sukumar [27], investigations reconsidered ML characteristics with BD, but did not do effort to link every acknowledged challenge.
Qiu et al. [28] developed various learning methods and presented various works of BD.Although they performed an immense job on current issues to identify possible solutions to the lack of classification as well as on approaches to solve the challenges and deepen the relationship between the hard-informed decision-making model and the learning outcomes which are most suitable for a particular task or a specific scenario.Thus, the focus of our work is to establish a link between solutions and challenges.A comparative analysis of these proposals and their limitations is presented in Table 2.

Limitations of Big Data Analytics
BD brings some big hopes.However, this is not a tool with unlimited features, making the most of the analysis means underestimating the limitations of using data capabilities [54].The following are some of the major limitations of experienced users and first-time data explorer.
 Data Misinterpretation: Data can reveal the user's behavior.However, it cannot also advise why users think or behave in their ways.But, misinterpretation of data is able to misguide dealers in their business attempting to capture utilizing the market progressive information.In addition, depending exclusively on data to formulate possibility may guide companies to take actions based on wrong relevance.The actuality is that identifying the predicted correlation and attempting to respond to the correct problem in support of the data is a different job from gathering and interpretation the data.
 Security Limitations: BD is also facing limitations due to security issues.Companies that collect data have a significant responsibility to protect data.The consequences of data breaches may include litigation, fines and loss of reputation.Security issues can greatly inhibit your ability to process data.For example, analyzing data by other organizations can be complicated, since the data might be concealed with a firewall or private cloud server.This creates a lot of trouble for sharing and transmitting data to be analyzed and worked on in a reliable manner.
 Outlier Effect: The third major constraint with BD is that outliers are common.Once the data is processed and analyzed, the user's failure or a new upgrade to the popular search engine will produce some biased results.The reality is that technology is not yet able to collect data completely accurately.However, Google's own algorithms and the inability to correctly predict search behavior made the project one of the company's most compelling failures to date.

FEATURE SIGNIFICANCE OF ML IN BIG DATA
ML utilizes an algorithm to discover hidden knowledge without explicit programming.In ML, it is important to understand repetitive components, where the models tend to adapt independently when exposed to BD.So, with the advent of new computer technologies, ML has significantly advanced from the past.Recently, ML algorithms have been able to consistently perform complex computations to integrate and analyze BD, which has not been available for a long time.A few well-known examples are illustrated below.
 The concentrate of ML with BD able to be found in "Google's self-driving car". ML applications utilizing BD are able to find various "recommendations" and "online business systems", such as Amazon, Netflix online, …etc. In-text data processing in various social media information, like Facebook, Twitter, …etc. ML can process BD to predict fraud detection in various financial and security systems.
 The significance of the SL algorithm is to be utilized where the required result is well-known.This algorithm is used with a set of inputs and a corresponding set of outputs.The algorithm compares the actual output with the correct output in feature analysis.
 Unsupervised learning is utilized for data without historical label.The algorithm must know that it is displayed to give the correct result and semi-SL is utilized for labeled and unmarked data, such as "classification," "regression" and "prediction".
 USL is utilized in opposition to data with no past labels.This algorithm must predict the correct result without knowing the data labels.

Future Research Directions of Big Data Analysis
Today, BD analysis is getting more and more attention, but there are still many research problems to be solved in various domains.
 Storage and Retrieval: Multidimensional data has to be integrated with the analysis on BD; so, you can explore arrays depending on in-memory illustration models [55].The incorporation of multidimensional data representations on BD involves the use of multidimensional extensions to enhance the query language HiveQL.With the rapid development of smartphones, images, audios and videos are produced at an alarming rate.However, the storage, retrieval and processing of this unstructured data require extensive research in various dimensions [56].
 BD Computations: In addition to the current BD paradigms, such as "MapReduce" [51], other paradigms are relevant, such as "YarcData (BD Graph Analytics)" and "high-performance computing cluster (HPCC)" systems need to be investigated [57].
 Visualization of High-Dimensional Data: Visualization facilitates assessment analysis at each action of data analysis.It concerns the remaining fraction of the "data warehousing" and "OLAP" research.For high-dimensional data, a various range of visualization tools is being developed [58].
 Real-time Processing Algorithms: Due to the frequency at which data and forecasts are produced, the various real-time algorithms might not be able to realize the processing time complexity and delay.
 Social Perspective's Dimension: It's essential to recognize that various technologies are able to produce quicker results, but assessment makers must utilize them intelligently [59].These outcomes may possibly have some social and cultural influences.There is no doubt that large-scale search data will assist in generating improved tools and facilities, as well as privacy intrusion and intrusive marketing.Data analysis assists even in understanding online behavior, local community and political movements [60]- [61].

CONCLUSION
BD analysis is the process of examining large and diverse datasets.Learning from large, unstructured data offers significant opportunities for many sectors.However, most of these routines are not sufficiently computational, practical or scalable.This paper presents a review of the need for research aimed at proposing new techniques that can be used to analyze BD.The concept of ML is increasingly adopted in current and future trends in BD implementation.This paper presents the challenges faced by various ML tools to provide an adaptable framework that fits the BD field of analysis.Analytical units can be combined with an ML engine to overcome data processing conditions.BD analytics and ML implementation support each other and can be powerful tools for understanding and predicting business behavior based on customer input information.With increasing use of ML concepts in research and business, the requirement of new methods to assist learning tasks has become increasingly essential in future research works to achieve significant improvements in ML approaches for BD analysis.

Figure 1 .
Figure 1.Big data transformation and analysis process.

Figure 3 .
Figure 3.A flowchart of supervised ML.

Table 1 .
Comparison of various BD tools.

Table 2 .
Comparison of proposal enhancements and limitations.