A survey of machine learning for big data processing

There is no doubt that big data are now rapidly expanding in all science and engineering domains. While the potential of these massive data is undoubtedly significant, fully making sense of them requires new ways of thinking and novel learning techniques to address the various challenges. In this paper, we present a literature survey of the latest advances in researches on machine learning for big data processing. First, we review the machine learning techniques and highlight some promising learning methods in recent studies, such as representation learning, deep learning, distributed and parallel learning, transfer learning, active learning, and kernel-based learning. Next, we focus on the analysis and discussions about the challenges and possible solutions of machine learning for big data. Following that, we investigate the close connections of machine learning with signal processing techniques for big data processing. Finally, we outline several open issues and research trends.

INTRODUCTION ML thrives on strong computer environments, efficient learning approaches (algorithms), tand rich and/or huge data.
As a result, tmachine learning has a lot of tpotential and is an important aspect of big data analytics. In the context of large data and tmodern computing settings, machine learning techniques are used. We want to look into the tbenefits and drawbacks of machine learning on huge data. Big data opens up new possibilities tfor machine learning. The framework is based on machine learning, which is divided into three phases: preprocessing,tlearning, and evaluation. In addition, the framework includes four other components: big data, user, domain, and system, all of which influence and are influenced by ML. The components of MLBiD and the phases of ML point the way to identifying opportunities and problems, as well as future work in a variety of unknown or underexplored research fields.

A Framework of Machine Learning on Big Data
Machine tlearning tis at tthe theart tof tMLBiD, tand it interacts twith tfour tother components: big data, user, domain, and system. Figure 1 depicts the framework for machine learning on large data (MLBiD).Interactions ttake happen in both directions. Big data, for example, provides inputs to the learning component, which tcreates outputs that form part of the big data.The tlearning component may be interacted with by the user by giving subject knowledge, personal preferences, and usability feedback, as well as by tutilising learning results to enhance decision making. The way learning algorithms should run and how efficient they should be has an influence on system architecture, and concurrently satisfying the learning demand may lead to a co-design of system architecture.Following that, we'll go through each of the components individually.

1.Machine Learning
tData preprocessing, learning, and assessment are common steps in machine learning (see Fig. 1). Data preprocessing aids in the transformation of raw data into the "correct shape" for further tlearning stages.It's likely that the traw data is unstructured, noisy, incomplete, and inconsistent.tThrough data cleaning, extraction, transformation,and fusion,the preprocessing phase converts such data into a form that may be tutilised as inputs to learning.Using the tpreprocessed input data, tthe learning phase tselects learning tmethods tand tsets model parameters to createtdesired outputs. Data preparation can be done using several learning methods, notably representational learning.

Fig. 2. A multi-dimensional taxonomy of machine learning
tThe form of learning feedback, the aim of learning activities, and the timeliness of data availability are all characteristics of machine learning. As a result, as illustrated in Fig. 2, we propose a multi-dimensional taxonomy of ML.

Big Data
tVolume(quantity/amount of data),velocity(speed of data creation),diversityt(kind, nature, and format of data),veracity (trustworthiness/quality of collected data), and value are the five dimensions oftbig data (insights and impact) .Starting at the bottom, wetstructured the five dimensions into a stack of large, data, and value levels (see Figure 3)

3.Other Components 1. Users
tDomain experts,tend-users, and ML researchers tand practitioners are tall tstakeholders in machine learning systems.Traditionally, ML practitioners have made the majority of decisions in using ML, from data collection to performance evaluation. End-user engagement has been restricted to supplying data labels, answering domain-related questions, and providing feedback on the learnt outcomes, which is generally mediated by practitioners, resulting in lengthy and asynchronous iterations.

2.Domain
tDomain tknowledge aids machine tlearning in detecting tintriguing tpatterns that tmight otherwise be missed by looking at datasets alone. tThere's a chance that the training datasets aren't big enough or representative enough to find all the patterns. tObtaining sufficient and representative tdata is also costly,if not timpossible, due to wide tdomain variance and application-specific needs.

3.System
tThe system architecture, often known as the tplatform, is a tcombination of tsoftware tand hardware that creates an environment in which machine learning algorithms may execute. tA multi-core computer with distributed architecture, for example, is predicted to increase ML efficiency when compared to simpler alternatives. To solve the problems of large data, new frameworks and system architectures such as Hadoop/Spark have been developed.

4.Data Preprocessing Opportunities and Challenges
The design of tpreprocessing tpipelines and tdata ttransformations that tresult in a tdata representation that can enable successful ML takes up a large portion of the actual effort in implementing an ML system.Data tpreparation ttries tto thandle tdifficulties such tdata redundancy,inconsistency, noise, theterogeneity, transformation, tlabelling,data imbalance, and feature representation/selection, among others.Due to the need for human work and a vast number of alternatives to choose from, data preparation and preprocessing is generally expensive.

Data Redundancy
tWhen two or more data samples reflect the same object,duplication occurs.Data duplication or tinconsistency tcan have a significant tinfluence ton tmachine tlearning. tDespite tthe development of a number of tapproaches for detecting duplicates over the last two decades, classic methods such as pairwise similarity comparison are no longer viable for huge data.

Data Noise
tData sparsity, missing and erroneous values, and outliers may all cause noise in tmachine learning. When dealing with huge data, traditional solutions to noisy data problems confront obstacles.Manual techniques, for texample, are no tlonger viable towing to their tlack of scalability; replacing them with a mean would lose the benefits of big data's richness and fine granularity

Data Heterogeneity
Big tdata promises to provide multi-view data from a variety of sources, in a variety of formats, and from a variety of population samples, and hence is very heterogeneous. The relevance of this multi-view heterogeneous data (e.g., unstructured text, audio, and video forms) for a learning task may vary.

Data Discretization
tDecision trees and Nave Bayes tare two ML techniques that tcan only deal with tdiscrete characteristics.tDiscretization converts quantitative data into qualitative data, resulting in a non overlapping domain split.

Data Labeling
tTraditional data annotation methods need a lot of time and effort. To deal with the problem of large data, several different approaches have been proposed.For example, tonline crowdsourced sources can provide free annotated training data with a wide range of class numbers and intra-class variation.Furthermore, tprobabilistic tprogramme induction may be used to accomplish human-level idea learning.

Imbalanced Data
Traditional stratified random sampling methods have addressed the problem of unbalanced data.However, if titerations of tsubsample creation and error tmetrics tcomputation tare required, the procedure might take a long time.Furthermore, standard sample algorithms are unable to enable data sampling across a user-defined subset of data, which includes value-based sampling.

Learning Opportunities and Challenges
tPrior to the emergence of the "big data" era, developing scalable machine learning algorithms capable of thandling enormous tdatasets was a tlong-standing research tsubject in the tML community.
We categorise tresearch in the taxonomy tbased ton twhether or tnot parallelism is addressed in their algorithms/platforms.Non-parallelism approaches strive for significantly quicker optimization tmethods that can cope with large tamounts of tdata without using parallelism.Traditionally,ML scalability thas been primarily focused ton building tnew algorithms that can run considerably more efficiently (e.g with greatly reduced time and/or space complexity). Table : A taxonomy of machine learning algorithms/platforms for big data

Non-parallelism
Most machine learning techniques are built on optimization.Combinatorial optimization (greedy search, beam search, branch-and-bound) and continuous optimization are two types of traditional optimization methods .Unconstrained optimization (e.g., gradient descent, conjugate gradient, quasi-Newton techniques) and constrained optimization (e.g., linear programming, quadratic programming) are two types of optimization.

Data Parallelism
To gain scalability, existing machine learning models might make use of big data approaches. These tinitiatives can be divided into ttwo groups.One tsolution is to provide a tgeneric middleware layer that reimplements current learning tasks so that they can be executed on a large data platform like Hadoop or Spark.A tmiddleware layer like this frequently includes general primitives/operations that are beneficial for a variety of learning activities.

Models/parameter Parallelism
A lot of work has gone into figuring out how to parallelize machine learning algorithms or how to provide performance guarantees on various parallelized methods.Many tmachine learning algorithms are at best trivially parallel [66-68], therefore these efforts are justified. Furthermore, large data machine learning is not merely a scaled-up version of small data machine learning.To handle the taccompanying technical issues, tit necessitates tnew formulations and algorithms.The roots of parallelization of learning algorithms can be found in distributed and large-scaled machine learning.

Distributed machine learning
In large-scale ML, distributed ML can naturally overcome the problem of algorithm complexity and tmemory restriction. Distributed tML scales up tlearning algorithms tby spreading tthe learning process across tnumerous computers or tprocessors and addressing a tdistributed optimization issue to solve the inability of learning algorithms to use all of the data to learn in an acceptable amount of time. Distributed tmachine learning can provide not just efficiency but also fault tolerance by duplicating data across machines

Deep Learning
Deep tneural network-based learning has recently emerged as one of the most rapidly growing and intriguing tfields of big data tlearning.Neural tnetworks are a class tof models based on biological neural networks, which are made up of interconnected neurons with connections that may be modified and adapted to inputs.Deep tneural networks are tessentially neural networks with a large number of hidden layers, or deep-layered architecture, with each layer performing a nonlinear transition from its input to output.

Hybrid Approaches
Hybrid tapproaches integrate model and data parallelism tby dividing both tdata and model variables at the same time, in addition to the two methods for ML on big data. This tnot only allows dispersed clusters to train quicker, but it also allows ML applications to run efficiently when tboth the data and the tmodel are too huge to store in a single tmachine's memory.

Conclusion
The topportunities and problems of machine learning (ML) on massive data are discussed in this study.Big tdata gives new prospects for inspiring transformational and unique tML solutions to thandle many tassociated ttechnical tdifficulties and produce treal-world consequences, while also posing several challenges for classical ML in terms of scalability, adaptability, and usability.These topportunities and difficulties can be used to guide future study in this field. The majority of existing work on machine learning for big tdata has concentrated on volume, tvelocity, and diversity, but nothing has been done to address the other two components of large data: truth and value.To tdeal with data veracity, tone promising direction is to develop algorithms that can assess the trustworthiness or credibility of data or data sources, allowing untrustworthy data to be filtered out during preprocessing; another promising direction is to develop new machine learning models that can infer with unreliable or even contradictory datat.
In tsummarise, machine learning is required to handle the problems offered by big data and to identify hidden patterns, knowledge, and insights in order to turn the latter's promise into practical value for business decision-making and scientific exploration. The combination of machine learning and big data speaks to a bright future in a new frontier.