A Survey on Machine Learning-Based Mobile Big Data Analysis: Challenges and Applications

This paper attempts to identify the requirement and the development of machine learning-based mobile big data (MBD) analysis through discussing the insights of challenges in themobile big data. Furthermore, it reviews the state-of-the-art applications of data analysis in the area of MBD. Firstly, we introduce the development of MBD. Secondly, the frequently applied data analysis methods are reviewed. Three typical applications of MBD analysis, namely, wireless channel modeling, human online and offline behavior analysis, and speech recognition in the Internet of Vehicles, are introduced, respectively. Finally, we summarize themain challenges and future development directions of mobile big data analysis.


Introduction
With the success of wireless local access network (WLAN) technology (a.k.a. Wi-Fi) and the second/third/fourth generation (2G/3G/4G) mobile network, the number of mobile phones, which is 7.74 billion, 103.5 per 100 inhabitants all over the world in 2017, is rising dramatically [1]. Nowadays, mobile phone can not only send voice and text messages, but also easily and conveniently access the Internet which has been recognized as the most revolutionary development of mobile Internet (M-Internet). Meanwhile, worldwide active mobile-broadband subscriptions in 2017 have increased to 4.22 billion, which is 9.21% higher than that in 2016 [1]. Figure 1 shows the numbers of mobile-cellular telephone and active mobile-broadband subscriptions of the world and main districts from 2010 to 2017. The numbers which are up to the bars are the mobile-cellular telephone or active mobile-broadband subscriptions (million) in the world of the year which increase each year. Under the M-Internet, various kinds of content (image, voice, video, etc.) can be sent and received everywhere and the related applications emerge to satisfy people's requirements, including working, study, daily life, entertainment, education, and healthcare. In China, mobile applications giants, i.e., Baidu, Alibaba, and Tencent, held 78% of M-Internet online time per day in apps which was about 2,412 minutes in 2017 [2]. This figure indicates that M-Internet has entered a rapid growth stage.
Nowadays, more than 1 billion smartphones are in use and producing a great quantity of data every day. This situation brings far-reaching impacts on society and social interaction and increases great opportunities for business. Meanwhile, with the rapid development of the Internet-of-Things (IoT), much more data is automatically generated by millions of machine nodes with growing mobility, for example, sensors carried by moving objects or vehicles. The volume, velocity, and variety of these data are increasing extremely fast, and soon they will become the new criterion for data analytics of enterprises and researchers. Therefore, mobile big data (MBD) has been already in our lives and is being enriched rapidly. The trend for explosively increased data volume with the increasing bandwidth and data rate in the M-Internet has followed the same exponential increase as Moore's Law for semiconductors [3]. The prediction [2] about the global data volume will grow up to 47 zettabytes (1 zettabyte = 1 × 10 21 bytes) by 2020 and 163 zettabytes by 2025. For M-Internet, 3.7 exabytes (1 exabyte = 1 × 10 18 bytes) data have been generated per month from the mobile data traffic in 2015 [4], 7.2 exabytes in 2016 [5], 24 exabytes by 2019 on forecasting [5], and 49 exabytes by 2021 on forecasting [5]. According to the statistical and prediction results, a concept called MBD has appeared.
The MBD can be considered as a huge quantity of mobile data which are generated from a massive number of mobile devices and cannot be processed and analyzed by a single machine [6,7]. MBD is playing and will play a more important role than ever before by the popularization of mobile devices including smartphones and IoT gadgets especially in the era of 4G and the forthcoming the fifth generation (5G) [4,8].
With the rapid development of information technologies, various data generated from different technical fields are showing explosive growth trends [9]. Big data has broad application prospects in many fields and has become important national strategic resources [10]. In the era of big data, many data analysis systems are facing big challenges as the volume of data increases. Therefore, analysis for MBD is currently a highly focused topic. The importance of MBD analysis is determined by its role in developing complex mobile systems which supports a variety of intelligently interactive services, for example, healthcare, intelligent energy networks, smart buildings, and online entertainments [4]. MBD analysis can be defined as mining terabyte-level or petabyte-level data collected from mobile users and wireless devices at the network-level or the app-level to discover unknown, latent, and meaningful patterns and knowledge with large-scale machine learning methods [11].
Present requirements of MBD are based on softwaredefined in order to be more scalable and flexible. M-Internet environment in the future will be even more complex and interconnected [12]. For this purpose, data centers of MBD need to collect user statistics information of millions of users and obtain meaningful results by proper MBD analysis methods. For the decreasing price of data storage and widely accessible high performance computers, an expansion of machine learning has come into not only theoretical researches, but also various application areas of big data. Even though, there is a long way to go for the machine learning-based MBD analysis.
Machine learning technology has been used by many Internet companies in their services: from web searches [13,14] to content filtering [15] and recommendation [16,17] on online social communities, shopping websites, or contend distribution platforms. Furthermore, it is also frequently appearing in products like smart cellphones, laptop computers, and smart furniture. Machine learning systems are used to detect and classify objects, return most relevant searching results, understand voice commands, and analyze using habits. In recent years, big data machine learning has become a hot spot [18]. Some conventional machine learning methods based on Bayesian framework [19][20][21][22], distributed optimization [23][24][25][26], and matrix factorization [27] can be applied into the aforementioned applications and have obtained good performances in small data sets. On this foundation, researchers have always been trying to fill their machine learning model with more and more data [28]. Furthermore, the data we got is not only big but also has features such as multisource, dynamic and sparse value; these features make it harder to analyze MBD with conventional machine learning methods. Therefore, the aforementioned applications implemented with conventional machine learning methods have fallen in a bottleneck period for low accuracy and generalization. Recently, a class of novel techniques, called deep learning, is applied in order to make the effort to solve the problems and has obtained good performances [29]. Machine learning, especially deep learning, has been an essential technique in order to use big data effectively.
Most conventional machine learning methods are shallow learning structures with one or none hidden layers.
These methods performed well in practical use and were precisely analyzed theoretically. But when dealing with highdimensional or complicated data, shallow machine learning methods show their weakness. Deep learning methods are developed to learn better representations automatically with deep structure by using supervised or unsupervised strategies [30,31]. The features extracted by deep hidden layers are used for regression, classification, or visualization. Deep learning uses more hidden layers and parameters to fit functions which could extract high level features from complex data; the parameters will be set automatically using large amount of unsupervised data [32,33]. The hidden layers of deep learning algorithms help the model learn better representation of data; the higher layers learn specific and abstract features from global features learned by lower layers. Many surveys show that nonlinear feature extractors that are linked up as stacks such as deep learning methods always perform better in machine learning tasks, for example, a more accurate classification method [34], better learning of data probabilistic models [35], and the extraction of robust features [36]. Deep learning methods have proved useful in data mining, natural language processing, and computer vison applications. A more detailed introduction of deep learning is presented in Section 3.1.4.
Artificial Intelligence (AI) is a technology that develops theories, methods, techniques, and applications that simulate or extend human brain abilities. The research of observing, learning, and decision-making process in human brain motivates the development of deep learning, which was first designed aiming to emulate the human brain's neural structures. Further observation on neural signals processing and the effect on brain mechanisms [37][38][39] inspired the architecture design of deep learning network, using layers and neuron connections to generalize globally. Conventional methods such as support vector machines, decision trees, and case-based reasoning which are based on statistics or logic knowledge of human may fall short when facing complex structure or relationships of data. Deep learning methods can learn patterns and relationships from hidden layers and may benefit the signal processing study in human brain with visualization methods of neural network. Deep learning has attracted much attention from AI researchers recently because of its state-of-the-art performance in machine learning domains including no only the aforementioned natural language processing (NLP), but also speech recognition [40,41], collaborative filtering [42], and computer vision [43,44].
Deep learning has been successfully used in industry products which have access to big data from users. Companies in United States such as Google, Apple, Facebook, and Chinese companies like Baidu, Alibaba, and Tencent have been collecting and analyzing data from millions of users and pushing forward deep learning based applications. For example, Tencent YouTu Lab has developed identification (ID) card identification and bank card identification systems. These systems can read information from card images to check user information while registering and bank information while purchasing. The identification systems are based on deep learning model and large volume of user data provided by Tencent. Apple develops Siri, a virtual intelligent assistant in iPhones, to answer questions about weather, location, news according to voice commands and dial numbers or send text messages. Siri also utilizes deep learning methods and uses data from apple services [45]. Google uses deep learning on Google translation service with massive data collected by Google search engine.
MBD contains a large variety of information of offline data and online real-time data stream generated from smart mobile terminals, sensors, and services and hastens various applications based on the advancement of data analysis technologies, such as collaborative filtering-based recommendation [46,47], user social behavior characteristics analysis [48][49][50][51], vehicle communications in the Internet of Vehicles (IoV) [52], online smart healthcare [53], and city residents' activity analysis [6]. Although the machine learning-based methods are widely applied in the MBD fields and obtain good performances in real data test, the present methods still need to be further developed. Therefore, five main challenges facing MBD analysis regarding the machine learning-based methods include large-scale and high-speed M-Internet, overfitting and underfitting problems, generalization problem, cross-modal learning, and extended channel dimensions and should be considered.
This paper attempts to identify the requirement and the development of machine learning-based mobile big data analysis through discussing the insights of challenges in the MBD and reviewing state-of-the-art applications of data analysis in the area of MBD. The remainder of the paper is organized as follows. Section 2 introduces the development of data collection and properties of MBD. The frequently adopted methods of data analysis and typical applications are reviewed in Section 3. Section 4 summarizes the future challenges of MBD analysis and provides suggestions.

Data Collection.
Data collection is the foundation of a data processing and analysis system. Data are collected from mobile smart terminals and Internet services, or called mobile Internet devices (MIDs) generally, which are multimedia-capable mobile devices providing wireless Internet access and contain smartphones, wearable computers, laptop computers, wireless sensors, etc. [54]. MBD can be divided into two hierarchical data form: transmission and application data, from bottom to top. The transmission data focus on solving channel modeling [55,56] and user access problems corresponding to the physical transmission system of M-Internet. On this foundation, application data focus on the applications based on the MBD including social networks analysis [57][58][59], user behavior analysis [48,50,60], speech analysis and decision in IoV [61][62][63][64][65][66], smart grid [67,68], networked healthcare [53,69,70], finance services [46,71], etc.
Due to the heterogeneity of the M-Internet and the variety of the access devices, the collected data are unstructured and usually in many categories and formats, which make data preprocessing become an essential part of a data processing and analysis system in order to ensure the input data complete

Raw Data
Generation of Implicit Ratings and reliable [72]. Data preprocessing can be divided into three steps which are data cleaning, generation of implicit ratings, and data integration [46].
(1) Data Cleaning. Due to possible equipment failures, transmission errors, or human factor, raw data are "dirty data" which cannot be directly used, generally [46]. Therefore, data cleaning methods including outlier detection and denoising are applied in the data preprocessing to obtain the data meet required quality. Manual removal of error data is difficult and impossible to accomplish in MBD due to the massive volume. Common data cleaning methods can alleviate the dirty data problem to some extent by training support vector regression (SVR) classifiers [73], multiple linear regression models [74], autoencoder [75], Bayesian methods [76][77][78], unsupervised methods [79], or information-theoretic models [79].
(2) Generation of Implicit Ratings. Generation of implicit ratings is mainly applied in recommend systems. The volume of rating data increases rapidly by analyzing specific user behaviors to solve data sparsity problem with machine learning algorithms, for example, neural networks and decision trees [46].
(3) Data Integration. Data integration is a step to integrate data from different resources with different formats and categories and to handle missing data fields [7]. Figure 2 represents the procedures of data collection and preprocessing.

Properties of Mobile Big Data.
The MBD brings a massive amount of new challenges to conventional data analysis methods for its high dimensionality, heterogeneity, and other complex features from applications, such as planning, operation and maintenance, optimization, and marketing [57]. This section discusses the five Vs (short for volume, velocity, variety, value, and veracity) features [80] deriving from big data towards the MBD. The five Vs features have been improved in M-Internet, while it makes users access Internet anytime and anywhere [81].
(1) Volume: Large Number of MIDs, Exabyte-Level Data, and High-Dimensional Data Space. Volume is the most obvious feature of MBD. In the forthcoming 5G network and the era of MBD, conventional store and analysis methods are incapable of processing the 1000x or more wireless traffic volume [7,82]. It is of great urgency to improve present MBD analysis methods and propose new ones. The methods should be simple and cost-effective to be implemented for MBD processing and analysis. Moreover, they should also be effective enough without requiring a massive amount of data for model training. Finally, they are precise to be applied in various fields [81].
(2) Velocity: Real-Time Data Streams and Efficiency Requirement. Velocity can be considered as the speed at which data are transmitted and analyzed [83]. The data is now continuously streaming into the servers in real-time and makes the original batch process break down [84]. Due to the high generating rate of MBD, velocity is the efficiency requirement of MBD analysis since real-time data processing and analysis are extremely important in order to maximize the value of MBD streams [7].
(3) Variety: Heterogeneous and Nonstructured Mobile Multimedia Contents. Due to the heterogeneity of MBD which means that mobile data traffic comes from spatially distributed data resources (i.e., MIDs), the variety of MBD arises and makes the MBD more complex [4]. Meanwhile, the nonstructured MBD also causes the variety. The MBD can be divided into structured data, semistructured data, and unstructured data. Here, unstructured data are usually collected in new applications and have random data fields and contents [7]; therefore, they are difficult to analyze before data cleaning and integration.

(4) Value: Mining Hidden Knowledge and Patterns from Low
Density Value Data. Value, or low density value of MBD, is caused by a large amount of useless or repeated information in the MBD. Therefore, we need to mine the big value by MBD analyzing which is hidden knowledge and patterns extraction. The purified data can provide comprehensive information to conduct more effectively analysis results about user demands, user behaviors, and user habits [85] and to achieve better system management and more accurate demand prediction and decision-making [86].
(5) Veracity: Consistency, Trustworthiness, and Security of MBD. The veracity of MBD includes two parts: data consistency and trustworthiness [80]. It can also be summarized as data quality. MBD quality is not guaranteed due to the noise of transmission channel, the equipment malfunctioning, and the uncalibrated sensors of MIDs or the human factor (for instance, malicious invasion) resulting in low-quality data points [4]. Veracity of MBD ensures that the data used in analysis process are authentic and protected from unauthorized access and modification [80].

Development of Data Analysis Methods.
In this section, we present some recent achievements in data analysis from four different perspectives.

Divide-and-Conquer Strategy and Sampling of Big Data.
The strategies dividing and conquering big data is a computing paradigm dealing with big data problems. The development of distributed and parallel computing makes divideand-conquer strategy particularly important. Generally speaking, whether the diversity of samples in learning data benefits the training results varies. Some redundant and noisy data can cause a large amount of storage cost as well as reducing the efficiency of the learning algorithm and affecting the learning accuracy. Therefore, it is more preferable to select representative samples to form a subset of original sample space according to a certain performance standard, such as maintaining the distribution of samples, topological structure, and keeping classification accuracy. Then learning method will be constructed on previous formed subset to finish the learning task. In this way, we can maintain or even improve the performance of big data analyzing algorithm with minimum computing and stock resources. The need to learn with big data demands on sample selection methods. But most of the sample selection method is only suitable for smaller data sets, such as the traditional condensed nearest neighbor [93], the reduced nearest neighbor [94], and the edited nearest neighbor [95]; the core concept of these methods is to find the minimum consistent subset. To find the minimum consistent subset, we need to test every sample and the result is very sensitive to the initialization of the subset and samples setting order. Li et al. [96] proposed a method to select the classification and edge boundary samples based on local geometry and probability distribution. They keep the space information of the original data but need to calculate k-means for each sample. Angiulli et al. [97,98] proposed a fast condensation nearest neighbor (FCNN) algorithm based on condensed nearest neighbor, which tends to choose the classification boundary samples.
Jordan [99] proposed statistical inference method for big data. When dealing with statistical inference with divideand-conquer algorithm, we need to get confidence intervals from huge data sets. By data resampling and then calculating confidence interval, the Bootstrap theory aims to obtain the fluctuation of the evaluation value. But it does not fit big data. The incomplete sampling of data can lead to erroneous range fluctuations. Data sampling should be correct in order to provide statistical inference calibration. An algorithm named Bag of Little Bootstraps was proposed, which can not only avoid this problem, but also has many advantages on computation. Another problem discussed in [99] is massive matrix calculation. The divide-and-conquer strategy is heuristic, which has a good effect in practical application. However, new theoretical problems arise when trying to describe the statistical properties of partition algorithm. To this end, the support concentration theorem based on the theory of random matrices has been proposed.
In conclusion, data partition and parallel processing strategy is the basic strategy to deal with big data. But the current partition and parallel processing strategy uses little data distribution knowledge, which has influence on the load balancing and the calculation efficiency of big data processing. Hence, there exists an urgent requirement to solve the problem about how to learn the distribution of big data for the optimization of load balancing.

Feature Selection of Big Data.
In the field of data mining, such as document classification and indexing, the dataset is always large, which contains a large number of records and features. This leads to the low efficiency of algorithm. By feature selection, we can eliminate the irrelevant features and increase the speed of task analysis. Thus, we can get a better preformed model with less running time.
Big data processing faces a huge challenge on how to deal with high-dimensional and sparse data. Traffic network, smartphone communication records, and information shared on Internet provide a large number of high-dimensional data, using tensor (such as a multidimensional array) as natural representation. Tensor decomposition, in this condition, becomes an important tool for summary and analysis. Kolda [100] proposed an efficient use of the memory of the Tucker decomposition method named as memory-efficient Tucker (MET) decomposition decreasing time and space cost which traditional tensor decomposition algorithm cannot do. MET adaptively selects execution strategy based on available memory in the process of decomposition. The algorithm maximizes the speed of computation in the premise of using the available memory. MET avoid dealing with the large number of sporadic intermediate results proceeded during the calculation process. The adaptive selections of operation sequence not only eliminate the intermediate overflow problem, but also save memory without reducing the precision. On the other hand, Wahba [101] proposed two approaches to the statistical machine learning model which involve discrete, noisy, and incomplete data. These two methods are regularized kernel estimation (RKE) and robust manifold unfolding (RMU). These methods use dissimilarity between training information to get nonnegative low rank definite matrix. The matrix will then be embedded into a low dimensional Euclidean space, which coordinate can be used as features of various learning modes. Similarly, most online learning research needs to access all features of training instances. Such classic scenario is not always suitable for practical applications when facing high-dimensional data instances or expensive feature sets. In order to break through this limit, 6 Wireless Communications and Mobile Computing Hoi et al. [102] propose an efficient algorithm to predict online feature solving problem using some active features based on their study of sparse regularization and truncation technique. They also test the proposed algorithm in some public data sets for feature selection performance.
The traditional self-organizing map (SOM) can be used for feature extraction. But the low speed of SOM limits its usage on large data sets. Sagheer [103] proposed a fast selforganizing map (FSOM) to solve this problem. The goal of this method is to find a feature space where data is mainly distributed in. If there exits such area, data can be extracted in these areas instead of information extraction in overall feature spaces. In this way, we can greatly reduce extraction time.
Anaraki [104] proposed a threshold method of fuzzy rough set feature selection based on fuzzy lower approximation. This method adds a threshold to limit the QuickReduct feature selection. The results of the experiment prove that this method can also help the accuracy of feature extraction with lower running time.
Gheyas et al. [105] proposed a hybrid algorithm of simulated annealing and genetic algorithm (SAGA), combining the advantages of simulated annealing algorithm, genetic algorithm, greedy algorithm, and neural network algorithm, to solve the NP-hard problem of selecting optimal feature subset. The experiment shows that this algorithm can find better optimal feature subset, reducing the time cost sharply. Gheyas pointed in as conclusion that there is seldom a single algorithm which can solve all the problems; the combination of algorithms can effectively raise the overall affect.
To sum up, because of the complexity, high dimensionality, and uncertain characteristics of big data, it is an urgent problem to solve how to reduce the difficulty of big data processing by using dimension reduction and feature selection technology.

Big Data Classification.
Supervised learning (classification) faces a new challenge of how to deal with big data. Currently, classification problems involving large-scale data are ubiquitous, but the traditional classification algorithms do not fit big data processing properly.
(1) Support Vector Machine (SVM). Traditional statistical machine learning method has two main problems when facing big data. (1) Traditional statistical machine learning methods are always involving intensive computing which makes it hard to apply on big data sets. (2) The prediction of model that fits the robust and nonparameter confidence interval is unknown. Lau et al. [106] proposed an online support vector machine (SVM) learning algorithm to deal with the classification problem for sequentially provided input data. The classification algorithm is faster, with less support vectors, and has better generalization ability. Laskov et al. [107] proposed a rapid, stable, and robust numerical incremental support vector machine learning method. Chang et al. [108] developed an open source package called LIBSVM as a library for SVM code implementation.
In addition, Huang et al. [109] present a large margin classifier M4. Unlike other large margin classifiers which locally or globally constructed separation hyperplane, this model can learn both local and global decision boundary. SVM and minimax probability machine (MPM) has a close connection with the model. The model has important theoretical significance and furthermore, the optimization problem of maxi-min margin machine (M 4 ) can be solved in polynomial time.
(2) Decision Tree (DT). Traditional decision tree (DT), as a classic classification learning algorithm, has a large memory requirement problem when processing big data. Franco-Arcega et al. [110] put forward a method of constructing DT from big data, which overcomes some weakness of algorithms in use. Furthermore, it can use all training data without saving them in memory. Experimental results showed that this method is faster than current decision tree algorithm on large-scale problems. Yang et al. [111] proposed a fast incremental optimization decision tree algorithm for large data processing with noise. Compared with former decision tree data mining algorithm, this method has a major advantage on real-time speed for data mining, which is quite suitable when dealing with continuous data from mobile devices. The most valuable feature of this model is that it can prevent explosive growth of the decision tree size and the decrease of prediction accuracy when the data packet contains noise. The model can generate compact decision tree and predict accuracy even with highly noisy data. Ben-Haim et al. [112] proposed an algorithm of building parallel decision tree classifier. The algorithm runs in distributed environment and is suitable for large amount and streaming data. Compared with serial decision tree, the algorithm can improve efficiency under the premise of accuracy error approximation.

(3) Neural Network and Extreme Learning Machine (ELM).
Traditional feedforward neural networks usually use gradient descent algorithm to tune weight parameters. Generally speaking, slow learning speed and poor generalization performance are the bottlenecks that restrict the application of feedforward neural network. Huang et al. [113] discarded the iterative adjustment strategy of the gradient descent algorithm and proposed extreme learning machine (ELM). This method randomly assigns the input weights and the deviations of the single hidden layer neural network. It can analyze the output weights of the network by one step calculation. Compared to the traditional feedforward neural network training algorithm, the network weights can be determined by multiple iterations, and the training speed of ELM is significantly improved.
However, due to the limitation of computing resource and computational complexity, it is a difficult problem to train a single ELM on big data. There are usually two ways to solve this problem: (1) training ELM [114] based with divide-andconquer strategy; (2) introducing parallel mechanism [115] to train a single ELM. It is shown in [116,117] that a single ELM has strong function approximation ability. Whether it is possible to extend this approximation capability to ELM based on divide-and-conquer strategy is a key index to evaluate the possibility that ELM can be applied to big data.
Some of the related studies also include effective learning to solve such problem [118].
In summary, the traditional classification method of machine learning is difficult to apply to the analysis of big data directly. The study of parallel or improved strategies of different classification algorithms has become the new direction.

Big Data Deep
Learning. With the unprecedentedly large and rapidly growing volumes of data, it is hard for us to get hidden information from big data with ordinary machine learning methods. The shallow-structured learning architectures of most conventional learning methods are not fit for the complex structures and relationships in these input data. Big data deep learning algorithm, with its deep architectures and globally feature extracting ability, can learn complex patterns and hidden connections beyond big data [37,119]. It has had state-of-the-art performances in many benchmarks and also been applied in industry products. In this section, we will introduce some deep learning methods in big data analytics.
Big data deep learning has some problems: (1) the hidden layers of deep network make it difficult to learn from a given data vector, (2) the gradient descent method for parameters learning makes the initialization time increasing sharply as the number of parameters arises, and (3) the approximations at the deepest hidden layer may be poor. Hinton et al. [32] proposed a deep architecture: deep belief network (DBN) which can learn from both labeled and unlabeled data by using unsupervised pretraining method to learn unlabeled data distributions and a supervised fine-tune method to construct the models, and solved part of the aforementioned problems. Meanwhile, subsequent researches, for example, [120], improved the DBN trying to solve the problems.
Convolutional neural network (CNN) [121] is another popular deep learning network structure for big data analyzing. A CNN has three common features including local receptive fields, shared weights, and spatial or temporal subsampling, and two typical types of layers [122,123]. Convolutional layers are key parts of CNN structure aiming to extract features from image. Subsampling layers, which are also called pooling layers, adjust outputs from convolutional layer to get translation invariance. CNN is mainly applied in computer vision field for big data, for example, image classification [124,125] and image segmentation [126]. Document (or textual) representation, also part of NLP, is the basic method for information retrieval and important to understand natural language. Document representation finds specific or important information from the documents by analyzing document structure and content. The unique information could be document topic or a set of labels highly related to the document. Shallow models for document representation only focus on small part of the text and get simple connection between words and sentences. Using deep learning can get global representation of the document because of its large receptive field and hidden layers which could extract more meaningful information. The deep learning methods for document representation make it possible to obtain features from high-dimensional textual data. Hinton et al. [127] proposed deep generative model to learn binary codes for documents which make documents easy to store up. Socher et al. [128] proposed a recursive neural network on analyzing natural language and contexts, achieving state-ofthe-art results on segmentation and understanding of natural language processing. Kumer et al. [129] proposed recurrent neural networks (RNN) which construct search space from large amount of textual data.
With the rapid growth and complexity of academic and industry data sets, how to train deep learning models with large amount of parameters has been a major problem. The works in [40,41,43,[130][131][132][133] proposed effective and stable parameter updating methods for training deep models. Researchers focus on large-scale deep learning that can be implemented in parallel including improved optimizers [131] and new structures [121,[133][134][135].
In conclusion, big data deep learning methods are the key methods of data mining. They use complex structure to learn patterns from big data sets and multimodal data. The development of data storage and computing technology promotes the development of deep learning methods and makes it easier to use in practical situations.

Wireless Channel Modeling.
As is well known, wireless communication transmits information through electromagnetic waves between a transmitting antenna and a receiving antenna, which is deemed as a wireless channel. In the past few decades, the channel dimension has been extended to space, time, and frequency, which means the channel property is comprehensively discovered. Another development is that channel characteristics can be accurately described by different methods, such as channel modeling [136].
Liang et al. [137] used machine learning to predict channel state information so as to decease the pilot overhead. Especially for 5G, wireless big data emerges and its related technologies are employed to traditional communication research to meet the demand of 5G. However, the wireless channel is essentially a physical electromagnetic wave, and the current 5G channel model research follows the traditional way. Zhang [138] proposed an interdisciplinary study of big data and wireless channels, which is a cluster-based channel model. In the cluster-nuclei based channel model, the multipath components (MPCs) are aggregated into a traditional stochastically channel model. At the same time, the scene is discerned by the computer and the environment is rebuilt by machine learning methods. Then, by matching the real propagation objects with the clusters, the clusternuclei, which are the key factors in contacting deterministic environment and stochastic clusters, can be easily found. There are two main steps employing the machine learning methods in the cluster-nuclei based channel model. The recent progress is shown as follows.

A Gaussian Mixture Model (GMM) Based Channel
MPCs Clustering Method. The MPCs are clustered with the Gaussian mixture model (GMM) [87,139]. Using sufficient statistic characteristics of channel multipath, the GMM can get clusters corresponding to the multipath propagation characteristics. The GMM assumes that all the MPCs consist of several Gaussian distributions in varying proportions. Given a set of channel multipath , the log-likelihood of the Gaussian mixture model is where Θ = { , , Σ , = 1, ⋅ ⋅ ⋅ , } is the set of all the parameters and ∈ [0, 1] is the prior probability satisfying the constraint ∑ =1 = 1. To estimate the GMM parameters, expectation maximization (EM) algorithm is employed to solve the log-likelihood function of GMM [87]. Figure 3 illustrates the simulation result of GMM clustering algorithm.
As seen in Figure 3, the GMM clustering obtains clearly compact clusters. As scattering property of the channel multipath obeys Gaussian distribution, the compact clusters can accord with the multipath scattering property. Moreover, corresponding to the clustering mechanism of GMM, paper [87] proposed a compact index (CI) to evaluate the clustering results shown as follows: where 2 is the variance of the kth cluster and tr( ) and tr( ) are given as where is the number of multipaths corresponding to the kth cluster. Both the means and variances of the clusters are considered in CI. Considering sufficient statistics characteristics, CI can uncover the inherent information of multipath parameters and provide appropriate explanation to the clustering result. Besides, considering sufficient statistics characteristics, the CI can evaluate the clustering results more reasonably.

Identifying the Scatters with the Simultaneous Localization and Mapping Algorithm (SLAM).
In order to reconstruct three-dimensional (3D) propagation environment and to find the main deterministic objects, simultaneous localization and mapping (SLAM) algorithm is used to identify the texture from the measurement scenario picture [140,141]. Figure 4 illustrates our indoor reconstruction result with SLAM algorithm.
The texture of propagation environment can be used to search for the main scatters in the propagation environment. Then, the three-dimensional propagation environment can be reconstructed with the deep learning method.
Then the mechanism to form the cluster-nuclei is clear. The channel impulse response can be produced by machine learning with a limited number of cluster-nuclei, i.e., decision tree [142], neural network [143], and mixture model [144]. Based on the database from various scenarios, antenna configurations, and frequency, channel changing rules can be explored and then input into the cluster-nuclei based modeling. Finally, the predication of channel impulse response in various scenarios and configuration can be realized [138].

Analyses of Human Online and Offline Behavior Based on Mobile Big Data. The advances of wireless networks and
increasing mobile applications bring about explosion of mobile traffic data. It is a good source of knowledge to obtain the individuals' movement regularity and acquire the mobility dynamics of populations of millions [145]. Previous researches have described how individuals visit geographical locations and employed mobile traffic data to analyze human offline mobility patterns. Representative works like [146,147] explore the mobility of users in terms of the number of base stations they visited, which turned out to be a heavy tail distribution. Authors in [146,148,149] also reveal that a few important locations are frequently visited by users. In particular, these preferred locations are usually related to home and work places. Moreover, through defining a measure of entropy, Song et al. [150] believe that 93% of individual movements are potentially predictable. Thus, various models have been applied to describe the human offline mobility behavior [151]. Passively collecting human mobile traffic data while users are accessing the mobile Internet has many advantages like low energy consumption. In general, the mobile big data covers a wide range and a great number of populations with fine time granularity, which gives us an opportunity to study human mobility at a scale that other data sources are very hard to reach [152]. Novel offline user mobility models developed based on the mobile big data are expected to benefit many fields, including urban planning, road traffic engineering, telecommunication network construction, and human sociology [145].
Online browsing behavior is another important facet regarding user behavior when it comes to network resource consumption. A variety of applications are now available on smart devices, covering all aspects of our daily life and providing convenience. For example, we can order taxies, shop, and book hotels using mobile phones. Yang et al. [49] provide a comprehensive study on user behaviors in exploiting the mobile Internet. It has been found that many factors, such as data usage and mobility pattern, may impact people's online behavior on mobile devices. It is discovered that the more the number of distinct cells a user visit, the more diverse applications user has visited. Zheng et al. [153] analyze the longitudinal impact of proximity density, personality, and location on smartphone traffic consumption. In particular, location has been proven to have strong influences on what kinds of apps users prefer to use [149,153]. The aforementioned observations point out that there is a close relationship between online browsing behavior and offline mobility behavior. Figure 5(a) is an example of how browsed applications and current location related to each other from the view of temporal and spatial regularity. It has been found that the mobility behaviors have strong influences on online browsing behavior [149,153,154]. Similar trends can also be observed for crowds at crowd gathering places, as is shown in Figure 5(b); i.e., certain apps are favored at places that group people together and provide some specific functions. The authors in [50] tried to measure the relationship between human mobility and app usage behavior. In particular, the authors proposed a rating framework which can forecast the online app usage behavior for individuals and crowds. Building the bridge between human offline mobility and online mobile Internet behavior can tell us what people really need in daily life. Content providers can leverage this knowledge to appropriately recommend content for mobile users. At the same time, Internet service providers (ISPs) can use this knowledge to optimize networks for better end-user experiences.
In order to make full use of users' online and offline information, some researchers begin to quantize the interplay between online social network and offline social network and investigate network dynamics from the view of mobile traffic data [155][156][157][158]. Specifically, the online and offline social networks are, respectively, constructed based on online interest based and location based social network among mobile users. The two different networks are grouped into layers of a multilayer social network = { , }, as shown in Figure 6. and depict offline and online social network separately. In each layer, the graph is described as G = ⟨V, E⟩, where and , respectively, represent node sets and edge sets. Nodes, such as 1 , . . . , 4 , represent users. Edges exist among users when users share similar objectbased interests [88]. Combining information from manifold networks in a multilayer structure provides a new insight into user interactions between virtual and physical worlds. It sheds light on the link generation process from multiple views, which will improve social bootstrapping and friend recommendations in various valuable applications by a large margin [158]. So far, we have summarized some representative works related to human online and offline behaviors. It is meaningful to note that owing to the highly spatial-temporal and nonhomogeneous nature of mobile traffic data, a pervasive framework is challenging yet indispensable to realize the collection, processing, and analyses of massive data, reducing resource consumption and improving Quality of Experience (QoE). The seminal work by Qiao et al. [60] proposes a framework for MBD (FMBD). It provides comprehensive functions on data collection, storage, processing, analyzing, and management to monitor and analyze the massive data. Figure 7(a) displays the architecture of FMBD, while Figure 7(b) shows the considered mobile networks framework. With the interaction between user equipment and 2G/3G/4G network, real massive mobile data can be collected by traffic monitoring equipment (TME). The implementation modules are employed based on Apache software [159]. FMBD builds a security environment and easy-to-use platform both for operators and data analysts, showing good performance on energy efficiency, portability, extensibility, usability, security, and stability. In order to meet the increasing demands on traffic monitoring and analyzing, the framework provides a solution to deal with large-scale mobile big data.
In conclusion, the prosperity of continuously emerging mobile applications and users' increasing demands on accessing Internet all bring about challenges for current and future mobile networks. This section surveys the literature on analyses of human online and offline behavior based on the mobile traffic data. Moreover, a framework has also been investigated, in order to meet the higher requirement of dealing with dramatically increased mobile traffic data. The analyses based on the big data will provide valuable information for the ISPs on network deployment, resource management, and the design of future mobile network architectures.

Speech Recognition and Verification for the Internet of
Vehicles. With the significant development of smart vehicle produces, intelligent vehicle based Internet of Vehicle (IoV) technologies have received widespread attention of many giant Internet businesses [160][161][162]. The IoV technologies include the communication between different vehicles and vehicles to sensors, roads, and humans. These communications can help the IoV system sharing and the gathering information on vehicles and their surrounds.
One of the challenges in the real-life applications of smart vehicles and IoV systems is how to design a robust interactive method between drivers and the IoV system [163]. The level of focusing on driving will directly affect the danger of driver and passengers; hence, the attention of drivers should be paid on the complex road situation in order to avoid accidents during an intense driving. So, using the voices transfer information to the IoV systems is an effective solution for assistant and cooperative driving. By building a speech recognition interactive system, the driver can check traffic jams near the destination or order a lunch in the restaurant near the rest stop through the IoV system by using voice-based interaction. The speech recognition interactive system for IoV system can reduce the risk of vehicle accident, and the drivers do not need to touch the control panels or any buttons. A useful speech recognition system in IoV can simplify the life of the drivers and passengers in vehicles [164]. In the IoV system, drivers want to use their own voice commands to control the driving vehicles, and the IoV system must recognize the difference between an authorized and unauthorized user. Therefore, an automatic speaker verification system is necessary in IoV, which can protect the vehicle from the imposters.
Recently, many deep learning methods have been applied in the speech recognition and speaker verification systems [41,[165][166][167], and published results show that speech processing methods driven by MBD and deep learning can obviously improve the performance of the existing speech recognition and speaker verification system [40,168,169]. In the IoV systems, millions of sensors collect abundant vehicles and environmental noises from engines and streets will significantly reduce the accuracy of speech processing system, while the traditional speech enhancement methods, for example, Wiener filtering [170] and minimum mean-square error estimation (MMSE) [171] which focus on advancing signal noise ratio (SNR), do not take full advantage of a priori distribution of noises around vehicles. With the help of machine learning and deep learning methods, we can use a priori knowledge of the noises to improve the robustness of speech processing systems.
For speech recognition task, deep-neural-network (DNN) can be applied to train an effective monophone classifier, instead of the traditional GMM based classifier. Moreover, the deep-neural-network hidden Markov model (DNN-HMM) speech recognition model can significantly improve the performance of Gaussian mixture model hidden Markov model (GMM-HMM) models [172][173][174]. As shown  in Figure 8, making full use of the self-adaption power of DNN, we can use the multitraining methods to improve the robustness of DNN monophone classifier by adding noise into the training data [89]. The experimental results in [89,175] show that the multitraining method can build a matched training and testing condition which can improve the accuracy of noisy speech recognition, especially for the prior knowledge of noise types that we can easily obtain in vehicles. As shown in Figure 9, a DNN can also be used to train a feature mapping network (FMN) which uses noisy features as input and corresponding clean features as training target. Enhanced features extracted by the FMN can improve the performance of speech recognition systems. Han et al. [176] used FMN to extract one enhanced Mel-frequency cepstral coefficient (MFCC) frame from 15 noisy MFCCs frames. Xu et al. [90] built a FMN which learned the mapping from a log spectrogram to a log Mel filter bank. The enhanced feature can remarkably reduce the word error rate in speech recognition.
Besides getting the mapping feature directly, the DNN can also be used to train an ideal binary mask (IBM) which can be used to separate the clean speech from background noise as shown in Figure 10 [91,177,178]. With a priori knowledge of noise types and SNR, we can generate IBMs as training targets and use noisy power spectral as training data. In the test phase, we can use the learned IBMs to get enhanced features which can improve the robustness of speech recognition.
In speaker verification tasks, the classical GMM based methods, for example, Gaussian mixture model universal background model (GMM-UBM) [179] and i-vector systems [180], need to build a background GMM, firstly, using a large quantity of speaker independent speeches. Then, by computing the statistics information on each GMM component of enrollment speakers, we can get speaker models or speaker ivectors. However, a trained monophone classification DNN can replace the function of GMM by computing the statistics information on each monophone instead of on GMM components. Many published papers [181][182][183][184] show that the DNN-i-vector based speaker verification systems work better than the GMM-i-vector method on detection accuracy and robustness.
Unlike in the speech recognition tasks where the DNNs are used to get enhanced features from noisy features, researchers more prefer to use a DNN or convolutional neural network (CNN) to generate noise robustness bottleneck feature directly in speaker verification tasks [185][186][187]. As shown in Figure 11, acoustic features or feature maps are used to train a DNN/CNN with a bottleneck layer which has less nodes and closes to the output layer. Speaker ID, noise types, monophone labels, or combination of these labels are used as training targets. Outputs of bottleneck layers include abundant differentiated information and can be used as speaker verification features which improve the performance of classical speaker verification methods such as the aforementioned GMM-UBM and i-vector. Similar to the multitraining method, adding noisy speeches into the training data can also improve the robustness of extracted bottleneck features [65,92].
Recently, some adversarial training methods are introduced to extract noise invariant bottleneck features [64,188]. As shown in Figure 12, the adversarial network includes two parts, i.e., an encoding network (EN) which can extract noise invariant features and a discriminative network (DN) which can judge noise types of the noise invariant feature generated from EN. Therefore, we can get robustness noise invariant features from EN which can improve the performance of speaker verification system by adversarial training these two parts in turn [64,188].
In conclusion, using DNN and machine learning methods can make full use of the MBD collected from the IoV systems. Moreover, it improves the performance of speech recognition and speaker verification methods applied in the voice interactive systems.

Conclusions and Future Challenges
Although the machine learning-based methods introduced in Section 3 are widely applied in the MBD fields and obtain good performances in real data test, the present methods still need to be further developed. Therefore, five main challenges facing MBD analysis regarding the machine learning-based methods should be considered as follows. on the development of machine learning-based MBD analysis methods towards high efficiency and precision.
(2) Overfitting and Underfitting Problems. A benefit of MBD to machine learning and deep learning lies in the fact that the risk of overfitting becomes smaller with more and more data available for training [28]. However, underfitting is another problem for the oversize data volume. In this condition, a larger model might be a better selection, while the model can express more hidden information of the data. Nevertheless, larger model which generally implies a deeper structure increases runtime of the model which affects the real-time performance. Therefore, the model size in machine learning and deep learning, which represents number of parameters, should be balanced to model performance and runtime.
(3) Generalization Problem. As the massive scale of MBD, it is impossible to gain entire data even if they are only in a specific field. Therefore, the generalization ability which can be defined as suitable of different data subspace, or called scalability, of a trained machine learning or deep learning model is of great importance for evaluating the performance.
(4) Cross-Modal Learning. The variety of MBD causes multiple modalities of data (for example, images, audios, personal location, web documents, and temperature) generated from multiple sensors (correspondingly, cameras, sound recorders, position sensor, and temperature sensor). Multimodal learning should learn from multimodal and heterogeneous input data with machine learning and deep learning [4,189] and obtain hidden knowledge and meaningful patterns; however, it is quite difficult to discover.
(5) Extended Channel Dimensions. The channel dimensions have been extended to three domains, i.e., space, time, and frequency, which means that the channel property is comprehensively discovered. Meanwhile, the increasing antenna number, high bandwidth, and various application scenarios bring the big data of channel measurements and estimations, especially for 5G. The finding channel characteristics need to be precisely described by more advanced channel modeling methodologies.
In this paper, the applications and challenges of machine learning-based MBD analysis in the M-Internet have been reviewed and discussed. The development of MBD in various application scenarios requires more advanced data analysis technologies especially machine learning-based methods. Three typical applications of MBD analysis focus on wireless channel modeling, human online and offline behavior analysis, and speech recognition and verification in the Internet of Vehicles, respectively, and the machine learning-based methods used are widely applied in many other fields. In order to meet the aforementioned future challenges, three main study aims, i.e., accuracy, feasibility, and scalability [28], are highlighted for present and future MBD analysis research. In future work, accuracy improving will be also the primary task on the basis of a feasible architecture for MBD analysis. In addition, as the aforementioned discussion of the generalization problem, scalability has obtained more and more attentions especially in a classification or recognition problem where scalability also includes the increase in the number of inferred classes. It is of great importance to improve the scalability of the methods with the high accuracy and feasibility in order to face the analysis requirements of MBD.

Conflicts of Interest
The authors declare that they have no conflicts of interest.