Big Data Meet Cyber-Physical Systems: A Panoramic Survey

The world is witnessing an unprecedented growth of cyber-physical systems (CPS), which are foreseen to revolutionize our world {via} creating new services and applications in a variety of sectors such as environmental monitoring, mobile-health systems, intelligent transportation systems and so on. The {information and communication technology }(ICT) sector is experiencing a significant growth in { data} traffic, driven by the widespread usage of smartphones, tablets and video streaming, along with the significant growth of sensors deployments that are anticipated in the near future. {It} is expected to outstandingly increase the growth rate of raw sensed data. In this paper, we present the CPS taxonomy {via} providing a broad overview of data collection, storage, access, processing and analysis. Compared with other survey papers, this is the first panoramic survey on big data for CPS, where our objective is to provide a panoramic summary of different CPS aspects. Furthermore, CPS {require} cybersecurity to protect {them} against malicious attacks and unauthorized intrusion, which {become} a challenge with the enormous amount of data that is continuously being generated in the network. {Thus, we also} provide an overview of the different security solutions proposed for CPS big data storage, access and analytics. We also discuss big data meeting green challenges in the contexts of CPS.

2020 [2]. This has allowed the rapid development of myriad of applications in health-care, public safety, environmental management, vehicular networks, industrial automation, and so on. A CPS mainly consists of physical components and a cyber twin interconnected together, where a cyber twin is a simulation model representative of the physical things such as a computer program [3]. Internet of Things (IoT), on the other hand, allows different CPS to be connected together for information transfer. This means that IoT acts as a connection bridge to network different cyber-physical things. The global expansion of interconnected CPS is facilitated by standardization efforts. For instance, the standardization activities of IoT are being led by the industry (AllSeen Alliance, Open Interconnect Consortium, Industrial Interconnect Consortium) and IEEE P2413 project on standards specifications of IoT architectural framework [4]. CPS have resulted in a tsunami of new information, also known as big data, which can help boost revenues of many businesses, by identifying customer needs and providing them with superior services. However, this enormous amount of data is so large in size and complex in realtime that it exceeds the processing capacities of conventional systems. That is why, cloud computing techniques along with machine learning tools, data mining, artificial intelligence, and fog computing can help the sensed data to be easily stored, processed, and analyzed to uncover hidden patterns, unknown correlations and other useful information [5]. That is why big data are referred to as "the 21 st century new oil". The characteristics of big data was well summarized in the Introduction section of [6]. The relevances of big data era and CPS actually are also highly relevant to global sustainability development goals recently discussed in [7].
Before any processing or analysis, data need to be acquired. The technological advancements in sensors have led to smarter, more efficient and low-cost sensors; the fact that facilitated their wide deployment. Two main sources to sense the data from: i) context-aware computing and communications [8], ii) social computing. With context-aware computing and communications [8], data are sensed from physical sensors, virtual sensors which retrieve data using web services technology, logical sensors which combines both virtual and physical sensors such as gathering weather information, global sensors which collect data from middleware infrastructure, and remote sensors for earth sciences applications [9], [10]. As for social computing, participatory sensing and mobile crowdsensing have led to shaping the structure of social networks, in which users collect and share sensed data using their own smartphones rather than relying on sensors [11]- [13].
Cloud computing facilitates big data storage, processing and management in CPS, by breaking them down into workflows, which are then distributed over multiple dedicated servers. This allows CPS to provide pervasive sensing services beyond the capacities of individual things, in addition to lower latency and power consumption and larger scalability.
Once the data have been collected, making sense of them becomes one the most important aspects of CPS. However, it is important first to eliminate redundant information and reduce data complexity so useful information extraction can be efficiently performed. In this survey, we will discuss about different tools to assist with the data mining process, mainly, feature selection, dimensionality reduction, knowledge discovery in databases, information visualization, computer vision, classification/clustering techniques, and real-time analysis.
The ubiquitous cyber-physical world is susceptible to security threats to a large degree. These security vulnerabilities are made easier with the inability to effectively handle the large amount of data that is constantly flowing through the network; that, in addition to the lack of qualified security experts. Sensitive data stored in the cloud can be accessed or altered by unauthorized users. Cyber-security attacks on the computations, such as false data injection, can affect the integrity and accuracy of extracted results. For all these reasons, research efforts have been shifting towards proposing robust security solutions for big data CPS. In this survey, we provide an overview of these proposed solutions.
In recent years, there have been growing interests in green information and communications technologies. Addressing green issues for CPS allows for a more sustainable and energy efficient systems. Green solutions are proposed for many aspects of CPS, mainly for i) data collection/storage, such as minimizing the number of relay transmissions, removing redundant transmission links, and the use of data compression techniques; ii) CPS computing such as dynamic voltage and frequency scaling and traffic engineering techniques; iii) CPS processing such as designing energy-efficient orchestrators, checkpointing aided parallel execution (CAPE), reducing the amount of exchanged data between clouds, the use of cloudlets which are closer to users than distant clouds, among other solutions.
We summarize the contributions of our paper as follows: • First, to the best of our knowledge, this is the first panoramic survey on big data for cyber-physical systems. Unlike other previous relevant literature surveys [21] [22] [9], [23], [24] [25] [26] [5], this paper provides a broader viewpoint on CPS from different aspects, mainly data collection and storage, processing, analytics, cybersecurity, and green solutions, rather than focusing on specific aspects of CPS.
• Second, we provide a broader summary of security solutions proposed for big data CPS in data collection, storage, and access, as well as in data processing and analytics; unlike previous survey papers such as [15], [27], [28] that only focused on security of a particular component or element of big data. • Third, we provide a summary in big data, storage, and communications for CPS. • Fourth, we provide a broader view of big data meeting green challenges for CPS, unlike [6], [29]- [32]. The sections of this paper are organized as follows. Section II compares this paper with other related surveys in literature. Section III describes the different sources, generations, and collections of big data for CPS, mainly context-aware computing, communications, and social computing, while Section IX presents different CPS-related big data applications. Then, in Section IV, we discuss about big data caching and routing for better CPS data management. In Section V, we present the different big data processing tools from cloud data processing and multi-cloud processing to big data clustering techniques, NoSQL and fog computing that help facilitate the analysis of large volume of data. In Section VI, an overview of big data analytics techniques and tools is provided. Section VII provides a summary of the different security solutions proposed for CPS, mainly in storage, access and analysis. Section VIII addresses the relevances of big data and green challenges for CPS. Finally, SectionX identifies CPS big data challenges and open issues. Conclusion remarks are presented in Section XI. A list of acronyms used in this paper is provided in Table I. II. RELEVANT WORKS Several survey papers have focused on specific aspects of big data, IoT, and CPS. However, to the best of our knowledge, none of them have investigated the interconnections of these concepts together in an extensive survey like this one.
In [21], the authors provided an overview on IoT from definitions to technologies to applications and standardizations, while presenting the different solutions proposed in literature for deterring security threats and preserving data integrity in RFID systems. In [9], the authors provided a brief overview on IoT, with emphasis on sensor networks and their relationships to IoT. Then, context-aware computing definitions, categories and characteristics from an IoT perspective were thoroughly presented. In [23], context data were surveyed for mobile ubiquitous environments, with the main focus being on providing efficient context data distribution for real network systems via highlighting context distribution architecture's layers, network deployments and taxonomy. In [24], different mobile crowd sensing incentives and types were reviewed to motivate normal users to participate and contribute to different sensing applications. In [25], caching big data techniques to optimize computational time and reduce storage overhead as well as improve system performance, scalability and efficiency were presented. In [5], different cloud computing schedulers using Hadoop-MapReduce with their pros and cons were presented for purpose of improving scheduling in cloud environment. In [22], big data analytics from challenges to solutions to open research issues were explored for efficient information extraction and decisions making.
In [15], different security and privacy protocols for MapReduce were surveyed for integrity, correctness and confidentiality of cloud data computations. As for fog computing, which extends cloud computing to the edge of the networks, [27] reviewed the different security and privacy challenges, solutions and open issues. For specific CPS applications, [28] provided a thorough survey on cybersecurity issues for smart grids, with emphasis on security requirements, network vulnerabilities, and secure communication protocols and architectures. Green challenges facing big data cloud computing and processing were surveyed in [6], [29], [30], [32], with different green IoT applications in [31].
CPS has also been surveyed in literature. For instance, in [33], CPS features, energy control, transmission, resource allocation and software designs were presented. The work [34] discussed the integration of cloud computing with CPS by categorizing them in three different areas: remote brain, big data manipulation, and virtualization. The paper [35] surveyed the general concepts, challenges and applications for CPS.
Most of the previous surveys have not discussed big data in the contexts of CPS, which is the main focus of this paper. Furthermore, the green and security challenges and solutions are explored specifically for big data in CPS. Therefore, this survey paper provides a big umbrella addressing promising research frontiers and insights in many challenges and open issues facing big data meeting CPS.

Survey
Scope [21] IoT definitions, technologies and applications [9] IoT context-aware definitions, categories and characteristics [23] Context data distribution for mobile ubiquitous environments [33] CPS features, energy, transmission, resource allocation and software designs [34] Cloud computing in CPS [35] CPS general concepts, challenges and applications [24] Mobile crowd-sensing incentives [25] Caching big data [5] Cloud computing schedulers [22] Big data analytics challenges and solutions [15] Security and privacy protocols for MapReduce [27] Fog computing security challenges, solutions and open issues [28] Cybersecurity for smart grids [29] Energy efficient cloud data centers and computing [30] Green big data processing for smart grids [31] Green IoT applications [6], [32] Big data meet green challenges accessible design boards (Raspberry pi, Onion, Arduino) [36], and the new efficient hardware architectures and components, which made sensors more robust to hardware wearing from harsh environments. Moreover, many of these sensors incorporate accelerometers that are 1000x more powerful in terms of sensitivity than those used in a Nintendo Wii [37]. In this section, we discuss the different sensed big data sources and types by grouping them into social computing and contextaware computing. An illustration is provided in Fig 1.

A. Context Aware Computing and Communications
The general definition of context-aware communications and networking (CACN) was provided in [8]. Context-aware computing could be considered as CACN operating in the higher networking layers. The data that sensors collect for its specific application purposes are considered raw data, which have been directly collected from the environment without further processing. With raw data alone, it becomes challenging to analyze and interpret them, and let alone the big data generated by the large scale deployment of sensors. For data which provide relevant information that is meaningful and easily interpretable, sensors need to engage in context-aware computing; that is, sensors need to store processed meaningful information, also known as "context information", that is easily understandable [8], [38]. An example to highlight the difference between raw sensor data and context information would be the blood sugar readings collected by bio-medical sensors on the bodies of patients, which are considered raw data. When these readings are processed and represented as a patient's average glucose level in the blood, they are referred to as context information. The quality of context (QoC) metric is used to assess the quality, validity, precision and upto-date context information [23]. Context-aware computing from acquisition, processing, storing and reasoning, can be performed by the applications themselves, or by using libraries and toolkits, or even by using a middleware platform [9].
As for sources of context information, they can be retrieved directly from physical sensors, from virtual sensors (they collect data from different sources by using web services technology and represent it as sensor data), from logical sensors ( a combination of physical and virtual sensors), from a middleware infrastructure (global sensor networks), from context servers (databases, web services), or even manually provided such as retrieving users' preferences. The context information can be further classified into primary context and secondary context, which provides information on how the data were obtained. For example, reading RFID tags directly from different production parts in industrial plants is considered the primary context, while obtaining the same information from the plant's database is referred to as the secondary context [9], [21].

B. Remote sensing
Collecting sensor data of objects from a distance is referred to as remote sensing (RS) [39]. RS is an integral part of earth sciences. For instance, space-borne and airborne sensors collect multi-spatial and multi-temporal RS big data from the global atmosphere for purposes of earth observation and climate monitoring [40]. Other remote sensing applications include Google Earth that provides pictures of the earth's surface, weather reporting, traffic monitoring, hydrology and oceanography [41].
In [42], the authors proposed a big data analytical architecture for real-time RS data processing using earth observatory system. The real-time processing includes filtration, load balancing and parallel processing of the useful RS data. The RS datasets are normally geographically distributed across several data centers, leading to difficulties in loading, scheduling and transmission of data. Moreover, the high dimensionality of the RS data makes their storage and data access rather complicated [40]. That is why, in [43], Wang et al. proposed a wavelet transform to represent RS big data by decomposing the datasets into multiscale detail coefficients, which are estimated using expectation-maximization likelihood. In [44], the authors evaluated the quality of RS data using statistical inference via using the prior knowledge of the dataset to get an unbiased estimator for the quality.

C. Social Computing
With the explosive increase in smartphones usage, mobile data have experienced an unprecedented growth, carrying enormous amount of information on user applications, network performance data, service characteristics, geographic information, subscriber's profile, and so on [45]. This has led to shaping the notion of "mobile big data", which, unlike traditional big data in computer networks, have their own unique characteristics. One of these characteristics is the ability to partition mobile data in time and space domains, such as in minutes, hours, days, location, and so on. Furthermore, due to the features of smartphones' usage, the same traffic, on one hand, can be highly likely requested by a group of subscribers in certain time and location; and on the other hand, subscribers in close proximity may exhibit similar behavior and mobility patterns, all of which can help optimize network performance [46].
Social computing allows the integration of these social behaviors and contexts into web technologies to assist with predicting social dynamics, which can render the operation, planning and maintenance of social wireless networks easier than ever [47], [48]. For instance, due to high social correlations and relationships among subscribers, a user social network can be formed, in which the habit, interests, mobility, and sharing patterns can be used to construct social community structures and analyze communication behaviors. One such an example of user social application is the popular Pokemon Go game, where users in close proximity share real-time maps to hunt for Pokemon characters [46]. Another example where social computing can be beneficial is in emergency situations, such as the spread of infectious diseases, where taking the appropriate policies by analyzing human interactions and predicting the emergency's evolution can help protect the public health [47]. This article [49] introduced Cybermatics as a broader vision of the IoT (called hyper IoT) to address science and technology issues in the heterogeneous cyberphysicalsocialthinking (CPST) hyperspace. Next, we list two different social computing tools for data collection, mainly, participatory sensing and crowd-sensing.
1) Participatory sensing: Participatory sensing or community sensing allows users to collect and share information either within social groups (social sensing) or with everyone (public sensing) using their own smart devices [11], [12]. This means that sensors can be substituted by users for purpose of data collection, which can significantly reduce the monetary costs of deploying physical sensors. However, with participatory sensing comes several challenges such as the quality and trustworthiness of collected data, the willingness of participants to engage in the sensing tasks and protecting participants' personal information.
In [11], the authors proposed a participant coordination architecture that selects the most efficient participants without exposing participants' personal information to the application server. To protect participants' privacy, in [50], Chang et al. proposed a secure scheme called PURE which allows participants to reach the global model to estimate via peer reviewing the local regression models. This enables participants to only report intermediate results back to the server without the need of sharing local private data with the server. In [51], Messaoud et al. proposed a mobile sensing scheme that reduces the sensing time required by participants, and increases the fairness of sensing tasks assignment to ensure participants' commitment to sensing while maintaining same data quality as in non-fair schemes. In an attempt to maximize the overall data quality, in [52], Wang et al. proposed a multitask allocation framework (MTPS) which pays participants a compensation from a shared budget for each sensing task, with additional compensation if a participant is assigned more than one task. This greedy framework allows the allocation of multiple tasks to participants. In [53], participatory sensing for environmental data collection was used, where the urban resolution metric was used to measure the quality of urban sensing. In [54], participatory sensing was applied to vehicular networks, where location, speed, and fuel consumption of vehicles can be communicated with the server through phones aboard via a WiFi interface to reduce data transfer delay time.
2) Mobile crowd-sensing: Mobile crowd-sensing (MCS) can be considered as an extension to participatory sensing. In addition to collecting data from mobile devices (mobile sensing), MCS uses social sensing via integrating and fusing the contributed data from mobile devices with that of the mobile social network services in order to provide solutions to more complex queries [13]. In vehicular networks, with participatory sensing, we can collect warning messages from vehicles to determine the traffic status. However, if in addition, a driver needs to know whether the route is safe to drive on based on authorities' recommendations, residents' preferences, and so on, then MCS can be useful (see Fig. 1).
In [55], Xiang et al. used MCS to construct accurate outdoor received signal strength (RSS) maps using error-prone smartphones. In [56], Wang et al. proposed an energy-efficient cost-effective data uploading in MCS via providing incentives to participants to use the appropriate timing and network to upload the data. Data was offloaded to Bluetooth/WiFi gateways via using predictions on users' calls and mobility. To maintain relatively good performance of MCS applications, a sufficient number of participants need to make contributions to sensing. In [24], the authors discussed about the different incentives for MCS from entertainment, service and monetary incentives, in which participants can be recruited in multiple sensing tasks.

IV. BIG DATA STORAGE AND COMMUNICATIONS
Caching CPS big data can lead to a reduction in the amount of traffic exchanged, which contributes to a better data management, lower latency and energy consumption. In this section, we discuss some of the solutions proposed for CPS big data caching and routing.

A. Big Data Caching and Storage for CPS
The high amount of traffic driven by the popular use of mobile video and online social media applications along with the scarcity of backhaul resources have pushed researchers and mobile operators to find solutions. Content caching in CPS is of high interest, especially that a big proportion of the traffic load originates from fetching data from different sources such as databases, cache servers and network gateways [57]. Rather than caching data from the cloud, performing the caching at the edge of mobile wireless networks, such as base stations and user equipments, offers the advantage of better data management [58], [59]. For instance, in [57], Zeydan et al. proposed a big data proactive caching architecture that predicts popular content from users' behavior and network characteristics to perform caching at the base stations. This has the advantage of backhaul offloading to the edge, so that data get closer to users, thereby enhancing users' quality of experience and reducing latency. The proposed architecture is validated using dataset from a Turkish mobile operator, where it was shown that proactive caching can yield 100% user satisfaction by offloading 98% of the backhaul to the edge. The proactive caching at the edge is envisioned to solve big data management in future 5G networks, especially that base stations densification and acquiring new spectrum do not seem quite effective in terms of cost, scalability and flexibility [25], [57]. A new approach to reducing network traffic in telehealth systems was suggested in [60], where a filter is used inside the sensors which associates a scale to each record, to determine the data fields that need to be sent to the server.
Different works have been proposed to deal with challenges facing Hadoop for handling big data storage and processing [61]- [63]. For instance, in [62], the authors proposed a novel cache to store the intermediate data, which helps eliminate redundancy in storage and processing of the big data set, in addition to speeding up the performance of the system by fetching the data from the cache rather than running mapper functions. Dache, a data-aware cache framework for big-data applications, was suggested in [63] to accelerate the execution of MapReduce tasks. The transmitters of the tasks send their intermediate data to a cache manager, which is then used to fetch potential processing results.
Cache memory can help speed up CPS communications via increasing the execution speed and decreasing the time spent on memory access to fetch the required data. For instance, in medical CPS, a timely response from medical sensors to servers' requests is required, where caching can be very useful. However, as mentioned in [64], cache misses are highly likely in CPS, especially when a task is interrupted by a higher priority task, triggering a cache interference cost. A potential solution to this problem is cache partitioning where different tasks with shared resources are isolated and assigned reserved partitions of a cache memory. This can be very useful for realtime CPS applications such as mobile-health, environmental monitoring, traffic surveillance, aerospace and so on, where reliability and predictability are of high importance. Moreover, with the large-scale network of CPS, temporal interferences from a large number of devices contending on shared resources, such as processor cores, buses, I/O devices, can significantly reduce the runtime of CPS and lead to unpredictable systems. Several algorithms and component-based approaches have been proposed to help reduce the temporal interferences of CPS [65]- [67].

B. Big Data Communications for CPS
To accommodate the big data environment for CPS, building a resilient network infrastructure for the data-intensive applications and services has become essential [68]. The data collections, the data chunks distribution to data centers and the data delivery to intended users, necessitate a fast and reliable networking to bridge these stages together. For instance, machine-type communications (MTC), a class of technologies related to IoT and defined by the 3rd Generation Partnership Project (3GPP), are becoming more popular as we move towards a smart city. In an attempt to make homes smarter, Apple introduced the HomeKit, which uses Siri to control different things inside the home remotely using the iPhone, iPad and even AppleTV [69]. Samsung, on the other hand, introduced SmartThings that allow the automation of many home tasks using WiFi, Zigbee, Bluetooth low energy (BLE), and Z-Wave [70]. These new technologies allow all the devices to be interconnected and to be communicated together. Many of the MTC communications are envisioned to run over current cellular networks [71], providing MTC devices with ubiquitous coverage, global connectivity, reliability and security. However, this creates a set of challenges for cellular network providers [72], such as the inability to handle MTC traffic on networks optimized for human communications, excessive congestions due to signaling overhead, packet scheduling problems due to MTC traffic requiring a number of radio resources below the minimum allocated to a cellular device;, and the large interferences generated from MTC devices [73]. Different relevant works, such as [74]- [77], suggested offloading the MTC traffic onto device-to-device (D2D) communications links. As a matter of fact, Google OnHub router can support direct D2D communications, as well as WiFi, BTL and 802.15.4 using Brillo as a stripped OS version of Android. The D2D technology provides an ideal solution to support the massive communications of MTC devices, especially that these devices are anticipated to be located close to each other [78]. The paper [79] discussed the integration of wireless sensor networks (WSNs) and mobile cloud computing (MCC) and proposed a sensory data processing framework to transmit desirable sensory data to the mobile users in a fast, reliable, and secure manner.
In [80], the authors proposed two different sustainable routing designs called SustainMe. The first model uses a dedicated backup protection in which network components are turned off after data is fully delivered to help save energy. The second model uses shared backup protection to achieve a tradeoff between energy efficiency and capacity usage efficiency. The latter is shown to consume less capacity than the the first model, but at the expense of an increase in energy expenditure. In [81], the authors proposed a software defined network (SDN) routing for Hadoop to accelerate the speed of big data delivery through speeding up the MapReduce data shuffling. This is useful for time-sensitive big data applications. As a matter of fact, SDN can significantly improve big data applications, especially which separates the control and data planes, allowing the control plane to be logically centralized so that decisions and network optimization are performed efficiently via having a global view of the network. All these SDN's features can make the big data acquisition, processing, transmission and storage much easier and faster. On the other hand, big data can also assist in efficiently designing and optimizing SDN functions through traffic engineering, cross-layer design, security and inter-and intra-data center communications [82].
In [68], [83], taking advantage of social and geographical community structures, routing, navigation services, and data delivery performance can all be improved. In [6], different big data routing algorithms for energy efficiency were reviewed, such as selecting routing paths for data centers that achieve the lowest energy costs, parallelizing data transfers by using multiple cores in data routing, and balancing big data traffic through a distributed adaptive routing algorithm that minimizes packet delivery.
In massive high-density environment, data collections and transmissions can be challenging tasks especially with the different data flow information, quantities and characteristics. In [84], cyclists' data collections and transmissions were studied. Autonomous data collection using automated video analysis can be performed in order to overcome the need for costly manual data collection. For instance, a computer vision system can group users' features based on spatial proximity data, speed, trajectory and others. The obtained information can be input to an algorithm that obtains valuable information such as traffic stream, number of cyclists, lane density and so on.
In high-density environment, it is important for data to be efficiently collected in a short period of time to ensure up-todate information. For this purpose, concurrent data collection trees can be valuable where multiple data collection streams can be initiated by different users on same set of devices [85]. This can be extremely helpful for delay-sensitive applications. In [86], the authors proposed concurrent massive data transmission for remote real-time health monitoring system. An input/output(I/O) Completion Port (IOCP) main server dynamically binds Internet protocol (IP) addresses to massive sensors before they can send their data. The main server assigns one of the IOCP servers with the least number of concurrent connections to micro sensors using Oracle Real Application Cluster (RAC). Finally, data storage is separated from Oracle server instances using Network Attached Storage (NAS) technology to allow for greater I/O performance for massive concurrent connections. The paper [87] firstly analyzed the objective of massive access control for machinetype communications (MTC) or machine-to-machine communications (M2M), another terms of CPS used in 3GPP standardizations for wireless cellular networks, and proposed distribution reshaping to enable data arrival distribution with highly coordinated manner to notably reduce multiple access congestions.
Real-time data fusions are important for heterogeneous IoT/CPS data streams and unreliable networks with increasing data size. To support locally available computational values to help real-time analytics, the work [88] proposed the fusion of three different data models with relational, semantical, and big data based data and metadata. The paper [89] proposed twolayer architecture for IoT data analysis, where the first layer is with the service oriented gateway based generic interface to obtain data from multiple interfaces and IoT systems and store them scalably and make relevant real-time analysis to extract high-level activities, while the second layer works for the probabilistic fusion of these high-level activities.

C. Summary and Insights
To summarize, different strategies can be employed to enhance CPS performance via speeding up data collection, processing and distribution. At the CPS traffic level, data collection and processing can be accelerated through caching at the edge, filtering the data to make it more manageable in size, caching execution tasks, and using cloudlets whenever possible. At the CPS devices level, the use of on-fly D2D, SDN, backup protection, and social/geographical community structures along with navigation services can help accelerate data delivery and reduce latency. The use of these technologies and solutions can significantly speed up data handling while at the same time extending the communication range of CPS traffic.

V. BIG DATA PROCESSING FOR CPS
Cloud computing along with data clustering facilitates the parallel processing and execution of tasks and queries. Mapping and scheduling workflows in a multi-cloud environment speeds up the processing and allows for a better big data management. In this section, we discuss cloud computing, big data clustering, NoSQL and fog computing for big data workflows processing.

A. Cloud Data Processing
With the large volume of data in the order of exabyte, it becomes almost impractical to process the data on individual machines, no matter how powerful they are. Parallel processing of the data chunks on dedicated servers, such as MapReduce tool proposed by Google, offers advantages over conventional processing methods; however it is still not very effective to handle a large amount of data, mainly due to scalability, latency, availability, and inefficient programming techniques, including but not limited to database management systems [90], [91]. One attractive solution to dedicated servers is the processing on cloud centers, which offers users the ability to rent computing and storage resources in a pay-as-you-go manner [92]. In addition, even though users will be sharing a common hardware, the shared resources appear exclusive to them through machine virtualization via hiding the platform details [93]. However, this approach can create problems in the pay-as-you-go environment due to untruthfulness, unfairness and inefficiency of resources and workload transactions [94].
The main difference between parallelizing tools, such as MapReduce and cloud computing, is that MapReduce uses mappers and reducers to produce intermediate results and final results, respectively. However, public clouds offers users with virtual machines (VMs) with a highly elastic resource allocation [92]. The function of Map is in charge of processing input key-value pairs and generating intermediate key-value pairs; while Reduce function is used to further compress the value set into a smaller set based on the intermediate values with the same keys [95]. To maximize the query rate of remotely located data in an attempt to maximize system performance, the authors in [96] designed a dynamic resource allocation algorithm that takes into account the computations of query streams across the nodes and the limiting number of resources available.
Due to the volume and velocity characteristics of big data, streaming data processing and storage might require different compression techniques to ensure efficiency and scalability. Yang et al. proposed a novel low data accuracy loss compression technique for cloud data processing and storage. A similarity check was performed on partitioned data chunks and a compression is conducted over the data chunks rather than the basic data units [97]. Another similarity check-based compression technique was proposed in [98] using weighted fast compression distance.
Integrating cloud computing in IoT can take the processing of sensing data streams to the next level to provide ubiquitous sensing services beyond the capacities of individual things. When combined with artificial intelligence, machine learning, and neuromorphic computing techniques, it is envisioned that new applications will be developed with automated decision making, which would revolutionize the field of smart cities, industrial plants, environmental monitoring and others (see Fig. 2). Cloud computing will enable IoT applications to have a reduced latency, power consumption and enhanced scalability. Examples of such applications include but not limited to healthcare, where patients' information can be accessed using a cloudlets-based infrastructure [99]. Cloudlets are clouds that are closer to users to help overcome the high latency and power consumption of distant clouds [100]. Other cloud application examples include vehicles traffic control system [101], genome analysis [102], earth surface analysis [103] and many others.

B. Multi-Cloud Data Processing
In many IoT scientific applications, the data collection and generation, computation, processing and analysis are broken down into workflows, consisting of interdepending computing entities. Due to the data-intensive nature of IoT applications, the large-scale workflows need to be distributed across multiple cloud centers [104]. To allow the support of multiple applications and to overcome the limitations of current frameworks that are dedicated to a unique type of applications, the authors in [105] proposed a distributed application management framework in multi-cloud environment by using a domain specific language (DSL) to describe applications in a hierarchical manner. However, the inter-cloud communications constitute a big deal of the financial costs of processing workflows due to their large volume. In [104], to optimize system performance, Wu et al. proposed a budgetconstrained workflow mapping in multi-cloud environment. An efficient pay-on-demand pricing strategy for streaming big data processing in multi-cloud environment was proposed in [106], to offer a low price for data load processing, while maximizing the revenues of cloud service providers. As for building trust across multiple cloud centers that collaborate together on data storage and processing, Li et al. in [107] proposed a trust-aware monitoring architecture between users and cloud centers with hierarchical feedback mechanism to enhance the robustness and reliability of the quality of service of cloud providers, which helps provide them with a rating based on their trust reputation. This allows users to select services from different cloud providers based on feedback and past service records.
In [108], Wang et al. optimized virtual machine (VM) placement in national cloud data centers to help minimize the energy consumption of data-intensive services, such as planet analysis. In these cloud centers, VMs are assigned to physical servers, which helps provide users with a high quality of service but at the expense of greater energy consumption. The trade-off between energy consumption and quality of service is taken into consideration in the optimization problem. The cost of data access and storage limitations of national cloud centers was considered in [109], where the authors used Lagrangian relaxation based heuristics algorithm to obtain the optimal data centers placement that can reduce data access costs.
The distribution of users' tasks on geographically distributed cloud centers was addressed in [110], where the authors proposed a big data management solution to maximize system throughput such that fairness of limited resources usage by users is guaranteed and the operational cost of service providers is reduced. To allocate resources of multi-cloud centers to users, a multi-round combinational double auction based mechanism was proposed in [111], where auctions on different VMs were performed by both users and data centers in multiple rounds to maintain a high quality of service level.

C. Big Data Clustering
Data clustering refers to partitioning a set of objects comprising of attributes into different groups of similar objects and features [112]. Data clustering becomes very useful in big data applications, where there is a high need to process and analyze large volume of data. Estimating the number of clusters becomes important as clustering facilitates the distribution of the data storage, tasks execution, parallel computing, and queries requests [113]. Fig. 3 shows two groups of clustering methods researched in depth in literature: i) hierarchical clustering, and ii) centroid-based clustering. In hierarchical clustering, nearby objects have higher probability of being grouped together than far away objects. On the other hand, in centroid-based clustering, objects closer to the cluster center are grouped together. The paper [114] discussed a flexible multiple clustering analytic and service framework, and a novel tensor-based multiple clusterings (TMC) approach.
One of the most popular centroid-based clustering is kmeans due to its computational efficiency and low-complexity implementation. However, as the number of clusters increases, k-means clustering suffers from the empty clustering problem and the increase in number of iterations for convergence. This means that traditional k-means is not suitable for big data applications. Different works have suggested enhanced versions of k-means clustering for purposes of improving clustering quality, execution time, and accuracy. For instance, in [115], the authors used an enhanced version of the kmeans clustering, where the initial centroids of the cluster are not selected randomly but based on averaging the data points. This achieves higher accuracy than conventional kmeans. Another enhanced version of k-means clustering was suggested in [116], to help eliminate the empty clustering problem of traditional k-means. The clustering approach was based on a combination of Fireworks and Cuckoo-search algorithms with representative points being selected as the centroids. In [117], the authors also used a centroid calculation heuristics to help enhance the clustering performance as the number of clusters increases.
In [112], for instance, the authors used hierarchical clustering in their proposed algorithm, where the number of clusters  [118] and [119]). was estimated visually based on a reordered distance matrix. The data samples were clustered using single linkage (SL), which cuts large edges of the clustering tree in a minimum spanning tree (MST). The algorithm was shown to be superior in terms of clustering speed and accuracy when compared to other clustering algorithms like k-means. Another hierarchical clustering was suggested in [120], where users themselves define the number of clusters based on different similarity measures such as homogeneity and the relative population of each cluster. This helps improve user satisfaction of the clustering algorithm. A hierarchical k-means clustering algorithm was proposed in [121] to find high quality initial centroids. The centroids were obtained based on a hierarchical structure of k-means that consists of several levels, where the first level is the original dataset. Subsequent levels are compromised of smaller size datasets that consist of similar patterns as the original dataset. Fuzzy clustering, another clustering technique, is similar to k-means; however an object can be associated with more than a single cluster depending on its degrees of membership that are usually calculated based on Euclidean distances between the object and the data center [122].
Another big data clustering is clustering features (CF), where a CF-tree includes a summary of data patterns in a cluster. It uses a threshold preset by users to set the size of micro clusters in a CF-tree. However, this method can have time and scalability complexities, especially with the large number of data that needs to be scanned and assessed to construct the tree. Clustering based on summary statistics allows the threshold to be different for different micro clusters by dynamically adjusted it based on regions' densities [123]. Graph clustering attempts to cluster data points based on network's structure and nodes' connections, where similarities between nodes allow them to be connected to build a community. An example would be building a Tweet graph for Twitter online social application where Tweets (nodes) with similarities (similar URL count, similar hashtag count, or similar username count) are connected together with edges [124]. Finally, incremental mining helps deal with the increasing network size and data growth challenges by incrementally updating the clusters without the need for reconstructing them from scratch.
In the context of CPS applications, [125] attempted to perform structuralized clustering for sensor-based CPS for purposes of reducing the energy consumption of the sensor network. The sensor network is basically structured into small equal units, each consisting of a set of clusters, whose number is carefully selected based on the cluster head workload and its distance to the base station to achieve energy efficiency. In [126], the authors tried to solve the cascaded subnetworks' failures problem in CPS by obtaining an upper bound on the small clusters' size to mitigate networks' failure, especially when they are tightly coupled. In [127], the authors proposed a secure clustering method for vehicular CPS, where a trust metric is calculated for each vehicle based on their transmission characteristics in order to create secure clusters. Another secure clustering for vehicular CPS was proposed in [128], where the clustering problem is formulated as a coalition game taking into consideration the relative velocity, position and bandwidth of vehicles. Furthermore, incentives and penalty mechanisms are suggested to prevent selfish nodes from degrading the communication quality performance. A density-based stream data clustering for real-time monitoring CPS applications was suggested in [129], where the authors used FlockStream algorithm to group similar data streams. Each data point is associated with an agent, and similar agents within a visibility range of each other in the virtual space, form a flock allowing for real-time stream clustering. A summary of the different big data clustering techniques for CPS is provided in Table III.

D. NoSQL
Conventional relational database management systems are not suitable for heterogeneous big data processing, as they consist of strict data model of pre-defined data structures and constraints with a fixed schema [140]. NoSQL (Not Only Structured Query Language) relaxes many of the relational databases' properties such as ACID transactional properties to allow for greater querying flexibility, operational scalability and simplicity, higher availability and faster read/write operations of unstructured big data through replicating and partitioning the data across several nodes [141], [142].
NoSQL databases can store data in three different forms: key-value stores, document databases, and column-oriented databases [143]. In the document databases form, the data is stored in a complex structure form such as XML documents. Column-oriented databases store columns of data in data tables, allowing greater ease of adding and deleting columns compared to row-oriented databases.

Consensus clustering statistics
Graph clustering/Spectral clustering

E. Fog Data Computing
Cloud computing and services can be extended to the edge of network via the fog computing paradigm. Though both cloud and fog provide data computations, storage and application services to end-users, the distinguished features of fog from cloud include its proximity to end-users, the dense geographical distribution and its mobility support [145]. These features facilitate fog computing for latency-sensitive applications. Consider an example [146] of smart traffic lights and connected vehicles, where an ambulance flashing lights sensed by a video camera could automatically trigger street lights to open lanes for the ambulance to pass through the traffic. In smart grid scenario [145], the fog devices at the edge collect the data generated by grid sensors, real-time process the data, and issue control commands to the actuators. A fog-based intelligent decision support system was proposed in [147] for driver safety and traffic violation monitoring based on the IoT. A smart city speeding traffic surveillance scheme using fog computing paradigm was provided in [148], and its effectiveness was validated by intensive experiments conducted using real-world traffic surveillance video streams. Fog computing could make possible the early predictions of biomarkers to enable automated decisions making in a connected health scenario [149]. A fog computing system for e-Health applications was implemented in [150], where fog nodes were installed in home to achieve the smallest processing time. Disaster decision support systems, deployed with fog nodes, could process acquired real data and trigger alarms in case of an emergency [151]. An augmented brain computer interaction game based on fog computing and linked data was developed in [152]. Other exemplary scenarios [146] for fog computing could be wireless sensor and actuator net-works, decentralized smart building control, software defined networks, and so on.
Beyond the latency-sensitive applications, fog computing could also offload the core network traffic and keep the sensitive data inside the network [2]. In [153], fog computing at smart gateways was shown to enhance the health monitoring system through the usage of advanced techniques such as embedded data mining, distributed storage and notification service at the edge of network. In [154], the authors proposed a high level programming model, called mobile fog, for future Internet applications that are geospatially distributed, large-scale and latency-sensitive. The placement and migration method for infrastructure providers was proposed in [155] to incorporate cloud and fog resources. Specifically, the network intensive operators were placed on distributed fog devices while computationally intensive operators were placed in the cloud. Some design goals and challenges in fog computing platform were described in [156], which also suggested several important components of a fog computing platform, i.e., authentication and authorization, offloading management, location services, system monitor, resource management, virtual machine scheduling.
The security and privacy issues were examined in the context of smart grids [28] and machine-to-machine communications [157], and so on. These security solutions for cloud computing may not be directly applicable to fog computing, as fog devices are placed at the edge of networks. One security issue of fog computing is the authentication at different levels of gateways. The compromise of gateways serving as fog devices could lead to the Man-in-the-Middle attack [158]. Since a certain amount of fog networks are connected through wireless, typical wireless attacks (e.g., jamming attacks, sniffer attacks, and so on) could be possible threats for fog computing [27]. Furthermore, the leakage of private information, such as data, location or usage could be the main concerns of end users.

F. Summary and Insights
From Table IV, we can observe that users' workloads can be more efficiently processed in a multi-cloud environment, especially that most CPS applications deal with a large volume of data. However, such a solution requires additional optimization techniques to optimize VMs placement for energy, security, fairness, and costs. We have discussed relevant issues in fog computing relevant issues, while there would be also more relevant big data processing issues in cloud computing and edge computing, such as relevant architectures and applications, distributed data analytical frameworks, cyber defense and cyber intelligence as well as convergence and complexity issues.
VI. BIG DATA ANALYTICS Big data analytics constitutes one of the most important arenas in big data systems, as it allows to uncover hidden patterns, unknown correlations and other useful information, which in turn assist in boosting the revenues for many businesses. In this section, we present an overview of relevant big data analytics techniques and tools. Readers may find some introductions of big data analytics in [159].

A. Data Mining
One of the interesting features of CPS is the automated decision making. This means that CPS objects are supposed to be smart in sensing, identifying events and interacting with others [160]. The massive data collected by CPS needs to be converted into useful knowledge to uncover hidden patterns to find solutions, enhance system performance and quality of services. The process of extracting this useful information is referred to as data mining. One solution to facilitate the data mining process is to reduce data complexity via allowing objects to capture only the interesting data rather than all of it. Before data mining can be applied to the data, some processing steps need to be completed such as key features selection, preprocessing and transformation of data. Dimensionality reduction is one potential method to reduce the number of features of the data [161]. For instance, in [162], Chen et al. used neural network with k-means clustering via principal component analysis (PCA) to reduce the complexity and the number of dimensions of gene expression data to extract disease-related information from gene expression profiles. Knowledge discovery in databases (KDD) is also used in different CPS scenarios to find hidden patterns and unknown correlations in data so that useful information can be converted into knowledge [163]. One such use of KDD is in smart infrastructures systems, where these systems need to answer queries and make recommendations about the system operation to the facility manager [164].
Tsai et al. [5] broke down the core operations of data mining into three main operations: data scanning, rules construction and rules update. Data scanning is selecting the needed data by the operator. Rules construction includes creating candidate rules by using selection, construction and perturbation. Finally, candidate rules are checked by the operator, then evaluated to determine which ones will be kept for the next iteration. The process of scanning, construction and update operations is repeated until the termination criteria is met. This data mining framework works for deterministic mining algorithms such as k-means, and the metaheuristic algorithms such as simulated annealing and genetic algorithm.
Clustering, classification and frequent pattern are different mining techniques that can be used to make CPS smarter. Clustering methods have already been discussed in Section V-C. Tsai et al. [5] discussed about two different purposes for clustering: i) clustering for infrastructure of IoT, and ii) clustering for services of IoT. Clustering for infrastructure of IoT helps enhance system performance in terms of identification, sensing and actuation, such as in [165], where nodes can exchange information between each other to identify whether they can be grouped together depending on the needs of the IoT applications. As for services of IoT, clustering can help provide higher quality services such as in smart homes [166]. On the other hand, classification does not require prior knowledge to complete the partitioning of objects into clusters, also known as unsupervised learning. Classification tools include decision trees, k-nearest neighbor, naive Bayesian classification, adaboost and support vector machines. Classification can also be done to improve infrastructure as well as services of IoT. Finally, frequent pattern mining is about uncovering interesting patterns such as which items will be purchased together with previously purchased items, or suggest items for customers to purchase based on customer's characteristics, behavior, purchase history, and so on. Fig. 4 illustrates the CPS big data mining process for useful information extraction. The paper [167] proposed NextCell algorithm to improve the location prediction via utilizing the social interplay mined in cellular call records.

B. Real-time Analytics
Real-time analysis is another approach to produce useful information from massive raw data. Real-time streams data are first converted to a structured form data before being analyzed by big data analysis tools such as Hadoop. Many application domains such as healthcare, transportation systems, environmental monitoring, and smart cities will require realtime decision making and control [12]. For example, Twitter data can be real-time analyzed to enhance the prediction process and to provide useful recommendations to users [168]; terrorist incidents data can be real-time analyzed to predict future incidents [169]; big data stream in healthcare can be analyzed to help medical staff make decisions in real-time, which can help save patients' lives and improve the healthcare services provided, while reducing medical costs [170]. Near real-time big data analysis architecture for vehicular networks was proposed in [171], which consists of a centralized data storage for data processing and a distributed data storage for streaming processed data in real-time analysis.
In [172], a real-time hybrid-stream big data analytics model was proposed for big data video analysis. The paper [173] considered the online network analysis as a stream analysis issue and proposed to utilize Spark Streaming to monitor and analyze the high-speed Internet traffic data in real-time. The work [174] proposed mobile edge computing nodes deployed on a transit bus with descriptive analytics to explore meaningful patterns of real-time data streams for transit. The paper [175] discussed the term and related concepts of Real Time Analytics (RTA) for industry big data analytical solutions. This paper [176] provided a framework to efficiently leverage big data technology and allow deep analysis of large and complex datasets for real-time big data warehousing.
Arranging the data in a representative form can provide information visualization, which makes the information extraction and understanding of complex large-scale systems much easier [177]. Geographical information systems (GIS) is one important tool of visualization [178], as it can help real-time analysis of many applications such as in healthcare, urban and regional planning, transportation systems, emergency situations, public safety, and so on. In [178], the authors proposed a large-scale system data visualization architecture called X-SimViz, which allows users for real-time dynamic data analytics and visualization. Computer vision is another approach to detecting security anomalies. Visualization can also be useful tool in predicting real-time cyber attacks. For instance, in [130], the authors used computer vision to transform the network traffic data into images using a multivariate correlation analysis approach based on a dissimilarity measure called Earth Mover's Distance to help detect denial-of-service attacks. A computer vision deep learning algorithm for human activity recognition was proposed in [179]. The model is capable of recognizing twelve types of human activities with high accuracy and without the need of prior knowledge, which is useful for security monitoring applications.

C. Cloud-Based Big Data Analytics
Cloud-based analysis in CPS constitutes a scalable and reliable architecture to perform analytics operations on big data stream, such as extracting, aggregating and analyzing data of different granularities [180]. A massive amount of data are usually stored in spreadsheets or other applications, and a cloud-based analytics service, using statistical analysis and machine learning, helps reduce the big data to a manageable size so information can be extracted, hypothesis can be tested, and conclusions can be drawn from non-numerical data such as photos. Data can be imported from the cloud and users are able to run cloud data analytics algorithms on big datasets, after which data can be stored back to the cloud [181]. For instance, in [182], the authors used cloud computing using MapReduce algorithm to conduct analysis on crime rates in the city of Austin using different attributes like crime type and location to help build a design that prevents future crimes for public safety.
Even though cloud computing is an attractive analytics tool for big data applications, it comes with several challenges, mainly concerning security, privacy and data ownership, which will be discussed further in Section VII. In [99], the authors extended the use of clouds to mobile cloud computing to help overcome the challenge of resources limitations such as memory, battery life and CPU power. A mobile cloud computing architecture was suggested for healthcare applications with discussion on various big data analytic tools available. In [183], the authors suggested using a hybrid cloud computing consisting of public and private clouds to accelerate the analysis of massive data workloads on MapReduce framework without requiring significant modifications to the framework. In a private cloud, cloud services delivered over the physical infrastructure are exclusively dedicated to the tenant. The hybrid cloud uses a set of virtual machines running on the private cloud, which take advantage of data locality, and another set of virtual machines run on a public cloud to run the analysis at a faster rate.
To optimize the utilization of cloud computing resources, predicting the expected workload and the amount of resources needed becomes important to reduce waste. In [184], the authors developed a system that predicts the resources requirements of a MapReduce application to optimize bandwidth allocation to the application, while, in [185], the authors used linear networks along with linear regression to predict the future need of new resources and VMs. When the system fails short in predicting the right amount of resources needed, it becomes incapable of accommodating a high workload demand, leading to anomalies. Anomalies detection is an essential part of big data analytics, as it helps improve the quality of service via checking whether the measurements of the workload observed and the baseline workloads diverge by a specific margin, where the baseline workloads provide a measure on how the demand changes during a period of time based on historical records [186].

D. Spatial-Temporal Analytics
Massively data obtained from widely deployed spatiotemporal sensors have caused grand challenges on data storage, process scalability, and retrieval efficiency. The paper [187] proposed distributed composite spatio-temporal index approach VegaIndexer for efficiently processing the large amount of spatio-temporal sensor data. The paper [188] investigated the big data issues in Internet of Vehicles (IOV) applications and proposed to use clouding based big data space-time analytics to enhance the analysis efficiency. The paper [189] proposed STAnD to determine anomaly patterns for potential malicious events within these spatial-temporal data sets. Massively data obtained from widely deployed spatio-temporal sensors have caused grand challenges on data storage, process scalability, and retrieval efficiency. The spatial distributed CPS nodes also can be used to analyze location information. The paper [190] proposed an efficient indoor positioning based on a new empirical propagation model using fingerprinting sensors, called regional propagation model (RPM), which is based on the cluster based propagation model theory, and then the paper [191] used particle swarm optimization (PSO) to estimate the location information via Kalman filter to update the initial estimated location.

E. Big Data Analytical Tools
In this section, some typical tools are briefly introduced for aforementioned three methods of big data analytics, namely data mining, real-time big data analytics and cloud-based big data analytics.

1) Tools for Data Mining:
Hadoop [22] is an open source managed by the Apache Software Foundation. There are two main components for Hadoop, namely HDFS [14] and MapReduce [15]. HDFS is developed from an inspiration of GFS [19], and it is a scalable and distributed storage system, which is an appropriate solution for data-intensive applications, such as Gigabyte and Terabyte scale. Rather than just being a storage layer of Hadoop, HDFS is also beneficial to throughput improvement of the system and it supplies efficient fault detection and automatic recovery. MapReduce is a framework which is used to analyze massive data sets in a distributed fashion by means of numerous machines [192]. There are two functions in the mathematical model of MapReduce including Map and Reduce, both of which are available to be programmed. R is also an open-source software environment for data mining developed by AT&T Bell Labs [16]. Actually, R is a realization of the S language used to explore data, implement statistical analysis and draw plots. Compared with S, R is more popular and supported by a large number of database manufacturers, such as Teradata and Oracle.
2) Tools for Real-Time Big Data Analytics: Storm [16], [17] is a distributed real-time computing system for big data analysis. Compared with Hadoop, Storm is easier to operate and more scalable to provide competitive and efficient services. Storm makes use of distinct topologies for different storm tasks in terms of storm clusters, which are composed of master nodes and worker nodes. The master nodes and worker nodes play two kinds of roles in the fields of big data analysis, namely nimbus and supervisor, respectively. The functions of these two roles are in agreement with jobtracker and tasktracker of the MapReduce framework. Nimbus takes charge of code distribution across the storm cluster, the schedule and assignment of worker nodes tasks, and the whole system surveillance. The supervisor compiles tasks given by nimbus. Splunk [18] is also a real-time platform designed for big data analytics. Based on the web interface, Splunk is available to search, monitor and analyze machine-generated big data, and the results are exhibited in different varieties including graphs, reports, alerts and so on. Compared with other realtime analytical tools, Splunk provides various smart services for commercial operations, system problem diagnosis, and so on.

3) Tools for Cloud-Based Big Data Analytics:
As the most popular tool for cloud-based big data analytics, Google's cloud computing platform [193] consists of GFS [26] (big data storage), BigTable [20] (big data management) and MapReduce (cloud computing), which was discussed in the previous section. GFS is a distributed file system and it is enchanced to meet the requirements of big data storage and usage demands of Google Inc. In order to deal with the commodity component failure problem, GFS facilitates continuous surveillance, errors detection and component faults tolerance. GFS adopts clustered approach that divides data chunks into 64-KB blocks and stores a 32-bit checksum for each block. BigTable supplies highly adaptable, reliable, applicable and dynamic control and management in the field of big data placement, representation, indexing and clustering for enormous and distributed commodity servers, and it constitutes of a row, column, record tablet and time stamp.

F. Summary and Insights
To better extract information from big data, it is of utmost importance to enhance cloud's analysis performance. A combination of different techniques discussed in this section can be used to optimize cloud computing resources. If VMs and cloud's resources and requirements can be predicted beforehand, then workloads can be efficiently processed and analyzed by taking advantage of cloud analytical tools previously discussed. Furthermore, using a hybrid cloud can further speed up the analysis of workloads, leading to reduced latency and efficient data mining.

VII. BIG DATA CYBERSECURITY AND PRIVACY
In the realm of cyber physical systems, the tight interaction among physical objects which collect and transmit a large volume of data place security threats under the spotlight of attention. With this enormous amount of data that are constantly flowing through the network, it becomes essential to protect the system from cyber attacks [194]. In this section, we provide an overview of the different security solutions proposed for big data storage, access and analytics (see Table V for a summary).

A. Security in Big Data Storage and Access
While data storage in the cloud offers several advantages in terms of data storage, availability, scalability and processing, it increases the chance of malicious attacks, that in addition to potential privacy invasion by cloud operators who can have accesses to sensitive data. All this puts a question mark on whether cloud data storage is feasible, especially for governmental agencies and financial industries. Several works have attempted to solve the security challenges of cloud storage. For instance, Gai et al. proposed a method that splits files into encrypted parts and store them in distributed cloud servers without users' data being directly reached by cloud service operators [195]. In [196], the authors optimized the data placement on cloud servers that minimizes the retrieval time of data files while guaranteeing their security based on the distance between nodes that store the data chunks, such that the malicious attacker cannot guess the locations of all the data chunks. In [197], the authors suggested that data should be encrypted and decrypted before being sent to clouds. The paper [198] addressed iOS devices as case studies and stressed the possible applications for pairing mode in iOS devices, which supports a trusted relationship between an iOS device and a personal computer, for covert data exfiltration.
When data need to be transferred from one cloud to another, data privacy and integrity become important. In [199], Ni et al. proposed a secure data transfer scheme where users encrypt the data blocks before uploading to the cloud. When transferring from one cloud to another, a security protocol was described using secret keys and a signature checking with polynomialbased authentication was performed without retrieving data from the source cloud. While the mentioned works considered encryption as a way to protect the data from privacy violations, encryption introduces a new challenge: cloud data deduplication, especially when data is shared among many users. Even though deduplication can save up to 95% in terms of needed storage for backup applications [200], and 68% for standard file systems [201], it wastes resources, and consumes energy, and makes the data management very complicated. In [202], Yan et al. attempted to solve the deduplication problems via proposing a scheme, where the users upload the encrypted data to the cloud along with a token for data duplication check, which is then used by the cloud service providers to check whether the data has already been stored. A scheme to verify data ownership was presented to ensure secure data management.

B. Security in Big Data Analytics
Enabling security and privacy aspects of big data analytics has attracted a great attention from the scientific community mainly due to different reasons. First, the data are more likely stored, processed and analyzed in several cloud centers leading to security issues due to the the random locations of data. Second, big data analytics treats sensitive data in similar way to other data without taking security measures such as encryption or blind processing into consideration [203]. Third, big data computations need to be protected from malicious attacks in order to preserve the integrity of the extracted results. In the realm of CPS, an enormous amount of data make the surveillance of security-related information for anomaly detection a challenging task for analysts. In healthcare, for instance, the security issues of information extraction from massive amount of data and accurate analytics are of high importance. Sensitive data recorded in databases need to be protected via monitoring which applications and users get accesses to the data [204]. In order to guarantee a strong secure big data analytics, the following tasks can be performed [205]: • Surveillance and monitoring of real-time data streams, • Implementation of advanced security controls such as additional authentication and blocking suspicious transactions, • Anomaly detection in behavior, usage, access and network traffic, • Defending the system against malicious attacks in realtime, • Adoption of visualization techniques that give a full overview of network problems and progress in real-time.
In [206], the authors tackled the big data analytics in mobile cellular networks based on random matrix theory, where big data is represented in matrix form of size n × N , where n is the number of data samples of a random vector x, and N is the number of independent realizations of x. For cellular networks, big data manifests as big signaling data consisting of 1) a large number of control messages to ensure reliability, security and efficiency of communications, 2) big traffic data which require traffic monitoring and analysis to balance network load and optimize system performance 3) big location data generated by GPS sensors, Bluetooth, WiFi and so on, to assist in different areas such as transportation systems, public safety, crime hot spots analysis and so on, 4) big radio waveforms data emanating from 5G massive MIMO systems to estimate users' moving speed for purposes of finding correlation among transmitted signals as well as assist in channel estimation, 5) big heterogeneous data such as data rate, packet drop, mobility and so on that can be analyzed to ensure cybersecurity. The work [207] proposed a secure high-order clustering algorithm via fast search density peaks on hybrid cloud for industrial Internet of Things.
Machine learning, among other tools, offers a promising solution to automate many of the above mentioned securityrelated tasks, especially with the continuous growth of the flowing data in terms of scale and complexity. Through the process of training datasets, machine learning makes possible the detection of future security anomalies via detecting unusual activities in the network traffic. To achieve a higher accuracy, a large volume of training datasets are needed, but this would be at the cost of added overhead and storage constraints. The process of training can be supervised, unsupervised or semisupervised, depending on whether the outcome of a particular dataset is already known. Particularly, the system starts via classifying similar datasets into clusters to determine their anomaly. A human analyst can then explore and identify any unusual data. The outcome found by the analyst can then be fed back to the training system in order to make it more "supervised" [208]. This has the potential in enabling the training system adapt to new forms of threats without human intervention, so actions can be immediately taken before actual damages occur.
Different approaches for anomaly detection exist in literature such as discretizing the continuous domain into different dimensions such as in the surveillance system in [209], where the author partitioned the surveillance area into a square grid where the positions and velocities of the moving objects falling in each cell are modeled by a Poisson point process. Another approach is the multivariate Gaussian analysis in which data are flagged as abnormal when they lie a number of standard deviations away from the mean. For instance, in [210], the authors used multivariate Gaussian analysis to detect Internet attacks and intrusions via analyzing the statistical properties of the IP traffic captured. In clustering methods such as k-means clustering, data points can be grouped into clusters based on their distance to the center of the cluster. Then, if some data point lies outside of the group cluster, it is considered as an anomaly. The authors in [211] used kernel k-means clustering with local-neighborhood information to detect a change in an image by optimally computing the kernel weights of the image features such as intensity and texture features. As for the artificial neural network approach, one implementation of such a model is the autoencoder, also known as replicator neural network, which flags anomalies based on calculations of the difference between the test data and the reconstructed one. This means that if the error between test and reconstructed data exceeds a specified threshold, then it is considered far away from a healthy system distribution [212]. An example of such an approach is given in [213], where the authors used the autoencoder as a high accuracy and low-latency model to detect anomalies in the energy consumption and operation of smart meters.
Privacy preserving data analytics and mining can be quite challenging tasks since analyzing encrypted data is an inefficient, costly and non-straightforward solution. Homomorphic encryption is one of the solutions proposed to enable analytical operations to be performed on ciphertexts using multiple mathematical operations [214], [215]. For instance, in [216], the authors used Efficient Privacy-preserving Outsourced calculation with Multiple keys (EPOM) homomorphic encryption to encrypt data before sending to cloud. The cloud then uses ID3 algorithm to perform data mining on encrypted data. The algorithm uses a hierarchical tree decision to determine which attributes from a set of samples provide the best prediction or information gain. Another approach for processing mining on encrypted data in cloud using homomorphic encryption was suggested in [217]. A cloud service provider (CSP) collects and stores encrypted data, while a server, referred to as Evaluator, collaborates with CSP to perform mining over encrypted data. A miner submits encrypted mining queries to CSP which in turn computes inner product between vectors to determine the frequency of the mining itemset without CSP and Evaluator having access to the sensitive data. However, homomorphic encryption can be computationally expensive and impractical for big datasets [218].
One potential approach to protect private data during the analytic processes is the k-anonymization proposed in [219]. First, users who access the data need to be authenticated and authorized based on the level of shared results' privacy. Then, a list of hashed and primary identifiers is generated to act as a data filter for the information that can be accessible by the authorized user. K-anonymization is then applied on the personally identifiable columns in the dataset in order to generalize or suppress values in the output dataset. The result is an k-anonymized list on which analysis and mining can be performed on the authorized accessible data.
Although k-anonymization appears to be a promising adaptable approach to privacy preserving data analytics independent of the underlying processes, its feasibility in the big data contexts is possibly not further evaluated for data types and computational time. Furthermore, k-anonymization can be problematic if different datasets contain same sensitive values [220]. One approach that attempts to solve this problem is the cosine similarity computation protocol suggested in [218], [221]. The proposed approach allows for larger datasets scalability for both binary and numerical data types in a timeefficient manner. The idea is to allow data to be shared without disclosing the sensitive information to unauthorized users. This can be done via computing the scalar product between different vectors of numerical values, such as calculating the cosine of the angle between them. Having the result closer to 1 indicates that vectors are more similar to each other.
On the other hand, many of the big data applications have hierarchical structures in nature, and thus require hierarchical privacy preserving solutions. For instance, in [222], hierarchical cloud and community access control can be implemented to strengthen privacy preserving in smart homes and smart meters. The home controller, which protects household personal data, is connected to a cloud platform through a community network, which provides privacy preserving solutions to homes through data separation, aggregation and fusion. The cloud combines the access control schemes for homes and community in more complex and stronger privacy protection process.

C. Summary and Insights
From Table V, it is clear that a single security solution is not sufficient to ensure a robust system against attackers. It is quite necessary to incorporate different strategies to face security flaws that stem from poor systems designs. For instance, it makes sense to combine advanced security controls with cryptography to guarantee that only legitimate users have accesses to data. However, such a solution might not work well for delay-sensitive CPS applications such as e-health systems, especially in the cases that the small sensors have limited computational capabilities. For such applications, it is more efficient to employ visualization detection along with machine learning-based anomaly detection techniques to protect users' sensitive data.

VIII. BIG DATA MEET GREEN CHALLENGES FOR CPS
There are two view directions for big Data meeting green challenges for CPS. Firstly, we address greening big data systems for CPS, including CPS big data collection/storage, computing and processing. Secondly, we discuss big data with CPS toward green applications.

A. Greening big data for CPS 1) Green Data Collections and Communications:
With a massive number of interacting objects, gathering sensed data poses a challenge in terms of energy consumption, mainly due to the limited communication range between subnetworks, necessitating that objects act as relays for surrounding objects in order to extend their communications. This affects the lifetime of objects since each object needs to relay large volume of data generated by its neighbors [233]. In this context, different solutions have been proposed for different big data applications. In [233], Takaishi et al. proposed energy efficient solutions for data collection in densely distributed sensor networks. In an attempt to reduce the number of relay transmissions needed, sensor nodes transmit their data to a data collector node, the sink node, when they become close in proximity to it. Therefore, it becomes important to figure out the trajectory that the sink node needs to follow the nodes' information such as location and residual energy, as well as the cluster formation in order to reduce the energy consumption of data collection. Data compression technology is another solution to help deal with challenges of data storage, collection, transmission, processing, and analysis. For instance, in [262], the authors proposed a highly efficient lossy data compression based on smart meters' load features, states and events with a small reconstruction error. ZIP-IO compression technique was proposed in [263] using FPGA as a potential implementation framework. Video compression is another important tool in big data for surveillance applications. The authors in [264] proposed a background-based coding optimization algorithm that uses the residual gradient and the block edge differences to improve picture quality while achieving a high level of compression.
To achieve higher power savings for machine-to-machine (M2M) or machine type communications (MTC), this paper [265] proposed improved approaches for M2M devices, radio access networks and core networks with simplified activities under optimized signaling flow without introducing negative impacts on legacy human-to-human (H2H) terminals. In wireless cellular networks, when mobile terminals can directly communicate each other without the aids of central stations or base stations, the communications in those scenarios are called device-to-device (D2D) communications [266], which is another term of CPS in those scenarios. The paper [267] studied energy-efficient power control for D2D communications underlaying cellular networks with multiple D2D pairs and co-channel interference caused by resource sharing. The work [268] proposed an architecture using D2D multicast for energy efficient content delivery in cellular networks. Removing redundant transmission links can also reduce energy consumption while increasing network capacity. Topology control algorithms such as local minimum spanning tree (LMST) [239] and Local Tree-based Reliable Topology (LTRT) [240] have low computational complexities and can help obtain the best logical topology which can be beneficial for energy efficient data collections. In addition to removing redundant transmission links, removing redundant data can help save energy, such as in [242], where a compressive-sensing-based collection framework was proposed for reducing data redundancy and saving energy. To implement this solution, an online learning module predicts the amount of data (principal data) that needs to be collected for compressive sensing. This means that the principal data are supposed to represent the whole big data using the compressive sensing technique. Then, each node can locally tweak the collection strategy dynamically depending on neighbors status, residual energy, and link quality. Nodes' clustering can also contribute to energy saving via reducing the number of data collection and transmissions, such as fanshaped clustering proposed in [241] for large-scale networks with energy efficient selection of a cluster head and relay node. Other clustering algorithms have existed in literature, such as the well-known Low-Energy Adaptive Clustering Hierarchy (LEACH) [269], which can be useful for big data wireless sensor networks. The work [270] adopted an energy-efficient architecture for Industrial Internet of Things (IIoT) with a sense entities domain, RESTful service hosted networks, a cloud server, and user applications to balance the traffic load and support a longer lifetime of the whole system. The paper [271] mainly studied two important issues: 1) performing cyber-physical systems (CPS) communications over cellular  [196] Secure data placement on cloud i) ensures users' data protection on clouds; ii) does not require complex signatures and keys; iii) provides data privacy at users' side i) data might need to be encrypted/decrypted before transferring to clouds; ii) can have a long data retrieval time; iii) not suitable for making real-time network decisions; iv) not very robust as attackers can guess data's locations; v) service providers can have access to users' data; vi) complicated data management and access [195] [197] [199] [202] [223] [215] Cryptographic Solutions i) ensures information protection, data privacy and integrity; ii) works in conjunction with other security solutions for more robust security i) costly in terms of computational overhead; ii) can have a large power consumption; iii) increases latency; iv) inefficient for many CPS devices with limited computational capabilities; v) works only for fixed malicious behaviors Data Access Permission Restrictions [50] [204] [224] [225] Advanced security controls i) selective access control; ii) can be used in conjunction with cryptography for a more robust security; iii) good for real-time data streams; iv) provides additional layer of protection i) difficult to access for legitimate users; ii) needs to be combined with other solutions; iii) not sufficient by itself to protect users' data; iv) might increase latency from accessing the data   [24], [56], [128], [145], [233] -Minimize the number of relay transmissions [233] -Data reduction techniques(aggregation, compression, network coding) [57], [58], [59], [60], [61]- [63], [184], [234], [235], [236], [237], [238] Power Management -Smart Grids -Green buildings -Green infrastructure -Renewable energy sources -Redundant transmission links -Deduplication -Redundant information in data collection -Remove redundant transmission links using energy-efficient topology control algorithm and clustering [125], [126], [127], [128], [129], [239], [240], [241], -Employ data duplication checks [202] -Employ compressive sensing-based data collection [242], [243] Green Transportation -Fuel management -Electric vehicles -Sustainable transport -Transport planning -Increased energy expenditure in cloud data centers Increased power consumption of networked devices -Large volume of exchanged data between cloud centers -Efficient utilization of resources [99], [184], [185] -Integrate energy harvesting from renewable sources [244], [245] -Sleep scheduling for unused data centers [30], [245] -Employ energy-efficient hardware(DVFS) [29], [246], [247], [248] Green Energy -Energy harvesting -Energy conservations -Cloud data centers distant from users -VM placement in cloud centers -Increased energy consumption in case of hardware/software failures -Design energy-efficient cloud data centers [30] -Activity scheduling (switch devices to sleep mode) [245] -Power control and adjustment of networked devices [247], [248] -Design architectures for power saving [30], [246] -Energy-efficient routing and traffic engineering techniques [56], [80], [249], [250], [251], [252], [253], [254], [255], [256] -Employ advanced communication technologies (MIMO) -Employ cloudlets [99], [100] -Energy-efficient VM techniques (VM migration, VM placement, VMallocation) [108], [257] -Reduce unnecessary executions using backups [258], [259], [260], [261] networks with ubiquitous coverage, global connectivity, reliability and security, 2) offloading a proportion of CPS traffic to small cells and freeing more network resources to other users.
2) Greening big data computing: While cloud centers are becoming an important aspect of big data computing to process data chunks in parallel, they contribute to a high energy expenditure, leading to increased costs and maintenance [272]. Therefore, research efforts are shifting towards creating sustainable big data computing techniques. Shojafar et al. proposed a job scheduler between servers called MMGreen to reduce energy consumption of computing in cloud data centers [246]. The MMGreen architecture is composed of physical servers hosting VMs connected to a front-end component that manages the incoming workload. Scheduler jobs that use dynamic voltage and frequency scaling (DVFS) technique for energy efficient servers include static scheduler whose power consumption is independent of clock rates and usage, and sequential schedulers which attempt to minimize reconfiguration costs via performing offline resource provisioning via predicting future workload information [29]. The paper [273] proposed 1) thermal-aware and power-aware hybrid energy consumption model jointly including the information of computing, cooling, and migration energy consumption, 2) a tensor-based task allocation and frequency assignment model to reflect the relationship among different tasks, nodes, time slots, and frequencies, 3) a thermal-aware and DVFS-enabled big data task scheduling algorithm for the energy consumption reduction of data centers.
In [246], the authors described different techniques to reduce energy consumption such as DVFS to reduce VMs' frequencies and real-time adjustment of VMs frequencies processing and switching while maintaining quality of service to users. In [247], [248], M2M power savings were optimized via reducing the execution frequency of some activities without negatively impacting the human-to-human communications.
Besides targeting the energy efficiency of cloud servers, network devices also need energy efficient solutions, since they also contribute to the total energy expenditure of the cloud data center. Traffic engineering is one solution for this problem, which takes advantage of the traffic prediction to turn off network devices such as switches during idle periods in order to reduce power consumption [249]- [252]. Traffic engineering techniques, such as the software defined networking (SDN)-based traffic engineering [253], allow the network devices to dynamically adapt to current workload. One problem with traffic engineering techniques is that the predicted traffic pattern might not be accurate due to the variability of big data applications running in the data center. This makes the network configuration suffer from frequent oscillations, since the network configuration needs to be updated frequently leading to performance degradation [254]. To take into consideration this time-varying aggregate traffic load, one proposed solution is to include flow deadlines to measure the speed at which requests' responses are delivered to users. This allows the design of energy-efficient scheduling and routing for data center networks [254]- [256]. Another solution to enhance the inaccurate traffic engineering techniques is to take into consideration the unique features of data centers such as regularity of the topology, VM assignment, application characteristics. Such a framework was proposed in [257], where the energy-saving problem was solved in two steps: i) a VM assignment algorithm that integrates application characteristics and network topology to better understand traffic patterns for energy efficient routing, and ii) an algorithm that minimizes the number of switches and balances traffic flow among them. Experimental results showed a 50% energy savings using the proposed framework.
3) Green Processing: An energy-efficient orchestrator for smart grid applications was proposed in [30]. The green orchestrator coordinates sustainability between smart grids and big data enterprises from green infrastructure (data centers) to running green frameworks such as Hadoop MapReduce. The orchestrator's main components are: i) a green lesser to establish a per-job service-level agreements (SLA) that takes into account the available power, the power consumption statistics of jobs, the network and server states; ii) a preexecution analyzer that executes jobs based on their power consumption statistics; iii) a network and server states predictors; iv) a network traffic analyzer which helps eliminate redundant traffic using traffic engineering techniques; v) a VMizer that intelligently places VMs such that some nodes are put to sleep; (vi) a pizer that schedules and places processes to a subset of clusters such that system resources are efficiently utilized; and (vii) a post-execution analyzer to analyze the energy profile of completed jobs.
Another green big data processing architecture is checkpointing aided parallel execution (CAPE), which uses checkpoints that save the sates of processes to avoid restarting unnecessary executions from beginning in case of hardware or software failures [258]- [260]. CAPE also allows threads of a shared-memory program to be executed in parallel on a distributed memory architecture rather than a shared memory architecture. The CAPE architecture can lead to energy saving since if the execution period of processors is short, they can go to idle mode rather than staying active for the whole execution of the program. This makes it beneficial in processing big data in CPS applications [260], [261].
Another approach to green big data processing is the efficient utilization of network resources by reducing the volume of communications that need to be exchanged in cloud data center networks. For instance, Asad et al. proposed spate coding for the purpose of reducing the amount of exchanged data, but without compromising the rate of information exchange [234]. Spate coding incorporates both index coding and network coding, and uses side information originating from several processes sharing a physical node to encode packets. This coding technique was shown to reduce the volume of communications by 62%, along with other advantages such as improving the utilization of system resources from disk utilization, queue size, and the number of bits transmitted during shuffle phase of Hadoop (200%) [234]. Other approaches to reduce the burden on data centers from the information exchange include traffic flow prediction to reduce network transfers [184], [235], and redundancy elimination scheme to remove redundant information data exchange, among others [236]- [238].

B. CPS-based Big Data toward Green Applications
CPS-based big data applications can contribute to greening different sectors, environment, economic, and social/technical issues [6], [31], [32]. As for the environment, efforts are made to reduce air/water pollution as well as the impact of climate changes. For instance, sensors can be deployed to monitor air and water qualities. Using MapReduce or Spark programming frameworks, the concentration of pollutants can be monitored and studied [274]; the air quality not covered via monitoring stations can be estimated [275], and the causalities of air pollutants can be identified using urban big data dynamics [276]. Pollution can also originate from oil spills. Predicting such catastrophes very early can help save beaches, coastlines and waters [6], [32]. Marine oil spills can be detected using a large archive of remote sensing big data [10]; and a real-time warning can be generated from a quantitative data analysis using supervised oil system [277]. Concerning water pollution, underwater sensors can be deployed to monitor water environment, such as water level, water flow, temperature, and pressure. The sensor network can be connected to a cloud platform via a wireless transceiver for analysis and visualization [278]. Noise pollution in cities is another contributing factor to environmental pollution, which can have negative impact on health, especially with the increasing number of circulating vehicles [6], [32]. Noise pollution levels can be predicted via collecting four data sources: complaint data, social media, points of interests, and road network data [279]. These data can be gathered via deploying static municipal sensors, together with participatory sensing with smart phones in order to provide more accurate noise maps [280].
As for green economics, CPS applications can be targeted for optimizing the energy use. One example is FirstFuel, which monitors temperature and lightnings inside buildings by checking the running status of the equipments, such as fans, heating and cooling units [6]. Power management can be realized using sensors monitoring the whole building, as well as using smart meters for electricity consumption measurements. A solution in energy efficient smart meters has been proposed in [281] using coalition game that maximizes the pay-off values of smart meters. An approach to predict daily electricity consumption inside buildings using data analysis was proposed in [282]. The authors used canonical variate analysis to group electricity consumption profiles into clusters in order to identify abnormal energy usage. Energy efficiency can also be employed in transportation systems to reduce fuel wastage. One example is [283], where the authors used electric vehicles (EV) battery model to estimate the driver's behaviors and driving range to improve energy efficiency.
Social media and participatory sensing can also be used toward greener environments. For instance, in [284], the authors used social media to optimize smart grid management. In [285], participatory sensing was used to implement a navigation service called GreenGPS to allow drivers to obtain customized routes that are the most fuel-efficient.

IX. CPS BIG DATA APPLICATIONS
We now present some of the main CPS big data applications in different fields, such as energy utilization, city management, and disaster events applications, along with a public safety case study model. Fig. 6 shows different CPS applications and their big data generation. For instance, the intelligent transportation system would generate big data consisting of driver's behavior, passenger information, vehicles' locations, traffic signals management, accidents' reporting, automated fare calculations, and so on. Each one of the CPS applications generate a large amount of data that need to be stored, processed and analyzed in order to improve services and applications' performance.

A. Smart Grids
Smart grids constitute an important aspect of sustainable energy utilization and are becoming more popular, especially with the advances in sensing and signal processing technologies. Automated smart decisions based on millions of data and control points play an important role in managing the energy usage patterns, understanding users' behaviors, reducing the need to build power plants, and addressing supply fluctuations by using renewable resources [244]. The large number of embedded power generator sensors and their communications with different home sensors and appliances are expected to generate a large amount of data. These advanced sensing and control technologies used in smart grids are often limited to a small region such as a city; however they are envisioned to be deployed on a much larger scale such as the whole country. This will introduce several challenges, among which information management, processing and analysis are the main ones [292]. These big data tasks get even more complicated with the increasing number of transactions that need to be processed for millions of customers. For instance, one smart grid utility is expected to handle two million customers with 22 gigabytes of data per day [292]. That is why big data tools from cloud computing [244], mining and analytics [293], [294], performance optimization [295] and others have been dedicated for smart grids applications.
Moreover, for reliable power grid, smart grids highly depend on cyber infrastructure. This poses several challenges such as exposing the physical operations of smart grid systems to cyber security attacks [296]. Furthermore, the collection of users' energy usage information such as the types of appliances they use, the eating/sleeping patterns, and so on, can be very beneficial in optimizing smart grids' performance; however users' privacy can also be affected. In [297], Yassine et al. proposed a game theoretic mechanism to balance between users' private information and the beneficial uses of data. In [298], a big data architecture for smart grids based on random matrix theory was proposed to conduct high dimensional analysis, identify data correlations, and manage data and energy flows among utilities. The proposed architecture not only allows large-scale big data analysis but also can be used as an anomaly detection tool to detect security flaws in smart grids.

B. Military Applications
Big data can also be exploited to improve military experiences, services, and training. Real-time authentication of command and control messages in cyber-physical infrastructures is of high importance for military services to ensure security. In [302], the authors developed a novel broadcast authentication scheme using special digital signatures for faster signature generation and verification, and packet loss tolerance. This can be useful to efficiently and rapidly secure military communications. In [303], the authors used Markov decision process to propose an approach to identify and reduce attacks' cost in military operations in order to protect important information through obtaining attack policies. Military satellite communications require being resilient to ensure missions' success. This can be achieved using matrixbased protection assessment approach based on traditional risk analysis, where an attack can be assessed in terms of both ease of attack and impact of attack [304]. Mitigating the following five core threats allows the satellite communications to be free from weak vulnerabilities that can be easily exploited by attackers: waveform, RF access to enemy, foreign presence, physical access, and traffic concentration [304].

C. City Management
Big data can facilitate daily activities by using smart infrastructures and services. With the increasing number of sensors deployed in in urban environment either indoor or outdoor, from smart phones, smart cards, on-board vehicle sensors and so on, the city is faced with a large amount of information that need to be exploited to detect different urban dynamic patterns [305]. For instance, traffic patterns can be analyzed and routes can be computed to allow people to reach their destinations faster. In [306], the authors proposed to deploy road sensors to obtain information on the overall traffic, such as speed and location of individual vehicles. This information is then processed using graph algorithms by taking advantage of big data tools such as Giraph, Spark and Hadoop. This helps provide real-time intelligent decisions for smart efficient transportation. Since wireless sensor networks (WSN) are the main infrastructures for smart cities to monitor and gather information from the environment, several works have focused on extending network's lifetime. For instance in [307], the authors proposed using software-defined networking (SDN) controller to reduce WSN traffic and improve decisions making. In [243], compressive sensing and consensus filter were used to reconstruct signals from fewer sensor nodes, leading to power savings. Sleep scheduling with renewable energy resources such as solar harvesting were also suggested in [?] to prolong WSN lifetime. CPS based big data may also support localization applications [190], [191].
An IoT-based general architecture for smart cities was presented in [305], which consists of four different layers: 1) technologies layer consisting of self-configured and remotelycontrolled sensors and actuators, 2) middleware layer where data from different sensors are collected to provide context information, 3) management layer where different data analytic tools are used to extract information, test hypothesis and draw conclusions, and 4) services layer consisting of services provided by smart cities based on the previous layers such as environmental monitoring, energy efficiency in buildings, intelligent transportation systems an so on.
Safety systems, such as deploying surveillance systems, are other city management big data applications. A computer vision deep learning algorithm for human activity recognition was proposed in [179]. The model is capable of recognizing twelve types of human activities with high accuracy and without the need of prior knowledge, which makes it useful for security monitoring applications. Crowd detection and surveillance is another safety system big data application. A target individual needs to be easily inferred from visual information. In [231], the authors proposed such a framework that can easily detect target location and update the motion information to improve the detection.

D. Medical Applications
CPS health systems are foreseen to shape the future of tele-medicine in different areas such as cardiology, surgery, patients' health monitoring, which will significantly enhance the healthcare system by providing timely, efficient and effective medical decisions for a myriad of health applications such as diabetes management, blood pressure and heart rhythm monitoring, elderly support and so on [308]. With 774 million connected health-related devices [309] by 2020, a large volume of data from small-scale networks, such as e-health systems or mobile-health systems, needs to be stored, processed and analyzed to enable timely intervention and better management of patients' health.
With e-health systems becoming widely deployed in hospitals and health centers, research has been focused on efficiently deploying medical body area networks (MBANs) to reduce interference on medical bands from other devices [310]. In MBANs, biomedical sensors are placed in the vicinity of patient's body or even inside her to sense health-related vital signals using short-range wireless technologies. The collected data are then multi-hopped to remote stations, so that medical staff can efficiently monitor patients' physiological conditions and disease progression [310]. For instance, in [311], elderly patients' health tracking application was proposed, where a mixed positioning algorithm allows for 24-hour monitoring of patients' activities and transmits an alarm to medical staff through SMS, e-mail or telephone in case of an abnormal event or emergency. However, transmitting this health information in a timely and energy-efficient manner is of utmost importance for e-health systems. In [312], a micro Subscription Management System (µSMS) middleware for ehealth systems was presented. The µSMS platform allows sensor nodes to exchange information to provide event-driven services with dynamic memory and variable payload such as GPS coordinates, Home Context and so on. The designed architecture achieves lower memory overhead, lower software components load time and lower event propagation time than other similar proposals, which are all critical requirements for energy efficiency, reliability and scalability of e-health systems. The paper [313] proposed locality sensitive hashing to learn sensor patterns for monitoring health conditions of dispersed users.

E. Disaster Events Applications
Network resilience and survivability are the utmost requirements for public safety networks. In case of a disaster or emergency event, the people who are first on scene are referred to as first responders, and they include law enforcement, firefighters, medical personnel and others [314]. Some of the major public safety requirements relate to the necessity of first responders to exchange information (voice and/or data) in a timely manner [314]. The big data can be used to support disaster events, such as analyzing big data from high resolution maps, floor plans and onfield video transmissions to transmit warning messages to authorities [315]. The remote sensing big data can be analyzed using a scalable hybrid parallelism approach to reduce the analytics execution time [316]. The large amount of data collected from previous earthquakes can be used to predict the future service availability areas, which can improve preparation and response to such events [317]. A disaster domain-specific search engine can be constructed using big data to make the understanding and preparation of disaster attacks easier and faster for authorities [318]. Fig. 7 depicts a public safety (PS) model, where the base station is unavailable due to network failure, indoor or underground communications or cell-edge locations. This fact necessitates that users form a device-to-device (D2D) network autonomously without assistance of any network infrastructure. In this model, each D2D cluster sends its PSrelated data findings such as maps, videos, pictures and so on to a command center for data analysis. This can be done through satellites or by raising in the sky a balloon with attached 4G eNodeB (AeNB) to restore temporarily connectivity (Project ABSOLUTE [319]). LTE is suggested as a technology enabler for PS communications [320]. LTE provides high rate and very low latency IP connectivity. LTE can complement existing Private Mobile Radio (PMR) networks like TETRA/TETRAPOL/Project25. In terms of end-to-end latency requirement, messages are expected to be delivered within 5-10 milliseconds (ms) with a reliability of 99.99% of packets successfully delivered within a time window [321], [322]. Moreover, the downlink peak data rates are expected to be over 50 Mbps with an uplink data rate of over 25 Mbps [323]. The command center should be able to analyze different data types and then transmit timely, useful and accurate dynamic messages to each D2D cluster head. The command center can transmit information such as assessment of the level of assistance needed, emergency notification messages, emergency service providers coordination, and so on. Each D2D cluster head in turn multicasts this information to its members for decision making. This poses a challenge: some users might be far away from the multicast transmitter or their link quality suffers, which would decrease the coverage probability of the receiver and might require several retransmissions to send the message successfully. Obviously, there is a trade-off between throughput (or end-to-end delay) and reliability due to the random transmission errors caused by the unpredictable behavior of the wireless channel. It is evident that we need to minimize delay, while maintaining some good degree of reliability. In what follows, we briefly derive the end-to-end delay and reliability for public safety networks.

F. A Public Safety Case Study Model
Let δ i be the indicator variable defined as: The set of nodes that have received successfully in the j th round of multicasting can be defined as: In order to minimize the total delay, D, for distributing the data to all nodes, L − 1 (L is the total number of nodes), we need to select node k ⋆ j to multicast in the j th round of multicasting such that: where R i,l is the transmission rate between nodes i and l.
Define the set of unvisited nodes as Π u and the set of the visited nodes as Π v Initially, we have Π v = 1, Π u = L − 1, and D = 0. In the j th round of multicasting by node k ⋆ j , D gets updated by adding to it the time of last node receiving the packet successfully as: whereN j = {i ∈ Π u |δ i = 0}, and t i is the minimum access time of node i. The multicasting rounds continue until either i) all unvisited nodes become visited, i.e., Π u = Π v , or ii) When D exceeds a threshold θ D , in that case the packet is dropped. Reliability can be defined as the ability of the network to perform its required functions for a certain period of time without service interruption. A link breakage can occur due to the following reasons: • The receiver is outside of the transmission range of the transmitter • The packet is lost due to channel errors • The packet is lost due to total delay D exceeding threshold θ D . Assuming users are not moving, as in [324], each D2D receiver is randomly and independently located around its corresponding transmitter with isotropic direction and Rayleigh distributed distance with probability density function (PDF): The probability that a user is outside the transmission range R of the transmitter can be calculated as p R = 1 − R 0 f D (r)dr = e −πλR 2 .
Considering log-normal shadow fading, the signal fades: (1) deterministically with a path-loss exponent α, and (2) stochastically represented by a random variable with zero mean and a variance σ. Therefore, the probability p e that a packet is not lost due to channel errors is given in [325]: where β th is the minimum threshold required to deliver a packet between nodes. Finally, we combine what we have discussed to get the link reliability as: where p drop = P (D > θ D ).

X. BIG DATA CHALLENGES AND OPEN ISSUES FOR CPS
While ongoing research is focusing on CPS enterprise development and applications, effective solutions to combat security flaws have not received the enough attention, which places question marks on the foreseen integration of CPS in critical infrastructures. This matter is made worse with CPS devices having i) computational challenges in ensuring data confidentiality and privacy protection; ii) the semi-or fullyautonomous security management [326]; and iii) the high computational costs of employing cryptography [327]. Lowcomplexity and lightweight ciphers such as PRESENT [328] have been developed; however research efforts should go beyond cryptography, especially that such solutions can be costly in terms of i) latency, ii) power consumption, and iii) key management complexity [329], [330]. Furthermore, different CPS applications might have different security perimeters with a multitude of interactions among devices. This complicates further the access control decisions, the trustworthiness of entities, and their authentication management. Therefore, there is a high need to define secure interoperation protocols and strategies for dynamic CPS, where devices from different applications interact together to complete specific tasks [331].
Privacy preserving is an important aspect of big data analytics and mining. Homomorphic encryption techniques, as mentioned in Section VII-B, have been suggested to perform analysis on encrypted data in cloud. However, they turned out to be impractical and inefficient in terms of computational time and overhead for large amount of datasets. Although anonymization techniques constitute a more efficient solution for privacy protection big data analytics, further research efforts are needed to determine their applicability for different big data types, and their performance evaluation and endurance in different mining techniques as outlined in Section VI-A. Moreover, future solutions should be based on the hierarchy property of many CPS big data applications [222]. This means that future research should be directed towards proposing integrated privacy preserving protocols in the modules of the CPS hierarchical structure that can perform analytical processes on encrypted variety of big data types in a timeand cost-efficient manner.
Correctness of CPS is another research area under the spotlight of attention. Due to dynamic nature of physical environment, CPS need to constantly be adapted to new situations while operating and functioning properly with little or no human supervision [332]. The use of models can allow the early detection of failures by simulating different components of complex designs to verify the integration of the whole system [333], [334]. CPS components verification to ensure they are working properly or that they meet execution time requirements is another approach to ensure correctness [335]- [337]. Future research shall focus on creating robust real-time anomaly detection and correctness techniques that have low overhead and implementation costs.
With CPS being employed on a large scale, data access, routing, transmissions, and processing might consume a significant amount of time, leading to real obstacles in the way of realizing ultra-reliable and low latency communications (URLLC), a key feature for emerging 5G technologies. In URLLC, stringent requirements are placed on latency, throughput, and availability. Future research should be directed towards faster data access from the clouds with fewer data retransmissions in order to free up resources and reduce latency. Obviously, there is a trade-off between throughput (or endto-end delay) and reliability due to transmission errors that might be caused by links failure, security policies, caching misses, scarcity of resources and so on. For instance, crosslayer designs protocols between different CPS layers (sensing, middleware, transmission, management and services layers) can be used in a similar way to the wireless communication protocols to help optimize different parameters from latency to reliability and security. An example would be the transmission layer communicating with the sensing layer about integrating data filtering aimed at prioritizing the sensed data in the case of too many re-transmissions.
For faster data collection, research efforts need to shift towards devices that do not need to preconfigure to a network with dynamic on-the-fly D2D connectivity, and without the need of controllers or infrastructure deployment. Furthermore, to extend the communication range among devices, incentivized devices with tokens will replace dedicated relays [338]. In terms of CPS computing, research directions should shift towards faster data processing via moving data processing closer to the sources and speeding-up big data handling. For a faster and efficient CPS analysis, research activities need to be further conducted on information fusion techniques, especially that information originates from heterogeneous devices with varying capabilities. Through perfecting the fusion algorithms, related information can be aggregated for analysis leading to higher quality information [339].
As mentioned in this survey, big data analytics is one of the most important aspects of CPS, as it helps to pave the way for new opportunities and services, in addition to enhancing or even optimizing systems' performance. However, as discussed in Section III, the sensed data can originate from a variety of sources with unstructured formats. Merging these different data types into single homogeneous structures is vital to ease and speed up the analysis process so meaningful conclusions can be drawn and automated decisions can be made [340]. How to homogenize different data sources remains an open research issue for CPS. Moreover, different CPS applications might need to collaborate together to achieve a specific mission. For instance, the mobile-health application might need to collaborate with the transportation system to get an ambulance as fast as possible in the case of patient's biomedical sensors revealing an emergency situation. In such scenarios, data analytics becomes a challenging task as it requires to be able to put the different analysis fragments from different CPS applications together to provide broader conclusions and decisions making.

XI. CONCLUSION
The emerging CPS technologies mutually benefit the technological advancements in big data processing and analysis. When combined with artificial intelligence, machine learning, and neuromorphic computing techniques, CPS will bring about new applications, services and opportunities, all envisioned to be automated with little or without human intervention. This will help revolutionize the "smart planet" concept, where smarter water management, smarter healthcare, smarter transportation, smarter energy, and smarter food will create a radical shift in our lives. In this survey paper, we have provided a broad overview of CPS big data collection, storage, access, caching, routing, processing, and analysis to support understanding and discovering the challenges facing CPS, the existing proposed solutions, and the open issues that are yet to be addressed. Then, we have discussed the security vulnerabilities of CPS and the different security solutions, as well as how big data meet green challenges for CPS systems to address sustainability and environmental concerns. He has been working for InterDigital Communications, Inc. as a 3GPP RAN1 standards delegate for new radios (5G) since 2015. Chunxuan Ye has more than 30 book chapters, conference and journal papers. He also has more than 60 U.S. and international patent applications with 25 U.S. patents granted. His specialties include wireless communications, wireless standards and technologies, information theory, software and firmware development.

PLACE PHOTO HERE
Yang Yi (M'09-SM'16) is an assistant professor in the Bradley Department of Electrical Engineering and Computer engineering at the Virginia Tech. She is an IEEE senior member. She received the B.S. and M.S. degrees in electronic engineering at Shanghai Jiao Tong University, and the Ph.D. degree in electrical and computer engineering at Texas A&M University. Her research interests include very large scale integrated (VLSI) circuits and systems, computer-aided design (CAD), neuromorphic architecture for brain-inspired computing systems, and low-power circuits design with advanced nano-technologies for high-speed wireless systems..