Hadoop-Based Distributed Sensor Node Management System

We present a new architecture of HDSM (Hadoop-based distributed sensor node management system) for distributed sensor node management using Hadoop mapreduce framework and distributed file system (DFS). It offers various efficient ways for collecting sensor data and managing multiple sensor nodes by launching specific mapreduce applications on sensor nodes which upload data of sensor nodes to DFS and retrieve sensor data periodically from DFS. Additionally, it provides a flexible management scheme for sensor node by reconfiguring firmware or updating configurations and data formats of sensor nodes based on mapreduce framework. Also, it achieves a transparent and fault-tolerant architecture by exploiting various crucial features of Hadoop. Our experimental results show that it has efficient and stable performance.


Introduction
In ubiquitous computing environment with wired/wireless sensor networks, efficient sensor data collection, monitoring, and management are very significant for various useful data analyses [1].Various sensor nodes such as CCTVs, temperature sensors, or Wi-Fi hotspots can be deployed with additional processor and storage devices for saving sensing data locally and connected to each other through networks such as Ethernet.These sensor networks need sophisticated approaches to achieve massive sensor data collection and cost-effective management of distributed sensor nodes.
Mapreduce is a useful framework in a wide range of applications including distributed parallel searching, indexing, clustering, and sorting for various data types and replaces old ad hoc software that performs the various analyses for large data sets [2].Moreover, mapreduce programming model offers an easier way to build distributed parallel application running on multiple computing nodes for processing big data.Therefore, it has been adapted to several computing environments such as multiprocessor systems and cloud environments.So far, there exist various implementations for mapreduce algorithm and its execution framework such as Hadoop.They may be integrated with several distributed file systems (DFS) and offer overall management of the whole process for launching mapreduce application, integrating multiple servers, and managing communications and data transfers between components.They provide redundant and fault-tolerant architecture via their distributed architecture, data replication, and heartbeat mechanism.
However, the previous approaches are focused on accumulating log data of applications generated from other servers connected through wired network onto repository based on DFS and processing them by using mapreduce.In our research, we regard sensor network as a large scale mapreduce cluster and present a new sensor node management platform, called HDSM (Hadoop-based distributed sensor node management system) for distributed sensor node management by exploiting mapreduce framework and DFS and launching various mapreduce applications for executing sensor data uploading, retrieving, flushing, monitoring, and updating configurations.It offers various efficient ways for collecting sensor data and managing multiple sensor nodes by launching specific mapreduce applications on sensor nodes which upload data of sensor nodes to DFS and retrieve sensor data periodically from DFS.Additionally, it provides a flexible management scheme for sensor node by reconfiguring a firmware or updating configurations and data formats of sensor nodes based on mapreduce framework.Also, it achieves a transparent and fault-tolerant architecture by exploiting various crucial features of Hadoop.Our experimental results show that it has efficient and stable performance.
The rest of the paper is organized as follows.First, in Section 2, we describe related works.Then, in Section 3, we describe an architecture of HDSM.Section 4 shows its experimental results, and Section 5 summarizes the conclusions of our research.

Related Works
Mapreduce application is comprised of two stages: mapper for processing fine-grained unit tasks such as filtering, counting, or sorting and reducer for summarizing operations [2].There are various implementations for mapreduce algorithm and its execution framework such as Google mapreduce [3], Apache Hadoop [4], and Twister [5], and inputs and outputs of mapreduce are usually stored in a distributed file system such as GFS [6] and HDFS [4].They offer orchestration of the whole management processes for launching mapreduce application, monitoring multiple nodes, data transferring, and managing communications between components.
Apache Hadoop is an open-source implementation of mapreduce framework that supports the running of applications on large clusters of commodity hardware and distributed file system named Hadoop distributed file system (HDFS) that stores large data on the computing nodes [2,4].It is written in the Java programming language.A Hadoop cluster includes a single master node and multiple slave nodes as described in Figure 1.For HDFS, Namenode is a master component which manages file locations and metadata of all files on HDFS.Namenode checks liveness of all slave nodes by heartbeat mechanism based on periodical notification with simple messages and offers command line interface and API layer for clients.Real data files on HDFS are divided into multiple chunks and managed by slave component named Datanode.It stores chunks into local disk of each worker node.Mapreduce engine for executing mapreduce application is comprised of a single Jobtracker on master node as a job coordinator process and multiple Tasktrackers on slave nodes as worker demons.After receiving job request for launching a mapreduce application, Jobtracker schedules and allocates map and reduces tasks to multiple Tasktrackers with awareness of the data location and makes them executed in parallel way.Mostly results of mapreduce jobs are saved into the HDFS.Both mapreduce engine and the HDFS are designed so that node failures are automatically handled by the framework.They provide redundant and fault-tolerant architecture via their expandable architecture and unique monitoring scheme named heartbeat mechanism.Therefore, their architecture is useful to construct distributed sensor node management system.
There exist a number of researches for processing collected sensor data using distributed parallel computing [7][8][9][10][11][12][13][14].Also, there are various approaches for constructing large data collection systems on grid and cloud environments [15][16][17].In particular, there are some approaches which are integrated with Hadoop mapreduce framework.Apache Flume [18] is a distributed service for efficiently collecting, aggregating, and transferring large amounts of log data from many different sources to HDFS.It offers a flexible architecture based on streaming data flows and uses a simple extensible data model that allows for online analytic application.Honu [19] is a large scale streaming data collection and processing pipeline built using Hadoop.It collects and processes all the unstructured and structured log events generated from various applications running on Amazon EC2 [20] or EMR [21] and makes them available to user through Hive [22].
However, these approaches are focused on accumulating log data of applications generated from other application servers connected through wired network onto repository based on DFS and processing them by using mapreduce.In our research, we regard sensor network as a large scale mapreduce cluster and present a sensor node management platform by exploiting mapreduce framework and DFS and launching various mapreduce applications for executing sensor data uploading, retrieving, flushing, monitoring, and updating configurations.

Distributed Sensor Node Management System Based on Mapreduce
In this section, we present an architecture of HDSM (Hadoop-based distributed sensor node management system) for managing distributed sensor nodes based on Hadoop mapreduce and distributed file system.
(1) HDSM offers an easier way to build distributed sensor node management system.HDSM uses basic architecture and components of Hadoop framework without correcting its functions or developing specific component for manipulating data transfers or communications.By integrating multiple sensor nodes with limited hardware resources via network and Hadoop technology, HDSM builds a huge logical sensor node cluster on distributed sensor network easily.
(2) HDSM offers automated management of the whole process for integrating multiple sensor nodes, launching mapreduce applications, and managing communications and data transfers between DFS and other components via Hadoop mapreduce framework.So it can save costs for building and maintaining management system for distributed sensor nodes.
(3) HDSM can focus on analyzing collected sensing data without being concerned about fault failure, since Hadoop has internal mechanism for handling system failure.
(4) HDSM can easily update software of sensor nodes such as sensor device drivers, firmware, data format, and node configurations by launching exclusive mapreduce application without much effort for manipulating each of sensor nodes in complex progress.a single map task is allocated on each sensor node, respectively, by assigning a number of map tasks as number of total sensor nodes, since Hadoop scheduler tends to allocate tasks to idle worker nodes which can afford to be assigned new tasks [2].Therefore, all map tasks process their role independently on sensor nodes in parallel.Figure 2 shows an overall architecture of HDSM.The overall operations of HDSM are described below as shown Figure 3.
(1) NCM requests Hadoop master to launch "Uploader" mapreduce application to make all sensor nodes upload their local sensing data to HDFS.Uploader application has only mapper class without reducer and creates a number of map tasks identical to the number of sensor nodes.Each map task is allocated and executed on each node and copies its local sensing data to HDFS with distinguishable file naming.Each local sensing data is divided into a single or multiple data chunk and saved by Datanode in HDFS.In this way, each sensor node uploads its local data to its Datanode, and location of all data is identified by Namenode.After uploading to HDFS, sensing data can be ready for retrieval by Hadoop master and its clients.
(2) NCM requests Hadoop master to retrieve uploaded sensing data from HDFS and store them to local disk, other repository servers, or data consumer services.All accumulated sensing data is read from multiple Datanode at a time periodically by reading multiple chunks from multiple Datanodes using Hadoop command.
(3) NCM requests Hadoop master for flushing HDFS storage to retain available disk space of sensor nodes due to the limited storage size of sensor node.
(4) NCM requests Hadoop master to update advanced sensor device driver or sensor node configurations and launch "Updater" mapreduce application.
Updater application has no reduce stage, and the number of map tasks is identical to that of sensor nodes.NCM provides multiple commands lines to map task, which in turn executes them as child tasks such as updating sensor driver or firmware codes to newer version, changing data formats, editing sensor node configurations, or even executing commands such as reboot, calibrating, and initializing of sensor devices.
HDSM performs the above operations periodically according to the NCM configuration.Additionally, it offers a flexible way for users to add more operations by developing their corresponding mapreduce applications such as simple data analyzers or data converters.If computing power of sensor nodes is not enough, user can execute data migration to other repositories for back up, or data processing systems for high performance data analysis.

Experiments and Results
We have implemented a prototype of HDSM and several mapreduce applications for key operations using Java programming language and Hadoop API [4].HDSM is deployed on several sensor nodes, each of which has wired webcam, Core2Duo 2.33 GHz processor, 2 GB memory, and 320 GB disk.Each sensor node which imitates DVR (digital video recorder) for CCTV records video data (sensing data) captured by camera and saves it to its local disk storage periodically.Video files have various sizes from 50 MB to 800 MB depending on recording time.Unique IP address is assigned to each sensor node, and all sensor nodes are connected to HDSM manager via gigabit Ethernet.
We evaluate the I/O performance of HDSM for collecting and retrieving sensor data by measuring its processing time and throughput in Figure 4.It shows the average time of uploading sensing data with various sizes to HDFS on several sensor nodes.Block size of data chunk is 64 MB, and there is no replication in this experiment.Increase of uploading time depends on the number of sensor nodes, since amount of collectable sensor data in a period is proportional to the number of sensor nodes.However, average throughputs are relatively stabilized, regardless of number of sensor nodes or data size as shown in Figure 5. Mean of throughputs in all cases is 26.230MB/sec, and standard deviation is 3.80.Also, we measure the performance of uploading operation with various replication levels in Figure 6.It shows average throughput of uploading data on 8 nodes with various replication levels.The throughput of uploading data is decreased with increasing replication level, since, although data replication of Hadoop is performed to provide fault tolerance and availability, it may cause more data traffic on network and lower I/O performance of HDFS.
We evaluate retrieving performance of sensor data accumulated on HDFS in Figures 7 and 8. Block size of data chunk is 64 MB, and there is no replication in this experiment.Figure 7 shows that average retrieving time decreases as the number of sensor nodes increases.Figure 8 shows that the average throughput of retrieving data is higher than that of uploading operation and related to the amount of data size.Also, it shows that mean of throughput in all cases is 37.79 MB/sec, but throughput decreases with increasing data size.
In real environment, sensor node may have lower performance than the one in our experimental environment.So, we evaluate uploading time on low performance sensor nodes by using virtual machines to compare with high performance ones additionally as shown in Figure 9. Virtual machine has single core 2.13 GHz processor, 512 MB memory, and 10 GB disk and simulates cheaper low performance DVR  devices lower than $200 dollars.Experimental conditions are identical to the ones in Figure 5. Figure 9 shows that the mean of throughputs is 22.804 MB/sec on low performance nodes, while 27.362 MB/sec on high performance nodes with 16.65% performance degradation.

Conclusion
In this paper, we have presented HDSM, the Hadoopbased distributed sensor node management system for distributed sensor node management scheme, by exploiting Hadoop mapreduce framework and distributed file system and launching various mapreduce applications for executing sensor data uploading, retrieving, flushing, monitoring, and updating configurations.It offers various efficient ways for collecting sensor data and managing multiple sensor nodes by launching specific mapreduce applications on sensor nodes which upload data of sensor nodes to DFS and retrieve sensor data periodically from DFS.Additionally, it provides a flexible management scheme for sensor node by reconfiguring a firmware or updating configurations and data formats of sensor nodes based on mapreduce framework.Also, it achieves a transparent and fault-tolerant architecture by exploiting various crucial features of Hadoop.Our experimental results show that it has efficient and stable performance.As a future work, we are planning more researches for additional improvements and functionality with fault tolerance and high throughput.

Figure 1 :
Figure 1: Architecture of Hadoop deployed on multiple nodes; HDFS and mapreduce and interaction between components.

Figure 5 :
Figure 5: Average throughput of uploading sensing data with respect to the number of sensor nodes with various data sizes.

Figure 6 :Figure 7 :
Figure 6: Average throughput of uploading data on 8 nodes with respect to different data size with various replication levels.

Figure 9 :
Figure 9: Average throughput of uploading sensor data with respect to number of sensor nodes and various data sizes on two types of node with low and high performance.
3.2.Architecture.HDSM consists of node control manager (NCM), Hadoop master, and slave.NCM and Hadoop master function as a HDSM manager which controls the overall operations of HDSM.A sensor network is considered as a large scale mapreduce cluster.Hadoop slave is deployed on each of sensor nodes.It consists of Tasktracker and Datanode and interacts with the Hadoop master which is composed of Jobtracker and Namenode.NCM manages and controls Hadoop master by using Hadoop commands and mediates interaction between HDFS and external systems such as repository or data consumer services.Retrieving accumulated sensing data in HDFS and flushing HDFS are done by executing Hadoop commands.Particularly, it offers effective way for retrieving of big data accumulated on sensor node cluster since data traffic is spread across all sensor nodes (Datanodes) in reading operation of HDFS.
NCM requests Hadoop master to execute various operations including uploading sensing data to HDFS, flushing HDFS, and patching sensor node firmware or configurations by launching proper mapreduce applications.In each operation,