Forensic Investigation through Data Remnants on Hadoop Big Data Storage System

Forensic examiners are in an uninterrupted battle with criminals in the use of Big Data technology. The underlying storage system is the main scene to trace the criminal activities. Big Data Storage System is identified as an emerging challenge to digital forensics. Thus, it requires the development of a sound methodology to investigate Big Data Storage System. Since the use of Hadoop as Big Data Storage System continues to grow rapidly, investigation process model for forensic analysis on Hadoop Storage and attached client devices is compulsory. Moreover, forensic analysis on Hadoop Big Data Storage System may take additional time without knowing where the data remnants can reside. In this paper, a new forensic investigation process model for Hadoop Big Data Storage System is proposed and discovered data remnants are presented. By conducting forensic research on Hadoop Big Data Storage System, the resulting data remnants assist the forensics examiners and practitioners for generating the evidences


INTRODUCTION
The current era witnessed massive increases in data due to the increased human dependency on automated systems as well as computers. This extra generated huge data is known as, Big Data? accompanying large dataset which is characterized by velocity, volume, and variety of data. Big Data is considered one of the greatest technologies for the digital revolution of the past few centuries [18]. In order to achieve large benefit from Big Data, it is required to consider processing power, and the raw storage along with the strong analytics abilities and services. A wide variety of applications rely on Distributed File System (DFS) based storage systems to store, process and analyze Big Data to provide efficient, easy to use and consistent storage solutions by sharing multiple files with establishing a hierarchical and unified assessment of these files. There are many kinds of distributed file systems such as network file systems (NFS) of SUN, Google File System (GFS) of Google, and HDFS (Hadoop distributed file system) of Apache and GLORY-FS of ETRI. But HDFS is an open source and many Big Data Storage Systems adopt it.
The Hadoop version 0.1.0 is published in April, 2006 and continues to increase its versions [14]. Up till now, latest released Apache Hadoop 2.7 was available in June, 2016 [3]. Hadoop is speedily mutable and new software packages are being added to Hadoop. Recently, parts of the inventive Hadoop Apache project have turned to build software, such as Avro, HBase, Pig, HCatalog, Hive, Flume, Oozie, Sqoop, and Zookeeper [8]. In Statista report [21], the Hadoop market was valued at 6 billion U.S. dollars worldwide in the year 2015. Hadoop Big Data Storage System can be identified as a challenge to digital forensic researchers. A number of companies became bundle Hadoop and related technologies into their own Hadoop distributions. The three prominent Hadoop distribution companies are MapR, Cloudera, and Hortonworks [13]. Among  This paper focuses on discovering the data remnants not only on the Hadoop Big Data Storage server but also on client machines which access to server with the aim to assist the forensic examiners for generating the effective evidences. An overview was provided in the context of forensic process models and Hadoop. This paper is organized as follows: Section 2 examines current literatures focusing on digital forensic process models and digital forensics on Hadoop Big Data System. The Section 3 describes the architecture of Hadoop, MapReduce and YARN. Furthermore, the overview of the Hadoop and the architecture of Hadoop Hortonwrok Data Platform on Red Hat Linux hosted on Amazon EC2 are also presented. In section 4, the forensically issues and the research methodology of Hadoop Big Data Storage System are presented. In addition, the proposed forensic investigation process model for Big Data Storage System is introduced. Implementation and investigation of Hadoop servers and client machines are presented in section 5 and 6; respectively. Section 7 summarizes the overall paper and the method used to answer research questions is described. Areas for future work are then highlighted.

LITERATURE REVIEWS
The related works of Hadoop forensic Investigation of various aspects are discussed in this section. The following literature reviews explore the procedures and approaches used by other researchers in this particular field.

Digital Forensic Process Models
Digital forensics is the practice of collecting, analyzing and reporting on digital data in a way that is legally admissible. Along the digital forensic history, several process models were proposed for forensic investigation. In 2001, forensic academia held large-scale consortiums and defined a general standard digital investigation process model [20]. This model contains six stages of planning, incident response, collect data, data analysis, presentation of finding and instance closure. This process model covers not only computer but also network forensics. The National Institute of Standards and Technology (NIST) described the original forensic process model [15]. This model includes the phases of collection, examination, analysis and reporting. The relevant data are identified, labeled and record in the collection phase and the collected data are accessed and extracted in examination phase. And then the results of the examination are analyzed to drive the useful information. Quick [24] described that there are numerous types of cloud services that have a hy-pothetically different use in criminal actions. A need of sound digital forensic framework related to the client devices forensic analysis for identifying probable data holding is highlighted. This research focused on discovering whether there are cloud storage data miscellanies on prevalent client devices. The proposed forensic framework was applied in analyzing widespread cloud storage services; Google Drive, and Microsoft SkyDrive to find the data remnants on client devices; Windows 7, and an Apple Iphone. The author pointed out that cloud storage username and password can be identified from the log file and browser information. The usages of anti-forensic software did not eliminate the data remnants although full erase process can remove all data. The use of proposed framework was also beneficial to guide the research and applicable in digital forensic investigation. Cho et al. [9] highlighted that the preceding forensic procedures are not suitable for HDFS based cloud system because of its characteristics; gigantic volume of distributed data, multi-users, and multi-layered data structures. These characteristics can generate two problems in the gathering evidences phase. One problem is that file blocks are replicated on different nodes while the other is the excessive time increase and storage of the original copying. They proposed a general forensic procedure and guideline for Hadoop based cloud system. In this proposed procedure, the authors added live analysis and live collection to the original forensic procedure to avoid the system suspension. By conducting the static and live collection simultaneously, the Hadoop forensic analysis can diminish the time for proof collection. However, they did not present a case study or specific scenario to illustrate their proposals.

Discussion
The forensic process models presented in [15,10] are standard and common procedures. The model [24] is a specific model focusing on cloud storage and digital forensic investigation. The paper [9] proposed a forensic procedure for Hadoop based cloud system. The evolution of Hadoop Big Data Storage System brings the challenges to forensic investigation as like it does in other research and technical areas. Therefore, today's forensic process models which are running on traditional systems have limitations on supporting forensic investigation. While addressing the active nature of this environment, the forensic investigation process model should fulfill with the following characteristics: • the iterative nature to easily change between each phases • the forensic data collection and analysis without system suspension  [29], the authors discussed the Hadoop Big Data system could give to new difficulties and challenges to forensic investigators. This paper highlighted that the understanding Hadoop internal structure is the important point for forensic investigators. They pointed out that the use of different tools and technology can do the forensics of big data. And then they demonstrated that the automated tool (Autopsy) can help finding the evidences on big data efficiently.

Discussion
The paper [2] investigates and protects the Data spillage from Hadoop cluster. The paper [29] highlighted that automated tools can perform forensic of big data efficiently. This paper focuses on investigation of Hadoop Big Data Storage System by analyzing data remnants on Hadoop storage server and client machines.

Hadoop Big Data System
The HDFS and MapReduce are the main Hadoop modules. HDFS allocates the files across the cluster to offer fault tolerant access and high-throughput. For distributed data processing, MapReduce is considered and efficient programming model. The HDFS file system architecture is designed after the Unix file system which stores files as blocks. Each block stored in a Datanode can be composed of data of size 64 MB or 128 MB as defined by system administrator [3]. Each group of blocks consists of metadata descriptions that are stored by the Namenode. The Namenode manages the storage of file locations and monitors the availability of Datanodes in the system. Hadoop offers a MapReduce framework for applications writing for large amounts of structured/semistructured data processing across large clusters of machines in a consistent and fault tolerant way. It uses a MapReduce implementation engine for fault-tolerant distributed computing system along the large stored datasets in the cluster's DFS. This MapReduce technique has been popularized by the fact that Google uses this technique on its clusters and licensed to Apache. In the separate Map and Reduce steps, each step is performed in parallel, where each operates on sets of key-value pairs. Therefore, program execution is divided into a Map and a Reduce phases, divided by data transfer between nodes in the cluster. A node completes a Map function in the first step on a section of the input data. The Map output is a set of records in the form of key-value pairs, stored on that node. The records for a key are aggregated at the node to run the Reducer for that key. This includes data transfer between machines. The second Reduce step is congested until the Map step data is transferred to the suitable machine. The Reduce step generates another set of key-value pairs for final output. This programming model is controlled to the use of key-value pairs. However, a surprising number of tasks will be adequate for this framework. The Hadoop architecture is changed from Hadoop 1.x to Hadoop 2.x. YARN (Yet Another Resource Negotiator) is a new module added in the Hadoop 2.x. It is employed for Cluster Resource Management. Figure 1 shows the architecture of Hadoop version 2. Figure 1 illustrates the layers of Hadoop 2.x architecture: storage layer HDFS and processing layer YARN. MapReduce 2 is a distributed application type that run MapReduce framework on top of YARN. The Resource Manager manages resources and allots the resources to the application. The Resource Manager has Scheduler and Application Manager components. The Scheduler executes the scheduling function using the client applications? resource requirements. The application Manager employs to accept job-submissions, exchanging-container to execute the specific Application Master and provides the service for restarting the Application Master container on failure. The Application Master has the responsibility of negotiating suitable resource containers from the Scheduler, tracking their status and monitoring. For launching containers, the Node Manager is engaged, where each can house a map or reduce task.

Hadoop HDP 2.3 on Red Hat Linux on Amazon EC2
Hortonworks distribution provides Hadoop system based on Apache Hadoop to analyses, sort and manage Big Data. Hortonworks is the simply commercial vendor that allocate complete open source Apache Hadoop without additional proprietary software. Hortonworks is easier learning curve to provide IT friendly tools for users. According to Gartner [16], 2014 Iaas Magic Quadrant, Amazon Web Service is the irresistible market share leader, with more than 5 times of the compute capacity in the use than the combined total of the other 14 providers. Amazon Web Services provides cloud computing services to build, secure, and organize Big Data applications. In order to meet the companies requirements for Linux7 server hosted on Amazon EC motivates us to investigate on this environment. Figure 2 presents the HDP 2.3 Red Hat Linux7 server hosted on Amazon EC2. Each instance on Amazon EC2 is a virtual server in the cloud. An Amazon Machine Image (AMI) offers the compulsory information to launch an instance. The instances can be monitored using Amazon CloudWatch that collects and processes raw data from Amazon EC2 into readable, near realtime metrics. A DB instance is an isolated database environment running in the cloud. In this system Red Hat Linux7 server is deployed as the instance of EC2 and Hadoop HDP is installed.

HADOOP BIG DATA STORAGE FOREN-SIC INVESTIGATION RESEARCH QUESTIONS AND METHODOLOGY
In this section, the research methodology for this paper is discussed, and then a proposed forensic investigation process model is outlined, which is applied in investigation to Hadoop Big Data Storage server and attached client machines.

Proposed Forensic Investigation Process Model for Hadoop Big Data Storage System
Over the past few years, a number of forensic process models have been proposed. However these existing models may not be fit-for-purpose in the Big Data Storage System environment.
A sound forensic process model for investigation in this environment is required. This section describes a new investigation process model that is adaptable for Hadoop Big Data Storage System. This proposed process model is based on NIST forensic process model. Figure 3 illustrates the proposed investigating process model for Hadoop Big Data Storage System. As the contribution of this process model, there is a cycle on the steps of requirements preparation, collection, and analysis. If the forensically sound data cannot be collected in the phase of collection, the investigation can go back to requirements preparation phase to arrange the usable tools and techniques for efficient collection Likewise, if there is a difficulty in analysis phase, re-operate the requirements preparation phase. Through- out the process, detailed documentations of every step should be retained. These documents are applied to reconstruct the event in generating investigation report, which can be used by investigators. The investigator can prepare the important things for the next investigation by regarding the previous documentations. Scope and Identification: It is the very first important phase to start the investigation. The investigator needed to survey the physical area of the system to set the edges of the investigation. Figure 4 demonstrates the edges of the forensic investigation; the targeted system, the purpose of the investigation, what methods should be applied, when it is taken out, how long it may take, and who will conduct the investigation. During the identification, the following steps are taken into considerations: • recognizing the possible data source • locating the data sources • identifying the physical sources.

Requirements Preparation:
It is the proactive measure that enables to maximize the ability as well as minimize the effort and unexpected risk associated with the investigation. Thus, the investigators prepare a set of requirements for ongoing phases. This phase is operated based on the prior experiences or study the documentations of previous investigations. Figure 4 depicted the materials needed to prepare for the next phases and compares tasks and their required materials. The necessary resources for collecting data are Forensic Server, backup devices, or blank media. In multi-user storage server, the system suspension makes the serious problem to users. It makes to change the original data files. In emphasizing the integrity of the investigation, the data are collected remotely. The Forensic Server is a facility machine to support remote collection and forensic analysis task. The investigator should setup one similar system environment with the identified system for studying the infrastructure of the targeted system.  with this environment is one of the responsibilities of this phase.  Figure 5 describes the material requirements to undertake the investigation. Forensic Server is needed to set up to collect and analyze the forensic data. To study the background knowledge of the infrastructure, a system which is similar configuration with the targeted system is equipped. The function of the Forensic Server is depicted in Figure 6. The Forensic Server requires the high access right to collect forensic data from targeted machines. The collected data are duplicated in other backup media for emergency use. The forensic analysis is also done in this server. Forensic imager tools and analysis tools are setting up on it. The responsibility of Forensic Server is • to perform remote collection • to store and make backup the forensic copy of disk images, memory dump and registry files which are collected from each machine • to mount these file and explore in read only mode • to conduct investigation and analysis.
Analysis: After collecting the data, the relevant pieces of information are assessed and extracted from the collected data. The important task is to attach a copy of the collected data to the environment in a read-only manner. And then forensics analysis tools and techniques are applied. Among the analysis methodologies including; data mining, data correlation, anomaly detection, profiling, timeframe, data hiding, application and file, and ownership and possession, the suitable analysis methods for this environment are described as follow: Keyword Searching: Big Data investigations can contain both structured and unstructured data source. These data may contain keywords of wanted information. The simplest method is matching with keywords. The data is gathered and mined from the file system's metadata layer and then parsed to sort for further analysis.
Timeline Analysis: The end goal is to embody the incident activity performed in the system comprising its date,the involved artifact, action and source.  should examine some points such as which files were downloaded, what programs were executed, which directories were opened, which files were clicked on, which files were deleted, where did the user browsed to and so on. In this analysis way, each file is emphasized on to trace the footages of the criminals or illegal usages on it. Hence, knowledge of file systems is required; configuration remnants and registry remnants to take advantage of this procedure that reduce the data amount to be analyzed. At the end of this phase, the output is hanged over to the next phase to draw the event reconstruction and reporting.

Media and Artifact Analysis
Reporting: This phase presents the findings as the outcome of the investigation. The results obtained from above phases are organized to draw a conclusion. The overall view established that the associations between individual results may provide a picture. It is the presenting strategic for exposing the incidence (case); this must be full of clarity, completeness, and accuracy of the findings. This means the findings have to be presented in a comprehensible way that is available to a non-technical audience. The report structure typically includes one or more sections detailing the evidence considered and the steps the investigator took to arrive at his findings. This is typically done by identifying the name, type, and characteristics of the evidences. There are many report formats relating to specific case type. A standard approach is to describe the process in chronological order, from identification through analysis.
Closing: This phase retains all related documentation recorded at each phase of the investigation. Each phase is reviewed so that the lesson can be learnt and applied for future investigations. The Figure 7 describes the tasks to accomplish closing phase. In this phase, the conclusion is drawn by deciding upon the result of reporting phase. All collected data through the process and resulting data remnants are stored and archived. The document in this phase is finalized document and that contains the summarizing the activities and occurrences of the whole process. The resulting document is stored together with previous ones. The documentation file of whole process allows the investigators to prepare the required materials and methodologies for the future investigations.

FORENSIC INVESTIGATION OF HADOOP STORAGE SERVER
The Hadoop characteristics; low cost, computing power, scalability and storage flexibility makes the Hadoop to deploy as organizational storage server. As the use of Hadoop storage server continues to grow rapidly, it becomes the target or facility to commit crime. In this work, the proposed process model is applied to investigate two infrastructures of Hadoop storage systems with different installation methods.

Forensic Investigation of Hadoop Infrastructure I
In this section, the forensic investigation is implemented on Hadoop storage sever infrastructure I by applying the proposed forensic investigation process model.

Scope and Identification Phase
The scope of  vices; file uploading, downloading and MapReduce processing are operated on the Hadoop infrastructure I.

Requirements Preparation Phase
In order to support the investigation, it is desirable to prepare the tools, software and methods which are compactable with this system infrastructure. Firstly, the investigator should prepare the system environment which has the same infrastructure with this current targeted environment. This similar system allows the investigator to study the nature of targeted system, test the tools and techniques. The following is step by step installation and configuration of Hadoop Storage Server.  [12] are prepared to collect data and change the format into human readable form.

Collection Phase
Before conducting the forensic analysis, the data should be forensically collected for analysis. For the effective collection, the prioritizing the data sources; likely value and volatility should be implemented. edits - * is a log listing each file system change (file creation, deletion or modification) * is made after the most recent fsimage.
After collecting the important files, collecting the volatile data should takes precedence over nonvolatile data. To get the memory image, dump the memory with "dd" command.
" dd if=/dev/mem of=media/usb/memory.image" The non-volatile data is collected by imaging the hard drive of the machine.

Analysis Phase
By analyzing the collected data, this investigation can track the footage on Hadoop storage server to identify the usages. The usages include the primary operation services; uploading, downloading and MapReduce function. Tables 2 through 4 reports the data remnants related to each operation service. In the tables, the IP address is expressed as xxx. The data remnants are: As shown in Table 2, the uploaded file name, source IP, destination IP and file operation name; upload = WRITE are left on Hadoop server when the uploading service is operated. As shown in Table 2, the uploaded file name, source IP, destination IP and file operation name; upload = WRITE are left on Hadoop server when the uploading service is operated.
In addition, Table 4 presents the remnants related to MapReduce task. When the MapReduce task is operated on a data set, a new folder is appeared in the metadata level log file. The name of this folder is composed of the program name (eg. WordCount) and processing time of this task. In Syslog, the output file name is recorded.

Reporting Phase
The investigators arrange the finding evidences to embody the event. They draw the event line with a specific feature; time and sequence. The output is documented to present in court of law. This phase relates to the legal presentation of the collected evidence and investigation. The presentation of findings can be demonstrated in many formats of documentation. Regardless of the format, one important point to notice is to clearly present the findings and collected evidences.

Closing Phase
By viewing the resulting remnants, the conclusion can be drawn that the usage of Hadoop 2.7.1 server can be identified. The investigator checks the documentations of each phase to extract which factors should be noticed for the next investigation. The finalized documentation is created and the whole documents are organized. The collected data are stored in archived format.

Forensic Investigation of Hadoop Infrastructure II
In this section, the forensic investigation is implemented on Hadoop storage sever infrastructure II by applying the proposed forensic investigation process model.

Scope and Identification Phase
In this investigation, the target system for investigation is Hadoop HDP 2.3 on Red Hat Linux 7 server which is hosted on Amazon Web Services EC2. The objective of this investigation is to trace the operation of file uploading, downloading, just opening a file on server and uninstalling the HDP2.3. This section focuses on discovering whether there are any data remnants on this storage server.

Requirements Preparation Phase
Initially, the environment of the same infrastructure with the targeted system is set up with the aim to study the targeted system. HDP 2.3 can be directly downloaded from their website to be installed. EC2 storage space is rent to install the Red Hat Linux7. HDP 2.3 is deployed on the top of Red Hat. In order to setup the Hadoop via Ambari, the installation steps are: 1. Lunching an EC2 instance 2. Pre-requisites for setting up Hadoop in Amazon Web Services 3. Hadoop cluster installation (via Ambari) Afterward, Hadoop HDP is called by address 'http://ec2-16……:8080/' via the web browsers. The default sign in name is 'admin' and password is also admin. The infrastructure study and testing the tools are done in this similar environment. In addition, the Forensic Server is implemented. Forensic tools, methods and other facility software for collection and analysis are also prepared as in section 5.1.3.

Collection Phase
For data collection, the Forensic Server connect the EC2 instance via PuTTY 0.67[23] and WinSCP 5.7.7. The forensic data are duplicated in other media. The prioritizing of the data sources; likely value and volatility is implemented. The forensically important files and volatile data are collected in first priority.

Analysis Phase
The exporting VM files are opened in the Forensic Server. This collected VM are analyzed to identify the usage and discover the remnants. Tables 5 through 7 list the data remnants by tracking the upload, download and read operation.   JOBNAME="word count" USER="xxx" SUB-MIT_TIME="1461554117485"

Reporting Phase
For the full presenting the forensic report, the investigator observes the data remnants and reconstructs the event to explore in law court. For the investigation of both Hadoop Storage Server infrastructures, the remaining data remnants are the same, however, the parts and files which contain these remnants are different.

Closing Phase
The investigator needs to catch up the process in every step to notice which factors are important for the next investigation.

FORENSIC INVESTIGATION OF CLIENT MACHINES
The storage and processing services can be supported by connecting the storage server from the client machines. In this section, the investigation is conducted on client machines that are accessed to the Hadoop server.

Scope and Identification Phase
The investigation objective of this section is to discover what data remnants are left on client machines when accessing the server. While identifying the client machines, we found that there are two type of accessing methods; via web access and SSH access. In web access, the client machine is connected to Hadoop server with the link "http://ec2-50-112-211-185.us-west2.compute.amazonaws.com:8080/".

Requirements Preparation Phase
The tools which are compactable with the targeted machines are prepared for both static and live analysis. Afterward, for studying the infrastructure, the three type of client machines are prepared as stated in Table 8. A Forensic Server is also prepared to collect and analysis forensic data.

Collection Phase
In this phase, the investigator collects disk image, memory dump and protected registry file of current machine by using forensic tools and file viewer software. To create the forensic image of hard disk, the write blocker is used to ensure that no data is written back to hard drive. We use the AccessData FTK imager 3.0.0.143 [1] for imaging by blocking write mode. After imaging the hard drive, the image file is collected in a Forensic Server. The memory dump files of each client machine are also collected for the live analysis.

Analysis Phase
The collected data from each VM are conducted to analyze. The acquiring image files are mounted to Forensic Server and open in read only mode to discover the data remnants.

(i) Testing Environment I, Analysis on Client Machine 1
For testing environment 1, the Client VM 1 is investigated. The specification of client machine1 is the Windows 7 64 bit which is accessing the server via Mozilla Firefox 49.0. The disk image file of this Client VM 1 mounted to the Forensic Server. The mounting drive is explored and the data remnants are discovered.
The data remnants such as web address, access date, uploaded file name, upload date are found as shown in the Table 9.

(ii) Testing Environment II, Analysis on Client Machine 2
For testing environment 2, the Client machine 2 is investigated. The specification is Windows 7 64 bit OS that access the server via IE 8.0.7601. The data remnants are discovered as shown in Table 10.

Reporting Phase
The investigator arranges the finding evidences to embody the event that could be.

Closing Phase
By viewing the resulting remnants, the conclusion can be drawn that we can discover data remnants on client machines to trace the usage of Hadoop server. The whole documentations are organized for later use. The collected data are stored in archived format. The investigator reviews the tasks of each phase to extract which factors should be notice for the next investigation.