Distributed File Sharing and Retrieval Model for Cloud Virtual Environment

Cloud-based storage services are multiplying and are being adapted mostly for data storage. At the same time, many potential problems pertaining to data storage and security are being addressed. This research provides the architecture for splitting user data and the solution in retrieving different data chunks, stored on different cloud storages. By doing this, not only the load on a single server is reduced but security and storage are efficiently used. Data would be stored and retrieved in slices, hence the chances of data forgery are diminishing. File processing will be faster as split parts would be fetched from different clouds, enabling parallel processing. Keywords-cloud computing; data security; file sharing; distributed system


INTRODUCTION
Cloud computing is being adapted by many organizations due to its dynamic scalability and virtualized resources [1]. Cloud computing, particularly cloud storage, has been proven in multiple real-world cases to simplify information technology (IT) operations helping companies to save millions of dollars. The era of big data would not be possible without the emergence of highly scalable cloud storage [2]. Cloud storage provides ease to users in a variety of ways like providing faster accessibility, rapid deployment, accessibility, and data security, backup and recovery. Due to the extensive use of cloud, problems pertaining to it are becoming noticeable [3]. The load on the servers is getting heavier and the system must look for more advisable methods to minimize it. For that purpose, distributed file systems and cloud storage come handy and address many of the problems discussed above. In these, a dataset is divided into small chunks and these chunks are then stored over different servers to minimize the load on any single server, hence increasing efficiency and security [4]. Goals that have been achieved by data distribution include fault tolerance, ability to store bigger data and managing them efficiently, append operations on file allowance, and reliable communication among different machines. However, attention is still needed on the domain of retrieval of distributed data in their complete original form.
This project focuses on data, stored onto different servers, distribution and retrieval. This can be achieved by the help of an information retrieval method which will facilitate us by maintaining the order of data slices for the same file, so while fetching they can be arranged into the complete original file. Several schemes have been proposed in context of file splitting and sharing, but the proposed scheme focuses on the following points.
• To design a hybrid multi-cloud model to improve cloud efficiency and provide a secure environment.
• A data distribution technique and splitter is used to encrypt and split data into chunks.
• To develop an efficient and intelligent method for the retrieval of split file chunks into its original form.

II. RELATED WORK
Security as a service (SaaS) provides the opportunity to divide data into chunks for security enhancement. When they are divided, each chunk is then encrypted and stored onto different databases [5]. But this technique has a serious issue which is the linked list based distribution of the data. The tail of each node of the linked list contains information of the next node. So, if one node is hacked, it will give information of all the other nodes. Security is provided by isolating the process of decryption and encryption to a third party, so the data is encrypted using a secured co-processor [5]. Cloud systems have provided ease to users in many ways like storing data, running applications, data recovery, and flexibility. Cloud also includes risk of data integrity, network dependency and centralization. This is the main reason many why big companies restrain from using clouds as primary storage. CSA, ENISA and NIST published general security guidance and recommendations, for cloud usage to provide some level of protection ranging from physical security to network/system/application security [6]. The idea of distributed computing turns is gaining popularity over the last few years. Data storage is a critical and important research field in distributed computing. In [4], authors presented the idea of distributed computing and distributed storage and the design of distributed storage right off the bat. Authors in [8] developed a dynamic load balancing algorithm to balance the load across the storage nodes during the expansion of private cloud storage. Authors in [9] addressed encryption challenges by the use of tanked searchable symmetric encryption. Authors in [7] implemented data integrity protocols to detect data corruption. Authors in [10] discussed cloud computing and its service models, cloud security issues, challenges, and analyzed various solutions with TTPA and studied their benefits in terms of data integrity, access control mechanism, and data confidentiality. Authors in [11] provided a KNN classification method, a privacy preserving protocol which accesses data in database using encryption and solved input record query of data mining.

A. Data Splitting and Encryption Techniques
Cryptography is a secure communication technique used when there is a presence of malicious third-parties-known as adversaries. Encryption which is a main module of cryptography uses an algorithm and a key to transform an input known as plaintext into an encrypted output called ciphertext. Algorithm will always decrypt the cipher text/block into plaintext if the same key is provided. The encryption algorithm is considered secure because if any of the encrypted data (cipher block) is lost, information from this block cannot be retrieved until the encryption key is given.

1) Data Encryption
• Symmetric encryption: This is the simplest kind of encryption that involves only one secret key to cipher and decipher information. Symmetrical encryption is an old and known technique. The secret key that can either be a number, a word, or a string of random letters. It is blended with the plain text of a message to change the content in a way. The sender and the recipient should know the secret key to encrypt and decrypt the messages.
• Asymmetric encryption: Asymmetric algorithms use two keys, one to encrypt the data, and the other key to decrypt. These inter-dependent keys are generated together. One is labeled as the public key and is distributed freely. The other is labeled as the private key and must be kept hidden.
• Hash: A hash function is a mathematical function that takes input/data of an arbitrary length and then generates fixed length hash based on the input. It is easily calculated, but it is very difficult to generate the original data if the hash value is not known.

2) Data Splitting
Cryptography splitting is an algorithm that splits the input/data into a number of chunks. Splitting is done at bit level. A secure key is used to control splitting and splitting is done randomly.

3) Data Security Algorithms
Blowfish, AES, RC4, DES, RC5, and RC6 are used for encryption to enhance security. The most widely used algorithms are AES-128, AES-192, and AES-256. The technique that we are using is the AES (advanced encryption standard)-256 which has the advantages of low memory cost, high speed, same key used to encrypt and decrypt, and the cipher and plain blocks are of the same size. The uploaded file will be encrypted first and then broken down into a number of chunks according to the number of public clouds available.

7) Placement of Chunks on Different Clouds
These encrypted chunks will be stored on different public clouds and the server ID of each chunk on which it is stored is fetched.

8) Maintaining Log File
A log file will be maintained having information about the number of registered users and uploaded files, along with the number of chunks and the cloud reference on which is stored for each chunk. Figure 1 shows the system flow of the proposed technique. Figure 2 shows the architecture of the proposed technique. In this section, the proposed model is compared with the plain file system. The comparison is based on time. The total time taken by the proposed model is the accumulated sum of receiving, encryption, splitting and transferring times. Table I shows the total time taken by different modules (receiving, encryption, splitting and transferring) when users upload files of different size. The smallest file (3.73MB) takes a total processing time of 17499ms whereas the largest file (25.3MB) takes52638ms. However, total time also depends on bandwidth, CPU, and memory.  Table II shows the total time taken by the plain file distribution system (PFS) when users upload files of different sizes. Total time is inversely proportional to the bandwidth. A detailed comparison between the proposed and the plain file system is shown in Figure 3. The proposed system looks a bit complex and took more processing time as compared to the PFS. It is understood that computational complexity can be compromised for data security purposes. The proposed system is more secure than the PFS.

C. Result Analysis
From the above results, it is concluded that the total time taken by the proposed distributed file system is increased by only a few seconds. This time is utilized in taking security measures (i.e. encryption, splitting) and in efficient storage utilization. To achieve high security, one must compromise on time. Table III shows the time consumption difference between the proposed and the plan file system. The proposed distributed file system is technically compared with Google file system (GFS) and the SaaS model.  A client interacts with the private server only. A client is not required to communicate directly with the public servers because the private server will be responsible for carrying out the client's request.
Divides files into fixed size chunks of 64MB each.
There is no restriction on the chunk size. The file division is based on the number of public clouds available. Each chunk is replicated on three public chunk servers.
There is no replication of chunks ensuring data integrity. Header contains all the important details including encryption technique, total number of chunks, total size of chunks, chunk number and a unique user id.
Log file (database) will maintain all important details..
Tail will contain reference to the next chunk.
Next chunk can be identified from the log file.
A third party does encryption and decryption. User can select between multiple encryption techniques.
Private cloud is responsible for encryption and decryption. User has no choice over the encryption technique.
Log file contains the record of the user data and helps to identify the first chunk.
Log file will contain the total number of chunks along with server IDs Log file will be maintained in master database and all the data will be dumped into random databases.
Log file will be maintained in the master cloud and all the data chunks will be saved into slave clouds.
Enhances the security of the encrypted data by distributing the data within the cloud.
Master and slave clouds. User can directly interact with the master cloud only. Communication takes place between master and slave clouds.
The proposed model has some characteristics which differentiate it from the previous models/techniques. These are: • User cannot directly communicate with slave clouds, which ensures data security.
• The proposed model does not give privilege of encryption technique to user which ensures reliability.
• The proposed model contains all information about data chunks and their order in secured log files • The proposed model doesn't involve any third party for encryption which is more trustworthy whereas SaaS Model depends upon third parties for encryption techniques.
• The proposed model is a hybrid model consisting of multiple clouds in which one is the master (private) and others are slaves (public).

V. CONCLUSION
The distributed file system is a very efficient way of partitioning data and storing them to multiple clouds. In that way the load on a single cloud is reduced and the performance is increased. It also helps in creating a secure way for file retrieval and processing, hence managing and protecting user's data. The main concern of the project is the efficient distribution of data and their retrieval in correct order.