A Maturity Analysis of Big Data Technologies

In recent years Big Data technologies have been developed at faster pace due to increase in demand from applications that generate and process vast amount of data. The Cloud Computing and the Internet of Things are the main drivers for developing enterprise solutions that support Business Intelligence which in turn, creates new opportunities and new business models. An enterprise can now collect data about its internal processes, process this data to gain new insights and business value and make better decisions. And this is the reason why Big Data is now seen as a vital component in any enterprise architecture. In this article the maturity of several Big Data technologies is put under analysis. For each technology there are several aspects considered, such as development status, market usage, licensing policies, availability for certifications, adoption, support for cloud computing and enterprise.


Big Data Overview
Driven by the need to generate business value, the enterprise has started to adopt Big Data as a solution, migrating from the classical databases and data stores which lack the flexibility and are not optimized enough [1].The changes in the environment make big data analytics attractive to all types of organizations, while the market conditions make it practical.The combination of simplified models for development, commoditization, a wider palette of data management tools, and low-cost utility computing has effectively lowered the barrier to entry.[2].The concept addresses large volumes of complex data, rapid growing data sets that may come from different autonomous sources.In recent approaches, Big Data is characterized by principles known as the 4V -Volume, Variety, Velocity and Veracity [3].
There are opinions about accepting other principles as Big Data characteristics, such as Value.Each day more businesses realize that Big Data is relevant as the applications generate large volumes of data generated automatically, from different data sources, centralized or autonomous.As traditional databases hit limitations when the need of analyzing this data, dedicated solutions must be considered.

Important BigData Solutions:
 Apache HBase/Hadoop is based on Google's BigTable distributed storage system, which runs on top of Hadoop as a distributed and scalable big data store.This means that HBase can leverage the distributed processing paradigm of the Hadoop Distributed File System (HDFS) and benefit from Hadoop's MapReduce programming model.It combines the scalability of Hadoop with real-time data access as a key/value store and deep analytic capabilities of Map Reduce [4].HBase allows to query for individual records as well as derive aggregate analytic reports across a massive amount of data.It can host large tables with billions of rows, millions of columns and run across a cluster of commodity hardware.HBase is composed of three types of servers in a master slave type of architecture.Region servers are responsible to serve data for reads and writes.When accessing data, clients communicate with HBase Region Servers directly.Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process.DOI: 10.12948/issn14531305/21.1.2017.05 Apache Cassandra is a distributed database used for the administration and management of large amounts of structured data across multiple servers, while providing highly available service and no single point of failure.It provides features such as continuous availability, linear scale performance, data distribution across multiple data centers and cloud availability zones.Cassandra inherits its data architecture from Google's BigTable and it borrows its distribution mechanisms from Amazon's Dynamo.The nodes in a Cassandra cluster are completely symmetrical, all having identical responsibilities.Cassandra also employs consistent hashing to partition and replicate data.it has the capability of handling large amounts of data and thousands of concurrent users or operations per second across multiple data centers.Cassandra has a hierarchy of caching mechanisms and carefully orchestrated disk I/O which ensures speed and data safety.Write operations are sent first to a persistent commit log (ensuring a durable write), then to a write-back cache called a memTable; when the memTable fills, it is flushed to a sorted string table -SSTable -on disk.A Cassandra cluster is organized as a ring, and it uses a partitioning strategy to distribute data evenly.
 Redis represents an in-memory data structure store used as a database, cache and message broker.It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperlogs and geospatial indexes with radius queries.Redis stores all data in RAM, allowing lightning fast reads and writes.It runs extremely efficiently in memory and handles high-velocity data, needing simple standard servers to deliver millions of operations per second with submillisecond latency.Redis is schema-less, but when one of its data structures (like HASH or Sorted Sets) is used, users can take advantage of the in-memory operations to accelerate the way data is processed.The Sorted Set is a structure that combines the features of a hash table with those of a sorted tree.Each entry in a Sorted Set is a combination of a string "member" and a double "score".The member acts as a key in the hash, with the score acting as the sorted value in the tree.With this combination, it can be accessed by members and scores directly by member or score value.
 VoltDB is a distributed SQL database intended for high-throughput transactional workloads on datasets which fit entirely in memory and where the data is automatically partitioned based on a column specified by the application developer.All data is stored in RAM memory.Disk snapshots are periodically used to backup data and provided on-disk recovery log for crash durability.Data is replicated to at least n+1 nodes to tolerate n failures.Tables may be replicated to every node for fast local reads, or sharded for linear storage scalability.VoltDB determines where each record goes without the need of user's specification.
All data operations in VoltDB are singlethreaded, running each operation to completion before starting the next.VoltDB is partitioning both the data and the work.For best performance the database tables and associated queries need to be partitioned so that the most common transactions can be run in parallel.Since each site operates independently, each transaction can be performed without the overhead of locking individual records that usually consumes processing time of traditional databases.VoltDB balances the requirements of maximum performance with the flexibility to accommodate less intense but equally important queries that cross partitions [5].
 MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling.In order to provide these features, MongoDB has a document-oriented data model that permits it to split up data through multiple servers, perform data balancing, load across a cluster, re-distribute documents automatically and perform routing of user requests.In case there is a need for more capacity, MongoDB allows the addition of new machines and can DOI: 10.12948/issn14531305/21.1.2017.05automatically manage data to the new machines.MongoDB doesn't require stored procedures and the model and stored data have the same structure -BSON, which is similar to JSON.MongoDB lacks a series of features that are usually common amongst the traditional relational databases.The most notable one is the lack of notably joins and complex multirow transactions; however the decision of not implementing this feature has led to a greater scalability of MongoDB.The technology is widely used at this moment and proved in the recent years to be a fast and easy to use solution to handle Big Data.

The Maturity Model for BigData
Solutions Big Data management systems receive data from various sources, gathering a huge volume of data often represented by heterogeneous and diverse dimensionalities [3].Different information collectors prefer their own schemata or protocols for data recording, and the nature of different applications also result in diverse data representations.Autonomous data sources with distributed and decentralized controls are a main characteristic of Big Data applications [6].As a result, an analysis on some of the most popular Big Data technologies is aiming at supporting companies in selecting the proper solution for their needs.In an adaptive organization, measurement and analysis can be valuable tools for understanding the present environment and evaluating the effectiveness of our actions.Advances in Internet and mobile technologies have dramatically expanded the scope and rate at which certain types of information can be collected.[7] This paper proposes to analyze the maturity of some of the most used Big Data technologies.In order to achieve its goal, the analysis is performed considering the following model:

Support Libraries
This chapter describes the support provided for developing client interfaces in order to access the solutions.A client interface allows programs that are written in various programming languages to interact with the database by using native functions of the language.The client interface handles the requests and translates them into a standardized communication protocol with the database.MongoDB uses the term driver when it comes to a client library that manages the interaction with the database in a language that is also appropriate for the respective application.
Beside the official supported libraries, there are also two Community Supported drivers for Go and Erlang.

Guidelines and Documentation
An analysis was performed on the documentation provided by each solution regarding the installation, administration and other features.MongoDBprovides manuals for both installation and administration of solution.The installation guidelines are given independently for Windows, Linux and OS X, both for MongoDB Community Edition and Enterprise.There are resources available on installation using MongoDB Cloud Manager.The administration documentation addresses the ongoing operation and maintenance of MongoDB instances and deployments.It includes both high level overviews as well as tutorials that cover specific procedures and processes for operating MongoDB.Developers can find detailed instructions on the configuration, maintenance, upgrading, monitoring or backup.A guide providing instructions on how to get started with MongoDB fast is available in editions for mongo Shell, Python, Java, Ruby, NodeJS, C++, C#.VoltDB -The solution comes as either prebuilt distributions or as source code.The installation documentation specifies system requirements for running VoltDB, installation and upgrading, description of the resources included in the distribution kit.The administration guidelines are detailed in providing information on how to properly design, develop and run an application On VoltDB.The manual addresses the vast majority of instructions required to properly administer the database.There is documentation available regarding best practices in the area, Getting started tutorial, Performance and Customization, Java API or Client Wire protocol.Hadoop/Hbase: -provides documentation about last release notes, about native libraries and security mode.It provides also documentation for the architectures, commands and tools of main components (HDFS, Map Reduce, Yarn).Also are detailed guidelines and documentation for integration with other systems, for authentication methods and other useful tools.Cloud computing is another paradigm which promises theoretically unlimited on-demand services to its users.The virtualization of resources allows abstracting hardware, requiring little interaction with cloud service providers and enabling users to access terabytes of storage, high processing power, and high availability in a pay-as-you-go model [16].While all the technologies that are analyzed in this paper are offered by the major Cloud providers, the deployment methods are different.

Licensing Policy
The technologies presented in this paper are distributed under the following licenses: The Apache License (ASL) is a permissive free software license written by the Apache Software Foundation (ASF).The license allows the user of the software the freedom to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software, under the terms of the license.Apache License 2.0 was released in January 2004 and includes easier usage for non-ASF projects, improved compatibility with GPL-based software, allow the license to be included by reference instead of listed in every file.

Table 1 .
Development status of BigData solutions releases, solving time of issues, number of contributors in technology development, support libraries for building client interfaces, documentation available to users and the offer of trainings and certifications.DOI: 10.12948/issn14531305/21.1.2017.05

Table 2 .
Node.js, ODBC, PHP, Python Redis There are many client libraries supported by Redis.For certain languages, there are clients that are recommended by Redis for use.New clients can be developed added in the Redis repository.Redis documentation provides instructions on how to create a client.Support libraries for BigData solutions

Table 4 .
Certifications for BigData solutions

Table 5 .
Cloud support for BigData solutions

Table 6 .
Licensing policy for BigData solutionsRedis is open source software released under the terms of the three clause BSD license.The licensing model is subscription based.Redis Cloud is priced according to data capacity (both fixed and pay-as-you-go plans are available), whereas RLEC is licensed according to the number of shards (Redis processes) in a cluster.Redis Cloud offers a free-for-life tier that's limited to a single database of up to 30MB.RLEC is available as a free download that's only soft-limited by the number of shards.A number of companies provide products which include Apache Hadoop, commercial support, and/or tools and utilities related to Hadoop:  Amazon offers a version of Apache Hadoop on their EC2 infrastructure, sold as Amazon Elastic MapReduce;  Apache Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem;  Cloudera offers its enterprise customers a group of products and services that complement the open-source Apache Hadoop platform;  DataStax provides a product which fully integrates Apache Hadoop with Apache Cassandra and Apache Solr in its DataStax Enterprise platform;  IBM InfoSphere BigInsights brings the power of Apache Hadoop to the enterprise;  Emblocsoft delivers enterprise Hadoop edition based on Apache Hadoop to meet the demand of enterprise data processing:  Big Data data visualization and advanced analytics;  Real time stream processing;  Machine learning at scale;  Enterprise integration. DataTorrent is certified on Apache Hadoop, and all leading distributions.The DataTorrent platform includes built-in fault tolerance and auto-scaling and can process billions of events/second.[18] VoltDB is available in both open source and enterprise edition.The open source, or community edition, provides basic database functionality with all the transactional performance benefits of VoltDB.The enterprise edition provides additional features needed to support production environments, such as high availability, durability, and export integrations for transactionality or dynamic scaling.The scaling is done automatically, as opposed to Community edition where the scaling must be performed manually.The Enterprise edition gives the benefit of unlimited customer support.[19]