1 Introduction

Nowadays, big data is the frontier topic for researchers, as it refers to rapidly increasing amounts of data gathered from heterogeneous devices (Chen and Zhang, 2014). Sensor networks, scientific experiments, websites, and many other applications produce data in various formats (Abouzeid et al., 2009). The tendency to shift from structured to unstructured data (Subramaniyaswamy et al., 2015) makes traditional relational databases unsuitable for storage. This inadequacy of relational databases motivates the development of efficient distributed storage mechanisms. Provision of highly scalable, reliable, and efficient storage for dynamically growing data is the main objective in deploying a tool for big data storage (Oliveira et al., 2012). Thus, innovative development of storage systems with improved access performance and fault tolerance is required.

Big data has influenced research, management, and business perspectives and has captured the attention of data solution providers toward the deployment of satisfactory technologies for big data storage (Sakr et al., 2011). Relational databases have been very efficient for intensive amounts of data in terms of storage and retrieval processes for many decades (Vicknair et al., 2010). However, with the advent and accessibility of the Internet, technology to the public has turned the structure of data towards schema-less, interconnected, and rapidly growing. Apart from that, the complexity of data generated by web resources does not allow the use of relational database technologies for analyzing image data (Xiao and Liu, 2011). The exponential growth, lack of structure, and the variety in types bring data storage and analysis challenges for traditional data management systems (Deka, 2014). Transformation of big data structures to relational data models, strictly defined relational schemas, and complexity of procedures for simple tasks are the rigid features of relational databases (Hecht and Jablonski, 2011), which are not acceptable to big data.

NoSQL technologies introduce flexible data models, horizontal scalability, and schema-less data models (Gorton and Klein, 2015). These databases aim to provide ease in scalability and management of large-volume data (Padhye and Tripathi, 2015). NoSQL databases offer a certain level of transaction handling so that they are adequate for social networking, e-mail, and other web-based applications. To improve the accessibility of data to its users, data are distributed and replicated in more than one site. Replication on the same site not only supports data recovery in case of any damage but also contributes in high availability if replicas are created on different geographic locations (Tanenbaum and van Steen, 2007; Turk et al., 2014). Consistency is another aspect of distributed storage systems when data has multiple copies and keeping the data up-to-date on each site becomes more challenging. Brewer (2012) pointed out that preference to either availability or consistency is a common design objective for distributed databases whereas network partitions are rare.

To date, NoSQL technologies have been widely deployed and reported as surveys in the literature, yet state of the art does not provide an in-depth investigation into the features and performance of NoSQL technologies. For instance, Sakr et al. (2011) presented a survey highlighting features and challenges of a few NoSQL databases to deploy on the cloud. Deka (2014) surveyed 15 cloud-based NoSQL databases to analyze read/write optimization, durability, and reliability. Han et al. (2011) described seven NoSQL databases under key-value, column-oriented, and document categories at an abstract level and classified them with the CAP theorem. Similarly, Chen et al. (2014) surveyed nine databases under the same three categories as described by Han et al. (2011) in the storage section of their survey. Another significant contribution in reviewing big data storage systems was made by Chen et al. (2014), who explained the issues related to massive storage, distributed storage, and big data storage. Their review also covers some well-known database technologies under key-value, column-oriented, and graph data models and categorizes them with Brewer’s CAP theorem. However, these three studies did not cover a large number of NoSQL databases in key-value, column-oriented, and document categories. Moreover, graph databases are not considered in these studies. In contrast, Vicknair et al. (2010) and Batra and Tyagi (2012) have studied Neo4j, which is a graph database, in comparison with the relational database to observe full-text character searches, security, data scalability, and other data provenance operations. Zhang and Xu (2013) highlighted the challenges (i.e., volume, variety, velocity, value, and complexity) related to storage of big data over distributed file systems in a different perspective. However, this survey did not aim to report the performance of existing NoSQL databases for the described challenges. Many other comparative studies are present in the literature, analyzing the performance of some specific category or limited features of NoSQL databases. However, the state of the art does not report any detailed investigation of a vast set of performance metrics covering a large number of storage technologies for big data.

Therefore, in this paper we highlight the features of distributed database technologies available for big data and present a more comprehensive review. We thoroughly study 26 NoSQL databases in this survey and investigate their performance. Moreover, we describe a number of recent widely spread storage technologies for big data under each data model such as key-value, column-oriented, document, and graph along with their licensing. In addition to that, we have strengthened our analysis with a discussion on the existing and recent trends in Brewer’s theorem. In accordance with Brewer’s recent explanation for distributed system characterization, we have highlighted each NoSQL database as either consistent or highly available. Therefore, the remarkable contribution of our work is to provide awareness for big data analysts to choose a storage option from a vast variety of databases with better tradeoff between consistency and availability. Furthermore, this study helps researchers understand and leverage an optimum storage solution for their future research work.

Following are the key objectives of this paper: (1) to investigate storage structures of a wide range of technologies in a big data environment; (2) to highlight distinctive properties of each storage technology; (3) to develop the taxonomy and evaluate big data storage technologies according to the well-known Brewer theorem presented for distributed systems; (4) to identify the challenges and research directions for coping with big data storage in the future.

The rest of the paper is organized as follows: Section 2 describes the evolution of big data storage technologies and their distinctive features over relational databases. Contemporary storage technologies for big data are also detailed in Section 2. Section 3 presents the taxonomy and categorization based on adopted data model and licensing. Section 4 describes Brewer’s CAP theorem for distributed systems along with its new explanation. Storage technologies are investigated and analyzed to suggest a type based on Brewer’s categorization. Section 5 summarizes the discussion and highlights future research challenges. Section 6 concludes the discussion.

2 Evolution of big data storage technologies

In this section we discuss the technological shift from relational, well-structured databases to non-relational, schema-less storage technologies. The drive and challenges due to big data are summarized. Moreover, the prominent features of storage technologies specified for big data are highlighted. Most commonly used big data storage technologies are also elaborated upon.

Over past few decades, relational databases have been used as well-structured data management technologies. They have been recommended for performing data management operations on structured data (Deagustini et al., 2013). Datasets such as the Internet Movie Database (IMDB, 2015) and MovieLens (MovieLens, 2015) are available for being manipulated using relational databases. Big data and emerging technologies like cloud computing allow data to be captured from interactive and portable devices in various formats (Kaisler et al., 2013; Chen and Zhang, 2014). The data come with the new challenges of fast retrieval, real-time processing, and interpretation over a large volume (Kumar, 2014). Unfortunately, relational databases could not evolve as fast as big data. Moreover, the support to fault tolerance and complex data structures is not satisfactory for heterogeneous data (Skoulis et al., 2015). Furthermore, the schema of relational databases does not support frequent changes (Neo4j, 2015). Google, Amazon, and Facebook are some of the well-known web data repositories. Their processing and dynamic scalability requirements are beyond the capabilities of relational databases (Pokorny, 2013). Thus, continuously growing data that come with heterogeneous structures need a better solution. We summarize the comparison between relational databases and big data storage systems in Table 1 using a SWOT analysis.

Table 1 SWOT analysis of relational databases and big data storage systems

Big data is essential for enterprises to predict valuable business outcomes. To meet the challenges of big data, NoSQL (not only SQL) databases have emerged as enterprise solutions. NoSQL databases overcome the problems of relational databases and offer horizontally scalable, flexible, highly available, accessible, and relatively inexpensive storage solutions (MacFadden, 2013). Thus, NoSQL databases have become the mostly adopted technologies for storing big data. Unlike relational databases, these technologies offer support to a large number of users interacting with big data simultaneously. NoSQL databases are great in achieving consistency, fault tolerance, availability, and support to query (Cattell, 2010). They also guarantee some distinctive features over relational databases: scalability, availability, fault tolerance, consistency, and secondary indexing.

2.1 Distinctions of big data storage technologies

It is very common to consider scalability, reliability, and availability as the design goals for a big data storage technology. However, it is also observed that consistency and availability influence each other in a distributed system, and one of them is compromised (Diack et al., 2013). According to the nature of big data, a single server is not a wise decision for storage, and it is better to configure a cluster of multiple hardware elements as the distributed storage system. To discuss storage technologies for big data, the description of features for distributed NoSQL systems provided in the literature is also significant. For this reason, we explain these features in distinction. Out of them, consistency, availability, and partition resilience are further considered to examine the applicability of Brewer’s theorem for current big data storage technologies in Section 4.

Scalability refers to support to growing volumes of data in such a manner that a significant increase or optimization in storage resources is possible (Putnik et al., 2013). The paradigm shift from batch processing to streaming data processing shows that the data volume is continuously increasing. Referring to our previous work (Gani et al., 2015), the volume for future big data will be at zettabyte scale and the storage requirements will increase with such volume. As far as the availability of a system is concerned, it suggests quick access to data storage resources (Bohlouli et al., 2013). For this purpose, data are replicated among different servers, which may be placed on the same location or distant locations to make the data highly available to users at their nearby sites, thus increasing big data retrieval efficiency (Azeem and Khan, 2012; Wang et al., 2015). In other words, minimum downtime and promptness of a system for ad hoc access requests define its availability (Oracle, 2015a).

Node failure is very common in distributed storage systems. To make it fault-tolerant, multiple copies of data are created and placed on the same node and/or different nodes of the storage cluster. Replication not only makes the system highly available but it is also useful for fault tolerance (Hilker, 2012). Furthermore, NoSQL databases offer very flexible schema and data relationships and are not as complex as relational databases (Kumar, 2014). Nowadays, data not only comprise tuples; documents and objects are also part of big data. Therefore, a predefined schema cannot deal with varying data structures (Cattell, 2010).

Regardless of the distribution, storage systems for big data ensure the data be complete and correct. Changes made by users are committed under defined rules (Oracle, 2015a). Eventual consistency has become a widely adopted mechanism to implement consistency in NoSQL databases. Changes are propagated eventually, and the system becomes consistent after propagating the changes. However, instantaneous propagation leads to the development of a strongly consistent system, yet it results in frequent access locks. In addition to fault tolerance and consistency, indexing is worthwhile for big data storage. Indexing is a method that improves the performance of data retrieval. In relational databases, primary keys are sufficient to perform search operations (Oracle Secondary, 2015). However, with the advent of big data, which comes with new challenges of heterogeneity in data structures, primary key indexing is not the solution. Secondary indexes are mostly created automatically by the system, and keys other than the primary key are produced.

2.2 Contemporary big data storage technologies

Research outcomes of exploring storage technologies for big data advocate different aspects of designing storage mechanisms. These reliable and highly available mechanisms contribute to improving data access performance. Improved data access performance drives better quality of data analysis. These technologies offer scalable storage solutions for growing big data with enhanced data structures and support fault tolerance. This section provides a brief explanation of storage systems for big data in each category. In accomplishing the design goals, current storage technologies are described to review their feasibility for big data. Their storage structure and outstanding features to support scalable resource provisioning for big data are also described here.

The Google File System (GFS) is a proprietary system developed by Google Inc. (Ghemawat et al., 2003) to manage data-intensive applications in a distributed manner. It is designed to satisfy the storage needs for steadily growing data as a significant objective along with other features provided by contemporary techniques. Current and future estimated workloads are analyzed to develop such a distributed file system. To deal with the commodity component failure problem, GFS facilitates continuous monitoring and ensures detecting errors, tolerates component faults, and recovers them automatically. GFS adopts a clustered approach that divides data chunks into 64-KB blocks and stores a 32-bit checksum for each block. As shown in Fig. 1, these checksums are stored on servers as part of metadata to ensure integrity. Moreover, the chunks are replicated to avoid chunk server faults and for availability and reliability (Dean and Ghemawat, 2008). Such systems are supposed to handle large-volume data where many kilo-byte size files become a challenge for them. However, GFS guarantees support to manage these small files along with appending new data concurrently for large files even when they are read/write-intensive.

Fig. 1
figure 1

Google File System (GFS) architecture (Ghemawat et al., 2003)

The Hadoop Distributed File System (HDFS) is developed as an inspiration of GFS. HDFS is a distributed, scalable storage system designed as a core of Apache Hadoop to run on inexpensive commodity hardware, initially designed as infrastructure of Apache Nutch. HDFS is a suitable solution for data-intensive applications, typically at gigabyte to terabyte scales, which require high throughput. HDFS provides quick fault detection and automatic recovery, as it comprises a large number of components. However, there is a probability of block failure and nonfunctioning (Borthakur, 2008). Block replication is offered to avoid node failure and unavailability or loss of data (Shvachko, 2010). Replication ensures not only the availability but also the reliability of the system and it is automatically handled by HDFS NameNode. Rather than just being a storage layer of Hadoop, HDFS is a standalone distributed file system that helps improve the throughput of the system. HDFS has a namespace-separated architecture. Metadata is stored on the master node, which is called NameNode, whereas block-split files are stored on a number of DataNodes. NameNode performs mapping of data on DataNodes and namespace operations such as open, close, and rename file. DataNodes fulfill read-write requests and create block and replicas. The architecture of HDFS is shown in Fig. 2.

Fig. 2
figure 2

Hadoop Distributed File System (HDFS) architecture (Borthakur, 2008)

BigTable is another product developed by Google Inc. (Chang et al., 2008) to provide a flexible and high-performance storage for large-scale structured data spread over a large number of commodity servers. BigTable is a highly adaptable, reliable, and applicable storage system to manage petabyte scale data on thousands of machines. BigTable has proven to be a high-performance and available data storage. Therefore, many Google applications and projects that might have throughput or latency requirements are using BigTable with different workloads. Moreover, BigTable provides dynamic control over data placement, representation, indexing, and clustering.

HBase is a column-oriented data store developed by Apache (Apache Software Foundation, 2015). HBase implements a BigTable storage system and is designed to handle storage needs of big data in the Apache project (George, 2011). HBase provides a scalable, distributed, fault-tolerant, and random read-write oriented access to big data on top of Hadoop and HDFS (Fig. 3) (Taylor, 2010). HBase uses the underlying HDFS from Hadoop to store table data. Its master component assigns regions to Region-Servers and keeps track of these servers. Region-Servers receive region information and manage read/write requests from clients on different regions (Khetrapal and Ganesh, 2006). As far as concurrency is concerned, HBase has a remarkable support to read-intensive transactions (Ruflin et al., 2011). HBase is more concerned about distribution and uses a clustered approach for data management; that is why it is a potential solution for a remarkably large number of rows of data.

Fig. 3
figure 3

Two implementations of BigTable (George, 2011; MongoDB, 2015)

Hypertable also implements BigTable (Hypertable, 2015). Hypertable is a distributed database that provides a very good support to consistency of stored data (Han et al., 2011). Hypertable is designed to run and is compatible with many distributed file systems such as GFS, HDFS, and CloudStore. Hypertable stores data in the form of tables and splits the tables to achieve distribution and scalability. Fig. 3 describes the relationship between the components of the Hypertable system. HyperSpace works as a lock manager and ensures high availability and consistency through distributed configuration of replicas by using a distributed consensus protocol. There is no effect on client data transfer if the master becomes unresponsive for a short period. This is preferable to have more Masters to achieve high availability requirements (Khetrapal and Ganesh, 2006).

MongoDB is an open-source NoSQL database designed by Mongo, Inc. as a highly available, scalable, and fault-tolerant document-oriented solution (MongoDB, 2015). MongoDB derives the characteristics of MySQl with the JSON data model (Ruflin et al., 2011). Therefore, MongoDB has the same horizontal scalability, ease of agile development with dynamic schema support to all kinds of document data, and efficiency in manageability as MySQL offers. Furthermore, indexing, dynamic and ad hoc queries, aggregation, and dynamic updates are some of the vast MySQL facilities that are adopted by MongoDB with slight modification. MongoDB stores documents as data in binary representation called BSON, which makes it easy to map data from applications to databases (Kristina and Michael, 2010). Moreover, robustness and availability of data are achieved with a replica set, and multiple replicas are available when the primary server is not responding. This database provides support for storing and combining multiple structure data, whereas the indexing is still feasible (Banker, 2011).

Terrastore is another document-based open-source distributed storage system, developed by Terracotta, Inc. (Terrastore, 2015). Terrastore offers support with high scalability as well as consistency and dynamic clustering during execution. Terrastore is built on top of the in-memory storage, Terracotta. To provide scalability, Terrastore offers not only the deployment of a single cluster but also support to multiple clusters. Furthermore, load balancing for data storage over nodes is automatically performed when nodes connect or disconnect from the clusters (Bossa, 2009). Terrastore is optimized for consistency, as it performs consistent hashing for partitioning and does not create replicas. However, data are redistributed when nodes join or leave the cluster. Like DynamoDB, Voldemort, and Cassandra, the Terrastore node nearby a client forwards the client requests to the serving node and it is not necessary for the client to have this routing knowledge. Although the built-in support to data distribution and partitioning makes Terrastore better than Memcached, Terrastore has more distribution overhead than Redis, which lowers its performance (Alex and Ana-Maria, 2009). Consequently, Terrastore is not considered a good approach for large amount of document data (Bossa, 2009).

HyperGraphDB is an implementation of the hypergraph model to develop an open-source graph database for artificial intelligence and web semantic projects (Iordanov, 2010). This graph database uses Berkeley DB for storage and not only provides storage for arbitrary data but also supports data mapping between host language and storage. In addition, an essential customizable indexing feature is provided by HyperGraphDB so that efficient data retrieval and graph traversal are achieved. HyperGraphDB uses a key-value mechanism to store graph information such as nodes and edges as a key (Dominguez-Sal et al., 2010). In contrast to HBase, HyperTable, Redis, and other master-slave storage systems and access architectures, HyperGraphDB implements a peer-to-peer data distribution mechanism. Each peer works independently, and updates are performed asynchronously and eventually (HyperGraphDB, 2010).

InfiniteGraph is a distributed object database designed for graph data that offers high scalability as well as complex search and traversing efficiency (InfiniteGraph, 2014). InfiniteGraph is a graph database where data and computation load are distributed among storage locations and runtime configurable zones. To achieve scalability, InfiniteGraph implements a model-based technique. Moreover, an InfiniteGraph database facilitates the creation of a custom model for frequently accessed elements so that query performance is improved. These frequently accessed elements can further be separated from other elements to handle locks. Along with custom models, the indexing framework of InfiniteGraph also contributes to better query performance (Objectivity, Inc., 2012). InfiniteGraph database users know that their growing graph data are stored with some schema and they can easily perform normalization and other presentation operations on their data (Fulton, 2011). Furthermore, because of deploying its model on cloud, development and testing of databases over Infinite-Graph is without any charge.

Rocket U2 is a document-oriented, proprietary database management suite that offers two associated platforms of multivalue databases: UniData and UniVerse (RocketSoftware, 2015). Rocket U2 ensures availability of the system after any disconnection, which makes it reliable and robust to immediately recover from any downtime (RocketSoftware, 2014a). According to a case study (RocketSoftware, 2014b), Rocket U2 provides a scalable and flexible system where changes can be made anytime according to the requirements. This flexibility allows easy updating in applications, and the database does not need to be rebuilt when changes occur. Multivalue products are integrated with many pieces of codes, which reduces the developers’ effort of coding. Furthermore, Rocket U2 databases sustain their performance in terms of response time even when the size of data grows. However, support to data partitioning is not available with Rocket U2.

Scalaris is a scalable, distributed, and highly available key-value store that is designed to fulfill the intensive read/write requirements of enterprises (Schütt et al., 2008). For concurrent access and to support intensive transaction, it is common to face node failures. However, Scalaris manages to obtain consistency for critical write operations by implementing a distributed transaction protocol and a structured overlay network on Erlang. The structured overlay network is implemented as the first layer of Scalaris architecture and provides support to a range of queries. The architecture of Scalaris also applies replication to make the system highly available and fault-tolerant. Regardless of the challenges of write-intensive transactions, the Paxos consensus protocol helps ensure consistency with reduced overhead (Scalaris, 2015).

To achieve high performance, custom data management, and scalability for applications, Berkeley DB provides an embedded support to key-value databases via libraries. Although Berkeley DB offers an open-source license, this license is constrained to the development of open-source applications. For third-party distribution, Berkeley DB offers commercial licensing facility. Berkeley DB was initially licensed under Sleepycat Software but later owned by Oracle. The products developed using Berkeley DB have simple data access and management procedures. Furthermore, due to customized configuration with no overhead, these products fulfill varying development needs such as local or distributed storage and storage capacity configuration, and many other options are available when Berkeley DB is used (Oracle, 2015b). As far as other features related to data distribution are concerned, Berkeley DB ensures high throughput for data access rate, non-blocking consistency, reliability, fault tolerance, and availability. Like relational databases, Berkeley DB has a good support to data access efficiency, transactions, and recovery mechanism (Seltzer and Bostic, 2015). However, provision to these relational databases-like features is packaged in libraries, and developers are given the configuration choice according to their requirements. Regardless of the features provided by Berkeley DB, it is noticeable that Berkeley DB does not support synchronization with other databases. Meanwhile, data partitioning is not facilitated by Berkeley DB (Oracle, 2015c).

DynamoDB is a widely used, fully distributed, schema-free NoSQL database tool by Amazon (Sivasubramanian, 2012). DynamoDB is predominantly used for storing unstructured, variable, and scalable data (Baron and Kotecha, 2013). DynamoDB offers infinite scaling capacity for data storage and access rate. It is applicable when frequent updates on distributed sites, efficient indexing, and adaptive scalability are required. Availability and durability can be achieved through automatic replication over multiple zones in a region (Niranjanamurthy et al., 2014). Regardless of the request size, DynamoDB ensures stable performance and visible resource utilization. Administrative tasks, such as hardware provisioning, setup and configuration, replication, software patching, and scaling, can be offloaded to Amazon Web Service (AWS). DynamoDB is an efficiently managed database; that is why it is suitable for scalable applications with exponentially growing data (Vyas and Kuppusamy, 2014).

Qualcomm has developed a document-oriented database named Qizx for large-volume, high-performance, and transaction-intensive XML data (Qualcomm, 2014a). Qizx is well suited for enterprises with fast data access and retrieval requirements. For this purpose, Qizx has automatic full-text and customizable indexing capability, which makes it efficient in query responses. In addition, support to terabyte-scale data and other features related to distribution are taken into consideration while designing this database. Qizx’s ‘shared-nothing’ clustering architecture allows each node or host in cluster to fully replicate the database to achieve high availability (Qualcomm, 2014b). However, these share-nothing hosts serve client read requests, and one dedicated server is responsible for update operations. Consequently, a consistent, fault-tolerant, and up-to-date replicating database system is achieved.

An industry-leading, broadly used, schema-less graph database that is effectively replacing relational databases is Neo4j (Neo4j, 2015). By analyzing complex data relationships with Neo4j, enterprises gain value benefits. In regard to scalability (Montag, 2013), high transaction load, concurrency, and performance for read request workloads, Neo4j not only competes other graph databases but also affects improvements on its older versions (Zicari, 2015). Support to write-intensive transactions without blocking is achieved with the help of buffering. In spite of its benefits, Neo4j is time consuming in setting up a ready-for-production and reliable database.

For real-time web applications, RethinkDB is the first open-source document database that supports complex queries in an efficient way (RethinkDB, 2015). RethinkDB reverses traditional storage systems and implements append-only nested data storage structure, which makes it more consistent, robust, and easily replicable (Walsh et al., 2009). RethinkDB offers easy scalability and faster response to client’s real-time queries. Although RethinkDB also uses the JSON data model, query operation processing is slower than that of MongoDB. However, RethinkDB has a real-time architecture and, due to its append-only storage structure, RethinkDB incurs less scaling cost as compared to MongoDB. Furthermore, advanced query language and administration are better than with MongoDB. Despite the benefits obtained by RethinkDB, RethinkDB does not perform very well for compute- and write-intensive applications.

Aerospike is the first flash-optimized, open-source, key-value data store for real-time data, which offers scalability and reliability with very low cost (Aerospike, 2015). Aerospike has a ‘shared-nothing’ architecture, which supports petabyte scale volume of data with reliability and linear scalability (Aerospike, 2012). Similar to Cassandra, a column-oriented database, Aerospike is designed to provide scalability, replication, automatic node partition recovery, and high availability. However, Aerospike is better than Cassandra in terms of read and write workload, consistency with jitters, and cost for read and write operations.

The first multi-model, open-source, and highly scalable database for document data with an extended, transparently managed graph layer to provide connections between records as relationships is OrientDB (OrientDB, 2015). The embedding graph layer makes OrientDB not only persistent and compact but also fast in data traversal and managing data relationships without increasing the cost. The multi-model platform of OrientDB makes it categorizable as a document-oriented graph database. Regardless of volume of data, OrientDB has a distinct speed in data storage as well as read and write transactions. To ensure performance and scalability, clustering and replication over heterogeneous servers is a significant benefit achieved by OrientDB (SD Times Newswire, 2013). OrientDB has many features similar to Couchbase document store, which is derived from CouchDB. Both have a schema-free model, support to secondary indexes, sharding for data partitioning, and the master-master replication model. However, OrientDB has additional features such as a multi-model approach to support all operating systems, which are not provided by Couchbase.

AllegroGraph is a graph database efficiently designed for RDF semantic web applications. AllegroGraph ensures persistency, memory utilization performance, and high scalability over millions of quads (AllegroGraph, 2015). Furthermore, automatic compute and storage resources management in AllegroGraph are the key performance fundamentals. AllegroGraph offers not only an enterprise and commercial version but also an open-source product. However, the open-source version of AllegroGraph is limited in scalability. Hybrid data representation, which covers both relational model and triple store, optimizes the range query performance in AllegroGraph (Aasman, 2008).

Redis is an alternative open-source cache and key-value store for big data, which provides an efficient data structure for indexing to speed up query operations and response (Carlson, 2013). Somehow, the functionality of Redis is similar to that of Scalaris. Redis has considerable support to replication in a master-slave environment. However, providing support to multiple data structure makes it a predominant choice for frequent data access situations. Linux is the recommended operating system for its deployment, as Redis is developed and tested on the Linux platform (Excoffier and Lischer, 2010). In the case of more scalability requirements when systems become out of RAM, Aerospike in-memory store is preferable to Redis. Redis is much closer to MemcacheDB in fast data access, as the whole data resides in the memory. However, Redis has some other powerful features such as built-in persistency and support to more datatypes. Unlike MemcacheDB, Redis is persistent as a real database and data will not be lost with restart. Furthermore, support to more data types is a unique property of Redis. Thus, Redis becomes a best choice when high scalability, heterogeneity in platform, servers and application, and in-memory data are required.

Voldemort is a general-purpose, distributed data storage solution for large-scale data (Sumbaly et al., 2012; Voldemort, 2015). The routing module of Voldemort is responsible for performing data clustering and replication and follows the same procedure as DynamoDB. Voldemort provides eventual consistency, as read/write operations can be performed on any node and for a very short period as happened when the view of data is inconsistent (Bunch et al., 2010). Like MemcacheDB, Voldemort also achieves data persistence using Berkeley DB, and the replication factor for distributed hash table entries can be set by the developer.

KAI is a distributed, key-value-based data store that is developed as an implementation of Amazon’s Dynamo. KAI provides high scalability and availability to websites with support to a variety of data. The name KAI is derived to elaborate its good organization of clusters where data are stored and processed in parallel in a distributed, replicated manner (SourceForge, 2015). Furthermore, load balancing, configurable scalability of nodes, fast configuration, and reliability through replication are the basic performance gains of KAI. Although KAI offers eventual consistency, the implementation of the quorum protocol helps achieve strong consistency.

Cassandra is a decentralized and highly available key-value storage system for structured data that comprises a large number of underlying data centers (Hewitt, 2010; Lakshman and Malik, 2010). It serves to provide scalability, instance storage, and improved performance for frequent read/write operation requests. Data consistency is achieved in Cassandra through periodic updates on replicating sites (Baron and Kotecha, 2013). In contrast to HBase, Cassandra provides significant support to write-intensive transactions (Ruflin et al., 2011). Most significantly, Cassandra persistently deals with component failures in such large-scale systems and achieves overall reliability and scalability. Schema can be built any time, as Cassandra provides flexibility. Clustering, partitioning, and replication while taking care of fault tolerance, reduced latency, and scalability are the prominent features of Cassandra (Lakshman and Malik, 2010; Abramova and Bernardino, 2013). It is a reasonable choice for continuously growing systems such as Facebook. Facebook has deployed Cassandra, where scalability linearly depends upon the number of users and a write-intense system is needed (Armbrust et al., 2009).

SimpleDB is an open-source, document-oriented database that is available as an Amazon service (Sciore, 2007). Without database administration load, SimpleDB ensures high availability and durability of data with automatic geographic replication. Furthermore, the data model is flexible, and automatic indexing is performed on the data. Thus, automatic provisioning of database administration makes application development simple via SimpleDB. Despite the ease of data management provided by SimpleDB, scalability is limited to 10 GB (Habeeb, 2010).

MemcacheDB is an open-source key-value store (Tudorica and Bucur, 2011; MemcacheDB, 2015) developed for fast and reliable storage and access of objects and data. It implements Berkeley DB to develop a general-purpose database, and many features of Berkeley DB such as transaction and replication are derived. MemcacheDB is a persistent database for dynamic, database-driven web applications (Coburn et al., 2011). Memcached users can access MemcacheDB through any API, as it uses the Memcached protocol. When implementing Memcached API, a cache system for fast access from memory, it becomes easy to provide data persistency as caching is already performed by Memcached (Helmke, 2012). It applies a master-slave approach to achieve data persistency where the data read request can be fulfilled by any site but write operation is performed only at the master node. It operates with a single master and multiple replicating nodes. However, it is flexible to allow any node to become a replicating node, but it reveals a security challenge (Bunch et al., 2010).

CouchDB is an open-source, distributed, scalable document database developed for performing data operations and administration on the web (Anderson et al., 2010; Apache, 2015). In case of intensive access requests, CouchDB adopts a sophisticated concurrency mechanism that does not let the system fail. However, such intensive workload cause delays in overall responses. CouchDB has support to dynamic data structure, which makes it possible to define schema as required. Regardless of wide adaptability of CouchDB in web application development, it is suggested to check its feasibility for developer requirements (Wenk and Slater, 2014). Like MongoDB, CouchDB also implements the JSON data model, which helps CouchDB support semistructured data. Furthermore, CouchDB allows the storage of any kind of data as documents (Ruflin et al. 2011).

Riak is a key-value database that provides high availability and less cost scalability (Sheehy, 2010). Riak provides simpler data models and data conflict resolution. Riak is an advanced form of key-value databases but somehow it is categorized as a document-based database. Although Riak has more functionalities than other key-value databases, it lacks the features of document-based databases (Cattell, 2010). Therefore, it is better to categorize Riak as a key-value database. As compared to Redis, Riak has a less complex key construct, which is a feature of document-based databases (Ruflin et al., 2011). This less complexity to key construct makes the data model of Riak flexible, and changes in the data model do not affect the indexes.

From an analytics perspective, tensor-based models are widely adopted to represent arbitrary relationships in data entities and to traverse graphs (Cichocki, 2014). Big data has high dimensionality rather than two-tuple relationships and brings the challenge of analyzing such multi-relational datasets. Therefore, tensor-based models are being used for big data representation. SciDB (Cudré-Mauroux et al., 2009), an implementation of the tensor model, is designed to manage multidimensional data as an array database, which creates chunks of arrays for parallel processing. SciDB supports many data manipulation operations such as linear algebra operators for chunk-based N-dimensional arrays. Attributes and dimensions of data are flexible and can be added or removed any time. SciDB is an append-only array database for scientific data, and the changes or corrections made in data are stored as new versions (Stonebraker et al., 2013).

In this section, we have provided a comprehensive account of current storage technologies for big data. Table 2 provides a brief summary of above discussed storage technologies available for big data. It highlights the design objectives and characteristics of these technologies. It comprises four vertical divisions, providing the name of the storage technology, the vendors, and the design goals. Development of distributed storage mechanism, which partitions the data in a NoSQL database format, provides scalability and efficient and fault-tolerant support to read/write operations on that data.

Table 2 Summary of big data storage technologies

Table 2 provides a summary of big data storage technologies. It can be seen that most of the storage technologies are designed to ensure scalability and consistency in general. However, these storage systems support specific features such as intensive workload, high availability, data access performance, and complex query operations. We provide a comparative table identifying the application areas of each storage technology in Table 3. The following section highlights these features to categorize the above discussed storage systems. Taxonomy based on the adopted data model of big data storage technologies is presented and detailed in the next section.

Table 3 Comparison and application areas of storage technologies

3 Taxonomy of big data storage technologies

Big data storage technologies described in the previous section are categorized according to their data model and licensing in this section. There are four types of data model, key-value, column-oriented, document-oriented, and graph, whereas licensing has three categories, open source, proprietary, and commercial. We classify the storage technologies and develop taxonomy based on their data models and licensing. Additionally, we elaborate tensor decompositions and networks as representation and analysis models for multi-relational big data stored using graph-based technologies. Scalability and fast data retrieval are the significant expectations of a storage system in this regard.

However, categories of data models offer different levels of heterogeneity. The following is a description of the data models, and the example cases for each data model are presented in Fig. 4.

Fig. 4
figure 4

Examples of data models: (a) key-value; (b) column-oriented; (c) document-oriented; (d) graph

Key-value: Key-value databases are developed for big data storage and store not only structured but especially unstructured data in the form of a key and the corresponding value of each data record (Pokorny, 2013). These databases have suitable storage structure for continuously growing, inconsistent values of big data for which faster response of queries is required. Key-value databases provide support to large-volume data storage and concurrent query operations. Unlike block storage systems, key-value data structure stores data in small objects instead of blocks and they are easily configurable (DeCandia et al., 2007). In key-value databases, values in records may differ or have different representations. In this way, this structure offers less memory consumption and the flexibility of adding more records easily. The appropriate scenario for a key-value database is when searching for more than one feature from records is needed. Most common applications of key-value storage are session information management for online games, online shopping, and other web applications where an enormous number of small-sized records are to be managed.

Column-oriented: This category of NoSQL databases is suitable for vertically partitioned, contiguously stored, and compressed storage systems. Column-oriented databases store columns of data separately, unlike traditional storage where data are stored in the form of complete records. Data reads and retrieval of attributes in such systems is quite fast and comparatively less expensive, as only the relevant column is accessed and concurrent process execution is performed for each column (Abadi et al., 2009; Hu and Dessloch, 2014). Column-oriented databases are highly scalable and eventually consistent. They provide support to applications for reliable and highly available storage (Dharavath and Kumar, 2015). The application areas of column-oriented databases are customer record analysis, data warehousing, patient data management, library systems, and wherever analysis is required to aggregate similar data items. Furthermore, with column-oriented structure it is convenient to add new features in all rows. For example, an online shopping website can apply aggregation on mostly viewed or ordered items in a specific time span, the trendy regions for online shopping, and the success rate of online trading in a year.

Document-oriented: Document-oriented data model is similar to key-value structure and stores data in the form of key and value as reference to the document. However, document databases support more complex queries and hierarchical relationships. This data model usually implements the JSON format and offers very flexible schema (Kristina and Michael, 2010). Although the storage architecture is schemaless for structured data, indexes are well defined in document-oriented databases. SimpleDB is the only database that does not offer explicitly defined indexes (Cattell, 2010). Document-oriented databases extract metadata to be used for further optimization and store it as documents. Use cases for document-oriented databases include user profiles on social networks, analysis of websites, and complex transactional data applications. On Facebook, as an example, users may provide less or extensive information on their profile. Each user profile will be stored as a document. The performance of heuristic schema offered by a document-oriented data storage model depends upon the user inputs and nature of queries.

Graph: Graph databases are the best choice to store data along with relationships. Graph databases offer persistent storage of objects and relationships and support simple and understandable queries with their own syntax (Zicari, 2015). Modern enterprises are expected to implement graph databases for their complex business processes and interconnected data, as this relational data structure offers easy data traversal (Neo4j, 2015). Highly frequent trading systems and recommendation systems prefer graph databases to achieve low latency. For instance, getting recommendations from customer feedback data on a commerce website requires self-joined, multilevel queries for traditional databases, which becomes a very complex operation. In contrast, for a graph database, this data manipulation is quite simple as two lines of code without affecting the structure of data.

Tensor-based models: An interesting representation of graph-oriented big data is tensor-based models, which are widely adopted models to represent arbitrary relationships in data entities and to traverse graphs (Cichocki, 2014). Dealing with high-order relationships of big data is significant in many areas of information processing such as topic modeling, social network analysis, and recommender systems. Graph-oriented systems are well suited for storing hetero-dimensional data. However, for analytical purposes, tensors are gaining attention to build open-source libraries. Lorica (2015) stated that it is useful to capture and analyze the complex relationships in data to extract more accurate information. Although there exist some efficient methods to analyze hidden variables and highly relational neural networks, the performance of tensors is beyond comparison. Scalability, accuracy, and timeliness are the reasons to prefer tensors over other methods.

Based on the specifications of each data model as described above, we present their relationship with volume and varying structure of data in Fig. 5. Key-value databases are more inclined to store each record regardless of considering completeness of data and assign a key identifier to record. However, graph databases are useful in storing well-structured, persistent objects with complex relationships. On the other hand, scalability, clustering, and data partitioning among nodes, as well as adding or removing nodes’ runtime, are difficult in graph databases. Data models that support volume and scalability also have to support inconsistencies and allow storage of unstructured data over schema-less architecture. In contrast, the graph data model offers support to well-structured and complex data, but the scalability is not as much as what other data models provide.

Fig. 5
figure 5

Data models of NoSQL databases

Besides data models of NoSQL databases designed as big data storage systems, we consider the type of licensing to categorize the storage mechanisms. Description of licensing is provided as the following:

Open source: An open-source system is freely available for both academia and business so that it can be used and integrated with their own piece of code or application (Milne and Witten, 2013). These systems are cost effective in terms of development and provide better quality and flexible access to even small businesses (Nagy et al., 2010).

Proprietary: Unlike open-source systems, proprietary systems may be owned after paying a reasonable fee for their use. These systems are given with some legal terms and conditions to take care and only for own use and not to be modified or redistributed (Nagy et al., 2010). The source codes of such systems are usually not given to the buyer.

Commercial: These systems are developed for sale. A trial part may be available free of cost but to obtain complete privilege for research and development; a user or enterprise must purchase it.

Fig. 6 elaborates the categorization of big data storage technologies that are considered and further analyzed based on Brewer’s theorem in a later section. The figure lists the storage technologies available for big data and presents the data model implemented by these technologies. It also shows the classification based on types of licensing, which are open source, proprietary, and commercial. Key-value systems are developed for big data storage on NoSQL databases with different storage structures. Applications where whole records need to be accessed and analyzed are referred to as key-value storage structure. For instance, DynamoDB (Sivasubramanian, 2012) adopts key-value structure of data storage and gains better performance for storing and accessing unstructured, inconsistent data. Stable performance and visible resource utilization are ensured even when the querying workload varies. Scalaris (Schütt et al., 2008), Berkeley DB (Oracle, 2015b), Aerospike (Aerospike, 2015), Redis (Carlson, 2013), Voldemort (Voldemort, 2015), KAI (SourceForge, 2015), and MemcacheDB (MemcacheDB, 2015) are also key-value storage systems as detailed in the previous section.

Fig. 6
figure 6

Taxonomy of big data storage technologies

In column-oriented storage systems, data are vertically partitioned and stored as separate columns. Each vertical partition may have different or the same organization of data. In Fig. 6, HBase (Apache Software Foundation, 2015) is shown as a column-oriented data store, which ensures a consistent view of the data every time for read/write requests. HBase is an implementation of BigTable (Chang et al., 2008) and developed as a NoSQL database to provide scalable storage and fault-tolerant access to its data. It is deployed on top of Hadoop and HDFS and facilitates efficient random read/write operations. Besides, HBase and BigTable, Hypertable (HyperTable, 2015) and Cassandra (Hewitt, 2010; Lakshman and Malik, 2010) also belong to the column-oriented data model category. MongoDB (MongoDB, 2015), Terrastore (Terrastore, 2015), Rocket U2 (RocketSoftware, 2015), Qizx (Qualcomm, 2014b), RethinkDB (RethinkDB, 2015), OrientDB (OrientDB, 2015), SimpleDB (Sciore, 2007), CouchDB (Apache, 2015), and Riak (Sheehy, 2010) are document-oriented databases. HyperGraphDB (Iordanov, 2010), InfiniteGraph (InfiniteGraph, 2014), Neo4j (Neo4j, 2015), and AllegroGraph (AllegroGraph, 2015) are graph databases. We extend graph-oriented data models and elaborate tensors as analytical models for processing and extracting information from high-dimensional big data. Tensors provide scalability when implemented on distributed systems. Kim and Candan (2014a) have used SciDB (Cudré-Mauroux et al., 2009) to implement tensor decomposition on disk rather than in memory, which minimizes I/O and processing costs. This tensor decomposition project is named TensorDB (Kim and Candan, 2014b) with support of both static and dynamic decompositions. Tensors have also been used in Neo4j.

From the licensing perspective, Fig. 6 shows that most of the available big data storage technologies are open-source and freely available for either a storage solution or a research platform. BigTable is a proprietary storage solution by Google to be built over many commodity servers. Although it is not an open-source project, its implementations, Hypertable and HBase, are available to research and consumer community without any cost. DynamoDB is a commercial storage technology. All storage systems discussed in this paper, except BigTable and DynamoDB, are open-source technologies.

Table 4 explains the taxonomy and describes the types of licensing and data models of each storage system. It also highlights the important features of these storage technologies. The technologies are distributed and are scalable, and they assure fault tolerance and efficient response to read/write operations on the data stored there. Furthermore, the table illustrates the applications of big data storage technologies.

Table 4 Taxonomy of big data storage technologies

This section has provided an illustration of the taxonomy of big data storage technologies based on the data model and licensing. We have categorized these technologies according to the data model into four NoSQL systems: key-value, column-oriented, document-oriented, and graph databases. The details and applications of each data model are provided. According to the licensing method, storage technologies are categorized as open source, proprietary, and commercial. Fig. 6 shows that most of the storage systems are open-source. However, BigTable (Chang et al., 2008), InfiniteGraph, and Rocket U2 (RocketSoftware, 2015) are proprietary systems, whereas DynamoDB, Qizx (Qualcomm, 2014a), and AllegroGraph are commercial systems. Berkeley DB (Oracle, 2015b) is the only key-value database that is available in both open-source and commercial licenses. In a later section, we analyze NoSQL distributed storage systems according to Brewer’s CAP theorem. We discuss the background and importance of Brewer’s theorem in analyzing distributed systems. We also list the methods adopted by these storage systems to implement consistency, replication, and partition of data.

4 Analysis of big data storage technologies

In this section we analyze the discussed big data storage technologies. Brewer’s CAP theorem, which gives widely used criteria to classify distributed systems, states that it is difficult for NoSQL technologies to fulfill ACID or BASE criteria as consistency, availability, and partition resilience are the considerable factors in designing these large-scale, distributed storage systems. ACID provides a set of characteristics for each transaction performed on relational database management systems and declares that these transactions are fully consistent (strong consistency), whereas BASE allows systems to have a weaker level of consistency. However, for distributed systems, this is not always possible to be consistent. Sometimes, these systems compromise between consistency and availability. Thus, the CAP theorem becomes useful for categorizing modern NoSQL storage technologies. According to new explanation of Brewer’s CAP theorem, a distributed system may be either consistent or highly available. Therefore, we explore the characteristics of existing NoSQL technologies and analyze them with Brewer’s proposed criteria. Later, we deeply investigate each storage technology to suggest the category of Brewer’s theorem where a technology lies. Furthermore, we summarize the mechanisms adopted by modern big data storage technologies to achieve consistency, data partitioning, replication, and indexing.

4.1 Brewer’s CAP theorem

We describe the significance of Brewer’s CAP theorem in the evaluation of distributed systems. For this purpose, we first describe other criteria and evolution of the CAP theorem. With the emergence of distributed storage systems, the need to reconsider ACID (Gray, 1981) constraints (atomicity, consistency, isolation, and durability) became urging, as it does not ensure availability. Although availability may not be a requirement for some network services, it is most likely to be considerable. Many researchers have presented their viewpoints to explain the basic requirements of a distributed storage system for big data. Fox et al. (1997), for example, identified them as incremental scalability, provisioning of overflow growth, high availability, and cost effectiveness for scalable network services. The authors showed that it is very difficult for networked clusters to fulfill all these requirements and presented the BASE (basic availability, soft state, and eventual consistency) model (Fox et al., 1997). This model can be preferably adopted when ACID is not applicable, and it provides a better support to faults and partial failures in clusters at very little cost.

When discussing consistency and availability, it seems that a strongly consistent system prohibits more than one copy of data; thus, the concept of high availability is compromised, and it discourages replication to increase users’ access to the required service. Fox and Brewer (1999) introduced a tradeoff to design distributed systems and presented the CAP principle (consistency, availability, and partition resilience). Before explaining the CAP principle and its tradeoff to categorize distributed systems, we provide below a brief definition of the three basic concepts considered related to this principle, namely consistency, availability, and partition resilience. Consistency refers to having the same up-to-date data at each node (Oracle, 2015a), availability suggests quick accessibility to data storage resources with minimum downtime (Oracle, 2015a), and partition resilience is related to fault tolerance in case of unresponsive nodes or a sub-network (Bohlouli et al., 2013).

Initially the authors claimed that a simultaneous provision of these three features as explained above is not possible for a distributed system and two out of three were supposed to be guaranteed (Fox and Brewer, 1999). For instance, there is a possibility that a distributed system compromises partition resilience and satisfies the other two. Existence of any two of these three features can present a distributed system. This principle invited criticism from experts, and, later, Brewer clarified the misleading concept of ‘choosing two out of three’ and explained that most of the time there is a choice between consistency and availability depending upon the user access pattern or nature of data (Brewer, 2012). As far as partition resilience is concerned, Brewer showed that it is rare to face the partitioning problem in a system and that a lot of options exist for partition handling and recovery. This explanation is depicted in Fig. 7. Thus, the theorem suggests a design of the distributed system be leaning towards either consistency or availability where partition resilience cannot be forfeited.

Fig. 7
figure 7

Brewer’s CAP theorem (Brewer, 2012)

4.2 Analysis of big data storage technologies based on Brewer’s CAP theorem

We have provided a brief illustration of Brewer’s CAP theorem in Section 4.1. According to the new explanation (Brewer, 2012), it is clear that a storage system will be of either CP type or AP type. Here we present a systematic explanation of each storage system available for big data and categorize the system into CP and AP types since it is declared in the CAP theorem that it is impossible for a database to have both ACID consistency and data availability. Consequently, we focus on two pair of combinations from this theorem, which are consistency-partition resilience and availability-partition resilience, in our discussion. Table 5 elaborates the features of big data storage technologies. These features are further considered in analyzing and categorizing these storage systems according to Brewer’s theorem. Table 5 also describes the methods for implementing consistency, replication, data partitioning, and indexing for these technologies. Furthermore, Brewer’s category is suggested in Table 5 for each storage system.

Table 5 Analysis of big data storage technologies

Based on the investigation of contemporary big data storage technologies presented in Table 5, we suggest the type for each technology as depicted by Brewer. We identify and prove that each technology falls into either the CP or AP category. In Fig. 8 we present the categorization based on the two pairs given by Brewer for modern distributed systems, to show their success and capability towards Brewer’s theorem. Intersection of two features presents the criteria, e.g., consistency-partition resilience (CP) and availability-partition resilience (AP). As Fig. 8 describes, recent distributed systems for storing big data satisfy either pair. Consistency and partition resilience are presented in Fig. 8 as CP, and the intersection shows the list of technologies that possess these two features from each data model. The pair availability-partition resilience is also shown in Fig. 8, and at the intersection lie Cassandra, Voldemort, and DynamoDB. We previously explained that, according to Brewer, it is not possible for a system to fully support consistency and availability at the same time. Therefore, for a distributed system it is important to satisfy either choice, but partition resilience is mandatory.

Fig. 8
figure 8

Big data storage technologies and the CAP theorem

The trivial case for the systems having CP is that these systems become unresponsive to requests until the system recovers from partitioning and data on all sites are up to date. This phenomenon results in a locking state for the system, and the system may become unavailable for this time span. CP-type systems aim to ensure strong consistency regardless of the data access performance, and usually they believe in performing synchronous write operations. Scalaris, Redis, MemcacheDB, and Berkeley DB are CP-type key-value databases.

Scalaris (Schütt et al., 2008) implements a non-blocking Paxos commit protocol to ensure strong consistency. While the master-slave approach of MemcacheDB supports implementation of consistency and clients can access any replica on slaves for read operations, write operations are performed only at the master node.

From the list of column-oriented databases, HBase, Hypertable, and BigTable are of CP type. For up-to-date and consistent data provisioning on all replicas, HBase allows synchronous writes to all replicas, which may increase latency or unavailability, but these are not the design goals for HBase (Wang et al., 2014). HBase is designed to provide read/write consistency and holds clients’ requests until the updating operation is complete. There is a write-aheadlog (WAL) in hard drive where write operations are updated first. Later, in-memory data are updated across all replicating nodes so that improved write throughput can be achieved. Similarly, BigTable implements transactional consistency on rows using Chubby service (Burrows, 2006) and the Paxos algorithm (Chandra et al., 2007) for persistent replicas and consistency lock management. Client acquires the lock and maintains the session during which it performs atomic reads and writes.

Document databases such as MongoDB, Terrastore, and RethinkDB are CP-type databases. These databases prefer consistency over availability. MongoDB and RethinkDB are authoritative systems from the design perspective. Authoritative systems do not face the data inconsistency problem, which makes MongoDB and RethinkDB simple to build applications on top of them, whereas unavailability occurs very often for them (RethinkDB, 2015). Terrastore ensures that each single document is consistent and up to date. Because Terrastore does not allow interdocument transactions or multiple simultaneous operations on the same document, Terrastore becomes more scalable with simpler consistency. Tolerance to node failure is provided by Terrastore to the extent that at least one server node is working and the server node must have connection to the master node. Otherwise, Terrastore will not ensure partition resilience.

All the graph databases discussed in this paper are of CP type. None of them is more inclined to support availability than consistency. AllegroGraph is the only graph database that implements transactional consistency; the rest of the graph databases aim to provide eventual consistency. HyperGraphDB implements Berkeley DB and offers non-blocking concurrency for transactions. Furthermore, HyperGraphDB has flexible schema, and updates are eventually consistent. Similarly, Neo4j supports nonblocking concurrency and eventual consistency. Even write-intensive transactions are efficiently accomplished by Neo4j. InfiniteGraph also supports eventual consistency for write operations.

High availability to its extent can be achieved only when each request has received a response and the system does not ever face a complete locking state. However, it will not guarantee consistent results of requests. Key-value databases such as Aerospike, Voldemort, KAI, Riak, and DynamoDB are AP-type databases. Voldemort is designed as a highly available database where eventual consistency ensures almost up-to-date replicas. DynamoDB, when receiving a data update request, does not propagate it to all replicas for the reason of not locking them from availability to read requests. DynamoDB performs updating on a quorum of servers, while other servers respond to clients’ requests in order to provide them access to consistent data at that time (Ramakrishnan, 2012). The quorum of servers needs to ensure successful write update before any request is entertained (Wang et al., 2014). Riak has complex conflict resolution strategies, and implementation of quorum also results in latency problems. However, Riak is strongly available and is a partition-resilient key-value system.

Cassandra is the only column-oriented AP-type database. Cassandra adopts the same mechanism to ensure availability and propagate consistency as DynamoDB does. Like Riak, Cassandra also has complex procedures for conflict resolution and quorum implementation. SimpleDB, CouchDB, OrientDB, Rocket U2, and Qizx are AP-type document databases. OrientDB offers versioning to implement the effect of write operations. In this way, effect of change in data can be viewed in different versions. As far as graph databases discussed in this paper are concerned, none of them is of AP type. In the next section we will summarize the discussion and identify future research challenges.

5 Discussion and future research challenges

This paper has presented the evolution and taxonomy of big data storage technologies. Technologies are commonly designed to provide a storage solution along with high scalability to growing data volumes with heterogeneous data structures. These databases are deployed over distributed systems to achieve high availability, improved data access, performance, and fault tolerance. However, the extent of provisioning these services is different for each database, which makes these databases distinguishable from each other. This extent helps identify these databases as of CP or AP type. In Table 5, contemporary key-value, column-oriented, document-oriented, and graph databases are presented and analyzed on the basis of their adopted procedures for consistency, data partitioning, replication, and indexing. Moreover, their correlation with Brewer’s CAP categorization is kept in the analysis to suggest their type.

Key-value databases apply data partitioning on separate records regardless of each having the same attributes. A unique key is assigned to each record, and the value contains data of a record. Although they are mostly suitable for unstructured data, structured data can be presented with these databases if record-based data retrieval and analysis needs to be performed on it. As far as licensing is concerned, most of the column-oriented and key-value databases are open source. BigTable is the only column-oriented database that has proprietary license. Likewise, DynamoDB is the only key-value database that is available commercially. Column-oriented databases such as HBase, Hypertable, Cassandra, and BigTable are mostly suitable for structured data with enough support to unstructured data. These databases apply vertical partitioning on the data and store each column as a separate chunk of data, and performing queries on attributes as well as attribute-based analysis of data is easier with this data model.

As far as document-oriented databases are concerned, these databases also have the key-value data structure. However, the value identifies a document instead of a record. Documents are usually XML files with some schema. Compared to the key-value data model, document databases have less support to scalability and unstructured data. Furthermore, databases having document structure are, on average, prone to availability and consistency. For instance, MongoDB, Terrastore, and RethinkDB are consistent databases, whereas SimpleDB, CouchDB, OrientDB, Rocket U2, and Qizx are highly available. Graph databases are well-structured databases, where analyzing data as well as their relationships is significant. Although graph databases do not have good support to scalability and clustering, these databases offer complex data structures. According to our analysis, all graph databases in this survey are CP-type systems.

With the proliferation of big data, industry and academia are more interested in data management than computational management. The technology has much evolved in provisioning vast storage and processing resources. However, in big data management, efficient techniques for data acquisition, preprocessing, processing, and storage are desirable. The ongoing development is focusing on bringing efficient solutions that support big data management. The Hadoop framework (Lam et al., 2010) has become the de facto standard for big data processing with the MapReduce programming model, which offers batch processing of extensive volume of files residing over commodity hardware, whereas for real-time big data processing, the simple, scalable streaming system (S4) (Neumeyer et al., 2010) is a widely adopted tool. The Apache Software Foundation has a list of contributions to big data solutions such as Mahout, Lucene, Hive, Pig, and Spark.

Besides processing, storage optimization is also important. Methods for data clustering, replication, and indexing for efficient storage utilization and data retrieval are of main concern. Storage-optimizing hierarchical agglomerative clustering (Buza et al., 2014), the K-means algorithm (Zhao et al., 2009), and the artificial bee colony (ABC) algorithm (Karaboga and Ozturk, 2011) are the clustering approaches used in recent research. The storage technologies presented in this paper have built-in support to replication, which in turn ensures data availability, fault tolerance, and less data accessing delay. For efficient replication, the ABC algorithm (Taheri et al., 2013), D2RS (Sun et al., 2012), and JXTA-overlay P2P platform (Spaho et al., 2013) are the famous techniques used. HAIL (Dittrich et al., 2012) provides an indexing solution for Hadoop, which improves the data search and retrieval process.

While summarizing the storage technologies presented in this paper, it can be stated that all the storage structures are partition-resilient, meaning that network partitioning and disconnections in distributed systems is rare and there are many options to handle and recover from a partitioning situation. Thus, the choice is only between consistency and availability. As illustrated in Section 4, the support to both of them is beyond the possibility of a distributed system. Thus, the discussed distributed storage systems are placed in either category. Storage systems that provide more support to consistency are CP-type systems. Apart from the efficiency of available big data storage technologies, the challenges of storing future big data still need to be addressed and considered. Finding and adopting the tradeoff between consistency and availability so that more throughput gains can be achieved from distributed storage systems leads to a number of challenges in this research. The following are some of these challenges:

  1. 1.

    Frequent data update and schema change: The update rate is mostly very high and the volume of data is growing very rapidly. In case of unstructured data, change in schema is also very common. However, available storage technologies are scalable, but the need to be efficient in data updates and schema is still under consideration. Some technologies like HDFS do not offer data update but append operation in only support.

  2. 2.

    Partitioning method: Data models suggest two methods of big data partitioning to make it contained on distributed storage nodes accordingly. Horizontal or vertical partitioning is applied on data based on access patterns. Data may be required to be analyzed by the features or records. Thus, the choice of column-oriented or key-value NoSQL databases is available. However, the prediction of access pattern might be wrong or change during execution. This poses a critical research challenge on existing data models specified for big data storage solutions.

  3. 3.

    Replication factor: Data are replicated over multiple sites to achieve fault tolerance and high availability to its users. Although this concept makes the storage very efficient to improve access performance, it compromises data consistency and does not suit in frequent data changing and up-to-date access requirement conditions. This leads to poorer access performance if frequent consistency locks are experienced. Furthermore, the number of replicas for data is a multiple of storage space consumption. Sometimes it looks wise to access data from a remote site rather to use local storage space. Therefore, preconfigured or customized replication by applications or users is a challenge.

  4. 4.

    User expertise: Data are becoming more complex nowadays. At the same time, the user space is broadened by enterprises, so that users from different domains can execute queries on data according to their problems. It reveals the requirement of simple-to-deploy and easy-to-use storage but with higher performance than the relational database solution. To achieve improved performance, sometimes these databases integrate DBMS platforms, which undoubtedly meets the expectations, but the implementation and configuration process becomes complex for non-expert users.

6 Conclusions

A categorization of contemporary storage technologies for big data has been attempted in this paper. The main objective of the paper is to investigate and analyze state-of-the-art big data storage technologies and to present their categorization based on Brewer’s CAP theorem for distributed systems, so that researchers and big data analysts can explore enhanced storage solutions in specific domains, where availability, consistency, and fault tolerance are considerable requirements. Moreover, presentation of limitations of existing storage technologies for adequate scalability is a major concern of this work. This paper focuses mainly on categorization under data models and classifies modern storage technologies for big data as key-value, column-oriented, document-oriented, and graph systems. These technologies are further studied and analyzed by a critical review using Brewer’s CAP theorem. A discussion is carried out to survey and highlight the efficiency of each storage system for the scalability, consistency, and availability requirements of big data. Future key challenges with regard to storing big data are also emphasized in the discussion. In conclusion, it can be stated that systems are inclined towards provisioning consistency and availability as required by applications or users. This leading phenomenon is used to suggest categorization of available big data storage systems so that the selection with either preference becomes obvious.