Saving Large Semantic Data in Cloud : A Survey of the Main DBaaS Solutions

In the last decades, the evolution of ICT has been spectacular, having a major impact on all the other sectors of activity. New technologies have emerged, coming up with solutions to existing problems and opening up new opportunities. This article discusses solutions that combine big data, semantic web and cloud computing technologies. The authors analyze various possibilities of storing large volumes of data in triplestore databases, which are currently the matter of choice for storing semantic web data. The paper first presents the existing solutions for installing triplestores on the premises and then focuses on triplestores as DBaaS (in cloud). Comparative analyzes are made between the various identified solutions. This paper provides useful means for choosing the most appropriate database solution for semantic web data representation, both on premises or as DBaaS.


Introduction
One of the main limitations of the World Wide Web (abbreviated WWW) is that it was not designed to be machine-readable, but only human-understandable.In 1994, at the very first International WWW Conference, five years after he invented the World Wide Web, Tim Berners-Lee introduced the idea of a semantic web that can be understood by machines [1].WWW Consortium (abbreviated W3C) is an international organization which aims to develop standards for the WWW.W3C was founded by Tim Berners-Lee at the same conference where he announced the need for semantic web [1].Ever since, one of the objectives of W3C is to improve the WWW, by upgrading it from a web of documents to a web of data.[2] Until the end of the 20 th century, the work on semantic web was mostly theoretical; no practical application emerged as an international standard.The first important practical approach consisted in microformats.Instead of using HTML tags for their usual purpose, microformats can be sent as metadata, annexing information understandable by machines.In the meantime, new standards have been developed (e.g.RDF, RDFS, OWL, SPRAQL and JSON-LD) as well as new approaches.[3] In this paper, the authors describe the current state of semantic web, present the main approaches used for semantic web and discuss the most important semantic web solutions for cloud computing.The aim of this article is to present and compare the main solutions for saving semantic data in cloud.The main semantic web representations are currently based on the semantic triple.A triple is a statement that links two objects, and follows the rule Subject-Predicate-Object.Every part of the triple has a uniform resource identifier (URI) associated with it.Semantic data can be stored as a large graph, where the subject and object are represented as nodes connected by a predicate, in the form of an edge.Any information can be represented by this simplistic model.The maximum potential of this approach can be reached if all the resources on the WWW have an associated URI and are connected with as many other resources as possible by triples.In this scenario, the WWW becomes a large unified database where information from more websites can be automatically extracted and correlated by simple queries.Table 1 shows the format of an URI and describes its components.All-important programming languages support JSON formats.JSON-LD (abbreviation from JSON -Linked Data) is designed to facilitate the representation of RDF relationships.[4] Table 2.The components of an ontology

Classes
Collections, sets, types of objects representing an entity in an ontology

Properties or parameters of a class or individual
Restrictions Dependencies between classes, restricting the set of valid arguments Rules "If-then" statements

Events
Changes of values of attributes or relationships RDF triples are implemented with associated models named ontologies.Ontologies are designed with sets of rules, terms and vocabularies.These sets provide definitions of the entities found in reality.Ontologies are used to develop a large number of applications in different areas, such as knowledge management, intelligent information integration, information retrieval, natural language processing, database design and integration, e-commerce, bioinformatics and education [5].The main components of an ontology are illustrated in Table 2. RDFS (RDF Schema) and OWL (abbreviation of Web Ontology Language) are semantic tools used to represent ontologies.They are also called description languages, and define classes and attributes of URIs and their relationships.

Semantic Web Databases -Triplestores
A triplestore is a database built to store semantic web data in the form of semantic triples.As RDF is the standard used for Semantic Web, triplestores are also called RDF stores.This section presents the main issues that should be considered when choosing a database for representing RDF triples.The authors describe the main triplestores available nowadays and present a comparative analysis of the existing solutions.Survey [6] presents the state of art of triplestores in 2014 and performs a qualitative analysis of the main RDF stores available at the time of writing.A similar approach is carried on in this paper, but with an update of criteria used and inclusion of the approaches and solutions developed in the meantime.To be taken into account for the analysis performed in this paper, RDF stores should include the following components: repository and middleware.A repository is a database constructed or adapted for storing and managing RDF triples.As RDF triples are simple data structures and their XML and JSON serializations are compatible with many programming languages, almost any database can be remodelled to store RDF triples.Therefore, to be regarded as a triplestore and to be able to easily communicate with the repository, the RDF database has to include middleware components (e.g.query engine, RDF parser, APIs for connecting with other applications and storage provider).Triplestores can be implemented both in traditional (SQL) or NoSQL database.NoSQL paradigm seems more suitable for storing and managing RDF triples, since it offers more flexibility.The main NoSQL types of models are: document, key-value, column and graph.RDF stores can be implemented in all these models, but the most reliable one is the graph.[6] The main technical characteristics to be considered when choosing a triplestore are: • Semantic web standards supported (e.g.RDF, RDFS, OWL, SPARQL); • Programming languages supported for connecting the triplestore with other applications; • Support to reasoning -the ability to make logical inferences based on a set of facts and axioms, thus acquiring new knowledge; • Licenses -both commercial and open source solutions are considered; • Last release date -is very important, as old triplestore may not be compatible with more recent applications; • Operating systems which are compatible with the triplestore.Table 3 illustrates the main triplestores available in the present together with their technical characteristics.

Choosing the best triplestore solutions
The authors compared the data gathered for each triplestore in order to decide which solution is the best to use for storing semantic web data at the moment.Each criterion selected above has to be considered, but obviously not all of them are equally important.Taking into account that IT technologies evolve very fast nowadays, the authors agree that last release date is can be the starting point for filtering the data.If the release date is not recent there are two issues to consider: • The possibility that some solutions will not work together due to versions incompatibilities.
• It is less likely that the solution still has support.The authors consider that the minimum requirement for a medium to big project should be to have the release date in 2017 or more recent and for a small to medium project from 2015 until present.Given these assumptions, there are 10 triplestores to be considered for a medium to big project and 14 for a small to medium one.The type of license is of course important when choosing a solution.While triplestores with open source licenses offer the advantages of flexibility and the possibility of lower costs (or none), the ones with commercial licenses are more likely to have good support.
To offer good flexibility in implementation, the authors consider that the minimum requirements regarding semantic web standards supported by the triplestores should be: RDF, OWL and SPARQL.The reasoner is an essential component in semantic web.As can be observed in Table 3, most of the triplestores have this type of tool included.
Having more programming languages available can be useful when accessing and managing store contents from local or semantic client application.On the other hand, all modern triplestores are compatible with at least one major programming language, which can be sufficient if proper specialists are engaged in the project.The authors consider the operating system as the least important criterion chosen.All triplestores are Linux compatible, which should be good enough considering that most of the servers run on Linux distributions.Considering the above, the authors consider the top triplestores available nowadays to be: AllegroGraph RDF Store, GraphDB (former OWLIM), MarkLogic, Mulgara, Profium Sense, RDF4 (former Sesame), Stardog, Apache Jena-TDB and Oracle Database 12c, in no specific order.Other studies came to similar conclusions, such as [6] or [7].Future tests for evaluating reasoning capabilities, security and loading speed can further reduce the list of best triplestore.Such tests to compare performances between triplestores DOI: 10.12948/issn14531305/22.1.2018.01 have been performed on some of the solutions identified [8] [9].

Short presentation of the main DBaaS
All the major cloud computing providers (the authors are talking here about the big four: Amazon, Microsoft, Google and IBM -see Figure 1 for details) have in their portfolios different DBMSs available in cloud as Databases as a Service.The advantages of these systems are that they can be much safer than a small-medium company could keep them and that they are scalable.This means that the customer is paying just for what he uses and once the business or the stored data grows, the customer will pay more and more without worrying about data migration, switching technologies or buying new servers.All the major providers have in their portfolios both SQL and NoSQL databases, but because this paper focuses mainly on the storage of semantic data, which is known to be unstructured, the authors will present in detail just the NoSQL databases available in cloud for each one of the providers.

Amazon
The cloud computing platform from Amazon, called Amazon Web Services (AWS) gives the user five different DBMSs to choose from: Amazon RDS, Amazon Redshift, Amazon DynamoDB, Amazon ElastiCache and Amazon Neptune.Even though, nonrelational database systems are just DynamoDB, ElastiCache and Neptune (see Figure 2).Neptune, on the other side, is a graph database at its core and the only DBaaS that natively supports W3C standards like RDF or SPARQL.Like the other ones from Amazon, it is auto-scalable, fully-managed and ACID compliant [12].3).Based on the DB-Engines Rankings and on the supported technologies it can be considered that the overall winner could be Amazon Neptune, but in the absence of a benchmark test and of a monthly average price survey of the whole group of analyzed database, it can be a subjective choice.

Conclusions and future work
In this article, the authors aimed to identify what DBaaS solution from the large cloud computing providers (Amazon, Microsoft, Google and IBM) can be considered when it comes to save large amounts of semantic data.
After the semantic web technologies and domain standards were presented, the authors have made a comparative analysis of the existing triplestores (databases specially created for saving semantic data) available on premises.Unfortunately, none of them is present in the portfolios of the large cloud providers and many of them are not up-todate.For this reason, the NoSQL databases available as DBaaS were analysed and the authors tried to figure out which one of them could be the best match when it comes to save large quantities of semantic data.Even though Amazon Neptune seems to be the overall winner, the authors will focus their future work on doing a benchmark testing on all the identified databases together with a cost analysis.

Fig. 2 .
Fig. 2. Databases available in Azure Web Services (source: https://aws.amazon.com/products/databases/?nc2=h_l3_db) DynamoDB is a fast document based or keyvalue store (the user can switch between those two) that uses SSD technologies, auto-scaling and auto-management.It is used by big names in the industry like Airbnb, Lyft and Netflix and accessible directly from code as a local database [10].ElastiCache is Amazon's in memory data store service.It is fully compatible with Redis and Memcached and provides real-time data processing.It is used by companies like Airbnb and McDonald's[11].Neptune, on the other side, is a graph database at its core and the only DBaaS that natively supports W3C standards like RDF or SPARQL.Like the other ones from Amazon, it is auto-scalable, fully-managed and ACID compliant[12].

Fig. 3 .
Fig. 3. DB-Engines Ranking for Graph Databases in March 2018 IANCU has graduated The Faculty of Cybernetics, Statistics and Economic Informatics from The Bucharest University of Economic Studies in 2010.He has a master's degree in Economic Informatics (2012) and a PhD in Economic Informatics starting from 2015 in the field of Ontologies and eLearning.He is an Assistant Lecturer in The Department of Economic Informatics from The Bucharest University of Economic Studies.His current research focuses on semantic technologies and ontologies innovations.Other fields of interest include machine learning, multimedia, mobile devices and IoT.Tiberiu-Marian GEORGESCU has graduated the Faculty of Cybernetics, Statistics and Economic Informatics in 2012.In 2015 he has graduated the Informatics Systems for the Management of Economic Resources Master program.Currently he pursues a PhD research in Economic Informatics at the Bucharest University of Economic Studies, guided by professor Ion SMEUREANU, PhD.He is working as a Teaching Assistant in the Department of Economic Informatics and Cybernetics.His main interests in the Informatics field are cybersecurity, artificial intelligence and semantic web.

"://[user "@"]host[:port][/path][ "?"query][ "#fragment"] The description of an URI's components Components Description
(abbreviated RDF).Through RDF, a triple can be represented as a succession of three URIs.There are more ways of representing triples with RDF, called serialization formats.The description language varies from one serialization format to another.The main serialization formats are RDF/XML and JSON-LD.However, Turtle (abbreviation of Terse RDF Triple Language) and N-Triples are worth mentioning, as they can be easily understood by humans.Both RDF/XML and JSON-LD serialization are based on popular standards used to represent and transfer data between applications.XML (abbreviation from Extensible Markup Language) has the advantage of being easy to interpret by applications, but in general it is considered a difficult writing format.JSON (abbreviation of JavaScript Object Notation) is used to represent data structures and transmit them between applications.

Table 4 .
very easy to use and provides also a RESTful API for data access.As for Bigtable, Google is in charge with the auto-scaling and management of the database.Another important feature of Datastore is that it is ACID compliant.DBaaS systems comparison [16]os DB is a NoSQL database dedicated for low latency systems.It can be used as a key-value, graph, column family or document store, all-in-one.It is also compatible with APIs like SQL, JavaScript, Gremlin, MongoDB, Apache Cassandra and Azure TableStorage.[13]TableStorage is a key-value store that uses semi-structured datasets to store large amounts of data.Client libraries for .NET, Java, Android, C++, Node.js,PHP,Ruby and Python are provided by Microsoft, but any other language that accepts HTTP requests is compatible due to the RESTful API.[14]Cloud Datastore is a highly-scalable NoSQL database for applications, that don't save necessary very large amounts of data[16].It DOI: 10.12948/issn14531305/22.1.2018.01 is