Semantic Smart World Framework

­is paper presents a general Semantic Smart World framework (SSWF), to cover the Migratory birds’ paths. ­is framework combines semantic and big data technologies to support meaning for big data. In order to build the proposed smart world framework, technologies such as cloud computing, semantic technology, big data, data visualization, and the Internet of ­ings are hybrid. We demonstrate the proposed framework through a case study of automatic prediction of air quality index and dierent weather phenomena in the dierent locations in the world. We discover the association between air pollution and increasing weather conditions. ­e experimental results indicate that the framework performance is suitable for heterogeneous big data.


Introduction
Migratory birds can move from one place to another without borders between countries; so, we need to use the concept of "Smart World". Big data can serve the world in "Smart World" challenges. Most of these challenges are related to data management. e most cited problems are privacy issues and dealing with the heterogeneity of world data. An important issue is how to build a generic smart world framework to support all dimensions of any city regardless of its size and characteristics. e rapid evolution of Information and Communication Technologies (ICT) and the Internet of ings (IoT) has impacted cities in the physical infrastructure, buildings, transportation systems, governance, environmental monitoring, healthcare, etc. e integration of devices, platforms, and applications using ICT is of great signi cance to smart cities [1]. e expression "Smart City" has many di erent de nitions. Some authors de ne a Smart City as the integration of social, physical, and IT infrastructure to improve the quality of city services. Other authors focus on a set of Information and Communication Technology (ICT) tools to integrate the Smart City environment [2].
City computing is a process of acquisition, integration, and analysis of a huge amount of heterogeneous data generated by diverse sources in city spaces, such as sensors, devices, vehicles, buildings, and humans. ese sources are the aim of addressing the major issues cities face (e.g., air pollution, increased energy consumption, and tra c congestion) [3].
ere are three main challenges in city computing: city sensing and data acquisition, computing with heterogeneous data, and hybrid systems combining the physical and virtual worlds.
Recently, many frameworks have been proposed in di erent dimensions of smart cities including transportation [4], environment [5], energy [6], social [7], economy [8], and public safety and security [9]. Most of these frameworks did not include semantic interpretation of the results and focused on speci c domains. In general, the data generated from smart cities are usually not easy to understand by humans because it has the challenges related to big data. e concept of Big Data is clari ed by considering ve Vs [10]: (1) Volume: refers to the size of data that has been generated by all di erent sources. (2) Velocity: refers to the speed of data changes.
(3) Variety: refers to the di erent types of data being generated. (4) Veracity: refers to the quality of the data. (5) Value: refers to the value of the data. erefore, there are necessary needs to build a general framework to overcome the challenges posed by data of the city.
To achieve the research objectives, this paper is structured as follows: Section 2 reviews background and previous related works. Section 3 illuminates the proposed framework architecture. Section 4 describes the implementation of our proposed framework SSWF. Section 5 provides a case study of the SSWF to analyze air pollution and weather on migratory birds' path. Section 6 explains the result of applying the proposed framework. Section 7 discusses the signi cant contribution and limitations of this research and concludes the paper.

Related Work
Di erent cities have already built IoT infrastructures and various sensor devices to collect the data needed. A huge number of research projects concentrates on the collection and economy of IoT data generated from smart cities.
Many Smart City frameworks can classify into three different classes [11]: (1) Models: abstract frameworks for Smart City.
(2) Speci c purpose model: framework and applications related to one domain of the Smart City. (3) Multidomain models: framework and applications that describe the Smart City as a complex system and consider more than one domain.
Smart City frameworks have a major focus on existing Smart City platforms. e existing works are mainly in four key areas: (1) data acquisition, (2) semantic interoperability, (3) data analysis, and (4) Smart City application development support [12]. We divided the framework in the Smart City into four categories, according to technologies used. Almost all of the frameworks use at least one or more of the following technologies (Big Data, Cloud Computing, Internet of ings, and Semantic Technology). Figure 1 presents technologies and their functions. Table 1 presents a comparative study among Smart City frameworks and the technologies used. Table 1 explores also security and API services. SCDAP "Smart City Data Analytics Panel" is a big data analytics framework for Smart City applications, the main feature of this architecture is limited to Apache Hadoop suite as an underlying data storage and management layer [13]. e "CityPulse" framework supports Smart City service creation by means of a distributed system for semantic discovery, data analytics, and interpretation of large-scale near the real-time Internet of ings data and social media data streams [12]. Zhang et al. [14] presented a semantic framework that integrates the IoT with machine learning for smart cities. is framework retrieves and models urban data for certain kinds of IoT applications based on semantic and machine-learning technologies. It is used to detect pollution from vehicles and to detect tra c patterns. Spit re and iCore are frameworks that use semantic technologies for IoT data collection [15]. CITIESData is a Smart City data management framework that includes data collection, cleansing, and publishing [16]. It divides Smart City data Applied Computational Intelligence and So Computing insensitive, quasi-sensitive, and open/public levels. en it suggests di erent strategies to process and publish the data within these categories. Mohbey [17] presented a Smart City framework using di erent technologies of big data and the Internet of ings. It focuses on problems related to real-time decisions for a smart city. Bibri [18] proposed a framework for a Smart City based on big data and sensor data.
ere are points still to be covered. On the one hand, Hybrid technologies such as Big Data, Semantic technology, cloud computing, the Internet of ings, and Data Vitalization are not integrated to support a more e cient smart city framework. On the other hand, big data frameworks did not support meaning to add value to the data. Semantic frameworks are so slow for data retrieving and processing. Security issues still prevail in previous frameworks.

The Proposed Semantic Smart World
Framework (SSWF) Architecture e world contains the set of Data Centres for Smart Cities which specializes in collecting and measuring big data for natural phenomena like bird migration, environmental pollution, and Climate Change. Many problems are there in these data centres; they are not connected together; they serve Smart Cities in the world; and, there is no open access to the data.
SSWF is a general semantic big data framework that combines semantic web and big data technologies to connect, predict, and discover the knowledge of world big data. at is without boundaries between cities. Big data technologies are required for most data-related activities, such as storing, processing, analyzing, and sharing, while Semantic technologies are required for meaning-related activities, such as event detection, reasoning, and decision support. us, this research is aimed to build and develop a general framework for smart cities that utilizes a combination of data-related and meaning-related activities.
As shown in Figure 3, e Smart Cities' infrastructure generates the heterogeneous big data. e architecture of SSWF consists of the following phases over Smart Cities Applied Computational Intelligence and So Computing 4 (Group BY, HAVING, ORDER BY, LIMIT, OFFSET, and VALUES).
We con gure general dynamic SPARQL query in Data visualization phase.

Data Analytics.
e processed data are further analyzed in this phase to utilize events and help decision-makers to take the correct decision. is phase supports semantic ltering, semantic monitoring, event detection, and knowledge discovery. Semantic ltering is s ltering based on expressions in the form of a conjunction of description logics atoms enriched with OWL data types and SWRL (Semantic Web Rule Language) built-ins. We build lter expression in the universal semantic data model. en the lter expression is translated into a SPARQL query. Semantic monitoring allows event types taxonomy and event parameters to de ne in Ontology. We de ne them in the universal semantic data model. Event detection translates into nding a set of event types for which a given event occurrence belongs to their domains.
Knowledge discovery (or data mining techniques) must be adapted to be suitable for Big Data analysis. In this phase, we handled the Gamma Association Coe cient to be suitable to use. e gamma association coe cient (also called the gamma statistic) shows us how closely sets of items in a series of data or transactions "match". Gamma can be calculated for ordinal (ordered) variables that are continuous variables (like temperature or humidity) or discrete variables (like "hot" or "cold"). e gamma estimator is based on the number of observations that are concordant and discordant. It ignores tied pairs (i.e., pairs of observations having the same values or the same values). e Gamma coe cient ranges between −1 and 1. Value 1 means perfect positive association, while value −1 means perfect inverse association. If there is no association between the variables, the value will be zero [19].
We assume that cross tabulation or 2 × 2 cross table formula is shown as Equation (1). where is the frequency of variable − against − , is the frequency of variable + against − , is the frequency of variable − against + , is the frequency of variable + against + .
In this section; we describe each component in SSWF architecture. e Smart Cities' infrastructure generates heterogeneous big data. e main challenge is the ability to collect and push timely data of city events from a huge number of heterogeneous sources such as sensors, servers, devices, vehicles, buildings, and human activities, and deal with both historical and real-time big data.

Big Data Storage.
is phase is responsible for storing the data collected from the Smart cities. It used big data storage systems like Hadoop Distributed le system (HDFS) and NoSQL Database. Moreover, this component should be capable of performing useful preprocessing tasks, such as data ltering, normalization, and transformation.

Universal Knowledge Base.
In this phase, we build a universal semantic data model as Ontology-based. is model should automatically classify the data, associate relationships, and nd new relationships. is is done by using the OWL "Web Ontology Language" and Ontology re-engineering method as Merging.

Parallel and Distributed Big Data Processing.
is phase is responsible for the processing of smart city data in distributed cluster nodes. ere are two types of data processing: Stream processing, to perform real-time data ow; and Batch processing, to perform large historical data-sets. We should choose suitable distributed big data processing frameworks.

Semantic Big Data Dynamic Query.
is phase integrates semantic dynamic query with big data distributed processing. We connect NoSQL with the universal knowledge base. Sematic dynamic queries can run directly on data stored in HDFS/NoSQL without requiring any data movement or transformation. ere are two main steps for run query: (1) e RDF Loader converts an RDF dataset into the data layout using MapReduce. (2) e Query Compiler rewrites a given Sematic dynamic into the SQL on big data ecosystem based on the algebraic representation of SPARQL expressions.
General Dynamic SPARQL query as shown in Figure 4. e query consists of three parts: the SELECT clause identi es the variables 1 , 2 , . . . , i to appear in the query results, e WHERE clause provides the basic graph pattern ( , , ) to match against the data graph, and lter which contains association rule or condition. e query can include modi ers like en Equation (5) shows the general 2 × 2 cross table will en Equation (6) shows the simplest form of to be suitable for big data e simpli cation in this way required just count , , , and this calculation must also count in any association rule discovery to calculate support and con dence.

Data Visualization.
e previous component produces output as a series of values. To represent these values, it will be necessary to use visualization techniques. In this type of (4) Notice that compares the product of diagonal cells (ad) to a product of the o -diagonal cells (bc). e denominator is an adjustment that ensures that is always between +1 and −1. We Generalized 2 × 2 tables for any category attributes in datasets in Equation (3) T 2: e so ware packages used in the framework and its function.

So ware Function
Big queue [22] Data transformation HDFS [23], HBASE [24] Data storage Spark [25] and Hadoop [23] Data processing Spark [25] Stream processing Hadoop YARN [23] Clustering management REST API's Data access Spark MLib [25] Machine learning SPARQL [26] Semantics logic OWL [27] and RDF [28] Protégé Knowledgebase Sempala Alexander [20] Interactive SPARQL query processing on Hadoop Java [29] Dashboard Applied Computational Intelligence and So Computing 6 particular kinds of knowledge. Users can choose or maybe data-driven. Visualization techniques can be classi ed into (graphical, tabular, or using color only). Users can access the technique, we focus on how to make the representation of the knowledge which is minded more understandable. Some representation forms may be better suited than others for Applied Computational Intelligence and So Computing 8 2010, the corresponding algebra tree is illustrated in (2) and the Spark SQL query is given in (3).
In the data processing phase; Alkatheri [21] built a comparative study among big data frameworks. In comparison with Spark, Apache Storm, Flink and Apache Hadoop frameworks for nonreal-time data, this comparison recognized Spark as a winner across various key performance indicators (KPI), while, for stream processing, Flink was the best. ese KPIs are processing time, CPU consumption, Latency, Execution time, task performance, and Scalability. We compare Spark and Flink frameworks on high-performance computing (HPC). We found that the Spark Framework is the best framework against the pervious KPIs.
Spark is very fast and easy to collect a huge amount of data processing. Apache Spark is a distributed processing framework that works on the in-memory system. It is known for its high performance. It is easy to use and has exibility with e ciency in handling huge data. Also, it supports application development in languages like Python and Java using Hadoop based storage system. Table 2 presents the so ware packages used in the proposed framework SSWF and its functions.
A corresponding sequence diagram illustrating this data ow process is shown in Figure 6.

Case Study of the SSWF to Analysis Air Pollution and Weather on Migratory Birds' Path
World health organization (WHO) shows that 9 out of 10 people breathe air containing high levels of pollutants. It estimates proposed framework by a set of data visualization components like a dashboard, mobile application and APL's.

Implementation of the Proposed Framework SSWF
In this section, we illuminate hardware and so ware packages used in the proposed framework SSWF and explain data ow for the proposed framework SSWF.
For Implementing SSWF, we need to use suitable so ware in each layer. Where we use the HPC System of Bibliotheca Alexandrina which has a SUN cluster of peak performance of 11.8 T ops, 130 eight-core compute nodes, 2 quad-core sockets per node, each is Intel Quad Xeon E5440 @ 2.83 GHz, 8 GB memory per node, Total memory 1.05 Tera Bytes, 36 TB shared scratch, Node-node interconnect, Ethernet & 4x SDR In niband network for MPI, 4x SDR In niband network for I/O to the global Lustre lesystems.
We develop a semantic dashboard using java JDK, then build a universal knowledge base using Protégé. e data are pushed from di erent data sources to NoSQL storage by Big Queue tool. Depending on the size of the data, the SSWF stored data in HDFS or Hbase. e SSWF processed data as batch processing in case of historical data or streaming processing in case of real-time data. e SSWF used Sempala Alexander [20] as interactive SPARQL query processing on SQL on Hadoop. e SSWF generates a dynamic SPARQL query over the universal semantic data model of the city.
A complete example of how a dynamic SPARQL query is translated to Spark SQL is illustrated in Figure 5, (1) the SPARQL query asks for an average of Q3 in "London" during that around 7 million people die every year from exposure to polluted air [30]. Most organizations deny access to their data by external researches due to privacy issues.
We study air quality [30] and weather forecasting [31] monitoring data for 40 European countries from 1969 to 2012. e size of the data per year is 1.5 GB. Multiple weather factors (temperature, wind speed, humidity, rainfall, etc,) are taken into consideration based on hourly monitoring. Air pollutants included particulate matter with an aerodynamic diameter ≤10 µm (PM10), PM2.5, nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), carbon monoxide (CO), and Ozone (O 3 ).
e European Environment Agency (EEA) launched the European Air Quality Index (AQI) to check the current air quality across Europe's cities and regions. We compare our results with the European Air Quality Index (AQI) (http://airindex.eea.europa.eu/) to check our predictions. e location and time components use the standard GEO W3C Ontology (http://www.w3.org/2003/01/geo/). We de ne a set of association rules for air quality index and weather. Table 4 shows the air quality index with air pollution parameters value. Figure 7 shows a general picture of the ontology of the Smart City environmental structure. SCEO contains set of classes like (County, City, Station, Device, Sensor, Air pollution, Weather, Water, Air Quality, Climate Change). It is also contain set of subclasses like air pollution has (CO, O 3 , NO 2 , SO 2 , PM2.5.PM10) and weather has (temperature, wind, humidity, rainfall, etc,).

Data Processing.
In this phase, Dynamic SPARQL query can be processed over large distributed datasets in memory e ciently on top of the existing cluster HPC platform without data preparation overhead. Dynamic SPARQL query can be run over Sempala which converts SPARQL query to algebraic representation and then to Spark SQL.

Data Visualization.
Geographical Dashboard has been implemented. Apache server tomcat was used to host Dashboard. Figure 8 shows the dashboard for the proposed framework-an easy to con gure Dynamic SPARQL query using dashboard controls. e result can be ltered by a range or the number of output values. e dashboard provides an Animated Marker Clustering.

Analysis of Results
For evaluation purposes, we measure the time of processing a dynamic query to retrieve all air quality and weather parameters ltered in di erent periods of times as shown in Figure 9. Table 5 shows the comparison between normal RDF and the proposed framework in the processing time of query code 2. Figure 10 shows the bar chart for time processing (seconds) for original data and the proposed framework in di erent periods of time.
We measure monthly average Air quality index for PM10 over London from 2008 to 2012 https://data.london.gov.uk/ and our proposed framework as shown in Figure 11. Figure 12 shows the comparison between the monthly average Air quality indexes for PM10 over London from 2008 to 2012 and our proposed framework. e matching ratio in the air quality index between the framework calculation and the real data is 98%. Now, we can discover knowledge by applying Gamma association coe cient on any two classes.
We build association rules discovery between PM2.5 and temperature and humidity during 2011 over London.
Where is the number of PM2.5 in each rule. i.e., satisfy the two classes at the same time.
We discover knowledge by applying the Gamma association coe cient on certain classes in the universal knowledge base. So, we build association rules discovery between PM2.5, temperature, and humidity during 2011 over London.
We predict the value of any attribute in the universal knowledge base. e main challenge is the missing data in some places in the migratory birds' path. So, the SSWF search for the nearest area has data and predicts the missing data. In this case; we predict the annual average of NO 2 over Egypt in 2011. Now, we apply the SSWF in the following sections.  Table 3.

Universal Knowledge Base.
We create a common knowledge base Smart City Environmental Ontology (SCEO) that can deal with static, semi-static and real-time data. SCEO can be used to make queries for predictions, suggestions, and deductions. SCEO is a universal Ontology, which merges more than one ontology. is merge will extend the knowledge base. is merge is necessary for knowledge transfer among di erent knowledge bases. e common between ontologies is the location and time components. Data Availability e data used to support the ndings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no con icts of interest.
is the total number of PM2.5 in each class of the second class, is the total number of the sample. For example, if we do not count a number of PM2.5 (events) with range low (10-20 (µg m −3 )) and humidity low type (less than 70), then Figure 13 shows the Semantic Web Rule Language (SWRL) Rule. Table 6 visualizes the outcome of the Gamma association coe cient and a relational association rules discovery for the daily average of humidity and PM2.5 during 2011 over London.
ere is a strong relation between PM2.5 class low (10-20 (µg m −3 )) and humidity low type (less than 70) and week relation between low PM2.5 and not low humidity. Figure 14 shows the relation curve among humidity, temperature, and PM2.5 during 2011 over London that con rms the above knowledge discovery rule.
Finally, we can apply association rule discovery functions to predict the value of any attribute in the universal knowledge base. e main challenge is the data are not available in some countries or cities in migratory birds' path. So, SSWF searches for the nearest area that has available data and predicts new area data value. In this case, we predict the annual average of NO 2 over Egypt in 2011. Our case study data do not have any information about Egypt. e nearest area of Egypt is Cyprus that is far away from Egypt with 956 KM. According to the Egyptian Environmental A airs Agency (EEAA) (http://www.eeaa.gov.eg), the annual average of NO 2 over Egypt in 2011 is 58 (µg m −3 ). e framework predicts the annual average of NO 2 is 56.25 (µg m −3 ) with an accuracy percentage of 96.9%.

Conclusions
e smart world is a dream, but we can do it, like the migratory birds around the world without a visa. is paper presents a framework aiming to build a general semantic big data framework for a smart world. e SSWF provides a universal knowledge base for data generated by di erent data sources. SSWF provides not only data but understandings of the meaning of data readings, context and relationships among data, facts, and events. SSWF has been adding millions of records in RDF triples by using air quality and weather monitoring data for 40 European countries. e advantages of this framework are (1) the increasing ability of end-users to self-manage data from di erent data sources. (2) Independent framework from the domain of services and environments. (3) Manages concepts and relationships from di erent data sources. e main characteristics of this framework are (1) build a universal semantic data model. (2) De ne a set of association rule discovery for prediction, suggestions, and deductions. (3) Service-Oriented Architecture (SOA). (4) Use of semantic RDF standards to make the data "self-describing". (5) Management of big data. e matching ratio between framework calculation and real data in the Air quality index is 98%. e matching ratio between framework calculation and real data in prediction new values is up to 96.9%. Finally, in future work, we will add more analysis techniques. We will merge Smart City ontologies with di erent domains to increase knowledge and pattern detection.