Future Location Prediction for Emergency Vehicles Using Big Data: A Case Study of Healthcare Engineering

The number of devices equipped with GPS sensors has increased enormously, which generates a massive amount of data. To analyse this huge data for various applications is still challenging. One such application is to predict the future location of an ambulance in the healthcare system based on its previous locations. For example, many smart city applications rely on user movement and location prediction like SnapTrends and Geofeedia. There are many models and algorithms which help predict the future location with high probabilities. However, in terms of efficiency and accuracy, the existing algorithms are still improving. In this study, a novel algorithm, NextSTMove, is proposed according to the available dataset which results in lower latency and higher probability. Apache Spark, a big data platform, was used for reducing the processing time and efficiently managing computing resources. The algorithm achieved 75% to 85% accuracy and in some cases 100% accuracy, where the users do not change their daily routine frequently. After comparing the prediction results of our algorithm, it was experimentally found that it predicts processes up to 300% faster than traditional algorithms. NextSTMove is therefore compared with and without Apache Spark and can help in finding useful knowledge for healthcare medical information systems and other data analytics related solutions especially healthcare engineering.


Introduction
Analysing the movement pattern has always been of a keen area of interest, may it be automobile, humans, or any other moving object. ese movement patterns can help analysts in making a decision related to the behaviour patterns of an object. For example, the idea of geo-marketing can be evolved if the pattern of the people who are shopping is observed.
Similarly, different location-aware applications can help urban planning by observing traffic patterns. Approximately 3.5 billion mobile phone users are predicted worldwide in 2020 [1]. A mobile user location is better estimated these days by the techniques that are currently being developed and used by the telecommunication providers. erefore, mobile user's patterns and activities are sensed by using different mobility data records that are saved by telecommunication companies [2]. e objective of observing mobility data is to see why and when the objects move. To accomplish these objectives, various data sources are used such as Global System for Mobile communications (GSM) or Global Positioning System (GPS) for analysing and later transforming them into meaningful predicting patterns. e process of predicting patterns is known as Knowledge Discovery (KD), i.e., produced from raw data and converted into meaningful knowledge [3]. A hybrid system for location recognition and prediction which addressed key issues of location-based services, such as location recognition and prediction, was proposed by [4]. e system used a hybrid method combining k-Nearest Neighbour (kNN) and decision tree to effectively recognize the locations not only in the outdoor environment but also in the indoor environment. NextPlace was presented by [5], an approach for spatio-temporal user location prediction based on nonlinear analysis of the time series of start times and duration times of visits to significant locations. is approach allows forecasting not only the next location of a user but also his/her arrival and residence time, i.e., the interval of time spent in that location. With a particular objective to settle on an informed decision concerning which advancements to understand, information was assembled from a few previous literature studies.
e main motivation of this research is the availability of massive data with the industry which was never utilized for business intelligence in the context of Pakistan. e data has been gathered over the years and only used for real-time monitoring of vehicles. ere is a huge potential to explore and perform geo visual analytics on available data on big data platforms such as Apache Spark. e main purpose of this research is to analyse the spatio-temporal mobility patterns of Global Positioning System (GPS) data using new technologies for big data; i.e., Apache Spark is used to reduce the time taken per job for discovering useful information, which can help assist decision-making for real-world scenarios. e future location of vehicles is predicted from a large pool of data with more than 100 million rows of records after developing a novel algorithm, Next Spatio-Temporal Move (NextST-Move), on Apache Spark to optimize the time taken and later the predicted locations are verified against the real data. e main contribution of our research is our newly proposed NextSTMove algorithm which is more efficient and accurate than existing algorithms. Moreover, we have used the real data of a local tracker company. e results of our algorithm can be very useful for long-term strategic and business advantages in healthcare engineering. e remainder of this paper is structured as follows. Section 2 describes the related literature review. Section 3 presents a detailed methodology and proposed algorithm. Results and discussion are presented in Section 4. Finally, conclusions are outlined in Section 5.

Related Work
ere are many trajectory prediction algorithms that exist in the literature. Over the years, various researchers have proposed novel algorithms which cater to their needs. Broadly, trajectory prediction algorithms are derived from machine learning approaches such as Bayesian networks [6], hidden Markov models [7], decision trees [8], neural networks [9], and state predictor methods [10]. is section describes existing work in this field while commenting on the above-mentioned parameters.
Research in mobility data is not that new. However, in the last few years, it has gained popularity for data mining and artificial intelligence, and health engineering [11][12][13]. Substantial amounts of information are produced by GPS and telecommunication technologies advancement. In the survey paper [14], five algorithms are used for four users that had different patterns. Innovations and advancement are giving hints of producing pervasive computing for mobility data which helps predict the accuracy. e trajectories that are stored for the semantics of mobility data are aiding in finding useful information about the movements of the objects [15].
Likewise, paper [16] presents visual techniques to generate trajectories (spatio-temporal sequences) using GPS data to assist in efficient trajectory projection of emergency vehicles in highly urbanized cities. Furthermore, papers [17,18] use visual analysis to implement intelligent transportation enabling efficient utilization of new knowledge and complex data.
A spatial-temporal prediction method was proposed by [19] which is called Spatial-Temporal Recurrent Neural Networks (STRNN). e experimental results on real datasets showed that STRNN outperformed the state-of-theart methods and can well model the spatial and temporal contexts. In [20,21], authors discussed that with the growing data volume arises a need for processing spatio-temporal queries efficiently. For this, they used parallel processing in Secondo for geospatial big data analysis, while in [22] the context of time and space in a massive geospatial big data database is analysed using High-Performance Computing (HPC). A classification was presented by [23] for approaching decision trees to predict the next place of mobile users. e authors implemented an optimizer to find the best parameter combination for each user since users had widely varying behaviour. Finally, the performance of the approach was demonstrated by the results of the experiments on the real-life dataset of 80 mobile users provided by Nokia. e existing solutions for geolocation prediction (GP) and divided geolocation prediction into two primary parts were reviewed by [24]. e initial step proposed to manufacture a geolocation expectation show is Mining Popular Geolocation Region (MPGR), and the second is Mining Personal Trajectory (MPT). e results described the basic concepts of GP, the characteristics of MPGR, and MPT. ey also discussed the limitations, openings, and future geolocation prediction analytical trends for mobility big data. Similarly, paper [25] proposed a methodology for the prediction of a user's outdoor location derived from contextual data (current location, day of the week, time, and speed), which were collected with a GPS device and with a smartphone. is methodology was based on spatial clustering of data and on-time segmentation to find points of interest that the user visits every day and every hour.
An investigation in 2013 by [26] worked on the perspectives identified with data accumulation and taking care of trajectories that are feeding to the databases with proper data. e trajectories recreation for producing meaningful trajectories includes procedures for gathering movement data and cleansing the data gathered, compression of data, and map coordinates to deliver noise-free trajectories. For the production of semantically compliant trajectories, raw spatial data from the common repository need to be recovered using different remaking tasks along with semantic trajectories. Moreover, paper [27] defined the concept behind the management of trajectory and their representations. e focus of the research was analysis on an extensive scale for phenomena related to mobility with more focus on the semantic behaviour of the data. e main goal of analysing the behaviour is indicating which behaviour defines which moving object.
An unprecedented amount of geospatial data gathered from moving objects defies human capability to analyse it. A study by [28] found new methods for processing and mining moving objects. For modelling and representing trajectories, paper [29] discusses the problem in the context of database systems. Moving objects databases represent a set of moving objects using abstract data types and maintain complete histories of movement.
An open-source software, Secondo, has a framework for big trajectory data whose data model is not fixed.
is Database Management Systems (DBMS) prototype can be used for different data models. "WhereNext" are previously visited trajectory patterns that were extracted [30] that use previously extracted trajectory patterns. In most of the studies, few aspects of trajectory prediction are discussed. For example, some studies focus on indoor and outdoor navigation. Similarly, other studies highlight public and private datasets. In many studies, the authors have validated the accuracy of these algorithms on given datasets. In our research, we proposed a novel algorithm, NextSTMove, using Apache Spark to minimize the query and processing time for GPS big data. e reason is because Apache Spark is becoming de facto for processing big data in the computing world. We used it to predict future locations of vehicle GPS data.
Bayes-based predictors were used to add to the performance of their prediction for leveraging big data [31]. ey studied a large Call Detail Record (CDR) dataset. At first, they explored the dataset and found that they can use call activity to generate prior probabilities for use in a Bayes predictor. With this reasoning, they developed an enhanced Bayes predictor that uses a distance threshold and the users' regular location to improve the generation of prior probabilities. Experimental results show that the enhancements they proposed increase accuracy of the Bayes based predictor by 17 percentage points. In the end, they concluded that it is feasible to leverage big cellular data to enhance location predictors without relying on external data. Apache Spark developed in the year 2009 at Berley's lab [32] is said to have achieved the lowest latency rate in comparison with Secondo and Parallel Secondo. It is freely available for several operating systems such as Windows, Linux, and Mac Operating systems. Apache Spark is a Unified Analytics Engine for big data processing and management that supports streaming data, batched data, SQL, Graph, and machine learning processes. Apache Spark for point cloud spatial data management was used to achieve a lower latency rate [33]. ey found, in comparison to the traditional methods for point cloud management, a file system storage, a single processing server, and a distributed approach based on Apache Spark were able to achieve a more agile speed and higher robustness and fault tolerance support. A comparison was made between the processing time taken by Relational Database Management System (PostgreSQL) and the time taken using Apache Spark. ey achieved up to 300% of reduced latency rate, which shows Apache Spark is faster compared to these DBMS. As the number of nodes in the cluster was increased, the processing capabilities of the system increased. Increasing the number of points did not affect the query execution time of Apache Spark much, whereas queries run over PostgreSQL slow almost immediately. A new platform for geospatial big data was developed by [34] inside Apache Spark a GeoSpark SQL framework that was able to carry out geospatial SQL queries over an Apache Spark system. e research showed that Apache Spark has a better performance than traditional Relational Database Management Systems (RDBMS) for a huge number of geospatial type queries. e methods for inserting the point data into the Apache Spark data structure are represented in [33]. e data were sliced into rectangular areas and each area was ingested in a separate document. Rectangular areas were numbered by a Geohash system and were stored in MongoDB. ese structures allowed executions of operations by MapReduce for point cloud data, sometimes MongoDB or from an external framework like Apache Hadoop [35]. Similarly, paper [36] compared quadtree and R-tree on Spark for finding the difference in the query efficiency.
e purpose of this research is to analyse the spatialtemporal mobility patterns of GPS data using Apache Spark to reduce the time taken per job for discovering useful information, which can support the decision-making process for real-world problems.

Overview.
In general, Apache Spark software is used for clustering of systems for very fast query response. It provides an executable environment for all the Spark applications in the Kernel of Spark core. e actual advantage of Apache Spark is that, compared with other technologies like Hadoop and MapReduce which only use disk for memory, Apache Spark uses memories and can also make use of the disk for the processes. Apache Spark is versatile, unlike the Hadoop ecosystem, as it does not have its own distributed file system but can make use of Hadoop Distributed File System (HDFS).
Apache Spark is a standalone software that does not use any resource manager. However, if we use it for more than one node and environment setup, we can use Yet Another Resource Negotiator (YARN) or Multiple Equivalent Simultaneous Offers (MESOs) for resource management, along with a distributed file system such as HDFS or Amazon Simple Storage Service (S3).

Spark SQL.
Spark has a built-in library for processing structured data.
is can be used for complicated SQL database queries and algorithm-based analytics. Spark SQL Journal of Healthcare Engineering supports HIVE, SQL-like HiveQL query, Java Database Connectivity (JDBC), and Open Database Connectivity (ODBC). is can also enable some degree of connections with existing databases, warehouses, and business intelligence environments.

Deployment of Apache Spark.
NextSTMove for predicting the future location of vehicles using Python programming is designed and implemented. PySpark utility was installed for Windows 7 using PIP. PySpark was locally installed in the system.  while i ⟵ unique locations do total count ⟵ 0 while j ⟵ all locations do IF (j �� i) total count ⟵ total count + 1 current probability ⟵ total count divide by count of all the locations and multiply by hundred END IF end IF (current probability > first probability) third probability ⟵ second probability second probability ⟵ first probability first probability ⟵ current probability first location ⟵ i ELSEIF (current probability > second probability) third probability ⟵ second probability second probability ⟵ current probability second location ⟵ i ELSEIF (current probability > third probability) third probability ⟵ current probability third location ⟵ i END IF IF (third probability − sys.maxsize) third location ⟵ i END IF ALGORITHM 1: NextSTMove: algorithm for predicting the top three locations.

Approach.
In this section, we describe the approach used for this research. e approach is divided into data collection, data pre-processing, creation of the spatio-temporal database in Apache Spark, creation of locations from user's data, and spatio-temporal queries. A flowchart of the methodology carried out is shown in Figure 1.

Raw GPS Data Collection.
Real-time data from a vehicle GPS tracker company is used for this research.
e GPS data was received in MS-SQL database. e data spanned from 1st January 2016 and ended on 31st December 2016 with a total of 105,096,953 records in all twelve tables for each month of the year 2016. A total of 2261 vehicles contributed to the data. Table 1 shows the data of the anonymized vehicle.
3.6. Data Pre-Processing. Data pre-processing involved cleaning of data, removal of unwanted data fields, and removal of missing fields to avoid null data in columns and adding new columns. Unlike [29] where the authors used an algorithm kNN to identify the latitudes and longitudes that are associated with each other, the data received were already assigned location names to a cluster of latitudes and longitudes which were quite accurate. e GPS data received had thirty-three columns and most of them were not useful for the algorithm. To extract only the useful five columns from all the twelve tables of each month from the MS-SQL database to a single Comma Separated Values (CSV) file, the following batch command query was used on windows command-line environment:   .appName ("Data cleaning") .

getOrCreate ()
We used DataFrames in Apache Spark version 2.0 Application Programming Interface (API) for managing our data. e final CSV file was generated for all the twelve months of the year and has been pre-processed and was assigned to a DataFrame using the following lines of code: SparkDataFrame � spark.read.format ("csv").option ("header","true").option ("mode","DROPMALFORMED").load ("E: TwelveMonthsTablesData.csv")

SparkDataFrame.createOrReplaceTempView ("TwelveMonthGPSData")
Here, SparkDataFrame is the DataFrame we are using and Spark's API to read the CSV file after loading it. A spatio-temporal database was created by using the crea-teOrReplaceTempView library of Apache Spark. e 'SparkDataFrame' was stored in our Spark SQL table which is used for NextSTMOve Algorithm.

Design of Spatial Queries.
As the focus of this work is future location prediction using Apache Spark, the design of the query included the spatio-temporal aspect of the data, i.e., where and when. We queried for where a used vehicle will be at a given time and on a given day, for example, the location of a particular user, for example, "User A" on "'Monday" between "9 Am" to "9 : 30 Am." We asked the user to input a valid vehicle number, the day they want to inquire, and the time between which they want to predict the vehicle location. After the user has input

Creation of Location from Users Data.
e results from the algorithm were visualized on the web using geospatial visualization libraries of Python. e result was extracted for web maps on run time to avoid any delays in data generation. e findings are discussed in Section 4. Algorithm 2 explains the logical steps involved: e coordinates of the top three locations along with their names and probabilities are ready to be mapped. Folium library in Python is used to generate a web map. Figure 2 illustrates all the vehicles visited in one month. e visualization seems cluttered. erefore, a further zoomedin view on the city of a single user for the 1st month can also be seen.

Top Predicted Locations.
e queries were applied to the data to generate the top three probable locations between two-time intervals. Table 2 shows the queries while Karachi and interior province to provide ambulance services to the larger city from underdeveloped areas.

Latency for Apache Spark.
A comparison is carried out on algorithms with and without Apache Spark. Initially, the algorithm was designed using a simple Python library. e NextSTMove algorithm was developed using queries and the latency of the process was calculated. e algorithm was then developed on Apache Spark using PySpark and queries applied are shown in Table 3. Up to 200 queries were applied and a sample of five random queries is shown in Table 3. After this, the algorithm was developed using Apache Spark for the queries shown in Table 3. We achieved a remarkable amount of decrease in the time taken by the queries. e job took more than two thousand seconds without using Apache Spark as illustrated in Figure 6. After using Apache Spark, the queries took less than 300 seconds. Figure 7 illustrates the time taken in detail.

Accuracy of Predicted Locations.
Six months were used to predict the future locations of users and then the data from the next six months were used to find the accuracy of the predicted locations. We compared the real-time locations from the next six months' data with the predicted output for the queries. Table 3 shows the queries whose accuracy percentage is illustrated in Figure 8. e three bars for each query in Figure 8 show the accuracy of the top three locations that were queried. As shown in Figure 8, query 4 has achieved a 100% accuracy, the reason being that this user has not changed his pattern for that time of the day. erefore, the algorithm predicted it accurately. Similarly, in query 3 the algorithm achieves an accuracy of 90% for the top 2nd and 3rd predicted location, which depicts that this user was mostly using the same route, or he/she was present at the same location mostly. e top 1st location for query 3 has a prediction rate of 85% which means the user showed varied behaviour.

Conclusions
Our work presents a novel algorithm, NextSTMove, where vehicle future movements are successfully predicted with and without Apache Spark. e algorithm achieved 75% to 85% accuracy and in typical cases 100% accuracy, where the users follow a repetitive pattern. e main aim of this research was to improve the latency and efficiency as compared to existing algorithms such as NextPlace [5]. Apache Spark, a big data platform, was fully utilized to achieve this. e algorithm reduced the processing time to up to 300%. is processing was done on a total of 2261 users having approximately 100 million data points.
is study is significant in predicting future locations of emergency vehicles.
is can facilitate users to perform spatial tasks while improving the analytical knowledge gained from understanding their behaviours. e emergency   vehicle tracker data reveal their spatio-temporal patterns.
is research work can also help in solving many geospatial big data applications from both a commercial and security viewpoint.
As part of a future road map, we plan to expand our work by including real-time streaming of big data instead of processing with batched data only. Furthermore, we plan to introduce more nodes to the distributed processing to enhance the efficiency of the system running the queries. More data attributes can be introduced to analyse additional information which can reveal meaningful information and patterns for real-time applications.
Further analysis can be carried out in answering the question of how and why a user visited a particular location.
is can help find the semantics of trajectories and carry out their analysis. Similarly, another future area in the algorithm can be predicting the next location of a vehicle using its previous history, i.e., where a user will be next after a specific location, by making a system to predict a route for vehicles that will be congested for a specific time and ask the emergency vehicles if they want to avoid that road.
is study opens up further avenues for research. e main concern for using Apache Spark for NextSTMove is that during the loading of queries the first query takes more time to process as compared to the rest of the queries. Also, Apache Spark gets batched data, while other platforms such as Apache Flink can work with streaming data as well. erefore, to increase processing capabilities, streaming data processing can be embedded along with it as part of our future work.

Data Availability
Some sample data of a few vehicles might be provided on request.

Conflicts of Interest
e authors declare no conflicts of interest.