research-article

SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing

Authors:
Furqan Baig

Stony Brook University

Stony Brook University
View Profile

,
Hoang Vo

Stony Brook University

Stony Brook University
View Profile

,
Tahsin Kurc

Stony Brook University

Stony Brook University
View Profile

,
Joel Saltz

Stony Brook University

Stony Brook University
View Profile

,
Fusheng Wang

Stony Brook University

Stony Brook University
View Profile

SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information SystemsNovember 2017Article No.: 28Pages 1–10https://doi.org/10.1145/3139958.3140019

Published:07 November 2017Publication History

SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

Pages 1–10

ABSTRACT

Much effort has been devoted to support high performance spatial queries on large volumes of spatial data in distributed spatial computing systems, especially in the MapReduce paradigm. Recent works have focused on extending spatial MapReduce frameworks to leverage high performance in-memory distributed processing capabilities of systems such as Spark. However, the performance advantage comes with the requirement of having enough memory and comprehensive configuration. Failing to fulfill this falls back to disk IO, defeating the purpose of such systems or in worst case gets out of memory and fails the job. The problem is aggravated further for spatial processing since the underlying in-memory systems are oblivious of spatial data features and characteristics. In this paper we present SparkGIS - an in-memory oriented spatial data querying system for high throughput and low latency spatial query handling by adapting Apache Spark's distributed processing capabilities. It supports basic spatial queries including containment, spatial join and k-nearest neighbor and allows extending these to complex query pipelines. SparkGIS mitigates skew in distributed processing by supporting several dynamic partitioning algorithms suitable for a rich set of contemporary application scenarios. Multilevel global and local, pre-generated and on-demand in-memory indexes, allow SparkGIS to prune input data and apply compute intensive operations on a subset of relevant spatial objects only. Finally, SparkGIS employs dynamic query rewriting to gracefully manage large spatial query workflows that exceed available distributed resources. Our comparative evaluation has shown that the performance of SparkGIS is on par with contemporary Spark based platforms for relatively smaller queries and outperforms them for larger data and memory intensive workflows by dynamic query rewriting and efficient spatial data management.

References

Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel Saltz. 2013. Hadoop GIS: A High Performance Spatial Data Warehousing System over Mapreduce. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1009--1020. Google ScholarDigital Library
Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the Outliers in Map-Reduce Clusters using Mantri.Google Scholar
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
Ahmed Eldawy. 2014. SpatialHadoop: Towards Flexible and Scalable Spatial Processing Using Mapreduce. In Proceedings of the 2014 SIGMOD PhD Symposium (SIGMOD'14 PhD Symposium). ACM, New York, NY, USA, 46--50. Google ScholarDigital Library
Roger Frye and Mark McKenney. 2015. Big Data Storage Techniques for Spatial Databases: Implications of Big Data Architecture on Spatial Query Processing. In Information Granularity, Big Data, and Computational Intelligence. Springer, 297--323.Google Scholar
Paul Jaccard. 1901. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz.Google Scholar
Jinxuan Wu Jia Yu, Mohamed Sarwat. 2015. GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data. In Proceedings of the 2015 International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2015). Google ScholarDigital Library
YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. Skewtune: mitigating skew in mapreduce applications. In Proc. 2012 ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2010. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. 1st ACM symposium on Cloud computing. Google ScholarDigital Library
Open Street Map. 2017. OSM. (2017). http://www.openstreetmap.orgGoogle Scholar
Shoji Nishimura, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. 2011. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. In Proceedings of the 2011 IEEE 12th International Conference on Mobile Data Management - Volume 01 (MDM '11). IEEE Computer Society, Washington, DC, USA, 7--16. Google ScholarDigital Library
Apache Spark. 2017. Spark Web. (2017). http://spark.apache.orgGoogle Scholar
Mingjie Tang, Yongyang Yu, Qutaibah M Malluhi, Mourad Ouzzani, and Walid G Aref. 2016. Locationspark: a distributed in-memory data management system for big spatial data. Proceedings of the VLDB Endowment 9, 13 (2016), 1565--1568. Google ScholarDigital Library
Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient In-Memory Spatial Analytics. In (To Appear) In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD'16). Google ScholarDigital Library
Simin You and Jianting Zhang. 2015. Large-Scale Spatial Join Query Processing in Cloud. Technical Report. City University of New York.Google Scholar
Simin You, Jianting Zhang, and L Gruenwald. 2015. Large-scale spatial join query processing in cloud. In IEEE CloudDM workshop (To Appear) http://www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf.Google ScholarCross Ref
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301 Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113 Google ScholarDigital Library

Index Terms

SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing

Recommendations

Simba: Efficient In-Memory Spatial Analytics
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial ...
Read More
A comprehensive memory analysis of data intensive workloads on server class architecture
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

The emergence of data analytics frameworks requires computational resources and memory subsystems that can naturally scale to manage massive amounts of diverse data. Given the large size and heterogeneity of the data, it is currently unclear whether ...
Read More
Big Data Analytics Techniques in Virtual Screening for Drug Discovery
BDCA'17: Proceedings of the 2nd international Conference on Big Data, Cloud and Applications

Virtual screening (VS) is a computational method used in the drug discovery process by searching large libraries of small molecules to identify that represent leads for certain target. According to the use of information about the ligand, the target or ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
November 2017
677 pages
ISBN:9781450354905
DOI:10.1145/3139958
Editors:
Erik Hoel,
Shawn Newsam,
Siva Ravada,
Roberto Tamassia,
Goce Trajcevski
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
In-Memory processing
MapReduce
Spark
Spatial processing
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
SIGSPATIAL '17 Paper Acceptance Rate39of193submissions,20%Overall Acceptance Rate220of1,116submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 48
  Total Citations
  View Citations
- 440
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing

SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Simba: Efficient In-Memory Spatial Analytics

A comprehensive memory analysis of data intensive workloads on server class architecture

Big Data Analytics Techniques in Virtual Screening for Drug Discovery