ABSTRACT
Much effort has been devoted to support high performance spatial queries on large volumes of spatial data in distributed spatial computing systems, especially in the MapReduce paradigm. Recent works have focused on extending spatial MapReduce frameworks to leverage high performance in-memory distributed processing capabilities of systems such as Spark. However, the performance advantage comes with the requirement of having enough memory and comprehensive configuration. Failing to fulfill this falls back to disk IO, defeating the purpose of such systems or in worst case gets out of memory and fails the job. The problem is aggravated further for spatial processing since the underlying in-memory systems are oblivious of spatial data features and characteristics. In this paper we present SparkGIS - an in-memory oriented spatial data querying system for high throughput and low latency spatial query handling by adapting Apache Spark's distributed processing capabilities. It supports basic spatial queries including containment, spatial join and k-nearest neighbor and allows extending these to complex query pipelines. SparkGIS mitigates skew in distributed processing by supporting several dynamic partitioning algorithms suitable for a rich set of contemporary application scenarios. Multilevel global and local, pre-generated and on-demand in-memory indexes, allow SparkGIS to prune input data and apply compute intensive operations on a subset of relevant spatial objects only. Finally, SparkGIS employs dynamic query rewriting to gracefully manage large spatial query workflows that exceed available distributed resources. Our comparative evaluation has shown that the performance of SparkGIS is on par with contemporary Spark based platforms for relatively smaller queries and outperforms them for larger data and memory intensive workflows by dynamic query rewriting and efficient spatial data management.
- Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel Saltz. 2013. Hadoop GIS: A High Performance Spatial Data Warehousing System over Mapreduce. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1009--1020. Google ScholarDigital Library
- Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the Outliers in Map-Reduce Clusters using Mantri.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
- Ahmed Eldawy. 2014. SpatialHadoop: Towards Flexible and Scalable Spatial Processing Using Mapreduce. In Proceedings of the 2014 SIGMOD PhD Symposium (SIGMOD'14 PhD Symposium). ACM, New York, NY, USA, 46--50. Google ScholarDigital Library
- Roger Frye and Mark McKenney. 2015. Big Data Storage Techniques for Spatial Databases: Implications of Big Data Architecture on Spatial Query Processing. In Information Granularity, Big Data, and Computational Intelligence. Springer, 297--323.Google Scholar
- Paul Jaccard. 1901. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz.Google Scholar
- Jinxuan Wu Jia Yu, Mohamed Sarwat. 2015. GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data. In Proceedings of the 2015 International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2015). Google ScholarDigital Library
- YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. Skewtune: mitigating skew in mapreduce applications. In Proc. 2012 ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2010. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. 1st ACM symposium on Cloud computing. Google ScholarDigital Library
- Open Street Map. 2017. OSM. (2017). http://www.openstreetmap.orgGoogle Scholar
- Shoji Nishimura, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. 2011. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. In Proceedings of the 2011 IEEE 12th International Conference on Mobile Data Management - Volume 01 (MDM '11). IEEE Computer Society, Washington, DC, USA, 7--16. Google ScholarDigital Library
- Apache Spark. 2017. Spark Web. (2017). http://spark.apache.orgGoogle Scholar
- Mingjie Tang, Yongyang Yu, Qutaibah M Malluhi, Mourad Ouzzani, and Walid G Aref. 2016. Locationspark: a distributed in-memory data management system for big spatial data. Proceedings of the VLDB Endowment 9, 13 (2016), 1565--1568. Google ScholarDigital Library
- Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient In-Memory Spatial Analytics. In (To Appear) In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD'16). Google ScholarDigital Library
- Simin You and Jianting Zhang. 2015. Large-Scale Spatial Join Query Processing in Cloud. Technical Report. City University of New York.Google Scholar
- Simin You, Jianting Zhang, and L Gruenwald. 2015. Large-scale spatial join query processing in cloud. In IEEE CloudDM workshop (To Appear) http://www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf.Google ScholarCross Ref
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301 Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113 Google ScholarDigital Library
Index Terms
- SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing
Recommendations
Simba: Efficient In-Memory Spatial Analytics
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataLarge spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS). Traditional spatial databases and spatial ...
A comprehensive memory analysis of data intensive workloads on server class architecture
MEMSYS '18: Proceedings of the International Symposium on Memory SystemsThe emergence of data analytics frameworks requires computational resources and memory subsystems that can naturally scale to manage massive amounts of diverse data. Given the large size and heterogeneity of the data, it is currently unclear whether ...
Big Data Analytics Techniques in Virtual Screening for Drug Discovery
BDCA'17: Proceedings of the 2nd international Conference on Big Data, Cloud and ApplicationsVirtual screening (VS) is a computational method used in the drug discovery process by searching large libraries of small molecules to identify that represent leads for certain target. According to the use of information about the ligand, the target or ...
Comments