计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 95-100.doi: 10.11896/jsjkx.210100060

• 数据库&大数据&数据科学 • 上一篇    下一篇

面向海量空间数据的分布式距离连接算法

王如斌1,3, 李瑞远2,3, 何华均1,3, 刘通4, 李天瑞1   

  1. 1 西南交通大学信息科学与技术学院 成都611756
    2 重庆大学计算机学院 重庆400044
    3 京东智能城市研究院 北京100176
    4 上海大学计算机工程与科学学院 上海200444
  • 收稿日期:2021-01-07 修回日期:2021-05-12 出版日期:2022-01-15 发布日期:2022-01-18
  • 通讯作者: 李瑞远(liruiyuan@whu.edu.cn)
  • 作者简介:wrb@my.swjtu.edu.cn
  • 基金资助:
    国家重点研发计划(2019YFB2101801)

Distributed Distance Join Algorithm for Massive Spatial Data

WANG Ru-bin1,3, LI Rui-yuan2,3, HE Hua-jun1,3, LIU Tong4, LI Tian-rui1   

  1. 1 School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China
    2 College of Computer Science,Chongqing University,Chongqing 400044,China
    3 JD Intelligent Cities Research,Beijing 100176,China
    4 School of Computer Engineering and Science,Shanghai University,Shanghai 200444,China
  • Received:2021-01-07 Revised:2021-05-12 Online:2022-01-15 Published:2022-01-18
  • About author:WANG Ru-bin,born in 1994,postgra-duate.His main research interests include spatio-temporal data management.
    LI Rui-yuan,born in 1990,Ph.D.His main research interests include spatio-temporal data management and mining.
  • Supported by:
    National Key Research and Development Program of China(2019YFB2101801).

摘要: 空间距离连接是空间数据分析最基本的操作之一,具有广泛的应用场景。针对现有分布式方法的空间域选取过大、数据倾斜、自连接较慢的问题,提出了一种新的面向海量空间数据的分布式距离连接算法JUST-Join。首先,JUST-Join仅选取必要的空间区域作为全局域,能够提前过滤数据,减少无效的数据传输和不必要的计算开销;然后,同时考虑了参与连接的两个数据集的分布,从而缓解了数据倾斜问题;最后,针对自连接情形的冗余计算,采用平面扫描算法来进一步提高效率。文中使用Spark实现了JUST-Join算法,并利用真实的数据集做了大量实验。实验结果表明,JUST-Join算法在效率和扩展性方面都优于现有的最先进的分布式空间分析系统。

关键词: 分布式计算, 空间分区, 空间距离连接, 空间索引, 时空数据

Abstract: Spatial distance join is one of the most common operations for spatial data analysis,which has various application scenarios.Existing distributed methods face the problems of too large space,high data skew,and slow self-join.To this end,this paper proposes a novel distributed distance join algorithm,i.e.,JUST-Join,for massive spatial data.First,JUST-Join regards only the necessary space as the global domain,which can filter invalid data out,reducing the overhead of unnecessary data transmission and computation.Second,we consider both the spatial distributions of the two datasets,which relieves the data skew issue.Third,for the spatial self-join,we adopt plane sweep method to further improve the efficiency.We implement JUST-Join algorithm based on Spark,and conduct extensive experiments using real datasets.The experimental results show that JUST-Join is superior to the state-of-the-art distributed spatial analysis systems in terms both of efficiency and scalability.

Key words: Distributed computing, Spatial distance join, Spatial indexing, Spatial partition, Spatio-temporal data

中图分类号: 

  • TP338
[1]HE H,LI R,WANG R,et al.Efficient suspected infectedcrowds detection based on spatio-temporal trajectories[J].ar-Xiv:2004.06653,2020.
[2]JACOX E H,SAMET H.Spatial join techniques[J].ACM Transactions on Database Systems (TODS),2007,32(1):7.
[3]CHEN D H,LIU L X,LE J J.Research on a Spatial Join Query with Keyword Search[J].Computer Science,2009,36(7):150-152.
[4]ELDAWY A,MOKBEL M F.Spatialhadoop:A mapreduceframework for spatial data[C]//ICDE.IEEE,2015:1352-1363.
[5]AJI A,WANG F,VO H,et al.Hadoop-GIS:A high performance spatial data warehousing system over MapReduce[J].Procee-dings of the VLDB Endowment,2013,6(11):1009-1020.
[6]YU J,WU J,SARWAT M.Geospark:A cluster computingframework for processing large-scale spatial data[C]//SIGSPATIAL.2015:1-4.
[7]TANG M,YU Y,MALLUHI Q M,et al.Locationspark:A distributed in-memory data management system for big spatial data[J].PVLDB,2016,9(13):1565-1568.
[8]YOU S,ZHANG J,GRUENWALD L.Large-scale spatial join query processing in cloud[C]//ICDE.IEEE,2015:34-41.
[9]XIE D,LI F,YAO B,et al.Simba:Efficient in-memory spatial analytics[C]//ICDE.2016:1071-1085.
[10]YANG K,DING X,ZHANG Y,et al.Distributed SimilarityQueries in Metric Spaces[J].Data Science and Engineering,2019,4(2):93-108.
[11]DEAN J,GHEMAWAT S.MapReduce:a flexible data proces-sing tool[J].Communications of the ACM,2010,53(1):72-77.
[12]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:Cluster computing with working sets[C]//Proceedings of 2nd USENIX Conference on Hot Topics in Cloud Computing.2010.
[13]QIAO B,HU B,ZHU J,et al.A top-k spatial join querying processing algorithm based on spark[J].Information Systems,2020,87:101419.
[14]WHITMAN R T,MARSH B G,PARK M B,et al.Distributed spatial and spatio-temporal join on apache spark[J].TSAS,2019,5(1):1-28.
[15]FINKEL R A,BENTLEY J L.Quad trees a data structure for retrieval on composite keys[J].Acta informatica,1974,4(1):1-9.
[16]GUTTMAN A.R-trees:A dynamic index structure for spatial searching[C]//SIGMOD.1984:47-57.
[17]LI R,HE H,WANG R,et al.Just:Jd urban spatio-temporal data engine[C]//ICDE.IEEE,2020:1558-1569.
[18]LI R,HE H,WANG R,et al.Trajmesa:A distributed nosqlstorage engine for big trajectory data[C]//ICDE.IEEE,2020:2002-2005.
[19]ELDAWY A,ALARABI L,MOKBEL M F.Spatial partitioning techniques in SpatialHadoop[J].Proceedings of the VLDB Endowment,2015,8(12):1602-1605.
[20]PREPARATA F P,SHAMOS M I.Computational geometry:an introduction[M].Springer Science & Business Media,2012.
[21]PANDEY V,KIPF A,NEUMANN T,et al.How good are mo-dern spatial analytics systems?[J].Proceedings of the VLDB Endowment,2018,11(11):1661-1673.
[1] 黎嵘繁, 钟婷, 吴劲, 周帆, 匡平.
基于时空注意力克里金的边坡形变数据插值方法
Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation
计算机科学, 2022, 49(8): 33-39. https://doi.org/10.11896/jsjkx.210600161
[2] 宋龙泽, 万怀宇, 郭晟楠, 林友芳.
面向出租车空载时间预测的多任务时空图卷积网络
Multi-task Spatial-Temporal Graph Convolutional Network for Taxi Idle Time Prediction
计算机科学, 2021, 48(7): 112-117. https://doi.org/10.11896/jsjkx.201000089
[3] 钱甜甜, 张帆.
基于分布式边缘计算的情绪识别系统
Emotion Recognition System Based on Distributed Edge Computing
计算机科学, 2021, 48(6A): 638-643. https://doi.org/10.11896/jsjkx.201000010
[4] 苑晨宇, 谢在鹏, 朱晓瑞, 屈志昊, 徐媛媛.
一种基于分布式编码的卷积优化算法
Convolutional Optimization Algorithm Based on Distributed Coding
计算机科学, 2021, 48(2): 47-54. https://doi.org/10.11896/jsjkx.200800187
[5] 李浩, 王飞, 谢思宇, 寇勇奇, 张兰, 杨兵, 康雁.
一种基于改进图波网的双重自回归分量交通预测模型
Dual Autoregressive Components Traffic Prediction Based on Improved Graph WaveNet
计算机科学, 2021, 48(11A): 159-165. https://doi.org/10.11896/jsjkx.201200051
[6] 游兰, 韩雪薇, 何正伟, 肖丝雨, 何渡, 潘筱萌.
基于改进Seq2Seq的短时AIS轨迹序列预测模型
Improved Sequence-to-Sequence Model for Short-term Vessel Trajectory Prediction Using AIS Data Streams
计算机科学, 2020, 47(9): 169-174. https://doi.org/10.11896/jsjkx.190800060
[7] 徐鹤, 吴昊, 李鹏.
面向物联网的时空数据处理算法设计
Design of Temporal-spatial Data Processing Algorithm for IoT
计算机科学, 2020, 47(11): 310-315. https://doi.org/10.11896/jsjkx.200400045
[8] 孙天旭, 赵蕴龙, 练作为, 孙毅, 蔡月啸.
基于时空数据的城市人流移动模式挖掘
Mobility Pattern Mining for People Flow Based on Spatio-Temporal Data
计算机科学, 2020, 47(10): 91-96. https://doi.org/10.11896/jsjkx.200100001
[9] 李博嘉, 张仰森, 陈若愚.
一种可指定分布的海量数据生成方法
Method for Generating Massive Data with Assignable Distribution
计算机科学, 2019, 46(8): 56-63. https://doi.org/10.11896/j.issn.1002-137X.2019.08.009
[10] 刘长赟,杨宇迪,周丽华,赵丽红.
带有时间标签的流行社交位置发现
Discovering Popular Social Location with Time Label
计算机科学, 2019, 46(7): 186-194. https://doi.org/10.11896/j.issn.1002-137X.2019.07.029
[11] 郭晟楠, 林友芳, 金文蔚, 万怀宇.
基于时空循环卷积网络的城市区域人口流量预测
Citywide Crowd Flows Prediction Based on Spatio-Temporal Recurrent Convolutional Networks
计算机科学, 2019, 46(6A): 385-391.
[12] 赵尔平, 孟小峰.
基于Spark的3D点云数据空间索引技术
Spatial Index of 3D Point Cloud Data Based on Spark
计算机科学, 2018, 45(9): 213-219. https://doi.org/10.11896/j.issn.1002-137X.2018.09.035
[13] 邵炜晖,许维胜,徐志宇,王宁,农静.
基于区块链的虚拟电厂模型研究
Study on Virtual Power Plant Model Based on Blockchain
计算机科学, 2018, 45(2): 25-31. https://doi.org/10.11896/j.issn.1002-137X.2018.02.005
[14] 朱坤,黄瑞章,张娜娜.
一种基于MapReduce模型的高效频繁项集挖掘算法
Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model
计算机科学, 2017, 44(7): 31-37. https://doi.org/10.11896/j.issn.1002-137X.2017.07.006
[15] 李红军,崔西宁,牟明,韩伟.
一种面向分布式嵌入式计算机的性能评估模型
Research on Distributed Embedded Computer Performance Evaluation Model
计算机科学, 2017, 44(4): 153-156. https://doi.org/10.11896/j.issn.1002-137X.2017.04.033
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!