计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 95-100.doi: 10.11896/jsjkx.210100060
王如斌1,3, 李瑞远2,3, 何华均1,3, 刘通4, 李天瑞1
WANG Ru-bin1,3, LI Rui-yuan2,3, HE Hua-jun1,3, LIU Tong4, LI Tian-rui1
摘要: 空间距离连接是空间数据分析最基本的操作之一,具有广泛的应用场景。针对现有分布式方法的空间域选取过大、数据倾斜、自连接较慢的问题,提出了一种新的面向海量空间数据的分布式距离连接算法JUST-Join。首先,JUST-Join仅选取必要的空间区域作为全局域,能够提前过滤数据,减少无效的数据传输和不必要的计算开销;然后,同时考虑了参与连接的两个数据集的分布,从而缓解了数据倾斜问题;最后,针对自连接情形的冗余计算,采用平面扫描算法来进一步提高效率。文中使用Spark实现了JUST-Join算法,并利用真实的数据集做了大量实验。实验结果表明,JUST-Join算法在效率和扩展性方面都优于现有的最先进的分布式空间分析系统。
中图分类号:
[1]HE H,LI R,WANG R,et al.Efficient suspected infectedcrowds detection based on spatio-temporal trajectories[J].ar-Xiv:2004.06653,2020. [2]JACOX E H,SAMET H.Spatial join techniques[J].ACM Transactions on Database Systems (TODS),2007,32(1):7. [3]CHEN D H,LIU L X,LE J J.Research on a Spatial Join Query with Keyword Search[J].Computer Science,2009,36(7):150-152. [4]ELDAWY A,MOKBEL M F.Spatialhadoop:A mapreduceframework for spatial data[C]//ICDE.IEEE,2015:1352-1363. [5]AJI A,WANG F,VO H,et al.Hadoop-GIS:A high performance spatial data warehousing system over MapReduce[J].Procee-dings of the VLDB Endowment,2013,6(11):1009-1020. [6]YU J,WU J,SARWAT M.Geospark:A cluster computingframework for processing large-scale spatial data[C]//SIGSPATIAL.2015:1-4. [7]TANG M,YU Y,MALLUHI Q M,et al.Locationspark:A distributed in-memory data management system for big spatial data[J].PVLDB,2016,9(13):1565-1568. [8]YOU S,ZHANG J,GRUENWALD L.Large-scale spatial join query processing in cloud[C]//ICDE.IEEE,2015:34-41. [9]XIE D,LI F,YAO B,et al.Simba:Efficient in-memory spatial analytics[C]//ICDE.2016:1071-1085. [10]YANG K,DING X,ZHANG Y,et al.Distributed SimilarityQueries in Metric Spaces[J].Data Science and Engineering,2019,4(2):93-108. [11]DEAN J,GHEMAWAT S.MapReduce:a flexible data proces-sing tool[J].Communications of the ACM,2010,53(1):72-77. [12]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:Cluster computing with working sets[C]//Proceedings of 2nd USENIX Conference on Hot Topics in Cloud Computing.2010. [13]QIAO B,HU B,ZHU J,et al.A top-k spatial join querying processing algorithm based on spark[J].Information Systems,2020,87:101419. [14]WHITMAN R T,MARSH B G,PARK M B,et al.Distributed spatial and spatio-temporal join on apache spark[J].TSAS,2019,5(1):1-28. [15]FINKEL R A,BENTLEY J L.Quad trees a data structure for retrieval on composite keys[J].Acta informatica,1974,4(1):1-9. [16]GUTTMAN A.R-trees:A dynamic index structure for spatial searching[C]//SIGMOD.1984:47-57. [17]LI R,HE H,WANG R,et al.Just:Jd urban spatio-temporal data engine[C]//ICDE.IEEE,2020:1558-1569. [18]LI R,HE H,WANG R,et al.Trajmesa:A distributed nosqlstorage engine for big trajectory data[C]//ICDE.IEEE,2020:2002-2005. [19]ELDAWY A,ALARABI L,MOKBEL M F.Spatial partitioning techniques in SpatialHadoop[J].Proceedings of the VLDB Endowment,2015,8(12):1602-1605. [20]PREPARATA F P,SHAMOS M I.Computational geometry:an introduction[M].Springer Science & Business Media,2012. [21]PANDEY V,KIPF A,NEUMANN T,et al.How good are mo-dern spatial analytics systems?[J].Proceedings of the VLDB Endowment,2018,11(11):1661-1673. |
[1] | 黎嵘繁, 钟婷, 吴劲, 周帆, 匡平. 基于时空注意力克里金的边坡形变数据插值方法 Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation 计算机科学, 2022, 49(8): 33-39. https://doi.org/10.11896/jsjkx.210600161 |
[2] | 宋龙泽, 万怀宇, 郭晟楠, 林友芳. 面向出租车空载时间预测的多任务时空图卷积网络 Multi-task Spatial-Temporal Graph Convolutional Network for Taxi Idle Time Prediction 计算机科学, 2021, 48(7): 112-117. https://doi.org/10.11896/jsjkx.201000089 |
[3] | 钱甜甜, 张帆. 基于分布式边缘计算的情绪识别系统 Emotion Recognition System Based on Distributed Edge Computing 计算机科学, 2021, 48(6A): 638-643. https://doi.org/10.11896/jsjkx.201000010 |
[4] | 苑晨宇, 谢在鹏, 朱晓瑞, 屈志昊, 徐媛媛. 一种基于分布式编码的卷积优化算法 Convolutional Optimization Algorithm Based on Distributed Coding 计算机科学, 2021, 48(2): 47-54. https://doi.org/10.11896/jsjkx.200800187 |
[5] | 李浩, 王飞, 谢思宇, 寇勇奇, 张兰, 杨兵, 康雁. 一种基于改进图波网的双重自回归分量交通预测模型 Dual Autoregressive Components Traffic Prediction Based on Improved Graph WaveNet 计算机科学, 2021, 48(11A): 159-165. https://doi.org/10.11896/jsjkx.201200051 |
[6] | 游兰, 韩雪薇, 何正伟, 肖丝雨, 何渡, 潘筱萌. 基于改进Seq2Seq的短时AIS轨迹序列预测模型 Improved Sequence-to-Sequence Model for Short-term Vessel Trajectory Prediction Using AIS Data Streams 计算机科学, 2020, 47(9): 169-174. https://doi.org/10.11896/jsjkx.190800060 |
[7] | 徐鹤, 吴昊, 李鹏. 面向物联网的时空数据处理算法设计 Design of Temporal-spatial Data Processing Algorithm for IoT 计算机科学, 2020, 47(11): 310-315. https://doi.org/10.11896/jsjkx.200400045 |
[8] | 孙天旭, 赵蕴龙, 练作为, 孙毅, 蔡月啸. 基于时空数据的城市人流移动模式挖掘 Mobility Pattern Mining for People Flow Based on Spatio-Temporal Data 计算机科学, 2020, 47(10): 91-96. https://doi.org/10.11896/jsjkx.200100001 |
[9] | 李博嘉, 张仰森, 陈若愚. 一种可指定分布的海量数据生成方法 Method for Generating Massive Data with Assignable Distribution 计算机科学, 2019, 46(8): 56-63. https://doi.org/10.11896/j.issn.1002-137X.2019.08.009 |
[10] | 刘长赟,杨宇迪,周丽华,赵丽红. 带有时间标签的流行社交位置发现 Discovering Popular Social Location with Time Label 计算机科学, 2019, 46(7): 186-194. https://doi.org/10.11896/j.issn.1002-137X.2019.07.029 |
[11] | 郭晟楠, 林友芳, 金文蔚, 万怀宇. 基于时空循环卷积网络的城市区域人口流量预测 Citywide Crowd Flows Prediction Based on Spatio-Temporal Recurrent Convolutional Networks 计算机科学, 2019, 46(6A): 385-391. |
[12] | 赵尔平, 孟小峰. 基于Spark的3D点云数据空间索引技术 Spatial Index of 3D Point Cloud Data Based on Spark 计算机科学, 2018, 45(9): 213-219. https://doi.org/10.11896/j.issn.1002-137X.2018.09.035 |
[13] | 邵炜晖,许维胜,徐志宇,王宁,农静. 基于区块链的虚拟电厂模型研究 Study on Virtual Power Plant Model Based on Blockchain 计算机科学, 2018, 45(2): 25-31. https://doi.org/10.11896/j.issn.1002-137X.2018.02.005 |
[14] | 朱坤,黄瑞章,张娜娜. 一种基于MapReduce模型的高效频繁项集挖掘算法 Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model 计算机科学, 2017, 44(7): 31-37. https://doi.org/10.11896/j.issn.1002-137X.2017.07.006 |
[15] | 李红军,崔西宁,牟明,韩伟. 一种面向分布式嵌入式计算机的性能评估模型 Research on Distributed Embedded Computer Performance Evaluation Model 计算机科学, 2017, 44(4): 153-156. https://doi.org/10.11896/j.issn.1002-137X.2017.04.033 |
|