Abstract
Recent advancements in high-performance networking interconnect significantly narrow the performance gap between intra-node and inter-node communications, and open up opportunities for distributed memory platforms to enforce cache coherency among distributed nodes. To this end, we propose GAM, an efficient distributed in-memory platform that provides a directory-based cache coherence protocol over remote direct memory access (RDMA). GAM manages the free memory distributed among multiple nodes to provide a unified memory model, and supports a set of user-friendly APIs for memory operations. To remove writes from critical execution paths, GAM allows a write to be reordered with the following reads and writes, and hence enforces partial store order (PSO) memory consistency. A light-weight logging scheme is designed to provide fault tolerance in GAM. We further build a transaction engine and a distributed hash table (DHT) atop GAM to show the ease-of-use and applicability of the provided APIs. Finally, we conduct an extensive micro benchmark to evaluate the read/write/lock performance of GAM under various workloads, and a macro benchmark against the transaction engine and DHT. The results show the superior performance of GAM over existing distributed memory platforms.
- S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. Computer, 29(12):66--76, 1996. Google ScholarDigital Library
- M. K. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis. Sinfonia: a new paradigm for building scalable distributed systems. In ACM SIGOPS Operating Systems Review, volume 41, pages 159--174. ACM, 2007. Google ScholarDigital Library
- E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. Steele Jr, S. Tobin-Hochstadt, J. Dias, C. Eastlund, et al. The fortress language specification. Sun Microsystems, 139(140):116, 2005.Google Scholar
- C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: shared memory computing on networks of workstations. Computer, 29(2):18--28, Feb 1996. Google ScholarDigital Library
- C. Binnig, A. Crotty, A. Galakatos, T. Kraska, and E. Zamanian. The end of slow networks: It's time for a redesign. PVLDB, 9(7):528--539, 2016. Google ScholarDigital Library
- Q. Cai, H. Zhang, W. Guo, G. Chen, B. C. Ooi, K. L. Tan, and W. F. Wong. Memepic: Towards a unified in-memory big data management system. IEEE Transactions on Big Data, pages 1--1, 2018.Google ScholarCross Ref
- J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In SOSP '91, pages 152--164, 1991. Google ScholarDigital Library
- B. L. Chamberlain, D. Callahan, and H. P. Zima. Parallel programmability and the chapel language. The International Journal of High Performance Computing Applications, 21(3):291--312, 2007. Google ScholarDigital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. Von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Acm Sigplan Notices, volume 40, pages 519--538. ACM, 2005. Google ScholarDigital Library
- C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, F. Cantonnet, T. El-Ghazawi, A. Mohanti, Y. Yao, and D. Chavarría-Miranda. An evaluation of global address space languages: co-array fortran and unified parallel c. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 36--47. ACM, 2005. Google ScholarDigital Library
- U. Consortium et al. Upc language specifications v1. 2. Lawrence Berkeley National Laboratory, 2005.Google Scholar
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In SoCC '10, pages 143--154, 2010. Google ScholarDigital Library
- P. J. Denning. The locality principle. Communications of the ACM, 48(7):19--24, 2005. Google ScholarDigital Library
- A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast remote memory. In NSDI '14, pages 401--414, 2014. Google ScholarDigital Library
- A. Dragojević, D. Narayanan, E. B. Nightingale, M. Renzelmann, A. Shamis, A. Badam, and M. Castro. No compromises: Distributed transactions with consistency, availability, and performance. In SOSP '15, pages 54--70, 2015. Google ScholarDigital Library
- M. J. Feeley, W. E. Morgan, E. Pighin, A. R. Karlin, H. M. Levy, and C. A. Thekkath. Implementing global memory management in a workstation cluster. In ACM SIGOPS Operating Systems Review, volume 29, pages 201--212. ACM, 1995. Google ScholarDigital Library
- M. J. Franklin, M. J. Carey, and M. Livny. Global memory management in client-server database architectures. In VLDB, pages 596--609, 1992. Google ScholarDigital Library
- J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin. Efficient memory disaggregation with infiniswap. In NSDI, pages 649--667, 2017. Google ScholarDigital Library
- InfiniBand Trade Association. Infiniband roadmap. http://www.infinibandta.org, 2016.Google Scholar
- N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda. High performance rdma-based design of hdfs over infiniband. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 35. IEEE Computer Society Press, 2012. Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. In SIGCOMM '14, pages 295--306, 2014. Google ScholarDigital Library
- R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-store: A high-performance, distributed main memory transaction processing system. PVLDB, 1(2):1496--1499, 2008. Google ScholarDigital Library
- A. K. M. Kaminsky and D. G. Andersen. Design guidelines for high performance rdma systems. In 2016 USENIX Annual Technical Conference, page 437, 2016. Google ScholarDigital Library
- S. Kaxiras, D. Klaftenegger, M. Norgren, A. Ros, and K. Sagonas. Turning centralized coherence and distributed critical-section execution on their head: A new approach for scalable distributed shared memory. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pages 3--14. ACM, 2015. Google ScholarDigital Library
- F. Li, S. Das, M. Syamala, and V. R. Narasayya. Accelerating relational databases by leveraging remote memory and rdma. In SIGMOD '16, pages 355--370, 2016. Google ScholarDigital Library
- K. Li and P. Hudak. Memory coherence in shared virtual memory systems. TOCS, 7(4):321--359, Nov. 1989. Google ScholarDigital Library
- Q. Lin, P. Chang, G. Chen, B. C. Ooi, K.-L. Tan, and Z. Wang. Towards a non-2PC transaction management in distributed database systems. In SIGMOD '16, pages 1659--1674, 2016. Google ScholarDigital Library
- F. Liu, L. Yin, and S. Blanas. Design and evaluation of an rdma-aware data shuffling operator for parallel database systems. In EuroSys. Google ScholarDigital Library
- J. Liu, J. Wu, S. P. Kini, P. Wyckoff, and D. K. Panda. High performance RDMA-based mpi implementation over infiniband. In ICS '03, pages 295--304, 2003. Google ScholarDigital Library
- S. Loesing, M. Pilman, T. Etter, and D. Kossmann. On the design and scalability of distributed shared-data databases. In SIGMOD '15, pages 663--676, 2015. Google ScholarDigital Library
- Mellanox. Connectx<sup>@</sup>-6 en 200gb/s adapter. http://www.mellanox.com/related-docs/prod_silicon/PB_ConnectX-6_EN_IC.pdf, 2016.Google Scholar
- Mellanox. Infiniband performance. http://www.mellanox.com/page/performance infini-band, 2016.Google Scholar
- C. Mitchell, Y. Geng, and J. Li. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In USENIX ATC '13, pages 103--114, 2013. Google ScholarDigital Library
- B. Mutnury, F. Paglia, J. Mobley, G. K. Singh, and R. Bellomio. Quickpath interconnect (QPI) design and analysis in high speed servers. In EPEPS '10, pages 265--268, 2010.Google ScholarCross Ref
- J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Latency-tolerant software distributed shared memory. In USENIX ATC '15, pages 291--305, 2015. Google ScholarDigital Library
- J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. Operating Systems Review, pages 92--105, 2010. Google ScholarDigital Library
- QLogic. Introduction to Ethernet latency. http://www.qlogic.com/Resources/Documents/TechnologyBriefs/Adapters/Tech_Brief_Introduction_to_-Ethernet_Latency.pdf, 2016.Google Scholar
- W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4):228--239, 2015. Google ScholarDigital Library
- Y. Shan, S.-Y. Tsai, and Y. Zhang. Distributed shared persistent memory. In SoCC, pages 323--337, 2017. Google ScholarDigital Library
- R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and M. Scott. Cashmere-2L: Software coherent shared memory on a clustered remote-write network. In SOSP '97, pages 170--183, 1997. Google ScholarDigital Library
- M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era: (it's time for a complete rewrite). In VLDB, pages 1150--1160, 2007. Google ScholarDigital Library
- M. Stonebraker and A. Weisberg. The voltdb main memory dbms. IEEE Data Engineering Bulletin, 2013.Google Scholar
- Transaction Processing Performance Council. TPC-C benchmark specification. http://www.tpc.org/tpcc, 2010.Google Scholar
- S. Wang, T. T. A. Dinh, Q. Lin, Z. Xie, M. Zhang, Q. Cai, G. Chen, B. C. Ooi, and P. Ruan. Forkbase: An efficient storage engine for blockchain and forkable applications. PVLDB, 11(10):1137--1150, 2018. Google ScholarDigital Library
- T. Wang, R. Johnson, and I. Pandis. Query fresh: Log shipping on steroids. PVLDB, 11(4):406--419, 2017. Google ScholarDigital Library
- X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen. Fast in-memory transaction processing using rdma and htm. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 87--104. ACM, 2015. Google ScholarDigital Library
- J. Wu, P. Wyckoff, and D. Panda. Pvfs over infiniband: Design and performance evaluation. In Parallel Processing, 2003. Proceedings. 2003 International Conference on, pages 125--132. IEEE, 2003.Google ScholarCross Ref
- K. Yelick, D. Bonachea, W.-Y. Chen, P. Colella, K. Datta, J. Duell, S. L. Graham, P. Hargrove, P. Hilfinger, P. Husbands, et al. Productivity and performance using partitioned global address space languages. In Proceedings of the 2007 international workshop on Parallel symbolic computation, pages 24--32. ACM, 2007. Google ScholarDigital Library
- K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, et al. Titanium: A high-performance java dialect. Concurrency Practice and Experience, 10(11--13):825--836, 1998.Google Scholar
Index Terms
- Efficient distributed memory management with RDMA and caching
Recommendations
An efficient design for fast memory registration in RDMA
Remote Direct Memory Access (RDMA) improves network bandwidth and reduces latency by eliminating unnecessary copies from network interface card to application buffers, but the communication buffer management to reduce memory registration and ...
Efficient page caching algorithm with prediction and migration for a hybrid main memory
Emerging next generation memories, NVRAMs, such as Phase-change RAM (PRAM), Ferroelectric RAM (FRAM), and Magnetic RAM (MRAM) are rapidly becoming promising candidates for large scale main memory because of their high density and low power consumption. ...
Characterizing Memory Write References for Efficient Management of Hybrid PCM and DRAM Memory
MASCOTS '11: Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication SystemsIn order to reduce the energy dissipation in main memory of computer systems, phase change memory (PCM) has emerged as one of the most promising technologies to incorporate into the memory hierarchy. However, PCM has two critical weaknesses to ...
Comments