skip to main content
10.1145/3555050.3569128acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article
Best Paper

An ultra-low latency and compatible PCIe interconnect for rack-scale communication

Published:30 November 2022Publication History

ABSTRACT

Emerging network-attached resource disaggregation architecture requires ultra-low latency rack-scale communication. However, current hardware offloading (e.g., RDMA) and user-space (e.g., mTCP) communication schemes still rely on heavily layered protocol stacks which requires the translation between PCIe bus and network protocol, or complex connection/memory resource management within RNICs, inevitably bringing latency overhead.

We argue that PCIe Non-Transparent Bridge (NTB) is a superior high-speed in-rack network technology to interconnect PCIe-attached machines or devices with the same PCIe fabric since no translation is needed between PCIe and network protocol. We present NTSocks, the first user-space in-rack interconnect over PCIe fabric which virtualizes native NTB into high-level network functionalities for rack-scale systems with software-hardware co-design. NTSocks provides (1) compatibility with a fast socket-like abstraction, (2) multi-thread scalability using a core-driven dat-aplane model, and (3) fair and efficient resource sharing with a multi-tenant isolation mechanism. Even though PCIe NTB is originally designed for device communication across PCIe domains, NTSocks shows a flexible user-level indirection with performance close to bare-metal NTB while providing common network stack features. In the evaluations with latency-sensitive Key-Value Store, NTSocks achieves better latency by up to 24.5× and 1.58× than kernel and RDMA socket, respectively.

References

  1. Krste Asanović. 2014. Firebox: a hardware building block for 2020 warehouse-scale computers.Google ScholarGoogle Scholar
  2. Broadcom. 2011. Pex8733, pci express gen 3 switch, 32 lanes, 18 ports. https://docs.broadcom.com/docs/12351852. (2011).Google ScholarGoogle Scholar
  3. Google Cloud. 2018. Tpu pods. https://cloud.google.com/tpu/. (2018).Google ScholarGoogle Scholar
  4. Tencent Cloud. 2019. High-performance network framework based on dpdk. http://f-stack.org/. (2019).Google ScholarGoogle Scholar
  5. DPDK Community. 2020. Data plane development kit. https://www.dpdk.org/. (2020).Google ScholarGoogle Scholar
  6. Linux Kernel Community. 2020. Ntb drivers in linux kernel. https://www.kernel.org/doc/Documentation/ntb.txt. (2020).Google ScholarGoogle Scholar
  7. Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with ycsb. In Proceedings of the First ACM Symposium on Cloud Computing, 143--145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. NVIDIA Corporation. 2022. Bluefield smartnic. https://www.nvidia.com/en-us/networking/products/data-processing-unit/. (2022).Google ScholarGoogle Scholar
  9. Paolo Costa, Hitesh Ballani, Kaveh Razavi, and Ian Kash. 2015. R2c2: a network stack for rack-scale computers. ACM SIGCOMM Computer Communication Review, 45, 4, 551--564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. CXL. 2020. Compute express link: the breakthrough cpu-to-device interconnect. https://www.computeexpresslink.org/. (2020).Google ScholarGoogle Scholar
  11. Alexandros Daglis, Stanko Novaković, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2015. Manycore network interfaces for in-memory rack-scale computing. ACM SIGARCH Computer Architecture News, 43, 3S, 567--579.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM, 56, 2, 74--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. Farm: fast remote memory. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 401--414.Google ScholarGoogle Scholar
  14. EMC. 2016. Dssd d5. https://www.emc.com/enus/storage/flash/dssd/dssd-d5/index.htm. (2016).Google ScholarGoogle Scholar
  15. Alireza Farshin, Amir Roozbeh, Gerald Q Maguire Jr, and Dejan Kostić. 2020. Reexamining direct cache access to optimize i/o intensive applications for multi-hundred-gigabit networks. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 673--689.Google ScholarGoogle Scholar
  16. Daniel Firestone et al. 2018. Azure accelerated networking: smartnics in the public cloud. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), 51--66.Google ScholarGoogle Scholar
  17. Linux Foundation. 2020. What is the vector packet processor (vpp). https://fd.io/docs/vpp/master/. (2020).Google ScholarGoogle Scholar
  18. The Apache Software Foundation. 2020. Ab - apache http server benchmarking tool. https://httpd.apache.org/docs/2.4/programs/ab.html. (2020).Google ScholarGoogle Scholar
  19. Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 249--264.Google ScholarGoogle Scholar
  20. Yixiao Gao et al. 2021. When cloud storage meets {rdma}. In 18th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 21), 519--533.Google ScholarGoogle Scholar
  21. Dan Gibson et al. 2022. Aquila: a unified, low-latency fabric for datacenter networks. In 19th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 22), 1249--1266.Google ScholarGoogle Scholar
  22. Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. 2022. Direct access, {high-performance} memory disaggregation with {directcxl}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), 287--294.Google ScholarGoogle Scholar
  23. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. 2017. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 649--667.Google ScholarGoogle Scholar
  24. Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. Rdma over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference, 202--215.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. 2022. Clio: a hardware-software co-designed disaggregated memory system. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 417--433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Microchip Technology Inc. 2019. Microchip switchtec pm853x. https://ww1.microchip.com/downloads/en/DeviceDoc/00002849.pdf. (2019).Google ScholarGoogle Scholar
  27. Intel. 2017. Intel rack scale design. https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html. (2017).Google ScholarGoogle Scholar
  28. Intel. 2020. Intel® 64 and ia-32 architectures optimization reference manual. https://software.intel.com/content/www/us/en/develop/down-load/intel-64-and-ia-32-architectures-optimization-reference-manual.html. (2020).Google ScholarGoogle Scholar
  29. Intel. 2020. Intel® data direct i/o technology. https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html. (2020).Google ScholarGoogle Scholar
  30. EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. 2014. Mtcp: a highly scalable user-level {tcp} stack for multicore systems. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 489--502.Google ScholarGoogle Scholar
  31. Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed {dnn} training in heterogeneous gpu/cpu clusters. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 463--479.Google ScholarGoogle Scholar
  32. Wu Jingjing and Maslekar Omkar. 2019. Dpdk pmd for ntb. https://static.sched.com/hosted_files/dpdkna2019/35/DKPMDforPCleNon-TransparentBridge.pptx. Intel, (2019).Google ScholarGoogle Scholar
  33. Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter rpcs can be general and fast. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 1--16.Google ScholarGoogle Scholar
  34. Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. Freeflow: software-based virtual {rdma} networking for containerized clouds. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 113--126.Google ScholarGoogle Scholar
  35. Yohei Kuga, Ryo Nakamura, Takeshi Matsuya, and Yuji Sekiya. 2020. Nettlp: a development platform for pcie devices in software interacting with hardware. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20), 141--155.Google ScholarGoogle Scholar
  36. Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Rowstron, Hugh Williams, and Xiaohan Zhao. 2016. Xfabric: a reconfigurable in-rack network for rack-scale computers. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), 15--29.Google ScholarGoogle Scholar
  37. Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. 2019. Socksdirect: datacenter sockets can be fast and compatible. In Proceedings of the ACM Special Interest Group on Data Communication, 90--103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Huaicheng Li et al. 2022. First-generation memory disaggregation for cloud platforms. arXiv preprint arXiv:2203.00241.Google ScholarGoogle Scholar
  39. Yuliang Li et al. 2019. Hpcc: high precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, 44--58.Google ScholarGoogle Scholar
  40. Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, and Arvind Krishnamurthy. 2018. Parameter hub: a rack-scale parameter server for distributed deep neural network training. In Proceedings of the ACM Symposium on Cloud Computing, 41--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wassim Mansour, Pablo Fajardo, Nicolas Janvier, et al. 2017. High performance rdma-based daq platform over pcie routable network. ICALEPCS, Barcelona, Spain, 8--13.Google ScholarGoogle Scholar
  42. Jonas Markussen, Lars Bjørlykke Kristiansen, Pål Halvorsen, Halvor Kielland-Gyrud, Håkon Kvale Stensland, and Carsten Griwodz. 2021. Smartio: zero-overhead device sharing through pcie networking. ACM Transactions on Computer Systems (TOCS), 38, 1--2, 1--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jonas Markussen, Lars Bjørlykke Kristiansen, Håkon Kvale Stensland, Friedrich Seifert, Carsten Griwodz, and Pål Halvorsen. 2018. Flexible device sharing in pcie clusters using device lending. In Proceedings of the 47th International Conference on Parallel Processing Companion, 1--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mellanox. 2019. Messaging accelerator (vma). Available at https://github.com/mellanox/libvma. (2019).Google ScholarGoogle Scholar
  45. Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel: isolation and sharing in disaggregated rack-scale storage. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), 17--33.Google ScholarGoogle Scholar
  46. Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W Moore. 2018. Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 327--341.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2016. The case for rackout: scalable data serving using rack-scale systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing, 182--195.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Stanko Novakovic, Alexandros Daglis, Dmitrii Ustiugov, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2019. Mitigating load imbalance in distributed data serving with rack-scale memory pooling. ACM Transactions on Computer Systems (TOCS), 36, 2, 1--37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. 2014. Pci express® base specification revision 4.0 version 0.3. https://xdevs.com/doc/Standards/PCI/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf. (2014).Google ScholarGoogle Scholar
  50. C. PETERSEN. 2016. Introducing lightning: a flexiblenvme jbof. https://code.facebook.com/posts/989638804458007/introducinglightning-a-flexible-nvme-jbof/. (Mar. 2016).Google ScholarGoogle Scholar
  51. DPDK Project. 2020. Ntb rawdev driver. https://doc.dpdk.org/guides/rawdevs/ntb.html. (2020).Google ScholarGoogle Scholar
  52. Yiming Qiu, Jiarong Xing, Kuo-Feng Hsu, Qiao Kang, Ming Liu, Srinivas Narayana, and Ang Chen. 2021. Automated smartnic offloading insights for network functions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 772--787.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jack Regula. 2004. Using non-transparent bridging in pci express systems. PLX Technology, Inc, 31.Google ScholarGoogle Scholar
  54. Holly Schroth. 2019. Are you ready for gen z in the workplace? California Management Review, 61, 3, 5--18.Google ScholarGoogle ScholarCross RefCross Ref
  55. ScyllaDB. 2019. Seastar: high-performance server-side application framework. http://seastar.io/. (2019).Google ScholarGoogle Scholar
  56. Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. Legoos: a disseminated, distributed {os} for hardware resource disaggregation. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), 69--87.Google ScholarGoogle Scholar
  57. Mark J Sullivan. 2010. Intel xeon processor c5500/c3500 series non-transparent bridge. Technology@ Intel Magazine.Google ScholarGoogle Scholar
  58. PLX Technologies. 2005. Multi-host system and intelligent i/o design with pci express. https://lwn.net/Articles/672752/. (2005).Google ScholarGoogle Scholar
  59. Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. 2020. Disaggregating persistent memory and controlling them remotely: an exploration of passive disaggregated key-value stores. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 33--48.Google ScholarGoogle Scholar
  60. Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. 2014. Marlin: a memory-based rack area network. In 2014 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS). IEEE, 125--135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Qing Wang, Youyou Lu, Erci Xu, Junru Li, Youmin Chen, and Jiwu Shu. 2021. Concordia: distributed shared memory with in-network cache coherence. In 19th {USENIX} Conference on File and Storage Technologies ({FAST} 21), 277--292.Google ScholarGoogle Scholar
  62. Xingda Wei, Xiating Xie, Rong Chen, Haibo Chen, and Binyu Zang. 2021. Characterizing and optimizing remote persistent memory with rdma and nvm. In 2021 {USENIX} Annual Technical Conference ({USENIX} {ATC} 21), 523--536.Google ScholarGoogle Scholar
  63. Xiangliang Yu. 2016. Ntb: add support for amd pci-express non-transparent bridge. https://lwn.net/Articles/672752/. (2016).Google ScholarGoogle Scholar
  64. Liuhang Zhang, Rui Hou, Sally A McKee, Jianbo Dong, and Lixin Zhang. 2016. P-socket: optimizing a communication library for a pcie-based intra-rack interconnect. In Proceedings of the ACM International Conference on Computing Frontiers, 145--153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Xiantao Zhang, Xiao Zheng, Zhi Wang, Hang Yang, Yibin Shen, and Xin Long. 2020. High-density multi-tenant bare-metal cloud. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 483--495.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. 2020. Racksched: a microsecond-scale scheduler for rack-scale computers. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 1225--1240.Google ScholarGoogle Scholar

Index Terms

  1. An ultra-low latency and compatible PCIe interconnect for rack-scale communication

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies
      November 2022
      431 pages
      ISBN:9781450395083
      DOI:10.1145/3555050

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 November 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      CoNEXT '22 Paper Acceptance Rate28of151submissions,19%Overall Acceptance Rate198of789submissions,25%
    • Article Metrics

      • Downloads (Last 12 months)259
      • Downloads (Last 6 weeks)16

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader