ABSTRACT
Emerging network-attached resource disaggregation architecture requires ultra-low latency rack-scale communication. However, current hardware offloading (e.g., RDMA) and user-space (e.g., mTCP) communication schemes still rely on heavily layered protocol stacks which requires the translation between PCIe bus and network protocol, or complex connection/memory resource management within RNICs, inevitably bringing latency overhead.
We argue that PCIe Non-Transparent Bridge (NTB) is a superior high-speed in-rack network technology to interconnect PCIe-attached machines or devices with the same PCIe fabric since no translation is needed between PCIe and network protocol. We present NTSocks, the first user-space in-rack interconnect over PCIe fabric which virtualizes native NTB into high-level network functionalities for rack-scale systems with software-hardware co-design. NTSocks provides (1) compatibility with a fast socket-like abstraction, (2) multi-thread scalability using a core-driven dat-aplane model, and (3) fair and efficient resource sharing with a multi-tenant isolation mechanism. Even though PCIe NTB is originally designed for device communication across PCIe domains, NTSocks shows a flexible user-level indirection with performance close to bare-metal NTB while providing common network stack features. In the evaluations with latency-sensitive Key-Value Store, NTSocks achieves better latency by up to 24.5× and 1.58× than kernel and RDMA socket, respectively.
- Krste Asanović. 2014. Firebox: a hardware building block for 2020 warehouse-scale computers.Google Scholar
- Broadcom. 2011. Pex8733, pci express gen 3 switch, 32 lanes, 18 ports. https://docs.broadcom.com/docs/12351852. (2011).Google Scholar
- Google Cloud. 2018. Tpu pods. https://cloud.google.com/tpu/. (2018).Google Scholar
- Tencent Cloud. 2019. High-performance network framework based on dpdk. http://f-stack.org/. (2019).Google Scholar
- DPDK Community. 2020. Data plane development kit. https://www.dpdk.org/. (2020).Google Scholar
- Linux Kernel Community. 2020. Ntb drivers in linux kernel. https://www.kernel.org/doc/Documentation/ntb.txt. (2020).Google Scholar
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with ycsb. In Proceedings of the First ACM Symposium on Cloud Computing, 143--145.Google ScholarDigital Library
- NVIDIA Corporation. 2022. Bluefield smartnic. https://www.nvidia.com/en-us/networking/products/data-processing-unit/. (2022).Google Scholar
- Paolo Costa, Hitesh Ballani, Kaveh Razavi, and Ian Kash. 2015. R2c2: a network stack for rack-scale computers. ACM SIGCOMM Computer Communication Review, 45, 4, 551--564.Google ScholarDigital Library
- CXL. 2020. Compute express link: the breakthrough cpu-to-device interconnect. https://www.computeexpresslink.org/. (2020).Google Scholar
- Alexandros Daglis, Stanko Novaković, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2015. Manycore network interfaces for in-memory rack-scale computing. ACM SIGARCH Computer Architecture News, 43, 3S, 567--579.Google ScholarDigital Library
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM, 56, 2, 74--80.Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. Farm: fast remote memory. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 401--414.Google Scholar
- EMC. 2016. Dssd d5. https://www.emc.com/enus/storage/flash/dssd/dssd-d5/index.htm. (2016).Google Scholar
- Alireza Farshin, Amir Roozbeh, Gerald Q Maguire Jr, and Dejan Kostić. 2020. Reexamining direct cache access to optimize i/o intensive applications for multi-hundred-gigabit networks. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 673--689.Google Scholar
- Daniel Firestone et al. 2018. Azure accelerated networking: smartnics in the public cloud. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), 51--66.Google Scholar
- Linux Foundation. 2020. What is the vector packet processor (vpp). https://fd.io/docs/vpp/master/. (2020).Google Scholar
- The Apache Software Foundation. 2020. Ab - apache http server benchmarking tool. https://httpd.apache.org/docs/2.4/programs/ab.html. (2020).Google Scholar
- Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 249--264.Google Scholar
- Yixiao Gao et al. 2021. When cloud storage meets {rdma}. In 18th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 21), 519--533.Google Scholar
- Dan Gibson et al. 2022. Aquila: a unified, low-latency fabric for datacenter networks. In 19th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 22), 1249--1266.Google Scholar
- Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. 2022. Direct access, {high-performance} memory disaggregation with {directcxl}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), 287--294.Google Scholar
- Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. 2017. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 649--667.Google Scholar
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. Rdma over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference, 202--215.Google ScholarDigital Library
- Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. 2022. Clio: a hardware-software co-designed disaggregated memory system. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 417--433.Google ScholarDigital Library
- Microchip Technology Inc. 2019. Microchip switchtec pm853x. https://ww1.microchip.com/downloads/en/DeviceDoc/00002849.pdf. (2019).Google Scholar
- Intel. 2017. Intel rack scale design. https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html. (2017).Google Scholar
- Intel. 2020. Intel® 64 and ia-32 architectures optimization reference manual. https://software.intel.com/content/www/us/en/develop/down-load/intel-64-and-ia-32-architectures-optimization-reference-manual.html. (2020).Google Scholar
- Intel. 2020. Intel® data direct i/o technology. https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html. (2020).Google Scholar
- EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. 2014. Mtcp: a highly scalable user-level {tcp} stack for multicore systems. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 489--502.Google Scholar
- Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed {dnn} training in heterogeneous gpu/cpu clusters. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 463--479.Google Scholar
- Wu Jingjing and Maslekar Omkar. 2019. Dpdk pmd for ntb. https://static.sched.com/hosted_files/dpdkna2019/35/DKPMDforPCleNon-TransparentBridge.pptx. Intel, (2019).Google Scholar
- Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter rpcs can be general and fast. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 1--16.Google Scholar
- Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. Freeflow: software-based virtual {rdma} networking for containerized clouds. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 113--126.Google Scholar
- Yohei Kuga, Ryo Nakamura, Takeshi Matsuya, and Yuji Sekiya. 2020. Nettlp: a development platform for pcie devices in software interacting with hardware. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20), 141--155.Google Scholar
- Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Rowstron, Hugh Williams, and Xiaohan Zhao. 2016. Xfabric: a reconfigurable in-rack network for rack-scale computers. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), 15--29.Google Scholar
- Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. 2019. Socksdirect: datacenter sockets can be fast and compatible. In Proceedings of the ACM Special Interest Group on Data Communication, 90--103.Google ScholarDigital Library
- Huaicheng Li et al. 2022. First-generation memory disaggregation for cloud platforms. arXiv preprint arXiv:2203.00241.Google Scholar
- Yuliang Li et al. 2019. Hpcc: high precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, 44--58.Google Scholar
- Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, and Arvind Krishnamurthy. 2018. Parameter hub: a rack-scale parameter server for distributed deep neural network training. In Proceedings of the ACM Symposium on Cloud Computing, 41--54.Google ScholarDigital Library
- Wassim Mansour, Pablo Fajardo, Nicolas Janvier, et al. 2017. High performance rdma-based daq platform over pcie routable network. ICALEPCS, Barcelona, Spain, 8--13.Google Scholar
- Jonas Markussen, Lars Bjørlykke Kristiansen, Pål Halvorsen, Halvor Kielland-Gyrud, Håkon Kvale Stensland, and Carsten Griwodz. 2021. Smartio: zero-overhead device sharing through pcie networking. ACM Transactions on Computer Systems (TOCS), 38, 1--2, 1--78.Google ScholarDigital Library
- Jonas Markussen, Lars Bjørlykke Kristiansen, Håkon Kvale Stensland, Friedrich Seifert, Carsten Griwodz, and Pål Halvorsen. 2018. Flexible device sharing in pcie clusters using device lending. In Proceedings of the 47th International Conference on Parallel Processing Companion, 1--10.Google ScholarDigital Library
- Mellanox. 2019. Messaging accelerator (vma). Available at https://github.com/mellanox/libvma. (2019).Google Scholar
- Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel: isolation and sharing in disaggregated rack-scale storage. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), 17--33.Google Scholar
- Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W Moore. 2018. Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 327--341.Google ScholarDigital Library
- Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2016. The case for rackout: scalable data serving using rack-scale systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing, 182--195.Google ScholarDigital Library
- Stanko Novakovic, Alexandros Daglis, Dmitrii Ustiugov, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2019. Mitigating load imbalance in distributed data serving with rack-scale memory pooling. ACM Transactions on Computer Systems (TOCS), 36, 2, 1--37.Google ScholarDigital Library
- 2014. Pci express® base specification revision 4.0 version 0.3. https://xdevs.com/doc/Standards/PCI/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf. (2014).Google Scholar
- C. PETERSEN. 2016. Introducing lightning: a flexiblenvme jbof. https://code.facebook.com/posts/989638804458007/introducinglightning-a-flexible-nvme-jbof/. (Mar. 2016).Google Scholar
- DPDK Project. 2020. Ntb rawdev driver. https://doc.dpdk.org/guides/rawdevs/ntb.html. (2020).Google Scholar
- Yiming Qiu, Jiarong Xing, Kuo-Feng Hsu, Qiao Kang, Ming Liu, Srinivas Narayana, and Ang Chen. 2021. Automated smartnic offloading insights for network functions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 772--787.Google ScholarDigital Library
- Jack Regula. 2004. Using non-transparent bridging in pci express systems. PLX Technology, Inc, 31.Google Scholar
- Holly Schroth. 2019. Are you ready for gen z in the workplace? California Management Review, 61, 3, 5--18.Google ScholarCross Ref
- ScyllaDB. 2019. Seastar: high-performance server-side application framework. http://seastar.io/. (2019).Google Scholar
- Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. Legoos: a disseminated, distributed {os} for hardware resource disaggregation. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), 69--87.Google Scholar
- Mark J Sullivan. 2010. Intel xeon processor c5500/c3500 series non-transparent bridge. Technology@ Intel Magazine.Google Scholar
- PLX Technologies. 2005. Multi-host system and intelligent i/o design with pci express. https://lwn.net/Articles/672752/. (2005).Google Scholar
- Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. 2020. Disaggregating persistent memory and controlling them remotely: an exploration of passive disaggregated key-value stores. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 33--48.Google Scholar
- Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. 2014. Marlin: a memory-based rack area network. In 2014 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS). IEEE, 125--135.Google ScholarDigital Library
- Qing Wang, Youyou Lu, Erci Xu, Junru Li, Youmin Chen, and Jiwu Shu. 2021. Concordia: distributed shared memory with in-network cache coherence. In 19th {USENIX} Conference on File and Storage Technologies ({FAST} 21), 277--292.Google Scholar
- Xingda Wei, Xiating Xie, Rong Chen, Haibo Chen, and Binyu Zang. 2021. Characterizing and optimizing remote persistent memory with rdma and nvm. In 2021 {USENIX} Annual Technical Conference ({USENIX} {ATC} 21), 523--536.Google Scholar
- Xiangliang Yu. 2016. Ntb: add support for amd pci-express non-transparent bridge. https://lwn.net/Articles/672752/. (2016).Google Scholar
- Liuhang Zhang, Rui Hou, Sally A McKee, Jianbo Dong, and Lixin Zhang. 2016. P-socket: optimizing a communication library for a pcie-based intra-rack interconnect. In Proceedings of the ACM International Conference on Computing Frontiers, 145--153.Google ScholarDigital Library
- Xiantao Zhang, Xiao Zheng, Zhi Wang, Hang Yang, Yibin Shen, and Xin Long. 2020. High-density multi-tenant bare-metal cloud. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 483--495.Google ScholarDigital Library
- Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. 2020. Racksched: a microsecond-scale scheduler for rack-scale computers. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 1225--1240.Google Scholar
Index Terms
- An ultra-low latency and compatible PCIe interconnect for rack-scale communication
Recommendations
Manycore network interfaces for in-memory rack-scale computing
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer ArchitectureDatacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on ...
Manycore network interfaces for in-memory rack-scale computing
ISCA'15Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on ...
Inter-rack live migration of multiple virtual machines
VTDC '12: Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing DateWithin datacenters, often multiple virtual machines (VMs) need to be live migrated simultaneously for various reasons such as maintenance, power savings, and load balancing. Such mass simultaneous live migration of multiple VMs can trigger large data ...
Comments