research-article

An ultra-low latency and compatible PCIe interconnect for rack-scale communication

Authors:
Yibo Huang

Fudan University, China

Fudan University, China
View Profile

,
Yukai Huang

Fudan University, China

Fudan University, China
View Profile

,
Ming Yan

Fudan University, China

Fudan University, China
View Profile

,
Jiayu Hu

Intel, China

Intel, China
View Profile

,
Cunming Liang

Intel, China

Intel, China
View Profile

,
Yang Xu

Fudan University, China

Fudan University, China
View Profile

,
Wenxiong Zou

Fudan University, China

Fudan University, China
View Profile

,
Yiming Zhang

Fudan University, China

Fudan University, China
View Profile

,
Rui Zhang

Fudan University, China

Fudan University, China
View Profile

,
Chunpu Huang

Fudan University, China

Fudan University, China
View Profile

,
Jie Wu

Fudan University, China

Fudan University, China
View Profile

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and TechnologiesNovember 2022Pages 232–244https://doi.org/10.1145/3555050.3569128

Published:30 November 2022Publication History

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies

Pages 232–244

ABSTRACT

Emerging network-attached resource disaggregation architecture requires ultra-low latency rack-scale communication. However, current hardware offloading (e.g., RDMA) and user-space (e.g., mTCP) communication schemes still rely on heavily layered protocol stacks which requires the translation between PCIe bus and network protocol, or complex connection/memory resource management within RNICs, inevitably bringing latency overhead.

We argue that PCIe Non-Transparent Bridge (NTB) is a superior high-speed in-rack network technology to interconnect PCIe-attached machines or devices with the same PCIe fabric since no translation is needed between PCIe and network protocol. We present NTSocks, the first user-space in-rack interconnect over PCIe fabric which virtualizes native NTB into high-level network functionalities for rack-scale systems with software-hardware co-design. NTSocks provides (1) compatibility with a fast socket-like abstraction, (2) multi-thread scalability using a core-driven dat-aplane model, and (3) fair and efficient resource sharing with a multi-tenant isolation mechanism. Even though PCIe NTB is originally designed for device communication across PCIe domains, NTSocks shows a flexible user-level indirection with performance close to bare-metal NTB while providing common network stack features. In the evaluations with latency-sensitive Key-Value Store, NTSocks achieves better latency by up to 24.5× and 1.58× than kernel and RDMA socket, respectively.

References

Krste Asanović. 2014. Firebox: a hardware building block for 2020 warehouse-scale computers.Google Scholar
Broadcom. 2011. Pex8733, pci express gen 3 switch, 32 lanes, 18 ports. https://docs.broadcom.com/docs/12351852. (2011).Google Scholar
Google Cloud. 2018. Tpu pods. https://cloud.google.com/tpu/. (2018).Google Scholar
Tencent Cloud. 2019. High-performance network framework based on dpdk. http://f-stack.org/. (2019).Google Scholar
DPDK Community. 2020. Data plane development kit. https://www.dpdk.org/. (2020).Google Scholar
Linux Kernel Community. 2020. Ntb drivers in linux kernel. https://www.kernel.org/doc/Documentation/ntb.txt. (2020).Google Scholar
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with ycsb. In Proceedings of the First ACM Symposium on Cloud Computing, 143--145.Google ScholarDigital Library
NVIDIA Corporation. 2022. Bluefield smartnic. https://www.nvidia.com/en-us/networking/products/data-processing-unit/. (2022).Google Scholar
Paolo Costa, Hitesh Ballani, Kaveh Razavi, and Ian Kash. 2015. R2c2: a network stack for rack-scale computers. ACM SIGCOMM Computer Communication Review, 45, 4, 551--564.Google ScholarDigital Library
CXL. 2020. Compute express link: the breakthrough cpu-to-device interconnect. https://www.computeexpresslink.org/. (2020).Google Scholar
Alexandros Daglis, Stanko Novaković, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2015. Manycore network interfaces for in-memory rack-scale computing. ACM SIGARCH Computer Architecture News, 43, 3S, 567--579.Google ScholarDigital Library
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM, 56, 2, 74--80.Google ScholarDigital Library
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. Farm: fast remote memory. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 401--414.Google Scholar
EMC. 2016. Dssd d5. https://www.emc.com/enus/storage/flash/dssd/dssd-d5/index.htm. (2016).Google Scholar
Alireza Farshin, Amir Roozbeh, Gerald Q Maguire Jr, and Dejan Kostić. 2020. Reexamining direct cache access to optimize i/o intensive applications for multi-hundred-gigabit networks. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 673--689.Google Scholar
Daniel Firestone et al. 2018. Azure accelerated networking: smartnics in the public cloud. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), 51--66.Google Scholar
Linux Foundation. 2020. What is the vector packet processor (vpp). https://fd.io/docs/vpp/master/. (2020).Google Scholar
The Apache Software Foundation. 2020. Ab - apache http server benchmarking tool. https://httpd.apache.org/docs/2.4/programs/ab.html. (2020).Google Scholar
Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 249--264.Google Scholar
Yixiao Gao et al. 2021. When cloud storage meets {rdma}. In 18th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 21), 519--533.Google Scholar
Dan Gibson et al. 2022. Aquila: a unified, low-latency fabric for datacenter networks. In 19th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 22), 1249--1266.Google Scholar
Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. 2022. Direct access, {high-performance} memory disaggregation with {directcxl}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), 287--294.Google Scholar
Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. 2017. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 649--667.Google Scholar
Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. Rdma over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference, 202--215.Google ScholarDigital Library
Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. 2022. Clio: a hardware-software co-designed disaggregated memory system. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 417--433.Google ScholarDigital Library
Microchip Technology Inc. 2019. Microchip switchtec pm853x. https://ww1.microchip.com/downloads/en/DeviceDoc/00002849.pdf. (2019).Google Scholar
Intel. 2017. Intel rack scale design. https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html. (2017).Google Scholar
Intel. 2020. Intel® 64 and ia-32 architectures optimization reference manual. https://software.intel.com/content/www/us/en/develop/down-load/intel-64-and-ia-32-architectures-optimization-reference-manual.html. (2020).Google Scholar
Intel. 2020. Intel® data direct i/o technology. https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html. (2020).Google Scholar
EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. 2014. Mtcp: a highly scalable user-level {tcp} stack for multicore systems. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 489--502.Google Scholar
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed {dnn} training in heterogeneous gpu/cpu clusters. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 463--479.Google Scholar
Wu Jingjing and Maslekar Omkar. 2019. Dpdk pmd for ntb. https://static.sched.com/hosted_files/dpdkna2019/35/DKPMDforPCleNon-TransparentBridge.pptx. Intel, (2019).Google Scholar
Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter rpcs can be general and fast. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 1--16.Google Scholar
Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. Freeflow: software-based virtual {rdma} networking for containerized clouds. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 113--126.Google Scholar
Yohei Kuga, Ryo Nakamura, Takeshi Matsuya, and Yuji Sekiya. 2020. Nettlp: a development platform for pcie devices in software interacting with hardware. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20), 141--155.Google Scholar
Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Rowstron, Hugh Williams, and Xiaohan Zhao. 2016. Xfabric: a reconfigurable in-rack network for rack-scale computers. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), 15--29.Google Scholar
Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. 2019. Socksdirect: datacenter sockets can be fast and compatible. In Proceedings of the ACM Special Interest Group on Data Communication, 90--103.Google ScholarDigital Library
Huaicheng Li et al. 2022. First-generation memory disaggregation for cloud platforms. arXiv preprint arXiv:2203.00241.Google Scholar
Yuliang Li et al. 2019. Hpcc: high precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, 44--58.Google Scholar
Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, and Arvind Krishnamurthy. 2018. Parameter hub: a rack-scale parameter server for distributed deep neural network training. In Proceedings of the ACM Symposium on Cloud Computing, 41--54.Google ScholarDigital Library
Wassim Mansour, Pablo Fajardo, Nicolas Janvier, et al. 2017. High performance rdma-based daq platform over pcie routable network. ICALEPCS, Barcelona, Spain, 8--13.Google Scholar
Jonas Markussen, Lars Bjørlykke Kristiansen, Pål Halvorsen, Halvor Kielland-Gyrud, Håkon Kvale Stensland, and Carsten Griwodz. 2021. Smartio: zero-overhead device sharing through pcie networking. ACM Transactions on Computer Systems (TOCS), 38, 1--2, 1--78.Google ScholarDigital Library
Jonas Markussen, Lars Bjørlykke Kristiansen, Håkon Kvale Stensland, Friedrich Seifert, Carsten Griwodz, and Pål Halvorsen. 2018. Flexible device sharing in pcie clusters using device lending. In Proceedings of the 47th International Conference on Parallel Processing Companion, 1--10.Google ScholarDigital Library
Mellanox. 2019. Messaging accelerator (vma). Available at https://github.com/mellanox/libvma. (2019).Google Scholar
Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel: isolation and sharing in disaggregated rack-scale storage. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), 17--33.Google Scholar
Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W Moore. 2018. Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 327--341.Google ScholarDigital Library
Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2016. The case for rackout: scalable data serving using rack-scale systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing, 182--195.Google ScholarDigital Library
Stanko Novakovic, Alexandros Daglis, Dmitrii Ustiugov, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2019. Mitigating load imbalance in distributed data serving with rack-scale memory pooling. ACM Transactions on Computer Systems (TOCS), 36, 2, 1--37.Google ScholarDigital Library
2014. Pci express® base specification revision 4.0 version 0.3. https://xdevs.com/doc/Standards/PCI/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf. (2014).Google Scholar
C. PETERSEN. 2016. Introducing lightning: a flexiblenvme jbof. https://code.facebook.com/posts/989638804458007/introducinglightning-a-flexible-nvme-jbof/. (Mar. 2016).Google Scholar
DPDK Project. 2020. Ntb rawdev driver. https://doc.dpdk.org/guides/rawdevs/ntb.html. (2020).Google Scholar
Yiming Qiu, Jiarong Xing, Kuo-Feng Hsu, Qiao Kang, Ming Liu, Srinivas Narayana, and Ang Chen. 2021. Automated smartnic offloading insights for network functions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 772--787.Google ScholarDigital Library
Jack Regula. 2004. Using non-transparent bridging in pci express systems. PLX Technology, Inc, 31.Google Scholar
Holly Schroth. 2019. Are you ready for gen z in the workplace? California Management Review, 61, 3, 5--18.Google ScholarCross Ref
ScyllaDB. 2019. Seastar: high-performance server-side application framework. http://seastar.io/. (2019).Google Scholar
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. Legoos: a disseminated, distributed {os} for hardware resource disaggregation. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), 69--87.Google Scholar
Mark J Sullivan. 2010. Intel xeon processor c5500/c3500 series non-transparent bridge. Technology@ Intel Magazine.Google Scholar
PLX Technologies. 2005. Multi-host system and intelligent i/o design with pci express. https://lwn.net/Articles/672752/. (2005).Google Scholar
Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. 2020. Disaggregating persistent memory and controlling them remotely: an exploration of passive disaggregated key-value stores. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 33--48.Google Scholar
Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. 2014. Marlin: a memory-based rack area network. In 2014 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS). IEEE, 125--135.Google ScholarDigital Library
Qing Wang, Youyou Lu, Erci Xu, Junru Li, Youmin Chen, and Jiwu Shu. 2021. Concordia: distributed shared memory with in-network cache coherence. In 19th {USENIX} Conference on File and Storage Technologies ({FAST} 21), 277--292.Google Scholar
Xingda Wei, Xiating Xie, Rong Chen, Haibo Chen, and Binyu Zang. 2021. Characterizing and optimizing remote persistent memory with rdma and nvm. In 2021 {USENIX} Annual Technical Conference ({USENIX} {ATC} 21), 523--536.Google Scholar
Xiangliang Yu. 2016. Ntb: add support for amd pci-express non-transparent bridge. https://lwn.net/Articles/672752/. (2016).Google Scholar
Liuhang Zhang, Rui Hou, Sally A McKee, Jianbo Dong, and Lixin Zhang. 2016. P-socket: optimizing a communication library for a pcie-based intra-rack interconnect. In Proceedings of the ACM International Conference on Computing Frontiers, 145--153.Google ScholarDigital Library
Xiantao Zhang, Xiao Zheng, Zhi Wang, Hang Yang, Yibin Shen, and Xin Long. 2020. High-density multi-tenant bare-metal cloud. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 483--495.Google ScholarDigital Library
Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. 2020. Racksched: a microsecond-scale scheduler for rack-scale computers. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 1225--1240.Google Scholar

Index Terms

An ultra-low latency and compatible PCIe interconnect for rack-scale communication
1. Networks
  1. Network types
    1. Data center networks

Recommendations

Manycore network interfaces for in-memory rack-scale computing
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on ...
Read More
Manycore network interfaces for in-memory rack-scale computing
ISCA'15

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on ...
Read More
Inter-rack live migration of multiple virtual machines
VTDC '12: Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date

Within datacenters, often multiple virtual machines (VMs) need to be live migrated simultaneously for various reasons such as maintenance, power savings, and load balancing. Such mass simultaneous live migration of multiple VMs can trigger large data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies
November 2022
431 pages
ISBN:9781450395083
DOI:10.1145/3555050
General Chairs:
Giuseppe Bianchi
University of Rome Tor Vergata, Italy
,
Alessandro Mei
Sapienza University of Rome, Italy
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 November 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Best Paper
Author Tags
PCIe interconnect
PCIe non-transparent bridging
disaggregation
high-speed networks
rack-scale communication
Qualifiers
- research-article
Conference

Acceptance Rates
CoNEXT '22 Paper Acceptance Rate28of151submissions,19%Overall Acceptance Rate198of789submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 550
  Total Downloads
- Downloads (Last 12 months)259
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.