Elsevier

Future Generation Computer Systems

Volume 53, December 2015, Pages 109-118
Future Generation Computer Systems

ASIP acceleration for virtual-to-physical address translation on RDMA-enabled FPGA-based network interfaces

https://doi.org/10.1016/j.future.2014.12.012Get rights and content

Highlights

  • RDMA needs independent memory management by the NIC in host/GPU virtual address space.

  • This requires fast lookup of buffers and Virtual-to-Physical address translation.

  • We developed an ASIP for the FPGA to accelerate these operations with good results.

  • ASIP design has been effective thanks to architecture exploration toolsuite.

Abstract

We developed a point-to-point, low latency, 3D torus Network Controller integrated in an FPGA-based PCIe board which implements a Remote Direct Memory Access (RDMA) communication protocol. RDMA requires ability to directly access the remote node application memory with minimal OS or CPU intervention. To this purpose, a key element is the design of a direct memory writing mechanism to address the destination buffers; on Virtual Memory supporting OSes this corresponds to a number of page-segmented DMAs. To minimally affect overall performance, mechanisms with lowest possible latency are needed for either Virtual-to-Physical address translation and registered buffers list scanning. In a first implementation these tasks were set on a soft-core μC on the FPGA, leading to a 1.6  μs latency to process a single packet and limiting the peak bandwidth. As a second trial, we present an accelerated version for these time-critical network functions exploiting an application-specific processor (ASIP) designed using a retargetable ASIP development toolsuite that allows architectural exploration. Benchmark results for Buffer Search and Virtual-to-Physical tasks on the ASIP show improvements for latency with up to ten times lower cycles cost compared with the soft-core μC.

Introduction

In the context of the EURETILE FP7 project  [1], aimed at developing a many tile computing platform and its programming paradigm, we recently developed a custom Network Interface Controller (NIC)  [2] enabling the assembling of a computing cluster with a 3D toroidal network mesh made of COTS components. This NIC, named APEnet+, is a PCI-Express device based on a high-end FPGA, exposing 6 fully bidirectional links for communication with remote nodes by means of Remote Direct Memory Access (RDMA) semantics; it aims at High Performance Computing for scientific applications, i.e. low-latency and high bandwidth. Moreover, leveraging upon peer-to-peer (P2P) capabilities of Fermi and Kepler class NVIDIA GPUs  [3], APEnet+ can perform real zero-copy, low latency GPU-to-GPU transfers  [4].

These advanced capabilities require APEnet+ being able to interface with the memory system of both the host and the GPU; specifically, APEnet+ is able to perform unmediated access to the virtual memory space of the remote node. Virtual memory is a key concept in modern operating systems and computer architectures, allowing for concealment of the fragmentation of physical memory and tricking each process running on a CPU into seeing its own addressable space as one contiguous chunk. For the x86_64 architecture APEnet+ is targeted to, this translates into splitting the memory address space into variable-sized pages (usually 4 KiB in size) and maintaining a list of matchings between virtual and physical pages known as the Page Table. Retrieval from the Page Table–the page walk–is a task charged onto a CPU-integrated specialized component, the Memory Management Unit (MMU), that translates every access to virtual addresses into physical ones in a way transparent to applications.

Because of this, when the APEnet+ card is at the receiving end of a network transfer it can comply with RDMA semantics only if able to autonomously translate the virtual addresses signing the header field of incoming data packets into physical addresses that are the real write targets of its DMA engine; this must also work whether the in-flight buffer be bound onto host or GPU memory of the remote node.

APEnet+ employs an FPGA–an Altera Stratix® IV  [5]–that allows a straightforward implementation of an embedded processor. For this reason, virtual memory management duties were first accomplished by means of a firmware running on such embedded soft-processor—the Nios II; it was however soon found that letting it walk the page table was a severe bottleneck and theoretical speed for the board transceivers was not satisfactorily saturated.

Just like on x86_64 an associative cache called Translation Lookaside Buffer (TLB) assists the MMU in its duties, we followed suit and implemented a TLB for APEnet+ by means of a Content Addressable Memory (CAM)  [6]. The CAM implementation allows the lowest possible latency for the TLB, which achieves actual saturation over the links. The drawback is that its resources cost is so high that, at least for the target FPGA, TLB size cannot grow beyond about a hundred entries. This means that page walking of a buffer spanning more pages than that will sooner or later have to resort to the slow Nios II to retrieve a missing entry and saturation is lost.

This shortcoming pushed us to seek improvements for the page walk slow path; this meant looking at Application-specific processors (ASIPs). ASIPs offer forms of architectural specialization that, when combined with instruction-level and data-level parallelism, can significantly increase the performance and reduce energy consumption, compared to general-purpose processors. Thanks to their software programmability, ASIPs offer the flexibility to cope with multiple algorithmic standards and floating specifications.

In the following we will describe in detail the iterative optimization process leading to design of a highly efficient page walking ASIP for APEnet+ where, as we will see, the architectural specialization will consist of dedicated register structures in which look-up tables are stored, and of custom instructions such as table look-up, insert and remove.

In Section  2 references to other works concerning network communication that involve ASIPs or interaction between FPGAs and GPUs can be found. In Section  3 we describe the APEnet+ architecture, with details of the relevant subsystems for the Nios II in 3.1 and of our ASIP in 3.2. In Section  4 the RDMA semantics for a network transfer is described. Section  5 contains specifics of the different page walking implementations; the Nios II firmware is in 5.1, the TLB is sketched in 5.2–its in-depth examination can be found in  [6]–and the ASIP is in 5.3; quantitative comparison between firmware and ASIP is in Section  6. Plans for complete integration of the page walking ASIP solution into APEnet+ and roadmap for its future development are presented in Section  7; conclusions are in Section  8.

Section snippets

Related work

RDMA protocols are a well-established solution, in many contexts such as HPC, databases or storage, to reduce CPU utilization and avoid memory bandwidth bottlenecks. RDMA enabled NICs have demonstrated outstanding performances as pointed out in [7], [8].

A first approach to GPUDirect technology usage can be found in  [9], where benefits for well known software suites are demonstrated up to 33% performance improvement on the described early model for GPU communications—avoiding memory staging but

A brief APEnet+ overview

APEnet+ is a point-to-point, low-latency network controller developed by INFN for a 3D-torus topology network, integrated in a PCIe Gen2 board based on an Altera Stratix IV FPGA. It is the building block for a hybrid CPU/GPU HPC cluster inside INFN  [15] and the basis for a GPU-enabling data acquisition interface in the low-level trigger of a High Energy Physics experiment  [16]. The board provides 6 QSFP+ modules which are directly connected to its embedded transceivers; 4 out of 6 modules are

RDMA communication paradigm

Traditional hardware and software architecture imposes a significant load on a server CPU and memory because data must be copied between the kernel and application. Memory bottlenecks become more severe as connection speeds (10 GbE and InfiniBand) exceed the processing power and memory bandwidth of servers. Remote Direct Memory Access (RDMA) allows to directly read/write information in another computer main memory with minimal demands on memory bus bandwidth and CPU processing overhead. This

RDMA task implementation and acceleration

As seen in Section  4.1, a NIC is required to offload from the application the RDMA buffer handling–with all the added complexities of virtual memory management for x86/x86_64 on a GNU/Linux OS–to achieve a zero-copy, off-loaded RDMA implementation.

We describe below the different versions for this managing infrastructure, as evolved from a software-only implementation for the Nios II, then to a Nios II software assisted by a custom logic TLB and finally to a dedicated ASIP.

Hardware costs

In Table 1 we show an outline of the FPGA logic usage measured with the synthesis software. Nios II and TLB hardware block are integrated in the current APEnet+ board implementation, instead the results for the DLX ASIP are obtained in a stand-alone project on Altera Stratix IV EP4SGX290.

In the current implementation of APEnet+ board, the Nios II μC achieves an operating frequency of 200 MHz, not so far from the limit declared by Altera for the Nios II/f  [22]. The preliminary synthesis trials

Future work

Synthetic tests in 6.2 show that the ASIP implementation yields promise of solving the shortcomings of the TLB and of the Nios II as they currently are in APEnet+.

Work is under way for trial integration of the ASIP onto the APEnet+ architecture, which would allow more realistic benchmarking and a definitive ruling whether complete Nios II takeover by DLX would be convenient for APEnet+.

Actual trials of FPGA synthesis for the DLX are not yet tuned in any way; its running frequency of 165 MHz in

Conclusions

When choosing to implement RDMA semantics, an FPGA-based GPU-aware network card like APEnet+ requires a complex interaction with the virtual memory system of its host and its GPUs; such interaction mostly consists of look-ups of buffer ranges and virtual-to-physical address translations. For our target platform, managing this interaction by a firmware run on the general purpose FPGA-provided Nios II μC was unable to saturate the advertised speed of the FPGA physical links. This could be fixed

Acknowledgments

This work was partially supported by EU Framework Programme 7 EURETILE project, grant number 247846; author R. Ammendola was supported by MIUR (Italy) through INFN SUMA project.

Roberto Ammendola received his Master’s Degree in Physics in 2002 from “Tor Vergata” University in Rome, Italy. He is with INFN since 2003, but worked also for Centro Fermi and “Tor Vergata” University. His main research interests are FPGAs and high speed network interconnects, HPC systems and infrastructure design and deployment.

References (23)

  • P.S. Paolucci, I. Bacivarov, G. Goossens, R. Leupers, F. Rousseau, C. Schumacher, L. Thiele, P. Vicini, EURETILE...
  • R. Ammendola

    APEnet+: a 3D Torus network optimized for GPU-based HPC systems

    J. Phys. Conf. Ser.

    (2012)
  • NVIDIA GPUDirect technology....
  • R. Ammendola

    GPU peer-to-peer techniques applied to a cluster interconnect

  • Altera Stratix IV Handbook....
  • R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P.S. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P....
  • J. Liu et al.

    High performance RDMA-based MPI implementation over InfiniBand

    Int. J. Parallel Program.

    (2004)
  • T.S. Woodall et al.

    High performance RDMA protocols in HPC

  • G. Shainer et al.

    The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications

    Comput. Sci.-Res. Dev.

    (2011)
  • S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, D. Panda, Efficient inter-node MPI communication using GPUDirect...
  • R. Bittner et al.

    Direct GPU/FPGA communication via PCI express

    Cluster Comput.

    (2013)
  • Cited by (3)

    • Reconfigurable content-addressable memory (CAM) on FPGAs: A tutorial and survey

      2022, Future Generation Computer Systems
      Citation Excerpt :

      CAM is becoming an essential part of many embedded applications, such as SDNs, but is missing in modern FPGAs. FPGAs have a vast range of applications besides networking [16] however, FPGA vendors have not yet included a hard-core CAM on modern FPGA devices [17,18]. Besides, to support networking applications, such as SDN, the vendors have provided soft intellectual property (IP) cores of the CAM based on other memories available on FPGAs.

    • Special Section on Terascale Computing

      2015, Future Generation Computer Systems
    • An Optimal Algorithm for Enumerating Connected Convex Subgraphs in Acyclic Digraphs

      2021, IEEE Transactions on Circuits and Systems II: Express Briefs

    Roberto Ammendola received his Master’s Degree in Physics in 2002 from “Tor Vergata” University in Rome, Italy. He is with INFN since 2003, but worked also for Centro Fermi and “Tor Vergata” University. His main research interests are FPGAs and high speed network interconnects, HPC systems and infrastructure design and deployment.

    Andrea Biagioni received his Master’s Degree in Physics in 2006 from University Sapienza in Rome, Italy, with a thesis regarding interconnection network implementation for the massively parallel computers designed by the APE Group at Istituto Nazionale di Fisica Nucleare (INFN). Since 2007 he has worked in this group as hardware developer collaborating at SHAPES and EURETILE projects involved in the VHDL coding for the development of a network processor and its data transmission control logic and for fast I/O mechanism with GPUs. His main research interests are High Performance Computer, NIC design, communication protocol, FPGA, fault tolerance and routing algorithms.

    Ottorino Frezza received his Master’s Degree in Physics in 2005 from University Sapienza in Rome, Italy with a thesis regarding VHDL FPGA-based design of a Read out board for a high energy physic experiment (NEMO). He worked from 2007 to 2009 for Eurotech s.p.a. as technical employee. Since 2009 he has worked for INFN as hardware developer. His main research interest is the development of network architecture on FPGA for massively parallel processing systems, focusing on high speed interfaces.

    Werner Geurts holds the position of VP Applications at Target Compiler Technologies. Before co-founding Target he was a researcher at IMEC, where he has been working on behavioral synthesis of data-path structures and on retargetable compilation. He has co-authored several papers in electronic design automation. He holds Master’s degrees in Eectrical Engineering from the Hogeschool Antwerpen and K.U. Leuven, and a Ph.D. degree from K.U. Leuven, since 1985, 1988, and 1995 respectively.

    Gert Goossens is CEO of Target Compiler Technologies. Before founding Target, he was affiliated with the IMEC research centre, where he headed research groups on behavioral synthesis and software compilation. He has authored or co-authored around 40 papers in electronic design automation. He received a Master’s and a Ph.D. degree in Electrical Engineering from K.U. Leuven, in 1984 and 1989 respectively.

    Francesca Lo Cicero received his Master’s Degree in Electronics Engineering in 2005 from University Tor Vergata in Rome, Italy with a thesis regarding VHDL FPGA-based design of a Calibration System for a Digital Beam Forming Network. Since 2006 she has worked for INFN as hardware developer. Her main research interest is the development of network architecture on FPGA for massively parallel processing systems, focusing on hardware accelerations and optimizations.

    Alessandro Lonardo received his Master’s Degree in Physics in 1997 from University “Sapienza” in Rome, Italy. His thesis work involved the development of a DSL optimizing compiler for the SIMD APEmille supercomputer. He contributed to the design of the apeNEXT SPMD parallel computer, developed its multi-node functional simulator and ported the gcc compiler to the new architecture. He was one of the designer of the Distributed Network Processor, an IP enabling 2D/3D internode communications in embedded multi-tile systems, and developed its TLM-SystemC model. Currently he works on the design and development of the APEnet+ 3D-torus network and NaNet real-time NIC.

    Pier Stanislao Paolucci received his Physics M.Sc. degree from University Sapienza (Rome, Italy). He coordinates the European EURETILE project. Previously, he coordinated the European SHAPES project. He is an INFN researcher, and has been member of the INFN APE group since its foundation (1984). The APE group designed several generations of massive parallel/distributed numerical computers. He also served as CTO of ATMEL Roma, leading the design of the DIOPSIS MPSoCs and mAgic VLIW numerical processors. He patented about MPSoC and VLIW. He also invented the ‘Cubed-Sphere’ gridding and co-invented ’Evolving Grammars’. He is a member of ACM and IEEE.

    Davide Rossetti has a degree in Theoretical Physics from Sapienza Rome University and is currently a senior engineer at NVIDIA Corp. His main research activities are in the fields of design and development of parallel computing and high-speed networking architectures optimized for numerical simulations, while his interests span different areas such as HPC, computer graphics, operating systems, I/O technologies, GPGPUs, embedded systems, digital design, and real-time systems.

    Francesco Simula received his Master’s Degree in Theoretical Physics in 2006 from University Sapienza in Rome, Italy with a thesis regarding spin glass simulations performed on the massively parallel computers designed by the APE Group, the supercomputing initiative internal to Istituto Nazionale di Fisica Nucleare (INFN). Since 2006, he has worked at the Department of Physics of University Sapienza as temporary research fellow and then for INFN as developer of high performance numerical simulation of scientifically interesting codes. Paralled computing is his main research interest, focusing on high performance networking and GPU acceleration on both HPC and embedded systems.

    Laura Tosoratto received a Master’s Degree in Physics in 2005 from University Sapienza in Rome, Italy, with a thesis about the porting of the GNU C Compiler for the apeNEXT supercomputer architecture, developed by the APE group at Istituto Nazionale di Fisica Nucleare (INFN). Since then she has worked as researcher in this group contributing as software developer to different projects with EU funded grants. Her main research interests are High Performance Computing and Networking, Message Passing libraries, Fault Tolerance and Distributed Platform functional simulators.

    Piero Vicini received his Master’s Degree in Physics from University Sapienza in Rome, Italy, then joined Istituto Nazionale di Fisica Nucleare (INFN) in 1993, where he is currently senior research associate. From 1993 he was one of the principal investigators of APE Group, INFN supercomputing initiative, as responsible for hardware development, VLSI design and APE Supercomputers production. Since 2005, he is the research group spokesman and coordinator. Current research interests are development of massively parallel processing systems optimized for scientific numerical simulation, in particular, floating point processor architectures, dedicated network architecture on FPGA, computational accelerators and high performance system integration.

    1

    Present address: NVIDIA Corp., Santa Clara, CA, United States.

    View full text