Trends in computing technologies and markets: The HEPiX TechWatch WG

Driven by the need to carefully plan and optimise the resources for the next data taking periods of Big Science projects, such as CERN’s Large Hadron Collider and others, sites started a common activity, the HEPiX Technology Watch Working Group, tasked with tracking the evolution of technologies and markets of concern to the data centres. The talk will give an overview of general and semiconductor markets, server markets, CPUs and accelerators, memories, storage and networks; it will highlight important areas of uncertainties and risks.


Introduction
High energy physics (HEP) and other big science projects are highly dependent on significant advances in compute, storage, and network performance to store and analyze the expected near order of magnitude increases in collected data in the next few years. Any deviation from expected improvements in performance (or cost/performance ratios) will have an impact on future experiments. However, with a few exceptions, technical and economic factors have slowed the rate of improvement in recent years relative to historical rates.

Semiconductors
The scientific community is dependent on the global semiconductor market to provide the advances in electronics that drives improvements in computing, detectors, and data acquisition. In FY2018, global demand for semiconductors exceeded 1 trillion in units [3]. In the first half (1H) of 2019, sales dropped 14% relative to 1H2018 due to a drop in memory and flash sales; however, the market is expected to recover in 2020 [3,4]. From a technology perspective, leading edge semiconductor foundries are moving to smaller (7 nm) device sizes, requiring new and costly lithographic equipment. Future migration to 5 nm and beyond is expected to require another round of costly lithography changes. This reality has caused Global Foundries to stop 7 nm process development, leaving TSMC, Samsung, and Intel at the leading edge [1]. These and higher chip design costs are making it harder to justify development of new chips using these leading edge processes as only chips with high enough sales volume can justify the up front development costs [2,6].

Servers
Compute servers are critical to HEP. Global revenues for the server market was $20 billion, down 11.6% year over year in 2Q2019, with 2.7 million servers shipped [7]. Average selling price (ASP) remained stable, as DRAM, flash and CPU costs dropped [8]. In FY2019, the hyperscalers (e.g. Amazon, Facebook, Google, Microsoft) purchased 35% of all Intel's server processors and the Original Device Manufacturers (ODMs), the suppliers of choice to the hyperscalers, captured a larger share of the server market than the traditional "Tier 1" servers brands [8,9]. At this point in time the impact of the hyperscalers on HEP procurements is unclear.

Central Processing Unit (CPU)
The x86 instruction set architecture (ISA), the one implemented by Intel and AMD, continues to be the dominant CPU ISA in the server market, with IBM Power, ARM, and RISC-V attempting to grow market share or enter the market [16,25].
x86 Architecture(Intel/AMD) In the x86 market, AMD is starting to challenge Intel's hegemony in the server CPU market. AMD's Zen2 2nd generation "ROME" EPYC server CPUs promise competitive CPU cores, higher core counts (64 vs 28 cores), larger memory bandwidth (8 vs 6 DDR4 memory channels) and greater I/O connectivity (128 PCI-e Gen 4 lanes vs 48 Gen 3 lanes) compared to Intel CPUs [11]. Intel has responded with support for Optane persistent memory and Vector Neural Network instructions (VNNI) in their latest CPUs [10]. However, Intel has been hampered by problems with their 10 nm chip manufacturing process [12]. Meanwhile, AMD has moved to chiplets, small silicon die implementing specific functions, and multi-chip modules (MCMs), integrated circuit packaging containing multiple chiplets, to mitigate the cost of 7 nm processes. Chiplets and MCMs enable reduced die size and varied CPU configuration, by mixing and matching different chiplets, at the cost of more complex fabrication and potentially higher power consumption and latency [13,14].

Non-86
IBM Power is the only non-x86 ISA that the HEP community likely to encounter at this time. The top two supercomputers in the November 2019 Top 500 list both use IBM Power 9 processors [28]. Unique features of Power 9 include 4 threads per core (SMT4), buffered memory (for large memory system), native NVLINK, and CAPI (coherent accelerator processor interface) for CPU connectivity to accelerators like GPUs. Power 9 was also an early adopter of PCI-e Gen 4 [15].
A new entrant in the data center market are CPUs from several vendors based on the ARM ISA, an ISA from ARM Holdings [16]. Marvell's ThunderX2 is available in systems from Cray, HPE, and Gigabyte [18,19]. Fujitsu's A64FX is the CPU in the Fugaku (Post-K) supercomputer to be installed at Riken in Japan [20]. Other ARM based CPUs, including Ampere's eMAG and Amazon's Graviton, are also targeting the data center [21,22]. ARM Holding has also announced their Neoverse N1 intellectual property blocks that can be used by licensees to quickly build custom CPUs that are designed for the data center [23]. The ARM ISA is also a core component of the European Union's European Processor Initiative [24] that aims to build a CPU for high performance computing (HPC). Given the breadth of industry activity with ARM, it is likely that the HEP community may encounter ARM in the data center in the near future.
RISC-V is an open source ISA that is attempting to follow the same path as ARM into the data center market place by starting in embedded/system on chip applications [25]. RISC-V has been targeted for inclusion in the European Processor Initiative's accelerator roadmap [26]. The EPI is part of the European Union's effort to build an HPC ecosystem [27].

Accelerators
The slowdown in the advancement of CPU performance in the past few years has lead computer designers to look to domain specific accelerators (DSA) for dramatic increases in performance and performance per watt. However, DSAs only work for specific types of applications, and typically require substantial modifications to software.

Graphics Processing Units (GPUs)
NVIDIA, leveraging development efforts in PC and console gaming, has made GPGPUs (General Purpose Graphics Processing Units) the DSA of choice in computing. Their flagship GPU is the V100 [29]. Nvidia GPUs are present in a large number of Top 500 HPC systems and provide a substantial fraction of the FLOPS (floating point operations per second) in these systems [28]. Both AMD with Radeon Instinct and Intel with X e are attempting to capture some of the GPU market [30,31]. Software libraries and tools remain a key to the adoption of GPUs, but the most mature packages tend to be platform specific. Recently, GPUs have added features to accommodate new applications (e.g., bfloat16 support for artificial intelligence) to thwart competition from dedicated AI accelerators.

AI Accelerator
Artificial Intelligence (AI) processors are another class of DSAs gaining traction, driven by the use of AI applications by the hyperscalers and other industries. Some AI processors are only available in the cloud, like Google's Tensor Processing Unit (TPU) while others are available as hardware that can be purchased in the open market (Intel Nervana, Habana Gaudi) [32][33][34]. The influence of the hyperscalers is evident in their ability to finance the development of Application Specific Integrated Circuits (ASICs) like Google's TPU and also to generated interest in AI accelerators in the broader market, to compel the development of commercial AI processors.

FPGA
DSAs can boost processing capability, but the cost of ASIC design is a significant barrier to their development. To avoid these costs, the electronics industry has turned to Field Programmable Gate Arrays (FPGAs) to implement hardware designs. FPGAs consist of logic blocks and interconnect that can be reconfigured in the field by the end user. Higher end FPGA's also can contain hardware implementations (hard IP blocks) of more common functions like DRAM controllers, network ports, and even CPUs. With sufficient skill and effort, it is possible to create DSAs with FPGAs at relatively low material cost. The open question is whether the benefits are worth the development effort.

Memory
Dynamic RAM (DRAM) is the working memory of choice for CPUs, but DRAM hasn't kept pace with CPUs performance for decades. Complex cache systems and DSAs have been developed to mitigate these problems, but cannot eliminate them. Advances in DRAMs have been limited to bandwidth and capacity, as access latencies have remained relatively unchanged. In addition, higher capacity and higher performance are inversely related, with High Bandwidth Memory 2 (HBM2) providing higher bandwidth at lower capacity compared to DDR5, which provides higher capacity at lower bandwidth, and with graphics memory (GDDR5) somewhere in between [35]. From a market perspective, DRAM sales are expected to decline by 38% in 2019 compared to 2018 [36].

Storage
With the increasing amounts of data being collected by HEP experiments, the advancements in storage are critical moving forward. The falling cost and increasing capacity of solid state memory (3D NAND flash) has resulted in a veritable renaissance in high performance storage. However, more traditional magnetic disks and tape continue to evolve slowly and are encountering technological and economic issues.
Flash 3D NAND flash is the technology of choice for non-volatile, high performance block storage. With more cell layers, currently at 128, and more bits per cell, 3 to 4 bits per cell in production, 5 bits in development, storage capacity continues to increase. Cost for flash memory dropped in 2019, due to production oversupply and relatively weak demand [36,37,40]. Flash revenue is expect to shrink by 32% in 2019, but recover in 2020 [37]. Flash has replaced magnetic disks for low latency and high bandwidth applications. To access these capabilities, legacy I/O buses like SCSI and block storage software stacks in operating systems have been replaced with NVMe (non-volatile memory express) and its associated NVMe storage stack. NAND flash is triggering the development of new form factors (e.g. M.2, EDSFF) while NVMe is enabling the development of network attached flash via NVMe over Fabric (NVMeoF) [38,39].

Persistent Memory
The development of low latency flash, lower than "normal" capacity flash, and Intel's Optane non-volatile memory has spurred interest in Persistent or Storage Class memory (PM or SCM memory). PM is non-volatile memory on the CPU memory bus and is less expensive than DRAM, but higher latency. There are three different behavioral models for PM, defined in the Storage Networking Industry Association (SNIA) Non-Volatile Memory Programming model [42]. These are DRAM-like but larger capacity (JEDEC NVDIMM-P), DRAM-like but persistent (JEDEC NVDIMM-N), and disks-like but faster [41]. When used in a computer system, the different modes require modification to different parts of the systems, this may include CPU support, BIOS/OS support, memory bus extensions (e.g. modifications to DDR4 for Optane), application support or a combination of these changes.

Hard Disk Drives (HDD)
Magnetic hard disks host the preponderance of online data for HEP. Historically, HDD capacity was doubling every 13 months and cost per TB was decreasing at an average rate of 40% per year, but more recently this has dropped to 20% per year [48]. The maximum areal bit density of Perpendicular Magnetic Recording (PMR) technology, used in current HDDs, and limits to the number of platters in the standard 3 1/2" form factor have been reached [43]. Near term solutions to increase drive capacities, notably shingled magnetic recording (SMR), significantly alter drive behavior, with some requiring explicit application support. Longer term solutions, like energy assisted HAMR (heat) and MAMR (microwave) recording, continue to be pushed back, with new estimates of drive availability in 2022 [43]. Performance limits, in the form of input and output operations per second (IOPS) have also plateaued. To rectify this situation, multi-actuator drives have been promised for 2020 that will double the IOPS per 3 1/2" drive [45]. In addition to these technical hurdles, the HDD market is also experiencing problems.
Over the past decade, the HDD market has been shrinking as SSDs continue to grab market share from HDDs. Near-line HDDs, the type used by HEP, is the only growth market for HDDs [44]. Also, hyperscalers purchase almost 50% of all HDDs, giving them significant influence in the market [46]. The contraction of the HDD market has lead to the consolidation of vendors, of which there are now three, and a reduction in the number of factories, with four plants between the top two vendors.

Tape
Historically, the archival media of choice for HEP has been magnetic tape. Its role has expanded over the years to near-line storage, with the deployment of robotic tape libraries. The technical road map for tape is clear, with 300-400MB/sec LTO-8 and IBM TS1160 tape drives in the market with uncompressed tape capacities at 12TB and 20TB respectively; LTO-9 technology at the horizon (2020?), and multi-100TB tapes demonstrated in the lab [47]. However, the economics of tape is less clear. Tape media revenue is currently estimated at $0.7 billion USD/year and has been flat or falling over the past few years [48,49]. Use of large capacity disks as a backup media and high fixed costs for tape compared to disk have been factors in the decline. IBM is the only leading edge tape drive manufacturer and the dominant driver of tape R&D. On the media side, only two manufactures are left, and they only recently settled a legal dispute that disrupted sale of LTO-8 media [50]. Finally, the hyperscalers represent a substantial consumer of tape technology. These factors can lead to a distorted market that can have detrimental consequences.

Network
High performance networking is a fundamental requirement for HEP, as large volumes of data need to be moved from the detector to compute and storage systems located around the world. This is one area where HEP has benefited from the interest of hyperscalers as they have driven the rapid transition to higher single lane bit rates, from 25 Gigabits per second (Gbps) to 50 Gbps and now 100 Gbps and concurrent reductions in cost [51,53]. However, this rapid change is not without consequences. Shorter technology life cycles have increased the pressure to upgrade equipment earlier and reduce investments in "older" technology, e.g. 10/40 gigabit Ethernet (GbE). The pace of change has also exceeded the speed of the IEEE standardization process [52]. A side effect of this change is the proliferation of different optical interconnect options at 100, 200, and 400 GbE, with different transceiver types, distance limitations, single lane bit rates, and "break out" capabilities [54,55].

System Architecture
New technologies like flash and domain specific accelerators, as well as limitations in existing technologies has resulted in a re-thinking of system architectures [59]. Chiplets (small, function specific integrated circuits), expanded use of multi-chip modules (e.g., AMD Zen, HBM2 memory), cache coherent interconnects (CCIX, CXL, NVLink, GenZ) for connectivity to CPUs, NVMe over Fabrics, and persistent memory are examples of new technologies that have been developed to allow radically different compute system architectures [56][57][58].
What new system architectures develop and which make lasting changes to computing is unclear at this time.

Summary
Next generation Big Science experiments like the High Luminosity LHC (HL-LHC), the Deep Underground Neutrino Experiment (DUNE), and the Square Kilometer Array (SKA), are highly dependent on the timely advancement of compute, storage, and network technology [60][61][62]. Cost, capacity, and performance need to improve at a reasonable pace so that network, compute and storage costs remain within reach. HEP directly benefits from a robust electronics industry and market as they finances R&D into more advance (faster, lower power, smaller) electronics, both analog and digital, and enables cost reduction from economies of scale and technology improvements. Healthy competition in the CPU business drives lower compute costs and higher performance, which benefits HEP greatly as data processing requirements increase due to more complex data extraction problems (like greater pile up) and larger volumes of data. The drive towards domain specific accelerators holds the potential to speed up data processing both at the counting house, enabling things like software triggers, and in the data center. Domain specific processors also may change HEP data analysis methods as machine learning techniques, for which AI processors are specifically designed, are being investigated by HEP experiments. The proliferation of these devices outside of HEP has fostered the development of software libraries that the HEP community has leveraged with minimal to no cost. Substantial increases in network performance, driven by the hyperscalers will enable the HEP community to move the data at the counting house, in the data center, and around the world at rates commensurate with the volume of data being generated. The migration towards chiplets and multi-chip modules, in combination with inter-chip interconnects like CXL, CCIX, and Gen-Z, may have an positive impact on HEP detectors and data acquisition systems by enabling the ability to compose customized hardware from standardized building blocks.
Advances are occurring, but there are definite technical and economic risks in the roadmap of several key technologies. In addition utilization of some of these advances entail a substantial amount of software development that must be started several years in advance of their availability (or use) as software development is a time consuming endeavor. Without knowledge of the direction that technology is advancing, HEP will not be prepared to take advantage of these advances.