Hardware and Software Solutions for Energy-Efficient Computing in Scientific Programming

Energy consumption is one of the major issues in today’s computer science, and an increasing number of scientific communities are interested in evaluating the tradeoff between time-to-solution and energy-to-solution. Despite, in the last two decades, computing which revolved around centralized computing infrastructures, such as supercomputing and data centers, the wide adoption of the Internet of &ings (IoT) paradigm is currently inverting this trend due to the huge amount of data it generates, pushing computing power back to places where the data are generated—the so-called fog/edge computing. &is shift towards a decentralized model requires an equivalent change in the software engineering paradigms, development environments, hardware tools, languages, and computation models for scientific programming because the local computational capabilities are typically limited and require a careful evaluation of power consumption. &is paper aims to present how these concepts can be actually implemented in scientific software by presenting the state of the art of powerful, less power-hungry processors from one side and energy-aware tools and techniques from the other one.


Introduction
Information and communication technologies (ICT) play a fundamental role in supporting human activities for the global economic, social, and environmentally sustainable developments [1]. However, energy consumption is one of the most relevant issues for present computing platforms, and this trend is expected to continue in the foreseeable future.
is implies that the electricity bill increasingly dominates costs related to the running of applications and the consequent environmental pollution [2].
is situation is evident for high-performance computing (HPC) infrastructures, where the sum of the energy bills over a supercomputer's lifetime is comparable to the acquisition cost and represents one of the most relevant elements of the total cost of ownership [3]. is is because energy is used not only for computation but also for cooling, communication, storage, and display [4]. e focus of performance-at-any-cost computer operations has led to the emergence of supercomputers that consume vast amounts of electrical power and produce so much heat in that extended cooling facilities must be constructed to ensure proper performance. e consequence is that, in the context of deploying an exascale system, the simple scaling of current technologies would result in a supercomputer with a power consumption of 100 MW, while a limit of 20 MW has been estimated as the maximum acceptable limit [5]. e attention to the flop-per-watt performance has been demonstrated by the introduction, in 2007, of the Green500 List [6] that ranks the top 500 supercomputers by energy efficiency [7]. e same problem also arises in general-purpose data centers: in the US, such infrastructures consumed about 70 billion kWh in 2014, representing 1.8% of total US electricity consumption, as reported in [8]. Some projections estimate for 2020 an electricity demand that varies by about 135 billion kWh, depending on the adoption rate of efficiency measures [9].
is scenario must be combined because in the past two decades, computing has been focused around centralized (and possibly complex [10]) infrastructures, but the wider diffusion of cyber-physical systems (CPSs) is currently inverting this trend, pushing computing power back to where data are generated. In both cases, the energy consumption of telecommunication networks is very relevant [11]. A striking example of the trend is the Internet of ings (IoT) paradigm, by which millions of devices generate a huge amount of data that are pre-elaborated locally before being integrated remotely in a data analytics context. Nevertheless, also considering science, the diffusion of powerful data acquisition devices boosted the diffusion of pre-elaboration computational architectures, such as in bioinformatics [12,13].
While HPC is a well-specific market sector, the so-called "embedded HPC" is an emerging topic [14] to develop and employ microservers/highly parallel embedded computing systems in the CPS. erefore, the adoption of energy-efficient systems represents a crucial aspect considering the characteristics of fog/edge computing environments [15].
We can formulate the problem as the need to assess a satisfactory tradeoff between time-to-solution and energyto-solution.
is problem has been faced with different approaches, which can be summarised as follows: vendors work on power-efficient processor architectures and software developers on how to use them. However, to reach exascale computing, an effective solution is possible only by properly managing all layers of the system, from the software stack to the cooling system [16] passing by less powerhungry CPUs. is can be achieved by reducing the energy consumed in the total system via both power-efficient software and hardware integrated solutions [17,18].
Energy efficiency is a key design challenge for modern computing systems for many years. Even more now, the Big Data paradigm requires addressing both issues related to the efficient processing of such an enormous amount of data and how to achieve this goal in a green way, i.e., considering issues related to sustainability and environmental concerns [19]. erefore, many papers proposing novel techniques for managing power aspects and presenting real-world experiences, together with surveys and overviews, have been published. A critical analysis on how to greening the whole life cycle of big data systems is presented in [20]. On a more technical perspective, Czarnul et al. [21] focused on the available methods and tools allowing proper configuration, management, and simulation of HPC systems for energyaware processing. An overview of application performance analysis tools, including the energetic profiling of an application and auto-tuning tools for energy saving, has been presented in [22]. e usage of low-power System-on-Chip (SoC) architectures for scientific (and industrial) applications is discussed in [23], intending to assess the tradeoff among time-to-solution, energy-to-solution, and economic aspects for both scientific and commercial purposes they can achieve in comparison to traditional server-grade architectures adopted in present infrastructures.
However, an issue is represented by the fact that nearly all the existing surveys focus on only one of the two main strategies, i.e., (i) e development and usage of new energy-efficient CPUs and SoCs (ii) e use of software tools and frameworks for reducing the power consumption of software using an existing CPU Moreover, as recognized by most of these papers, this is a rapidly evolving research field where new results are continuously presented. For example, at the time of writing, the following five European research projects and initiatives are ongoing: (i) Mont-Blanc 2020, European scalable, modular, and power-efficient HPC processor (ii) HiPEAC, High Performance and Embedded Architecture and Compilation (iii) LEGaTO, Low-Energy Toolset for Heterogeneous Computing (iv) SDK4ED, Software Development toolKit for Energy optimization and technical Debt elimination (v) TeamPlay, Time, Energy and security Analysis for Multi/Many-core heterogeneous PLAtforms is is because the European Commission has been aware since at least 2010 that the ICT sector is responsible for carbon emissions which are rapidly growing and should be kept to a minimum and therefore is supporting the development of more energy-efficient computing technologies. erefore, this work's main goal is to present the most relevant available solutions for users interested in improving the energy consumption of scientific software focusing on computation.
is is achieved by investigating the availability and performance of current hardware devices and software tools for scientific applications.
is means that the aspects related to energy efficiency in communications are not considered here. Interested readers can rely on [24,25]. e structure of the paper is as follows: Section 2 presents hardware techniques and solutions for achieving energy-savvy processing, Section 3 discusses tools and methodologies for supporting developers in producing energy-aware software, while the last section concludes the paper. Dynamic frequency (DFS) or voltage (DVS) scaling allows to modulate the power consumption processor and memory [26], scaling the clock frequency of one or both subsystems according to the execution of memory-or compute-bound application kernels [27].
For example, voltage reduction has to be considered for the heterogeneous accelerators equipping current systems also because the efficient reduction of the total power can be achieved with different voltage reduction levels for each available chip [28].
Very often, voltage and frequency ranges are fully interdependent, i.e., a change in clock frequency does imply changes in the supply voltage, and vice versa: in these cases, the technique is called dynamic voltage and frequency scaling (DFVS) [29]. Specific hardware mechanisms can implement DVFS with minimal software and operating system involvement or through enabling software.
For example, DVFS is implemented in the Linux kernel with the CPUfreq subsystem [30,31]. e original implementation of kernel 2.6 has been designed to be used when no real-time tasks are executed. However, it is possible to relax this constraint [32].
More recently, other projects focused on near-threshold voltage (NTV) computing [33], making the processors work at even lower voltages. Since this may lead to computation errors, appropriate checks and recomputation have to be added to algorithms in this case.
On the contrary, the Intel Turbo Boost technology opportunistically allows the processor to run faster than the nominal frequency if the CPU is operating below the defined power and temperature limits to speed up compute-intensive applications [34]. In detail, as explained in [35], "the thermal design power (TDP) represents the maximum amount of power the cooling system in a computer requires to dissipate.
is is the power budget under which the system needs to operate. Nevertheless, this is not the same as the maximum power the processor can consume. e processor can consume more than the TDP for a short time without it being thermally significant." More details on this and the hardware power controller called Running Average Power Limit (RAPL) introduced with the Sandy Bridge architecture are provided in [36]. A similar solution, the NVIDIA Management Library (NVML), has been provided for NVIDIA GPUs [37,38].
e Advanced Configuration and Power Interface specification has been developed since 1996 to provide the possibility to manage these aspects via software, e.g., at the operative system level. For example, ACPI defines up to 16 active states, named P0-P15, associated with a set of power/ performance/latency characteristics [39]. In P0, the process runs at the maximum power and frequency level, while these values are decreased from P1 till maximum supported Pi [40].

Commercial-Off-the-Shelf Low-Power Devices.
e energy-efficient architectures range from many-core architectures, such as the Graphics Processing Unit (GPU) to System on Chip (SoC), to Systems-on-Chip (SoCs). GPUs feature a high performance-per-watt ratio. At the time of writing this paper, the most powerful GPU devices, AMD MI100 and NVIDIA A100, presented, respectively, a peak performance of 38.33 gigaflops per watt (GFlops/W) and 24.25 GFlops/W considering 64 bit floating-point operations, with a power consumption of, respectively, 300 and 260 watt. It is, therefore, clear that GPUs aim at one side at energy efficiency, but they require careful programming and optimization to provide high computing performance. e increasingly adopted class of low-power processors, often called System-on-Chip (SoC), originally designed for the embedded and mobile market, represents an attractive solution for scientific and industrial applications given their increasing computing performance coupled with relatively low cost and low electrical power demand.
SoC hardware platforms typically embed in the same die low-power multicore processors possibly combined with a GPU and all the circuitry needed for several I/O devices. For the case of off-the-shelf SoCs, various limitations may arise, such as 32 bit-only architectures, small CPU caches, small RAM sizes, high latency interconnections, and unavailability of ECC memory.
However, some solutions are progressively reducing the performance gap with high-end processors, with the added value of keeping a competitive edge on costs, reducing their carbon footprint, and preserving the environment. For these reasons, in this paper, we disregard devices such as Arduino or Raspberry Pi devices that, even if considered for computeintensive applications [41], are mainly used for equipping IoT systems [42,43] without significant, local preprocessing of data.
Fugaku represents the most important example of the adoption of SoCs for HPC-the first supercomputer in the TOP500 list of November 2020 and the most recent at the time of writing this paper-which is equipped with Fujitsu's 48-core A64FX SoC, providing a comparable performanceper-watt value with respect to GPU-based systems [44].
In the corresponding Green500 List, we can see that Fugaku appears in position 10 with a value of 15.418 GFlops/ W, while NVIDIA DGX SuperPOD, the most energy-savvy system which is equipped with NVIDIA A100 GPUs, Scientific Programming provides 26.195 GFlops/W but is ranked only at position 170 in the TOP500. A more interesting comparison is between Fugaku and Selene, again a supercomputer equipped with A100 GPUs: this last appears in position 5 in both lists, with a value of 23.983 GFlops/W but providing only 63,460 TFlops/s with respect to 442,010 TFlops/s provided by Fugaku.
As for most HPC architectures, the question remains this [45]: do the raw numbers related to performance per second and watt correspond to achievable performance figures for most of the scientific applications and, in particular, for the application I am interested into?
is was the goal of the Computing On SOC Architecture (COSA) project [46,47], an initiative funded by the Italian Institute for Nuclear Physics (INFN) between 2015 and 2018. In particular, the COSA project focused on assessing the energy consumption behavior of a wide set of state-of-the-art architectures using benchmarks and software widely used in many scientific applications.
In particular, an in-depth comparison of the performance of x86-based SoCs (i.e., Pentium N3700 and J4205, Avoton C2750, Xeon D1540, and Atom C3958) and lowpower GPUs (i.e., Jetson TK1 and TX1) for state-of-the-art high-end solutions (i.e., Xeon E5-2683 and Tesla K20) is discussed in [23] with two benchmarks, represented by the widely used, computationally intensive N-body algorithm and the use of a deep learning approach applied to a classification problem, together with the real-world application taken from the field of molecular biology.
Although comparing high-end commercial/HPC servers with motherboards based on low-power SoC taken from the mobile and embedded world can be considered unfair, the results assess that the use of low-power architectures represents a feasible choice in terms of tradeoff among time-tosolution, energy-to-solution, and economic aspects. e authors also discuss the economic aspects in [15,48] by showing how a proper placement of the computational services considering edge and fog's composition cloud infrastructures is the key factor for achieving the best tradeoff between costs, performance, and power consumption.
Regarding the usage of SoCs based on ARM instruction set architectures (ISAs) or FPGAs, a quantitative evaluation is presented, for example, in [49], again using the N-body algorithm. Both these devices have been exploited in the ExaNoDe project to build a prototype of computing element for exascale [50].
However, it is to note that the porting of the code on these architectures is a bit more complex because the development and tuning tools have not yet reached the maturity level, ease of use, and does not provide the wide set of functionalities as those provided for free by Intel or NVIDIA [51].

HPC Low-Power
Devices. If we move from off-the-shelf products to the design of new solutions for joining high performance and energy efficiency, one of the most important references is represented by the Mont-Blanc project, started in 2011. Its goal is to foster the development of a low-power European processor for Exascale, with a target of 50 GFlops/W at the processor level. is project is part of the European Processor Initiative, a Framework Partnership Agreement to develop the European skills in the design and exploitation of such processors.
Also, this project, together with ExaNoDe [52], is part of a wider group of EU-funded projects (e.g., ExaNeSt [53] focused on interconnection and storage and Ecoscale [54] focused on the heterogeneous architecture and, in particular, on the use of FPGAs), pursuing a strategic vision for economical, low-power approaches.
Also, the Mont-Blanc projects consider the use of ARM instruction set architectures (ISAs), such as the underX processor family [55], and quantitative evaluations about different energy-performance tradeoffs achievable when designing an architecture based on mobile market technologies have been presented [56].
Heterogeneity seems to represent the most promising way, e.g., by integrating CPUs (X86 or ARM), GPUs, and FPGA in a single platform [57]. Also, the great efforts in developing unified programming models and API supporting all these heterogeneous hardware architectures such as OpenCL, SYCL, and oneAPI [58] demonstrate this trend.

Tools for Energy-Efficient Computing
In the previous section, we saw that power and energy consumption had become the driving metrics for computing hardware design and the most interesting CPUs. However, the advances in hardware efficiency must be followed by energy-aware algorithms, appropriate choice and allocation of specific hardware to applications, and adequate management techniques.
One of the most complete and interesting introductions to the problem was presented by Prof. Gallaghers [59] in summer school "ICT-Energy: Energy consumption in future ICT device" organized in 2016 within the context of the ICT-Energy European project [60].
e key concept is that energy is consumed by hardware, but this occurs under the control of software. Normal highlevel languages (e.g., C++ and Java) hide the hardware characteristics, but the key aspect is that there could be many differences in the same high-level code (e.g., C++) machine instruction programs with different energy consumption figures. To this extent, an interesting tool is represented by Compiler Explorer [61], an open-source web application for interactive compiler code generation observation based on Node.js [62]. It shows the assembly output of the compiled code with different compilers and compiler versions to extract valuable information as, for example, for evaluating the power consumption. erefore, energy saving has to start at the software level to be propagated to the hardware level. Techniques for saving energy with power-aware hardware management or power capping [63] described in the previous section can represent a valuable complement. However, a key aspect, neglected by nearly all programmers, is their active engagement to inspect where a program wastes energy and, therefore, experiment with different designs.
is is obviously coupled with the fact that results have to be produced within an acceptable deadline [64], an aspect often disregarded approaching the energy efficiency problem.

Profiling
Tools. e first step for achieving energy-efficient behavior is to investigate software behavior using information gathered as a program executes (i.e., profiling it) or simulating this through a performance model.
One of the most used tools for profiling is the Performance API (PAPI) analysis library [65]. PAPI is platform independent and provides developers with an interface and methodology for gathering performance-related data made available by hardware. e basic principle is to allow developers to see the relation between the software performance and processor events. As regards the power consumption, PAPI has been extended to measure and report energy and power values also on complex architectures [66].
Also, the PowerPack framework [67] provides a set of tools for analyzing the energetic performance. Unlike PAPI, the measurements are gathered on a separate machine in order to limit probe effects. e scalable performance measurement infrastructure for parallel codes (Score-P) [68] has been extended for collecting information from technologies such as the aforementioned Intel RAPL.
Extrae is a tool relying on PAPI that allows collecting its countermetrics (including power and thermal data) for parallel programs [37]. Paraver effectively supports the analysis of such information, a visual data browser developed at the Barcelona Supercomputing Center as the previous one [69].
e Energy-Aware COmputing Framework (EACOF) has been designed to allow developers to profile their code for energy consumption [70]. In particular, it allows profiling codes in order to know exactly where energy is being used. Moreover, it allows applications to adapt at runtime based on current energy consumption. As an example application, the authors proposed a video player that may intelligently adapt based on energy consumption readings to ensure a video will complete before the battery runs out. e framework is available on GitHub [71], but no updates have been published since 2015.
In general, many tools such as these two have been presented in the literature. It is worth citing EProf [72], having the main feature to support fine-grained attributions of energy consumption to a particular function/software segment. However, in most cases, they are not actively maintained at the end of the projects where they have been developed, and software becomes difficult-if not impossible-to find and run.
A similar fate occurred for the Multiple Metrics Modeling Infrastructure (MuMMI) [73] project, focused on integrating existing tools such as PAPI and PowerPack for facilitating measurement, modeling, and prediction of software for multicore systems.

Dynamic
Tuning. Some tools aim to achieve energysaving figures automatically. In detail, many of them have been proposed, e.g., [74,75], but, as stated before, not actively maintained. Here, we present just two of them because they are not part of wider and integrated solutions, which are discussed below.
e Global Extensible Open Power Manager (GEOPM) is a framework for exploring power and energy optimizations targeting high-performance computing [76]. One of the most interesting features is the possibility to dynamically coordinate hardware settings across all compute nodes used by an application in response to the application's behavior and requests from the resource manager. For example, it is possible to optimize MPI applications to improve energy efficiency or reduce the effects of work imbalance, system jitter, and manufacturing variation through built-in or userdefined control algorithms. e framework is available on GitHub [77].
e COUNTDOWN Slack library [78] allows identifying and automatically reducing power consumption during communication and synchronization primitives [79]. e library faces the problem of power wasting in communication and synchronization operations because of the adopted blocking mechanisms [80]: for example, nearly all MPI implementations use a busy-waiting mechanism. is library, on the contrary, is able to run a processor in a lowpower mode, resulting in lower power consumption with limited or no impact on the execution time [81].

Integrated Solutions.
e Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing (READEX) project has been funded by the European Union's Horizon 2020 research program between 2015 and 2018 to develop a tool-aided methodology for dynamic autotuning for performance and energy efficiency [82]. e tool suite was released in 2018, and it is available via GitHub [83]. e methodology is based on instrumenting an application with Score-P. is can be performed in an automatic way by compiling it with Score-P. en, the dynamism of the application is detected and analyzed in order to identify the significant regions that will be managed with the project tuning methodology at runtime.
e key advantage of this suite is that it can be exploited by any developer even if she/he is unaware of the READEX methodology, with the result of increasing the energy efficiency of her/his application. It has been estimated that the application of the READEX tool suite to a nearly complex application can take several days [84], mainly for compiling the application with Score-P.
e Low-Energy Toolset for Heterogeneous Computing (LEGaTO) project has been funded by the European Union's Horizon 2020 research program between 2017 and 2020 to design and develop a software toolchain for energy-efficiency computing on heterogeneous hardware, i.e., a system equipped with CPUs, GPUs, and FPGA [57,85].
Scientific Programming e toolchain was released in 2020, and it is available via GitHub [86]. It is composed by several software components integrated to achieve a consistent programming environment across heterogeneous hardware platforms. e hearth of the toolchain is represented by OmpSs [87], an extension to OpenMP developed at the Barcelona Supercomputing Center for supporting the asynchronous parallelism on heterogeneous resources as multicore CPUs, GPUs, and FPGAs.
An application in the OmpSs programming model is composed of one or more tasks with possible data dependency flow among some of them. e runtime environment analyses the resulting graph and produces a correct and possibly concurrent order of task execution. Several compiler and runtime systems (e.g., Nanos6, XiTAO [88], and Mercurium) support the process and manage all the energy efficiency, security, and fault-tolerance aspects [89]. ree use cases have been defined in healthcare, IoT for Smart Homes and Cities, and machine learning because they have different requirements in terms of energy efficiency, fault tolerance, and security. Results have been published in Deliverable 5.4 [90].
e Software Development toolKit for Energy optimization and technical Debt elimination (SDK4ED) project has been funded by the European Union's Horizon 2020 research program between 2018 and 2020 to minimize the cost, the development time, and the complexity of low-energy software development processes by designing a methodological approach and a software toolchain [91]. e SDK4ED platform [92] consists of five toolboxes: Technical Debt Management, Energy Optimization, Dependability Optimization, Forecaster, and Decision Support. ey are implemented following the microservice paradigm as Docker images containing the specific web service.
Focusing on the Energy toolbox, it analyses projects available in an online repository (e.g., GitHub) on the machine running the Docker container with regard to its energy efficiency. is means it finds the energy hotspots, estimates the energy consumption through static or dynamic analysis [93,94], and inspects possible solutions by suggesting specific code refactoring. is is a valuable approach, in particular, for software reusing [95]. e project ended at the end of 2020. erefore, at the time of writing, not all the details and the code are available.
e Time, Energy, and security Analysis for Multi/ Many-core heterogeneous PLAtforms (TeamPlay) project has been funded by the European Union's Horizon 2020 research program since 2018 to design and develop new techniques for producing highly parallel software for lowenergy systems, such as IoT devices and CPS [96]. e idea is to develop a set of tools for allowing programmers to reason about time, energy, and security at the program source level. e idea is to design new language constructs to manage these extrafunctional properties as first-class citizens of the source code and express contracts in the source code that are machine-checkable by an underlying proof system. e project is ongoing; therefore, at the time of writing, little information and software components were available.

Conclusions
Energy consumption is increasingly becoming one of the most relevant issues concerning the computing platforms for scientific applications and workloads.
As stated in [97], the huge level of energy consumption of ICT systems is probably due to the fact that nobody really cared for a long time, but today, things are changing because of economic reasons and also because our way of thinking has changed.
In this paper, we presented state-of-the-art solutions, both hardware and software, and methodological approaches for pursuing energy efficiency in scientific software to provide interested readers an updated introduction to the topic. e conclusion we can derive is that there are an increasing number of projects focusing on these topics, and some interesting SoC-based solutions are available. From the software side, instead, the situation is not satisfactory because tools are sometimes difficult to be found, not integrated, and, very often, disappear after the end of the project that developed them. What is actually needed is the definition of a common methodology and a coordination effort of groups acting in this field comparable with that of the Virtual Institute-High-Productivity Supercomputing (VI-HPS) [98], having in mind the tradeoff among time-tosolution, energy-to-solution, and usability of the proposed tools.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.