Improving the accuracy of energy predictive models for multicore CPUs by combining utilization and performance events model variables ✩

Energy predictive modeling is the leading method for determining the energy consumption of an application. Performance monitoring counters (PMCs) and resource utilizations have been the principal source of model variables primarily due to their high positive correlation with energy consumption. Performance events, however, have come to dominate the landscape due to their better prediction accuracy compared to utilization variables. Recently, the theory of energy of computing has been proposed whose practical implications for constructing accurate and reliable linear energy predictive models are unified in a consistency test that includes a selection criterion of additivity for model variables. In this work, we analyze the prediction accuracy of models employing utilization variables only, PMCs only, and combination of both utilization variables and PMCs, through the lens of this theory for modern multicore CPU platforms. We discover that employing utilization variables only in linear energy predictive models does not capture all the energy-consuming activities during an application execution. However, combination of utilization variables with PMCs that are highly additive and highly correlated with energy consumption, gives the most accurate linear energy predictive model. Our experimental results show that application-specific and platform-level models using both utilization variables and PMCs exhibit up to 3.6 × and 2.6 × better average prediction accuracy respectively when compared with models employing utilization variables only and highly additive PMCs only. ©


Introduction
Accurate measurement of energy consumption during an application execution is key to energy minimization at application level. There are three mainstream measurement approaches: (a) System-level physical power measurements using external power meters, (b) Using on-chip power sensors, and (c) Energy predictive models. System-level physical measurements using external power meters are considered the ground truth. Fahad, Shahid, Reddy, Lastovetsky [23] present a comparative study of on-chip sensors and energy predictive models against the ground truth. Briefly, they show that the profiles obtained for dynamic energy consumption of the applications using on-chip sensors deviate significantly from the ground truth suggesting that on-chip power sensors do not capture the dynamic energy consumption during an application run holistically. We will present a synopsis of the development of the energy predictive modeling landscape. Energy predictive models emerged as a dominant energy measurement approach because of their ability to provide a finegrained component-level breakdown of energy consumption. Resource utilizations and performance monitoring counters (PMCs) have been the principal source of model variables primarily due to their high positive correlation with energy consumption. There are three prominent kinds of models based on them. The first kind [25,27,32,36,58,64] is based on utilizations of resources (CPU, memory, disk, and network). The second kind [7][8][9][10]31,34,39,40,51] employs PMCs. PMCs are special-purpose hardware registers provided in modern processor architectures to record the counts of software events, that represent the kernel-level activities such as page-faults, context-switches, etc., and hardware events arising from the micro-architecture core and the performance monitoring unit such as CPU-cycles, branch-instructions, cache-misses, etc. While utilizations are high-level metrics, PMCs are pure counters that contain activity or access counts. The third kind of models is based on both utilizations and PMCs [22,37,49]. All the proposed models (with the exception of [37,51]) predict power consumption. The energy consumption during an application execution is then determined through a creative application of the power model. One approach is to obtain the area under the discrete function of the power measurements provided by Utilization models were shown to exhibit poor prediction accuracy than PMC-based models [48,49] on multicore CPU platforms. Research works [23,30,42,45] demonstrate the poor prediction accuracy of PMC-based models and report that the linear regression models yield prediction errors as high as 150%. This can be explained as follows. First, modern multicore CPU platforms have several inherent complexities which are: (a) Severe resource contention due to tight integration of tens of cores organized in multiple sockets with multi-level cache hierarchy and contending for shared on-chip resources such as the lastlevel cache, interconnect, and DRAM controllers; (b) Non-uniform memory access (NUMA) where the time for memory access between a core and main memory is not uniform and where main memory is distributed between locality domains or groups called NUMA nodes; and (c) Dynamic power management of multiple power domains (CPU sockets, DRAM). Due to these complexities, the energy consumption of a computing resource demonstrates a non-linear and non-smooth functional relationship with the utilization of the resource. Second, a sound theoretical framework to understand the fundamental significance of the model variables with respect to the energy consumption and the causes of inaccuracy or the reported wide variance of accuracy of the models has been lacking.
The theory of energy of computing has progressively matured over the past three years starting with a proposal of a criterion for selection of PMCs in the research work [52] followed by a formal description of the theory and its practical implications in [53]. Shahid et al. [53] propose a novel theory of energy of computing and unified its practical implications to increase the prediction accuracy of linear energy predictive models in a consistency test, which contains a suite of properties that include determinism, reproducibility, and additivity to select model variables and constraints for model coefficients. The authors show that the average prediction accuracy of linear regression models can be improved from 31% to 18% by selecting PMCs that pass the consistency test of the theory of energy of computing [51]. . HCLWattsUp represents the ground truth, which is system-level physical measurements using power meters. Application employed is dense matrix multiplication (DGEMM) executing on HCLServer2.
In this work, we aim to improve the prediction accuracy of linear energy predictive models further by combining utilization variables and PMCs. We perform a comparative study of the prediction accuracy of models employing utilization variables only, PMCs only, and combining both utilization variables and PMCs using the consistency test of the theory of energy of computing on modern multicore CPU platforms. We first check the reliability of CPU and memory utilizations as model variables in energy predictive models using the consistency test. We study the additivity of average CPU and memory utilization for the execution of applications on two modern multicore servers, HCLServer1 and HCLServer2 (specifications are shown in Table 1). Our results show that utilization variables are highly additive (satisfying the input tolerance of 5%) for our application suite and pass the consistency test.
We then employ the utilization variables in linear energy predictive models at two levels, platform and application, on both the servers. We demonstrate that the models exhibit poor accuracy with an average prediction error of up to 50%. Models that employ only PMCs have an average prediction error of up to 36%. Combining utilization variables along with PMCs that are highly additive and highly correlated with energy consumption, however, yields the most accurate linear energy predictive model. Application-specific and platform-level models using both utilization variables and PMCs perform up to 3.6× and 2.6× better in terms of average prediction accuracy when compared with models employing utilization variables only and PMCs only.
We illustrate our findings using results from one of our experiments. Fig. 1 shows the predictions of application-specific models constructed for dense matrix multiplication (DGEMM) employing utilization variables only (UPT), PMCs only (PMC), and combining both utilization variables and PMCs (UPMC). HCLWattsUp represents the ground truth, which is system-level physical power measurements using power meters. The ground truth profile exhibits drastic variations due to the inherent complexities in modern multicore CPU platforms. Due to these complexities, the profiles of dynamic energy and workload size for real-life dataparallel applications have complex and non-smooth functional relationship [38,46,47]). Utilization variables capture the overall energy consumption trend (or the average energy consumption) of the profiles of the application executions. However, they do not capture the variations in the profiles. Models based on highly additive and highly energy-correlated PMCs accurately capture these variations that account for most of the energy-consuming activities during the execution of an application. They do not, however, account for some energy-consuming activities that are captured by high-level utilization variables. Models that employ both utilization variables and highly additive and highly energycorrelated PMCs are able to account for all the energy-consuming activities during the execution of an application and hence are found to provide the best accuracy.
The original contributions of this work are summarized as follows: • A first experimental study analyzing the prediction accuracy of linear energy predictive models employing utilization variables only, PMCs only, and both utilization variables and PMCs and which are selected based on the theory of energy of computing on modern multicore CPU platforms.
• We show that models employing both utilization variables and highly additive and energy correlated PMCs are able to better account for the energy-consuming activities during the execution of an application and hence are found to provide significant improvements in prediction accuracy compared to models that are based on utilization variables only and PMCs only.
We organize the rest of this paper as follows. Section 2 presents the terminology. Section 3 contains the literature review. Section 4 overviews the theory of energy of computing and contains the expressions for various energy predictive models. Section 5 contains the experimental setup and details the selection and measurement of model variables. Section 6 presents the experimental results and analysis. In Section 7, we present the discussions containing the learned lessons for improving the prediction accuracies of energy predictive models and future work. Finally, Section 8 concludes the paper.

Terminology: Energy consumption and energy predictive models
There are two types of energy consumptions, static energy, and dynamic energy. The total energy consumption is the sum of dynamic and static energy consumptions. The static energy consumption is calculated by multiplying the idle power of the platform (without application execution) with the execution time of the application. The dynamic energy consumption is calculated by subtracting this static energy consumption from the total energy consumed by the platform during the application execution. That is, if P S is the static power consumption of the platform, E T is the total energy consumption of the platform during the execution of an application, which takes T E seconds, then the dynamic energy E D is equal to E D = E T − (P S × T E ). We present the rationale behind using dynamic energy consumption instead of total energy consumption in section 2 of the supplemental [54]. In this work, we consider only the dynamic energy consumption.
The dynamic energy predictive models are built using specialized linear regression. The mathematical form of a model is shown below: where E D is the dynamic energy consumption which is the dependent variable, {x 1 , . . . , x n } are the independent variables, and {β 1 , . . . , β n } are the regression coefficients or the model parameters. In real life, there usually is stochastic noise (measurement errors). Therefore, the measured energy is typically expressed as where the error term or noise ϵ is a Gaussian random variable with expectation zero and variance σ 2 , written ϵ ∼ N (0, σ 2 ).

Related work
Our literature survey is organized as follows: (a) Survey of the mainstream methods for power and energy measurement, (b) Survey of power and energy predictive models based on utilization variables and PMCs. (c) Review of notable literature surveys on energy predictive models, and (d) Finally, recent advancements in this field.

Mainstream methods for energy measurements
There are three mainstream methods for energy measurement: (1) System-level power measurements using physical power meters, (2) On-chip power sensors, and (3) Energy predictive models. The first method is considered the ground truth. Fahad et al. [24] present the first methodology to measure the component-level energy consumption of a hybrid application on a heterogeneous computing platform using this method.
The second method is now supported by popular processor vendors who provide vendor-specific software libraries to acquire the power data from the on-chip sensors. Intel CPUs offer Running Average Power Limit (RAPL) [50] to monitor power and control frequency (and voltage). AMD starting from Bulldozer micro-architecture equip their processors with an estimation of average power over a certain interval through the Application Power Management (APM) [20] capability. Hackenberg et al. [30] report that APM provides highly inaccurate data, particularly during the processor sleep states. Intel Xeon Phi co-processors are equipped with on-board Intel System Management Controller chip (SMC) [17] providing energy consumption that can be programmatically obtained using Intel manycore platform software stack (Intel MPSS) [16]. The accuracy of Intel MPSS is not available. Nvidia Management Library NVML [4] provides programmatic interfaces to obtain the energy consumption of an Nvidia GPU from its on-chip power sensors. There are, however, some issues with the energy measurements provided by Nvidia on-chip sensors [13]. Fahad et al. [23] present the first detailed study on the accuracy of on-chip power sensors and show that deviations of the energy measurements provided by on-chip sensors including Intel RAPL from the ground truth does not motivate their use in the optimization of applications for dynamic energy.
The third method using energy predictive models emerged as a popular alternative to determine the energy consumption of an application. Performance monitoring counters (PMCs) and resource utilizations have been the principal source of model variables primarily due to their high positive correlation with energy consumption. PMCs, however, have come to dominate the landscape due to their better prediction accuracy compared to utilization variables.

Utilization based models
The early power models using the resource utilization parameters (such as CPU, memory, network, and I/O utilization statistics) as predictor variables include [ The general utilization based model for total power consumption can be described as follows: where c base is the base power consumption of a processor and {c 1 , c 2 , c 3 , c 4 } are the regression coefficients or the model parameters. U CPU , U Mem , U Disk , and U Net are the CPU, memory, disk, and network utilizations respectively.

PMC based models
Tools to Determine PMCs: PAPI [2] provides a standard API for accessing PMCs available on most modern microprocessors. It provides two types of events, native events and present events. Native events correspond to PMCs native to a platform. Likwid [59] provides command-line tools and an API to obtain PMCs for both Intel, POWER8, and AMD processors on the Linux OS. For Nvidia GPUs, CUDA Profiling Tools Interface (CUPTI) [3] can be used for obtaining the PMCs for CUDA applications. Intel PCM [1] is used for reading PMCs of core and uncore (which includes the QPI) components of an Intel processor. It supports the statistical analysis of core frequencies, QPI power, and DRAM activities. Linux Perf [61] also called perf_events can be used to gather the PMCs for CPUs in Linux. It also comes as a profiling tool suite including perf stat, perf record, perf report, perf annotate, perf top and perf bench.
From the literature, we can divide the approaches to select the PMCs into the following categories: 1. Approaches that consider all the PMCs to capture all possible contributors to energy consumption. To the best of our knowledge, we found no research works that adopt this approach. This could be due to several reasons: • Gathering all PMCs requires huge programming effort and time.
• Interpretation (for example, visual) of the relationship between energy consumption and PMCs is difficult especially when there is a large number of PMCs.
• Dynamic or runtime models must choose PMCs that can be gathered in just one application run.
• Typically, simple models (those with fewer parameters) are preferred over complex models not because they are accurate but because simplicity is considered a desirable virtue.
2. Approaches that are based on a statistical methodology such a correlation and principal component analysis for the selection of PMCs. 3. Approaches that use expert advice or intuition to pick a subset (that may not necessarily be determined in one application run) and that, in experts' opinion, is a dominant contributor to energy consumption. 4. Approaches that select PMCs using a theoretical model of the energy of computing, which is the manifestation of the fundamental physical law of energy conservation [52,53].

Critiques of PMCs for Energy Predictive Modeling:
There are research works that have critically examined and highlighted the poor prediction accuracy of PMCs for energy predictive modeling. Economou et al. [22] highlight the fundamental limitation, which is the inability to obtain all the PMCs simultaneously or in one application run. They also mention the lack of PMCs to model the energy consumption of disk I/O and network I/O. McCullough et al. [42] evaluate the competence of predictive power models for modern node architectures and show that linear regression models show prediction errors as high as 150%. They suggest that direct physical measurement of power consumption should be the preferred approach to tackle the inherent complexities posed by modern node architectures. O'Brien et al. [45] survey predictive power and energy models focusing on the highly heterogeneous and hierarchical node architecture in modern HPC computing platforms. They report that the prediction errors of linear PMC-based energy predictive models as high as 60%.

Models employing utilization variables and PMCs
Economou et al. [22] propose a linear power predictive model that employs CPU, disk, and network utilizations and a PMC containing the off-chip memory access count. Rivoire et al. [49] compare five full-system real-time power models. Four of these models are utilization-based whereas the fifth includes the model proposed by [22]. They report that the PMC-based model is the best overall in terms of accuracy since it is able to account for majority of the contributors to system's dynamic power.
Khokhriakhov et al. [37] propose a qualitative linear dynamic energy model employing CPU utilization and PMCs to explain the discovered energy nonproportionality on their multicore CPU platforms. The model is shown below: where u cpu is the CPU utilization, p l is the PMC containing the time of page walk caused by load miss, p s is the PMC for the time of page walk caused by store miss in dTLB, and t is the execution time of the application. The model is used to show that the energy nonproportionality is due to the activity of the data translation lookaside buffer (dTLB), which is disproportionately energy expensive.

Important surveys on energy predictive models
Mobius et al. [43] present a survey of power consumption models for single-core and multicore processors, virtual machines, and servers. They conclude that linear regression-based approaches dominate and that one prominent shortcoming of these models is that they use static instead of variable workloads for training the models. Dayarathna et al. [19] present an indepth survey on data center power modeling. Bridges et al. [11] present a survey of techniques to monitor and model the energy consumption of GPUs. They cover in-depth PMC-based modeling of GPUs. They also state that the accuracy of results from internal power meters must be thoroughly verified using external power meters. O'Brien et al. [45] survey predictive power and energy models focusing on the highly heterogeneous and hierarchical node architecture in modern HPC computing platforms.

Recent advancements in the energy predictive models employing PMCs
In all aforementioned works, a sound theoretical framework to understand the fundamental significance of the model variables with respect to the energy consumption and the causes of inaccuracy or the reported wide variance of accuracy of the models has been lacking.
The theory of energy of computing has progressively matured over the past three years starting with a proposal of a criterion for selection of PMCs in the research work [52] followed by a formal description of the theory and its practical implications in [53]. Shahid et al. [52] propose a novel property of PMCs called additivity, which is based on an intuitive and simple rule that if a PMC is intended to be employed as a model variable in a linear energy predictive model, then its count for a compound application should be equal to the sum of its counts for the executions of the base applications forming the compound application. A compound application is defined as the serial execution of two applications. It is based on the experimental observation that the dynamic energy consumption of a serial run of two applications is the sum of dynamic energy consumption observed for the sole executions of each application. The authors study the additivity of PMCs provided by the two mainstream frameworks, Likwid [59] and PAPI [2] on a modern Intel Haswell multicore server. They demonstrate that many PMCs available on modern processors obtained using Likwid and PAPI and that are employed in state-of-the-art models are non-additive.
Shahid et al. [53] proposed a novel theory of energy of computing and unified its practical implications to increase the prediction accuracy of linear energy predictive models in a consistency test, which contains a suite of properties that include determinism, reproducibility, and additivity to select model variables and constraints for model coefficients. By applying the consistency test, the authors improve the average prediction accuracy of state-of-the-art linear regression models from 31% to 18%. Shahid et al. [51] demonstrate that the accuracy of energy predictive models based on three popular mainstream techniques (linear regression, random forests, and neural networks) can be improved by following the properties of the consistency test, which includes selecting PMCs based on the property of additivity. They show that the removal of non-additive PMCs improves the average prediction accuracy of linear regression models from 31% to 18%, random forest models from 38% to 24%, and neural network models from 30% to 24%.

Theory of energy of computing: Practical implications for linear energy predictive models
In this section, we present a brief overview of the theory of energy of computing proposed in [53]. The theory of energy of computing is a formalism containing properties of PMC-based energy predictive models that are manifestations of the fundamental physical law of energy conservation. The properties capture the essence of single application runs and characterize the behavior of serial execution of two applications. They are intuitive and experimentally validated and are formulated based on the following observations: • In a fully dedicated and stable environment, with each execution of a single application being represented by the same PMC vector, for any two applications, the PMC vector of their serial execution will always be the same.
• An application run that does not perform any work does not consume or generate energy. It is represented by a null PMC vector (where all the PMC values are zeros).
• An application with a PMC vector that is not null must consume some energy. Since PMCs account for energyconsuming activities of applications, an application with any energy-consuming activity higher than zero activity must consume more energy than zero.
• Finally, the consumed energy of compound application is always equal to the sum of energies consumed by the individual applications. The serial execution of two applications, say the base applications, forms a compound application.
The practical implications of the theory for constructing accurate and reliable linear energy predictive models are unified in a consistency test. The test includes the following selection criteria for model variables, model intercept, and model coefficients: • Each model variable must be deterministic and reproducible.
In the case of PMC-based energy predictive models, the multiple runs of an application keeping the operating environment constant must return the same PMC count.
• Each model variable must be additive. The property of additivity is further summarized in the following section.
• The model intercept must be zero.
• Each model coefficient must be positive.
The first two properties are combined into an additivity test for the selection of PMCs. A linear energy predictive model employing PMCs and which violates the properties of the consistency test will have poor prediction accuracy. By definition and intuition, PMCs are all pure counters of energy-consuming activities in modern processor architectures and as such must be additive. Therefore, according to the theory of energy of computing, any consistent, and hence accurate, energy model, which only employs PMCs, must be linear. This also means that any non-linear energy model employing PMCs only, will be inconsistent and hence inherently inaccurate. A non-linear energy model, in order to be accurate, must employ non-additive model variables in addition to PMCs.
While the theory is proposed with PMC-based energy predictive models as focal point, it is applicable for any model variables that are pure counters of energy-consuming activities.
In this work, we show how model variables based on resource utilizations can be designed to leverage the theory.

Linear energy predictive models employing utilization variables and PMCs
We will now look at the mathematical expressions for the linear energy predictive models employing utilization variables and PMCs. The model parameters (or the regression coefficients) are constrained to be positive to meet the requirements of the consistency test.
For the dynamic energy predictive models that employ PMCs, the mathematical form is shown below: E pmc = α 1 × p 1 + · · · + α n × p n (5) where E pmc is the dynamic energy consumption, {p 1 , . . . , p n } are the PMCs, and {α 1 , . . . , α n } are the regression coefficients or the model parameters. We ignore the stochastic noise term.
We consider models employing the utilizations of CPU and memory as model variables. Since utilizations are high-level variables and not pure counters, we have to design model variables that are based on utilizations and that represent energy consumption. We provide expressions for model variables corresponding to CPU and memory only, since they are the most stressed by the applications used in our experimental study and therefore are the dominant consumers of dynamic energy. Similar model variable expressions, however, can be derived for disk and network utilizations for platforms where these two make non-trivial contributions to dynamic energy consumption.
CPU utilization during a time period represents the proportion of time the CPU is busy doing work divided by the total amount of time in the time period. The product of CPU utilization and the maximum power (thermal design power) gives an estimate of the power consumption of the CPU, that is, the average power consumption during the time period. The model variable, which is this product multiplied by the execution time of the application, represents the energy consumption. The CPU utilization model variable, u cpu , therefore is determined using the equation below: Similarly, the memory utilization model variable, u mem , is determined using the equation below: U cpu andŨ mem are the average CPU and memory utilizations, TDP cpu and TDP mem are the thermal design powers of CPU and memory, and t is the execution time of the application.
The mathematical form of the dynamic energy predictive models employing the utilization variables is shown below: (8) where E u is the dynamic energy consumption, {u cpu , u mem } are the utilization variables, and {β 1 , β 2 } are the regression coefficients or the model parameters.
The mathematical form of the dynamic energy predictive models employing both utilization variables and PMCs is shown below: where E upmc is the dynamic energy consumption and {γ 1 , γ 2 , θ 1 , . . . , θ n } are the regression coefficients or the model parameters.

Additivity of model variables
The property of additivity (first proposed in [52]) is based on an intuitive and simple rule that if a model variable is employed in a linear energy predictive model, its count for a compound application should be equal to the sum of its counts for the executions of the base applications forming the compound application.
The additivity of a model variable is determined as follows. We first obtain the counts of the model variable for the separate executions of the base applications. Then, we run the compound application and record the count for the model variable. Typically, the main computations for the compound application consist of the main computations of the base applications executed one after the other. If the model variable of the compound application is equal to the sum of the model variables obtained for the base applications (within a tolerance of 5%), the model variable is categorized as potentially additive. Otherwise, it is labeled as non-additive.
For each model variable, we determine the maximum percentage error. For a compound application, the percentage error is calculated as follows: where e c , e b1 , e b2 are the model variable values for the compound application and the constituent base applications respectively. The additivity test error for a model variable is the maximum of percentage errors for all the compound applications in the experimental test-suite. The most additive model variables are employed in a model for better prediction accuracy. We will go into the details of this selection process in our experiments section.

Experimental setup
We start with selection of model variables using the consistency test followed by comparison of prediction accuracy of application-specific and platform-level linear energy predictive models.
Our experimental platforms include the Intel Haswell and Intel Skylake multicore CPU servers, whose specifications are given in Table 1. Our application test suite is composed of highly optimized scientific applications such as DGEMM and FFT from the Intel math kernel library (MKL), NASA benchmarking suite  Table 2 lists the applications along with their description. For each application run, we measure the following: (a) Dynamic energy consumption, (b) Execution time, (c) PMCs, and (d) Utilization variables. The dynamic energy consumption during the application execution is measured using a WattsUp Pro power meter and is obtained programmatically via the HCLWattsUp API [5] (section 4 of supplemental [54]). The power meter is periodically calibrated using an ANSI C12.20 revenuegrade power meter, Yokogawa WT210. The calibration methodology is explained in section 6 of the supplemental [54]. PMCs are obtained using the Likwid tool [59] and Linux Perf [61].
To ensure the reliability of our results, we follow a statistical methodology where a sample mean for a response variable (energy, time, PMC, utilization variables) is obtained from multiple experimental runs. The sample mean is calculated by executing the application repeatedly until it lies in the 95% confidence interval and a precision of 0.05 (5%) has been achieved. For this purpose, Student's t-test is used assuming that the individual observations are independent and their population follows the normal distribution. We verify the validity of these assumptions by plotting the distributions of observations. The experimental methodology to determine the sample mean is described in section 3 of the supplemental [54]. The prediction error of a model is calculated as follows: is the prediction by a model and E G is the ground truth value. The average prediction error for n data points is calculated as ( ∑ n i=1 error i )/n, where error i is the prediction error for data point i.
We now apply the consistency test to select PMCs and utilization parameters.

Measurement and selection of Performance Monitoring Counters (PMCs)
Likwid tool [59] offers 164 and 323 PMCs on HCLServer1 and HCLServer2, respectively. To collect all the PMCs, each application must be executed about 53 and 99 times on HCLServer1 and HCLServer2, respectively. This is due to the availability of a limited number of hardware registers (3)(4) to store the PMCs. We apply the first stage of consistency test where we check if the PMCs are deterministic and reproducible as follows: 1. We eliminate PMCs with counts less than or equal to 10.
The eliminated PMCs have no significance on modeling energy consumption of our platform because we found them to be non-reproducible. We also remove several PMCs that count equal to zero. The reduced set contains 151 and 298 PMCs on HCLServer1 and HCLServer2, respectively. 2. We compare the PMCs using three different tools, Likwid, PAPI, and Linux Perf. We eliminate the PMCs that show differences. After this elimination, the total number of PMCs reduces to 115 and 224 on HCLServer1 and HCLServer2.
The total work performed during the execution of an application in our test suite is entirely due to CPU and main memory activities. To find their contributions towards the dynamic energy, we use two synthetic applications, A and B, performing floatingpoint operations and memcpy() operations on all the memory blocks, respectively. We execute A employing all processor cores for 10 s and measure its dynamic energy consumption, which is equal to 1337 joules. We then execute B for the same amount of time and discover that the dynamic energy consumption is insignificant and cannot be captured within the statistical confidence of 95%. We increase the execution time of A and B to 20 and 30 s and find their dynamic energy consumptions to be 2596 joules and 3821 joules, and, 0 joules and 4 joules, respectively. We conclude that the major contribution to dynamic energy consumption is due to CPU activities. Therefore, we remove the PMCs that belong to Likwid main memory group for any further analysis due to two reasons. First, the memory activities make negligible contribution to the dynamic energy consumption on our platforms. Second, low counts for memory PMCs add noise that affects the training of models and unduly worsen the prediction accuracy of the models. The main CPU activities can be grouped as PMCs belonging to cache, branch instructions, microoperations (uops), floating-point instructions, instruction decode queue, and cycles. We denote them as prime PMCs.
The second stage of the consistency test involves application of the additivity property. We automate the determination of a PMC's additivity using a tool called AdditivityChecker (section 9 of the supplemental [54]). We discover that all the prime PMCs fail the additivity test for a vast set of applications with a specified tolerance of 5% on our platforms.

Measurement and selection of utilization variables
We follow the steps below to determine the utilization variables (u cpu , u mem ) on HCLServers: • Using an automated shell script, we collect the average CPU and memory utilization in percentage for the platform using Linux ps tool. The script reads the CPU and memory utilization every 0.25 s during the application execution.
• The CPU utilization for an application is the average utilization of the individual cores employed in the execution of that application.
• For an application, the trapezoidal rule is used to determine the average utilization using the utilization profile for the application.
• Finally, the average CPU and memory utilizations are multiplied with the corresponding TDPs and the execution time of the application.
We now apply the consistency test to each utilization variable on HCLServer1 and HCLServer2. We found both variables deterministic and reproducible by executing applications (Table 2) using different problem sizes on HCLServer1 and HCLServer2.
To study the additivity of the utilization variables, we take the same application set and compose 60 and 40 compound applications from the base applications on both the servers. The additivity test reveals that both variables are highly additive (with errors less than 5%) for all the applications. Therefore, we conclude that both utilization variables can be employed as model variables in any application-specific and platform-level linear energy predictive model.
Since the contribution of memory activities towards dynamic energy consumption of the applications is insignificant on our platforms, we analyze the impact of u mem as a model variable. With no constraints on the model coefficients, we find that the model coefficient of u mem to be negative for all the models constructed in our experiments with the exception of the application-specific model for FFT. For the models with the negative coefficient, we remove the memory utilization variable since it violates the properties of the consistency test and re-construct the models. We also find that the removal of memory utilization variable from the model for FFT reduces the prediction power of the model by 0.02× only. Therefore, for the consistency of the experimental results, we remove u mem as a model variable from our energy predictive models.

Experimental results
We divide the experiments into the following two groups: 1. Group 1: We study the accuracy of application-specific energy predictive models on HCLServer1 and HCLServer2 using utilization variables only, prime PMCs only, and both utilization variables and PMCs.

Group 2:
We study the accuracy of platform-level energy predictive models on HCLServer1 and HCLServer2 using utilization variables only, prime PMCs only, and both utilization variables and PMCs. We divide the experiments in this section into two classes, class A and class B. In class A, we explore the prediction accuracies of models for a limited set of applications (DGEMM and FFT). In class B, we analyze models that employ data-sets composed using all the applications in our testsuite for a wide range of problem sizes.

Study of accuracy of application-specific energy predictive models
We select two highly optimized scientific applications: 2dimensional Fast Fourier Transform (FFT) and Dense Matrix-Multiplication application (DGEMM), from Intel Math Kernel Library (MKL). The experimental steps are below: • We build two data-sets to study the additivity of PMCs for FFT and DGEMM containing the compound and base applications. Using the additivity test errors, we select the most additive PMCs that are common for both applications.
• By executing FFT and DGEMM for a range of problem sizes, we build a vast data-set containing dynamic energy consumptions, PMCs, and both utilization variables and PMCs to build energy predictive models.
• We employ the utilization variables only, PMCs only, and both utilization variables and PMCs in LR models as predictor variables.
• Finally, we analyze the prediction accuracy of the LR models.

Experiments to select PMCs and utilization variables
We present the methodology to select the utilization variables and PMCs to be employed in the models in the following steps: • We build a dataset of 50 base applications using different problem sizes for DGEMM and FFT to apply the additivity test. The range of problem sizes for DGEMM is 6500 × 6500 to 20 000 × 20 000, and for FFT is 22 400 × 22 400 to 29 000 × 29 000. We select this range because of reasonable execution time (>3 s) of the applications on our platforms.
• For each application in a dataset, we measure the following: PMCs, utilization variables, dynamic energy consumption, and the execution time. We also build a dataset of 30 compound applications from the serial execution of base applications. The additivity test based on the two datasets reveals that several PMCs are highly additive and are common for both applications. The utilization variables for both applications are highly additive and highly positively correlated with energy with errors less than 0.5%.
• From the additivity test results on HCLServer1 and HCLServer2, we select PMCs that are commonly additive with additivity test errors of less than 1%. In total there are eight PMCs for both servers with an error equal or less than 1% represented as set SA = {a1, a2, a3, a4, a5, a6, a7, a8}  and set SB = {b1, b2, b3, b4, b5, b6, b7, b8} for HCLServer1 and HCLServer2, respectively. All these PMCs belong to the dominant PMC groups from the CPU that represent the energy consuming activities of our platform.
• We calculate the correlation for all PMCs in SA and SB with the dynamic energy consumption. The PMCs and their correlations with dynamic energy consumption are given in Table 3.

Energy predictive models for DGEMM and FFT
We build a dataset containing dynamic energy consumption, execution time, PMCs (Table 3) and utilization variables for MKL-DGEMM and MKL-FFT for a range of problem sizes on our platforms ( Table 1). The number of data points in the data-set and range of problem sizes are given in Table 4. The dataset is split into two subsets for training and testing the models.
We build models for MKL-FFT and MKL-DGEMM using LR by employing the predictor variables from the training sets given in Table 4 for HCLServer1 and HCLServer2. The models are evaluated using the test dataset and are divided into two categories (S1 and S2) as given below: S1: Linear energy predictive models for FFT and DGEMM executing on HCLServer1.
• S1-PMC-Corr-FFT and S1-PMC-Corr-DGEMM employ the top four high positively correlated PMCs, {a1, a2, a3, a4}. • S1-UPMC-FFT and S1-UPMC-DGEMM employ {u cpu , a1, a2, a3, a4}. S2: Linear energy predictive models for FFT and DGEMM executing on HCLServer2.  Tables 5 and 6 show the minimum, average, and maximum prediction errors for the models in category S1 and S2, respectively. Fig. 2 compares the average prediction accuracies of models and Intel Running Average Power Limit (RAPL) [50] in both categories. On both our servers, RAPL is an on-chip power sensor that employs voltage regulator current monitor (VR IMON) for both CPU and DRAM. VR IMON is an analog circuit within a voltage regulator (VR), which keeps track of an estimate of the power as the VRs supply current to the CPU. RAPL samples this reading periodically (100 µs to 1 ms). S1-UPMC-FFT and S1-UPMC-DGEMM yield minimum average prediction errors of 9.2% and 11% for models in category A, respectively. Similarly, S2-UPMC-FFT and S2-UPMC-DGEMM have minimum average prediction errors of 19% and 9.4%, respectively. Category S1 models show that DGEMM and FFT models based on Table 4 Data-set for application-specific models on HCLServer1 and HCLServer2.

Application
Range of problem sizes Step

Discussion
Following are the salient observations from the results: • The models employing only utilization variables have poor prediction accuracy for all model categories.
• Intel RAPL has better average prediction accuracy than the utilization models.
• The average prediction accuracy for models employing additive PMCs (in the sets, SA and SB) is better than models using only utilization variables and Intel RAPL. The accuracy further improves for the models, which employ the top four most positively correlated PMCs (in the sets, SA-Corr and SB-Corr) along with the additive PMCs. • The most accurate models for DGEMM and FFT applications employ five and six PMCs. Therefore, at least six hardware registers must be dedicated to storing the PMCs so that these models can be used for online. Currently, only 3-4 hardware registers are dedicated to storing PMCs during an application run on our experimental platforms.
• Fig. 2 shows that the patterns for prediction accuracy for the two applications are different for the two platforms. In HCLServer1, FFT models have better prediction accuracy than DGEMM models. It is, however, the reverse on HCLServer2. The best set of PMCs employed as model variables for the applications is different for the two platforms. This illustrates that the same set of model variables may not represent the energy-consuming activities of an application on all platforms, even if they share the same set of PMCs. The differences in the employed model variables translate into differences in average prediction accuracies for dynamic energy consumption for the same application executing on different platforms. Therefore, we conclude that the prediction accuracy of PMC based models is not just sensitive to an application but also to the platform.
• The best prediction accuracy is achieved for models that employ both utilization variables and PMCs. This is because they capture most of the energy-consuming activities during an application execution on our platforms.

Study of accuracy of platform-level energy predictive models
In this section, we study the accuracy of platform-level energy predictive models using the testsuite (Table 2). We divide our experiments into two classes: 1. Class A: Comparison of prediction accuracy of models employing utilization variables, PMCs, and both utilization variables and PMCs for two applications, DGEMM and FFT. 2. Class B: Comparison of prediction accuracy of models employing utilization variables, PMCs, and both utilization variables and PMCs for a dataset obtained using a diverse set of applications executing a wide range of problem sizes on HCLServer1 and HCLServer2.

Class A: Analysis of prediction accuracy of energy predictive models for DGEMM and FFT
The experiments in this class are run on HCLServer1 and HCLServer2 (Table 1). Since we choose commonly additive PMCs and utilization variables for MKL-DGEMM and MKL-FFT for application-specific models (Section 6.1), we combine the dataset for both applications. We build the following two categories of models using the extended dataset: • S1-MMFT : S1-UPT-MMFT, S1-PMC-MMFT, S1-PMC-Corr-MMFT, and S1-UPMC-MMFT are LR-based models employing utilization variable only, PMCs in the set SA, PMCs in the set SA-Corr, and utilization variable and PMCs (u cpu , and the set of PMCs, SA-Corr) on HCLServer1, respectively. The number of data points in the training and test data sets for the models on HCLServer1 and HCLServer2 are 153 and 66, and, 490 and 211, respectively. Table 7 shows the minimum, average, and maximum prediction errors for the models built in class A. Fig. 3 shows the comparison of average prediction accuracies for models in category S1-MMFT and S2-MMFT. The results for S1-MMFT and S2-MMFT to show that models employing both utilization variables and PMCs perform (2.1×, 1.8×, 3.3×) and (2.3×, 2×, 2×) better in terms of average prediction accuracies when compared with models employing only utilization variables only, PMCs only, and Intel RAPL, respectively.

Discussion
Following are the salient observations from the results: • Intel RAPL performs the worst in terms of average prediction accuracy.
• The average prediction accuracy improves for models employing both utilization variables and PMCs. S1-UPMC-MMFT and S2-UPMC-MMFT result in minimum average prediction errors of 10.4% and 17% for HCLServer1 and HCLServer2, respectively. • The average prediction accuracy of the best performing models employing both utilization variables and PMCs for an application-specific model is better than the models constructed with the combined dataset for the two applications. As you increase the number of applications, the average prediction accuracy would approach the accuracy for that of platform-level models.

Class B: Analysis of prediction accuracy of energy predictive models for a broad set of applications
We choose HCLServer1 for the experiments in this section. We select six PMCs (x 1 to x 6 in Table 8), which are widely used in energy predictive models (Section 3) and belong to the dominant energy-consuming PMC groups. We build a dataset of 277 points as base applications by executing the applications from our test suite with different problem sizes. This dataset is used to train the models. We build a test dataset containing points for 50 compound applications which are composed of serial executions of base applications. Each point contains the dynamic energy consumption and PMCs for the execution of an application. We apply additivity test and find no PMC to be additive within tolerance of 5%. We list the PMCs and their additivity error percentages in Table 8.
We build six LR models, {LR1, LR2, LR3, LR4, LR5, LR6}. Each model contains a decreasing number of non-additive PMCs. Model LR1 employs all the selected PMCs as predictor variables. Model LR2 is based on five most additive PMCs. PMC x 4 is removed because it has the highest non-additivity. Model LR3 uses four most additive PMCs and so on until Model LR6 containing the top additive PMC, which is x 6 . Table 8 List of selected PMCs for modeling with their additivity test errors (%).

Selected PMCs
Additivity test error (%)  Table 9 Linear predictive models (LR1-LR6) using zero intercepts and positive coefficients with their minimum, average, and maximum prediction errors.

Model PMCs
Prediction errors (%) [min, avg, max] LR1 x 1 , x 2 , x 3 , x 4 , x 5 , x 6 (6.6, 31.2, 61.9) LR2 x 1 , x 2 , x 3 , x 5 , x 6 (6.6, 31.2, 61.9) LR3 We compare the predictions of the models with system-level physical power measurements using HCLWattsUp, which we consider to be the ground truth. The minimum, average, and maximum prediction errors for the models are given in Table 9. It can be seen that LR5 employing two most additive PMCs yields the most accurate PMC based energy predictive model.
We then expand our dataset with 586 points on HCLServer1 using the applications in our testsuite ( Table 2). The input parameters for the applications used to build the dataset are as follows: For each application configuration, we measure the dynamic energy consumption, execution time, PMCs, and utilization variables. 410 and 176 points have been used to train and test the models respectively. We build three platform-level models PL-UPT, PL-PMC, and PL-UPMC, employing utilization variables only, PMCs only, and both utilization variables and PMCs on HCLServer1. Table 10 shows the prediction accuracies (also shown in Fig. 4 as bar charts). The results show that models employing both utilization variables and PMCs have 2.60×, 1.42×, and 1.96× better average prediction accuracies than models employing utilization variables only, PMCs only, and Intel RAPL, respectively.

Discussion
Following are the salient observations from the results: • The minimum average prediction error of 14.36% is obtained for the model employing both utilization variables and the most additive PMCs. • While utilization variables capture the overall energy consumption trend of the application executions, they do not capture holistically and completely all the energyconsuming activities during the execution of an application. Models that employ both utilization variables and PMCs that are highly additive and highly energy-correlated are able to account for most of the energy-consuming activities during the execution of an application and hence are found to provide the best accuracy.

Discussion and future work
We now summarize the most important findings from our experiments: • We analyzed the prediction accuracy of linear energy predictive models employing utilization variables only, PMCs only, and both utilization variables and PMCs. The utilization variables capture the overall energy consumption trend (or the average energy consumption) of the profiles of the application executions. But they fail to capture the variations in the profiles. Models based on highly additive and highly energy correlated PMCs accurately capture these variations. They do not, however, account for some energy-consuming activities that are captured by high-level utilization variables.
• The best models employing both utilization variables and PMCs exhibit 3.6× and 2.6× better average prediction accuracy than models employing utilization variables only and PMCs only. The average prediction accuracies of the application-specific models employing both utilization variables and PMCs for FFT and DGEMM are {10.41%, 10.98%}, and {19.98%, 9.40%} on HCLServer1 and HCLServer2 respectively. The average prediction accuracy of the platform-level model employing both utilization variables and PMCs is 14.36% on HCLServer1.
• The memory activities do not contribute towards dynamic energy consumption of the applications on our platforms. Therefore, we remove the PMCs related to memory activities from the models in our analysis. With no constraints on the model coefficients, we find that the model coefficient of u mem to be negative for all the models constructed in our experiments with the exception of the application-specific model for FFT. For the models with the negative coefficient, we remove the memory utilization variable since it violates the properties of the consistency test and re-construct the models. We also find that the removal of memory utilization variable from the model for FFT reduces the prediction power of the model by 0.02× only. Therefore, for the consistency of the experimental results, we remove u mem as a model variable from our energy predictive models.
• We observe that the patterns for prediction accuracy for application-specific models for FFT and DGEMM are different for the two experimental platforms. In HCLServer1, FFT models have better prediction accuracy than DGEMM models. It is, however, the reverse on HCLServer2. The best set of PMCs employed as model variables for the applications is different for the two platforms. This illustrates that the same set of model variables may not represent the energyconsuming activities of an application on all platforms, even if they share the same set of PMCs. Therefore, we conclude that the prediction accuracy of PMC based models is not just sensitive to an application but also to the platform.
• The most accurate application-specific models for DGEMM and FFT applications employ five and six PMCs. Therefore, at least six hardware registers must be dedicated to storing the PMCs so that these models can be used for online. Currently, only 3-4 hardware registers are dedicated to storing PMCs during an application run on our experimental platforms.
In our future work, we will pursue two related lines of research. First, we will continue to find improvements to the prediction accuracy of linear energy predictive models by adding influential model variables using the theory of energy of computing. Second, we will analyze how energy predictive models employing both utilization variables and PMCs can be combined with system-level measurements using power meters for accurately and efficiently determining application component level decomposition of energy consumption and energy optimization of parallel applications on heterogeneous hybrid computing platforms.

Conclusion
In this work, we performed a comparative study of the prediction accuracy of models employing utilization variables only, PMCs only, and combining both utilization variables and PMCs through the lens of the theory of energy of computing on modern multicore CPU platforms. We discovered that employing utilization variables only in linear energy predictive models does not capture all the energy-consuming activities during application execution. However, combining utilization variables with PMCs that are highly additive and highly correlated with energy consumption gave the most accurate linear energy predictive model. Our experimental results showed that application-specific and platform-level models using both utilization variables and PMCs perform up to 3.6× and 2.6× better in terms of average prediction accuracy when compared with models employing utilization variables only and highly additive PMCs only, respectively.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.