Continuous Real-Time Estimation of Power System Inertia Using Energy Variations and Q-Learning

With the growing emphasis on mitigating climate change, the power industry is moving toward renewable energy sources as an alternative to fossil fuel-based power plants. The transition to renewable energy has created numerous challenges, one of which is the low levels of inertia that impact the stability of power systems. Therefore, inertia monitoring has become an integral part of power system operation to dispatch renewable energy sources while maintaining frequency stability. This article presents an online method to continuously estimate the inertia of a power system. The inertia is computed from data provided by Phasor Measurement Units (PMUs) using small variations in frequency and power under ambient conditions. The method uses electrical and kinetic energy variations to compute inertia. In addition, a $Q$ -learning-based method is presented to identify mechanical power changes to discard invalid inertia estimates. The method is demonstrated using the IEEE-39 bus system to monitor the regional inertia of the test system.


I. INTRODUCTION
T HE MODERN power grid is moving toward renewable generation from conventional power plants based on fossil fuels to decarbonize the power sector. However, there are many challenges in the transition toward renewable sources. One of the main constraints limiting renewable energy penetration is the declining inertia of the grid. Immediately after a disturbance, the power imbalance in the grid is compensated by the kinetic energy stored in the rotating machines connected to the grid by delivering (or absorbing) excess power to (or from) the grid. This spontaneous response of the grid to balance the power mismatch is called the inertial response.
The decrease (or increase) in kinetic energy due to inertial response is mirrored in the angular velocity of the machines and hence, the grid frequency. The kinetic energy of a machine is proportional to the product of its moment of inertia and its angular velocity squared. Therefore, for the same power imbalance, the corresponding deviation in its velocity will be slow if the moment of inertia is high. As a result, if the collective inertia contributed by all the machines on the grid, known as the grid inertia, is high, the deviation in the frequency will be slow. However, if the inertia is low, the frequency deviation will be rapid and may become severe enough to lead to a blackout before the frequency control reserves act.
A significant amount of grid inertia comes from the synchronous turbo generators of conventional power plants. However, the traditional plants are retired to reduce CO 2 emissions and are heavily replaced by renewable sources with lower or no inertia. The grid inertia, therefore, reduces considerably. With reduced inertia, the frequency deviates more rapidly upon a disturbance, increasing the probability of outages [1]. It also makes the frequency control balancing supply and demand more difficult [2]. In addition, the inertia also varies significantly in time due to the intermittent nature of renewable generation. Also, the inertia distribution is not uniform throughout the grid, affecting the angular stability in addition to the frequency stability of the system [1], [3]. Thus, it is evident that inertia plays a vital role in determining grid stability. Hence, continuous monitoring of regional inertia is critical to future grid operation with high renewable penetration to ensure the grid's security.
With the advent of synchrophasor technology in smart power systems, there is a growing usage of Phasor Measurement Units (PMUs) across the globe to monitor the grid parameters in real time. The grid inertia can be estimated and monitored using the measurements reported from PMUs installed at various locations of the grid, such as generators and tie lines. Several research articles have proposed different strategies for estimating power system inertia. Papers [4], [5], and [6] provide a few offline techniques to estimate inertia upon events. The online methods can be broadly categorized into discrete and continuous techniques based on the algorithm's timeline. The discrete methods estimate inertia upon the occurrence of an event, such as loss of generation or fault [1], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]. The methods presented in [1], [7], [8], [9], and [10] used the swing equation to estimate the inertia of areas [1], whole system [7], [8], [9], or individual sources [10] in the event of a power imbalance. Similarly, papers [11] and [12] use energy-based methods to estimate inertia. Cai et al. [13], Yang et al. [14], and Panda et al. [15] used oscillations after a fault to compute inertia. In [16], nonsynchronous inertia of a microgrid is evaluated upon a disturbance. However, these methods rely on occurrences of grid disturbances to estimate inertia and do not continuously monitor inertia.
The continuous methods are online methods that estimate inertia continuously using real-time measurements [17], [18], [19], [20], [21], [22], [23]. Reference [17] uses micro perturbations to estimate the inertia at a node, while [18] uses them to calculate the inertia of the entire grid. However, the geographical inertia distribution is not quantified, and also the method needs additional equipment to inject perturbation and capture frequency variations [18]. Papers [19], [20], [21], [22], and [23] use ambient measurements under normal operating conditions to monitor inertia. They employ the system identification methodology to evaluate inertia from the small and continuous frequency and power fluctuations. However, the algorithm's computational complexity increases with larger systems with numerous measurements. The methods also assume that the mechanical power is constant, resulting in inaccurate estimations if the mechanical power varies. In [24], the interarea modes under normal operating conditions are used to estimate inertia. The algorithm's efficacy relies on the visibility of the modes and the identification of the correct mode for a particular area. As seen from the above review, studies on inertia estimation are still emerging, and further research is required to overcome the above limitations.
This research aims to continuously estimate the regional inertia of a power system in real time using PMU data during normal operation, considering mechanical power variations. The objective is achieved using two algorithms running in parallel. The first algorithm estimates inertia at each measurement window, assuming constant mechanical power. Simultaneously, the second algorithm checks for changes in the mechanical power in the current window using measurements and inertia estimations of the current and previous windows. If the second algorithm detects mechanical power variation, then the corresponding inertia estimate is rejected; otherwise, it is accepted. This article's key contributions are given as follows.
1) A novel method for estimating regional inertia from ambient power and frequency data is proposed. The inertia is calculated by comparing the changes in electrical and kinetic energy. 2) An innovative strategy based on the Q-learning algorithm is proposed to automatically detect measurement windows with mechanical power variations. To the author's knowledge, this is the first time a method for detecting mechanical power changes has been suggested.
The following is the overall structure of this article. Section II begins with the mathematical model and then explains the proposed method for inertia estimation. The Q-learning-based strategy for detecting mechanical power changes is described in Section III. The demonstration of the methods in the test system and the corresponding results is reported in Section IV. This article concludes in Section V.

II. ENERGY-BASED METHOD FOR INERTIA ESTIMATION USING AMBIENT DATA A. MATHEMATICAL MODEL
Under ambient conditions, i.e., under normal operating conditions, there will be small frequency fluctuations due to the continuous variation between generation and demand. Assuming that an area of a power system is represented by an equivalent rotating machine, its rotational motion neglecting damping can be expressed as P m -mechanical power input to the area, P g -electrical power output of the area, J eq -moment of inertia of the equivalent machine representing the area, and f-frequency of the area. In general, the Center Of Inertia (COI) frequency is used to denote the frequency of the area. At the instants where the Rate Of Change Of Frequency (ROCOF) (df /dt) is zero, the mechanical power is equal to electrical power. ROCOF is zero at the maximum and minimum frequencies.
The maximum and minimum together can be called the extremum of the frequency.
Let t 0 be a moment of extremum. Let f 0 , P m,0 , and P g,0 be the corresponding frequency, mechanical power input, and electrical power output at t 0 . Since ROCOF is zero at t 0 , P m,0 = P g,0 according to (1). Following the extremum, let P g,t be the change in electrical power corresponding to time t. It is the difference between the electrical power at time t (P g,t ) and P g,0 . f t is the frequency at time t. The mechanical power is assumed constant and equal to P m,0 . Integrating both sides of (1) with respect to time starting from the moment of extremum at t 0 Converting to per-unit using the base power S and rated frequency f rated H eq is the inertia constant of the area-equivalent machine. It depicts the duration for which the kinetic energy can support the area in the event of a generation loss equal to the base power before becoming depleted. The Left-Hand Side (LHS) of (6) is the change in electrical energy, whereas the Right-Hand Side (RHS) is the change in kinetic energy of the area. The variables are marked with an overline to indicate per-unit values. The method to estimate inertia using (6) is discussed in Section II-B.

B. METHOD FOR ESTIMATION
All the generators in the area are assumed to be equipped with PMUs, which report frequency and power measurements to the inertia monitoring system. Using PMU measurements, the electrical power output of the area is calculated as the total power generation from all generators in the area. The frequency of the area is represented by the average of the frequencies reported from all generators instead of COI frequency. As the inertia values of the generators needed to compute the COI frequency are unknown, the COI frequency cannot be used to denote the frequency of the area. A rolling-window approach is used to estimate inertia with a fixed window length and a fixed refresh rate. The estimation is repeated every few seconds (determined by the refresh rate) using the latest window of frequency and power measurements. For the frequency signal in the current window, the extremums are detected by finding positive and negative peaks of the signal [25]. The detection of peaks using the above method results in multiple unwanted peaks, which are eliminated using minimum peak width criteria. Fig. 1 shows the sample of ambient frequency and electrical power fluctuations corresponding to area 4 of the IEEE 39 bus test system. As seen in the figure, at each extremum point, the frequency stays almost stationary, but the corresponding power varies considerably. Thus, even a slight shift in the frequency extremum's position will result in an appreciable difference in the corresponding power measurement. The noise can also influence the positioning of the frequency extremum. Since the estimation algorithm uses small changes in electrical power with respect to P g,0 measured at each extremum point, even a slight deviation of the measured P g,0 from the actual mechanical power input P m,0 will result in a considerable error in the estimation. This problem can be handled by slightly adjusting the instants of the frequency extremums to match the instants where the measured electrical power is equal to P m,0 . However, the value of P m,0 is unknown.
If the mechanical power is constant throughout the window, the electrical power at all frequency extremum points should theoretically be the same and equal to the constant mechanical power input (P m,0 ). However, practically, the values differ slightly from each other. It can be observed in Fig. 1 that all the red dots at the bottom graph (electrical power at frequency extremums) are located roughly at the same level. This property can be used to: 1) estimate the value of mechanical power, P m,0 , for the current window; 2) identify whether there is a change in mechanical power in the current window rather than directly assuming that the mechanical power is constant. The detection of change in mechanical power is discussed in Section III.
The value of mechanical power P m,0 for a given window is estimated as the average of electrical power measurements associated with all the frequency extremums of the window. Once it is estimated, the instants of frequency extremums are fine-tuned to match the instants where the measured electrical power is equal to the estimated P m,0 . Fig. 2 shows the fine-tuning of frequency extremum instants for a sample window of measurements given in Fig. 1. The fine-tuning is carried out only if the new position is close to the original extremum point. Otherwise, the point is excluded for estimation.
In a single measurement window, multiple frequency extremums may be detected. The following steps are repeated for each extremum after they have been fine-tuned. 1) t 0 is set as the time instant corresponding to the selected extremum. 2) Starting from t 0 , using measurements at each time instant, t > t 0 , till the next extremum point, multiple linear equations are formed using (6). The equations have one unknown parameter, H eq , and the rest are calculated from the power and frequency measurements of the area. The trapezoidal rule can be used to calculate the integral in the LHS of (6). Finally, all equations formed for each frequency extremum are gathered to form a system of linear equations for the current window, which is depicted by where A and B are vectors representing the quantities (f 2 0 −f 2 t ) and t t 0 P g,t dt at each time instant put together for every extremum point. The solution to (8) is the estimate of inertia for the current window. The least-squares algorithm may be used to find the solution. However, if no frequency extremum is detected in a window or if all extremums are eliminated by fine-tuning, the corresponding estimate for the window is flagged as zero.

III. DETECTION OF MECHANICAL POWER CHANGES
When the mechanical power supply changes, the fundamental assumption used in developing the method for inertia estimation, discussed in Section II, is violated. As a result, the corresponding inertia estimate is invalid and needs to be discarded. Hence, a method is required for identifying the measurement windows with mechanical power changes so that the corresponding estimates can be discarded.

A. COEFFICIENT OF VARIATION
As discussed in Section II-B, if the mechanical power remains constant in a window, the electrical power measurements corresponding to the frequency extremums of the window are roughly equal. Therefore, the relative variability among these measurements can identify whether the mechanical power is constant in a given window. For this purpose, a statistical measure, Coefficient Of Variation (CV), is used. CV is selected because it is independent of the unit of measurement scale σ and μ are the standard deviation and mean of the data set. The data set here refers to the electrical power measurements at all frequency extremums of the given window. Fig. 3 illustrates the variation in CVs for windows with constant and variable mechanical power. The figure shows the frequency (f ), electrical power (P g ), and mechanical power (P m ) measurements of windows with a constant and variable mechanical power corresponding to area 4 of the IEEE 39 bus system. It can be seen from the figure that the CV is significantly higher for the window with variable mechanical power than it is for the window with constant mechanical power. Consequently, the CV can be used as a metric to identify changes in mechanical power.

B. STRUCTURE OF THE Q-LEARNING-BASED ALGORITHM
As illustrated in Fig. 3, if the CV is above a particular threshold, it can be inferred that there is a change in mechanical power in the given window. However, the threshold is not known a priori and is different for different systems, areas, noise levels, and operating conditions. Therefore, a Q-learning-based method is proposed that is adaptive and flexible to detect mechanical power changes for a given window. It learns the acceptable CV ranges for each area beyond which the mechanical power input is identified to have changed. Q-learning is a model-free reinforcement learning method that finds the optimal action for a given state [26]. The method interacts with the system and takes a random action for a given state. In return, a reward or penalty is awarded depending on whether or not the action is correct. Thus, the algorithm successively learns the correct action for different states by directly interacting with the system. The Q-learning algorithm has a table that contains state-action values [26]. Generally, this   CVs from 0.001 to 0.002, and so on. The boundaries of the states can be determined by examining the distribution of CV values. After collecting sufficient CVs from all areas, a histogram can be used to identify an approximate point below which most of the lower-end CVs are located. Moving from left to right of the histogram, this is a point with the lowest count after the first peak, as shown by a black dotted line in Fig. 4. The range of CVs below this point can be divided into 10-20 bins of equal and smaller width. Similarly, for the CVs above this point, 5-10 bins of equal and larger width can be created. For each state, the actions are to either "reject" or "accept" the estimated inertia. The reject action implies that changes in mechanical power are detected. The accept action implies that the mechanical power was constant throughout the window.
The training of the algorithm is not considered separately. The algorithm is directly used online, and it learns the state-action values of the Q-table while online. In this way, the algorithm updates the Q-table automatically when the topology or operating conditions of the system change. The temporal difference method is used to update the Q-table [27]. The Q-table is first initialized with zero values. Each time the algorithm selects an action for a given state, the Q-value associated with the state-action pair is updated with a new value. Soon, the entire Q-table will be updated with new values as the algorithm encounters more and more new states. Let s i be the current state and a j be the action taken by the algorithm. The new value for Q(s i , a j ) is computed using the update rule given in where lr is the learning rate (0 < lr ≤ 1) that determines how quickly the Q-values change, and r is the reward for the selected action decided by the reward mechanism. The reward mechanism is discussed in Section III-C. Equation (10) does not consider the expected future rewards for updating the Q-value, as the future state is independent of the current action. Since there is no separate training phase, the Q-table will continue to update as long as it is in service. The algorithm chooses the action with the highest Q-value for a given state. The reject action is preferred over the accept action when the Q-values of both actions are equal in a given state. Based on the action chosen by the algorithm, the inertia estimate is updated using H smooth,n = αH n + (1 − α)H smooth,n−1 (12) where n is the index for the rolling windows and also the index for the corresponding inertia estimates. h n is the raw estimate of inertia for the nth window calculated using (8), and H n is the validated estimate for the nth window after rejecting the estimates of the windows with mechanical power changes. cumsum n = 0 ensures that the estimate is within acceptable limits. cumsum n is defined in (13). H smooth,n is the final smoothed estimate for the nth window after exponential smoothing. α is the corresponding smoothing constant. It can range from 0 to 1. Since the inertia is estimated using small fluctuations in the measurements, there may be occasional outliers in the estimates due to measurement errors, noise, and other factors. Hence, exponential smoothing is used to overcome this problem by smoothing the validated estimates. Because inertia is a time-varying quantity that might suddenly change when generators are disconnected and reconnected, it is essential to place more emphasis on the most recent estimates. Since it provides greater weight to recent estimates than to older estimates, exponential smoothing is employed in place of other approaches.

C. REWARD MECHANISM
The reward mechanism to update the Q-values is developed based on the assumption that the temporal variation of inertia is slower than the refresh rate used for the estimation. The refresh rate is about 1-2 s, and the inertia is not expected to change substantially within seconds. Hence, the above assumption seems valid. When the mechanical power varies, the accompanying inertia estimate can take any value and may abruptly differ from the value estimated in the preceding window. Additionally, it can happen that during fine-tuning, all the frequency extremum instants are excluded because their corresponding P g values are far from their average. Therefore, the estimates fluctuate between random values or are flagged as zeros as long as the mechanical power changes. The same is illustrated in Fig. 5. The figure shows the plots of inertia estimates (h n ) and average mechanical power in each window corresponding to area 3 of the IEEE 39 bus system. As seen from the figure, the inertia estimates are inconsistent when the mechanical power is varying and consistent when it is constant. This rationale is used to formulate the reward mechanism. A cumulative sum is calculated for every estimate to measure its consistency. The formula for the same is given in cumsum n = (1 + cumsum n−1 ) chk lim chk cns (13) where cumsum n is the cumulative sum at the nth estimate, and chk lim and chk cns are the binary variables to check for limits and consistency. According to (14), the variable chk lim is set to one, if the estimated inertia (h n ) falls within the maximum (H max ) and minimum (H min ) limits. Otherwise, it is set to zero. From (15), if the current estimate falls within a specific range, determined by the variation limit (var lim ), of the previous estimate, the variable chk cns is set to one. H min , H max , and var lim are set to 0, 20, and 0.1, respectively. If there is a sudden change in the current estimate or if the estimate exceeds the limits, the cumulative sum becomes zero. However, if the current estimate is consistent with the previous estimate, the cumulative sum increases by one. The cumulative sum shows the number of consecutive consistent estimates. In addition to the cumulative sum, another criterion is also used to assign rewards. Whenever the accept action takes place, the corresponding CV value is stored to create a data set of accepted CV values (ACV set ). This data set is used to compute a dynamic threshold (th dyn ) for each window using the expression given in Q 3 is the third quartile and IQR is the interquartile range. The reward mechanism based on the cumulative sum and the dynamic threshold is detailed in Table 2. The rewards are decided based on the action taken. The total reward (r) for the selected action is r1 + r2 + r3, each calculated according to the procedure detailed in Table 2. The reward r, thus calculated, is used in (10) to update the associated state-action value. The state-action value is not altered if 0 < cumsum < 10.
A flowchart of the entire algorithm, including the Q-learning-based method, to estimate the inertia of a power system area for the nth window is given in Fig. 6. The energy-based inertia estimation process is depicted in blue. In contrast, the Q-learning-based method for detecting mechanical power changes is depicted in green. The dotted lines in  the figure show the process and data flow for updating the Q-table to use in the next estimation window. As new measurements arrive, these steps are repeated for each window to monitor inertia continuously.

IV. SIMULATION AND RESULTS
The proposed methods for estimating inertia and detecting mechanical power changes are demonstrated using the modified IEEE 39 bus test system by monitoring its regional inertia [28]. The system is tested with both constant and varying mechanical power to demonstrate the effectiveness of the proposed algorithms in both scenarios. The test system has ten generators and four areas. It is split into four areas considering the coherency of the generators based on frequency responses [21], [29]. The estimated regional inertia values are verified against the actual inertia of the test system. The sum of inertia constants of all generators in an area is used to compute the area's actual inertia. The test system's Single-Line Diagram (SLD) is shown in Fig. 7. The simulation is carried out in PSCAD. As discussed in [21], small and continuous fluctuations in the system are created by introducing Gaussian variations in all the loads of the system. Slow variations up to 0.1 Hz are used to simulate continuous fluctuations in the load. Overall, the load varies within a range of ±0.10% of the actual load. Noises of different levels are separately added to the simulation measurements to generate noisy data comparable to a real system. The governor control is used to cause changes in mechanical power. The governor control adjusts the input mechanical power in response to the frequency deviation. Except for generator 1 (G1), which represents the external grid, all generators are equipped with governors [28]. In this work, the dead band of the governors is set to 0.036 Hz, a typical value for a 60-Hz system. So, the governors will respond and change the mechanical power input of the generators whenever the absolute frequency deviation crosses 0.036 Hz from the nominal frequency.
The voltage and current phasors at the generator buses are calculated using the FFT algorithm to emulate PMU measurements. The FFT algorithm samples the voltages and currents at 16 samples per cycle. The phasors are then used to obtain the frequency and power output. Each generator's frequency and power output are reported to the inertia monitoring system at 60 frames/s. This reporting rate was chosen based on the standard recommendation for a 60-Hz system [30]. The sum of power generation from all generators in the area is used to compute the area's electrical power output. The frequency of an area is calculated as the average of frequencies measured at the generator buses.
A Butterworth low-pass filter with a cut-off frequency of 1 Hz is used to filter high-frequency noises before using the frequency and power measurements for inertia estimation. For the Q-table, 19 discrete states are used by dividing continuous CV values into multiple ranges. States 1-10 represent CVs from 0 to 0.02 in steps of 0.002. Since most lower-end CVs are less than 0.02, it is decided to have a smaller step size until 0.02. States 11-18 represent CVs from 0.02 to 0.1 in steps of 0.01, and state 19 is for CVs above 0.1. The four areas of the system will have four Q-tables, one for each area. A typical value of 0.1 is chosen for the learning rate, lr, for updating Q-tables. The smoothing constant α for exponential smoothing is also typically set to 0.1.

A. CONSTANT MECHANICAL POWER
This section highlights the performance of the energy-based method for inertia estimation presented in Section II. The system is simulated for 10 min. The mechanical power is kept constant by keeping the governors out of service. The inertia of each area is estimated on a rolling-window basis with a 30-s window length and a 1-s refresh rate, yielding 571 estimates for 10 min of data. The plot of the initial raw estimates (h n ) before applying Q-learning and smoothing is shown in Fig. 8. The figure shows that the estimates are very close to the actual value. The mean value of the absolute errors of all the estimates compared to the actual value ( abs,μ ) is also furnished in the figure.
Results from the proposed method are compared to those obtained using the system identification-based approach presented in [21]. Paper [21] estimates inertia from ambient measurements under constant mechanical power. It employs Numerical Algorithms For Subspace State Space System Identification (N4SID) to identify the system model from the measurements and estimates inertia from the identified model's step response. Table 3 compares the raw estimates from the proposed energy-based method (before applying Q-learning and smoothing) to both the raw and exponentialsmoothed estimates (smoothing constant as defined in [21]) computed using the algorithm given in [21]. In Table 3, H act stands for actual inertia, μ for the mean of the estimates, σ for the standard deviation, and abs,μ for the mean of the absolute errors of all the estimates compared to the actual value. The inertia values are specified on a 1000-MVA   base. The results show that the accuracy of the energybased method is significantly higher than that of the system identification-based method. With the suggested approach, the computation time is also substantially reduced. The computation time per estimate for the system identification-based approach is between 100 and 200 ms, but for the energybased method, it is between 1 and 5 ms. In addition, the system identification-based approach uses a longer window length in the range of hundreds of seconds, whereas the proposed energy-based method uses a shorter 30-s window length. These results demonstrate the effectiveness and advantage of the suggested approach compared to the system identification-based approach.
The application of the Q-learning algorithm to the inertia estimates obtained from the energy-based method to detect mechanical power changes is illustrated in Fig. 9. In the figure, the "blue" and "orange" plots denote the initial raw (h n ) and final smoothed (H smooth,n ) estimates of inertia, respectively. The gray and green shaded regions represent the instants of reject and accept actions taken by the Q-learning method. The Q-learning method starts learning directly from the initial raw estimates of inertia (h n ). Initially, all state-action values of the Q-table are set to zero. Furthermore, if both action values for an input state are the same, the reject action is given preference over the accept action. As a result, the Q-learning algorithm rejects all the estimates until it learns and updates the Q-table. From (11), the reject action should return the previous smoothed estimate. However, because no estimate has been validated yet by the algorithm, smoothed estimates are not present. So, the final smoothed estimates (orange) at the beginning are displayed as null until the algorithm has learned enough to validate an input inertia value. Once the algorithm validates and accepts an estimate, if the reject action occurs again, the method returns the previous smoothed estimate as per (11). From the figure, it can be seen that, except for the initial learning period, all estimates are accepted. This result suggests that the Q-learning method was successful in determining that the mechanical power was constant throughout.
The performance of the proposed method is tested at different noise levels. The noise level is defined as the percentage of noise relative to the ambient variations. As an illustration, Fig. 10 shows the first window of frequency and electrical power measurements corresponding to area 2 with and without additional noise. abs,μ of the initial raw and final smoothed inertia estimates under different noise levels is shown in Fig. 11. It is evident from the figure that the method is reasonably accurate even at high noise levels.

B. VARIABLE MECHANICAL POWER
The system is simulated for 10 min. The governors are taken back into service. For example, Fig. 12 shows the simulated data of frequency (f ), electrical power (P g ), and mechanical power (P m ) for areas 1 and 3. For the power measurements, a base of 1000 MVA is used. It is observed from the figure that the power and frequency fluctuations are very small. In addition, it is clear that area 1's mechanical power remains constant because it lacks a governor.
The inertia of each area is estimated on a rolling-window basis with the same window length and refresh rate used in Section IV-A. The initial raw estimates (h n ) and the final smoothed estimates (H smooth,n ) obtained after applying the Q-learning technique are plotted in Fig. 13. The actions taken by the Q-learning method at each window are also depicted in the figure. The "red" curve in the figure denotes each area's average mechanical power in each window. As seen from the figure, the initial raw estimates corresponding to the region of varying mechanical power are significantly deviating from the actual value. The figure also clearly shows that the accept action coincides with the region of constant mechanical power, and the reject action coincides with the region of varying mechanical power. This result confirms that the Q-learning method could effectively identify windows corresponding to mechanical power changes and reject the accompanying incorrect estimates. In addition, the figure also provides abs,μ in % for the final smoothed estimates. These errors are observed to be very small, demonstrating the algorithm's efficacy in accurately estimating inertia using ambient measurements with mechanical power changes. abs,μ of the final smoothed estimates of inertia under different noise levels is shown in Fig. 14. It is evident from the figure that the method is accurate even at high noise levels. The values of Q-tables of each area at the end after the final estimation corresponding to the cases of without and with noise (25%) are shown as a bar graph in Fig. 15. The Q-values of the states have been updated based on the range of CVs encountered by each area for each case. For the case with noise, the states with positive values for the accept action have spread out and shifted to the right as compared to the case with no noise, indicating that the range of acceptable CV values is higher and wider for the case with noise.  The sensitivity of the method's accuracy to different values of exponential smoothing constant (α) and learning rate (lr) is investigated. Fig. 16 depicts the variation in abs,μ with respect to various α and lr values for the case with a noise level equal to 25%. When α is varied, lr is kept constant at 0.1, and when lr is varied, α is fixed to 0.1. A value of 1 for α denotes no smoothing. The figure shows that with no smoothing, the error is approximately 10%, even with high noise. Furthermore, it can be seen from the figure that the error increases gradually as α increases. The highest   increment in error is 5.8% when α is changed from 0.1 to 1. However, as observed from the figure, the error does not have a correlation with lr. When lr is increased, the error remains almost constant or fluctuates within a small range.

C. DYNAMIC PERFORMANCE 1) INERTIA TRACKING
The inertia values of areas 3 and 4 are changed to new values at 390 s to test the algorithm's ability to track varying inertia. The inertia values are changed by modifying the inertia constants of the generators in the areas. The rolling windows have a 30-s length. Therefore, the windows that end at instants between 390 and 419 s have a mixture of measurements from both the old and new operating conditions. The windows that end at instants between 405 and 419 s will have the majority of measurements corresponding to the new operating conditions. From 420 s, the windows of measurements will fully reflect the new circumstances. So, it takes around 15-30 s for the inertia estimates without smoothing to reach the new values. Fig. 17 shows the plot of final smoothed estimates tracking the varying inertia of areas 3 and 4 for α = 0.1. The figure also shows the transition time for the inertia estimates to reach within 2% of the new inertia values for different exponential smoothing constants (α). As seen in the figure, as α is increased, the transition time reduces and saturates between 15 and 30 s. The learning rate (lr) does not affect this transition as the range of CV values does not vary and the inertia estimates gradually change over the transition time.

2) INERTIA TRACKING WITH SUDDEN NOISE INCREASE
Similar to Section IV-C.1, the inertia values of areas 3 and 4 are changed to new values at 390 s. In addition, the noise level is suddenly increased from no noise to 25%. Fig. 18 shows the plot of final smoothed estimates tracking the varying inertia of areas 3 and 4 with a sudden increase in noise for different learning rates. Because of the increased noise level, the range of acceptable CV values widens, and there are fluctuations in the raw estimates due to noise. Even then, the algorithm can still track the modified inertia values by learning the new range of acceptable CVs for all lr values. For area 4, the final smoothed estimates take a similar transition time (28 s) to arrive at the new inertia for all lr values. For area 3, though, they are different. The transition time is 38, 38, and 137 s for lr equal to 0.1, 0.5, and 0.9, respectively. Multiple simulations have shown that the transition is consistently smooth and quick with low lr ≤ 0.3. In contrast, it is inconsistent when lr is high. In some cases, as in the case of area 4, the transition time is similar to that of low lr. Other times, the transition takes longer, as in the case of "lr = 0.9" for area 3. From (10), high lr implies that the latest rewards are given more weight than the previously learned Q-values. Hence, if lr is high, the Q-values may not be stable and can fluctuate significantly with changes in the rewards. This fluctuation in Q-values might be the reason for the inconsistency in the transition time to a new noise level when lr is high.
Overall, the results imply that the developed energy-based inertia estimation algorithm can reliably estimate inertia using small, continuous fluctuations in ambient measurements. In addition, the Q-learning method can successfully identify the instances of mechanical power variation, enabling incorrect inertia estimates to be rejected. The method can track time-varying inertia values even under changing operating conditions.

V. CONCLUSION
This research introduced a new energy-based method to estimate inertia using frequency and power data from PMUs under ambient conditions. It also presented a novel way to detect changes in mechanical power based on the Q-learning algorithm. A comprehensive strategy combining both methods is described for the continuous and real-time estimation of regional inertia values of a power system considering mechanical power variations. The accuracy and adaptability of the method are successfully verified and demonstrated using the IEEE-39 bus system. Compared to the method based on system identification, the proposed method is shown to have significantly better accuracy and higher computational efficiency. The effectiveness of the method under different noise levels is also validated.
The proposed solution based on Q-learning does not require any additional training and can be directly used online. The reward mechanism is versatile and can be customized to meet system-specific needs. Due to its computational efficiency, the method is easily scalable to larger systems. Further research is required to test the applicability of the method to estimate virtual inertia from renewable sources. Also, the algorithm's performance under high renewable penetration needs to be verified.