Accuracy of Solid-State Residential Water Meters under Intermittent Flow Conditions

Accurate water consumption measurement of customers is a crucial component of water utility sustainability. During the last decade, sophisticated measuring technologies without moving components, known as solid-state water meters or static meters, have emerged. Solid-state water meters promise an improved accuracy with more processing and transmission capabilities in comparison with traditional mechanical meters. A compromise needs to be reached between energy consumption and battery life as all these new features are extremely demanding on electric energy. The usual approach adopted by the manufacturer is to reduce the frequency with which static meters take measurements of the circulating flow. This reduction in signal sampling frequency can have a significant effect on the accuracy of the instruments when measuring water consumption events of 30 s or less, these events being common in residential customers. The research presented analyses of the metrological performance of 28 commercially available solid-state water meters from six different manufacturers in the presence of intermittent flows of various durations. The results show that the magnitude and dispersion of the error under intermittent flows is significantly larger in comparison to steady state flow conditions. The ultrasonic meters examined were more influenced by the intermittency than the electromagnetic meters.

v List of tables

Influence of meter resolution on test uncertainty
One of the assumptions of this study is to test solid-state meters in conditions as similar as possible to those found in the field. For this reason, the resolution of the meter, which is litres in all cases, was not modified in the laboratory. In order to determine how the uncertainty of the test was affected by this, up to 17 repetitions were conducted for some meters in certain tests under steady flow conditions (Table S-1). Figure S-1 shows quantile-quantile plots of the results obtained for the M0016 meter, which was tested 14 times at 500, 1000, 1500, and 2000 L/h.
Analysis of the error distribution shows that it ranges from ±1%. In addition, some values are more likely than others. This behaviour can be explained by the lack of resolution of the solid-state meters since the test bench under steady flow conditions is highly repetitive. Consequently, testing the meters in the laboratory under conditions like those found in the field implies an increase in the uncertainty of the test. This invites reflection on whether the resolution of solid-state meters can be modified according to the situation in which they operate (laboratory or field).   Table Table S-1, Table S-2, and Table S-3 show the number of repetitions carried out per meter  and test performed. From the information provided through these tables can be inferred how much data has been used to calculate an average error value or a box-whisker graph.   3  3  3  3  3  3  3  5  5  5  5  3  3  3   M0008 3  3  3  3  3  3  3  5  5  5  5  3  3  3   M0010 3  3  3  3  3  3  3  5  5  5  5  3 3 3

Hypothesis testing methods overview
Hypothesis testing is one of the foundations of inferential statistics [1]. In the present study, it is used for two purposes: i) contrast if there is equality of variances of two normally distributed populations; ii) comparison between the mean of two normally distributed populations whose variances are unknown. The hypothesis testing follows the next structure: The null hypothesis (the equality of means or variances of two populations in the case studied), which remains true unless there is strong evidence for the opposite, is contrasted with the alternative hypothesis (there is a difference between means or population variances), which is bilateral in the cases studied; then a decision rule is established to reject or accept the null hypothesis depending on the level of significance, which is equal to the probability of rejecting the null hypothesis being true (type I error) or, in other words, the probability of obtaining a false positive.
The level of significance should be as small as possible. However, a decrease in type I error leads to an increase in type II error (the probability of rejecting the alternative hypothesis when it is true or the probability of obtaining a false negative). Therefore, doubts about the veracity of the result obtained increase if the value taken is very small.
On the other hand, effect size and power are used to properly understand the results obtained from the hypothesis testing. The power is the complementary probability of the type II error, i.e. the probability of accepting an alternative hypothesis that is true (true positive). Therefore, the power does not affect the decision rule, but informs about its properties. For example, an increase in the number of meters tested leads to an increase in the power of the hypothesis testing for the same level of significance [2], which can be used to assess the return on investment in a larger sample. The effect size is the difference between the null and alternative hypothesis in units of the population standard deviation [1]. Thus, the importance of the difference found can be determined through the effect size, since a statistically significant difference is sometimes not meaningful from a technical point of view [2].
Finally, the p-value is used to examine the hypothesis testing. The p-value is the probability of obtaining a higher statistic than the observed if the null hypothesis was true [1]. Hence, it can be used to find the lowest significance level that defines when the null hypothesis can be rejected instead of using the pre-assigned values of significance [2]. In the present study, the significance level has been prefixed at 0.05, so if a p-value<0.05 is obtained, the null hypothesis will be rejected, based on a procedure that provides erroneous rejections with a 5% probability [3].
Regarding the specific methods used, the equality of variances between populations under comparison is contrasted through the F test [2], after verifying the normality of the data by means of the Shapiro-Wilk test [4] and the visualization of quantile-quantile plots with marginal histograms. Thereafter, the t-test [2] has been conducted to assess whether the performance of the M2 meters varies with age (independent samples) and flow conditions (paired samples). Finally, the effect size (d of Cohen) and the power of the test [5] have been calculated to analyse the results, which can be found in section 9 of the supplementary material.
R-statistics [6] is the software used to perform hypothesis testing. The external packages needed are: 1) compute.es [7] y effsize [8] to calculate d of Cohen in independent and paired samples, respectively; 2) pwr [9] to determine the power of test. 6 Table S-4 and Table S-5 show the mean and standard deviation of the error obtained per meter and test under steady flow conditions performed. These results are used as a reference for comparisons with the error obtained in tests under intermittent flow conditions. Table S-4. Average error obtained per test under steady flow conditions (T1) and meter. "Type of meter" and "Meter" are the first (*) and second (**) column, respectively.     (T2, T3, T4, T5, T6 and T7) obtained per meter. This is intended to complement Figures 13 and 14

Percentage of test results outside the error range of ±5%
According to ISO 4064-1 in all its versions since 1993, the maximum permissible error (MPE) of a class 2 water meter is set to ±5% in the lower range, between Q1 and Q2, and ±2% in the upper range, between Q2 and Q4. Taking as reference the largest MPE, this section describes the number of tests in which the error of the meters tested was greater than the MPE for the various test types considered. For specific test condition, if a high percentage of errors fall outside the permissible limits, an acceptable metrology will only be achieved if the errors between different tests are mutually compensated. This implies that measurement of individual water consumptions, under these conditions, is highly inaccurate, and an acceptable accuracy can only be achieved when a large number of consumptions are considered. Consequently, the actual accuracy of the meter mainly depends on the ability of the algorithms implemented by the manufacturer to reduce the bias caused by the sampling of the flow signal. At the present, none of the international regulations and standards related to water meters take into account this consideration in the test program.  Figure S-6, the percentage of tests in which measurement errors are larger than ±5% is exceptionally high, frequently reaching figures close to 100%, implying that the error of all tests conducted for that specific test type was greater than ±5%. In comparison with Ultrasonic Electromagnetic Mechanical ultrasonic meters, electromagnetic meters also showed large percentages of tests falling outside the MPE. However, from the results obtained is inferred that the ability of its algorithms to compensate positive and negative error is significantly better than the ones implemented in ultrasonic meters.

Comparison of the calculated and measured error in tests T6 and T7
The objective of this section is to study the difference between the error measured on the test bench (Errm) and the error calculated from the error curve and the consumption pattern (Errc) when the meters are tested under intermittent and variable flow conditions. The results obtained in T6 and T7 test are used to conduct this analysis. The description of these tests can be found in section 2.3 of the research paper.
The value used as a measured error is the average of the errors obtained in the repetitions conducted per test (Table S- Table S-6 and Table S-7 summarise the results obtained for each meter per test conducted. In the case of electromagnetic and mechanical meters, the greatest differences between calculated and measured error are detected in test T6-P1 (consumption duration of 5 seconds and flow rate range between 200 and 600 L/h). In contrast, in the case of the ultrasonic meters, the response has been more variable. When the results are analysed by type of meter, M1, M3 and M8 generally perform worse at 5 second consumptions (T6), while for M2 and M7 the 10 second consumptions are more adverse (T7). On the other hand, the response obtained is more deficient for P1 (low flow range) or P3 (high flow range) profiles than P2, in which the flow variability is higher.   Figure S-8 shows through box-whiskers the distribution of the difference between the measured and calculated error per meter. The main conclusions obtained are according to the research paper: i) under intermittent flow conditions, mechanical meters maintain repeatability, but tend to overregistration; ii) ultrasonic meters have a very variable response, observing problems of over and under registration; iii) the tested electromagnetic meters behave very similarly under steady and intermittent flow conditions. Thus, the electromagnetic meters tested measure correctly under intermittent and variable flow conditions, making them more competitive from a metrological point of view. However, the sample size should be increased to verify the results achieved. Regarding ultrasonic meters, the great variability of response and the notable difference observed between the measured and calculated error generates concerns about their ability to guarantee that all users will be treated equally. It can also happen that water services are adversely affected or favoured when the errors of different consumptions are not cancelled out. Unfortunately, these behaviours can only be identified through testing under intermittent flow conditions. For this reason, this type of tests should be included in the ISO 4064 and OIML R49 meter approval test program to protect both users and water utilities.

Hypothesis testing results
The existence of statistically significant differences in the performance of M2 meters of different ages and under intermittent and steady flow conditions are analysed in this section. Other meter types examined in this study are not included in the analysis due to sample size limitations. In any case, the principles of the methodology followed here can be applied in the future to conduct comparisons such as the performance assessment between different meter types or to test the divergences between various batches from the same manufacturer.
The first analysis conducted studies the difference between the average error of two groups of meters: one manufactured in 2014 and in service for approximately 4 years and another manufactured between 2017 and 2018. The aim is to identify the influence on the metrology of the service period of a meter and whether its performance can be affected by changes implemented in the firmware over time. Two independent samples of 5 (2014) and 8 (2017-2018) meters are available, therefore the sample is unbalanced. Each sample has been tested under steady (T1) and intermittent (T3, T4 and T5) flow conditions at 200, 500 and 2000 L/h. The representative error of a given test is the mean of the errors obtained in each repetition carried out (the number of repetitions per test can be found in the Table S-1 and Table S-2).
The method used to solve the problem described is the independent or unpaired Student's t-test. The assumption of normality should be previously verified, as well as the homoscedasticity since it modifies the formulation of the test. The verifications conducted are: a) the Shapiro-Wilk test and the quantile-quantile plots with marginal histograms to check whether the normality of each defined group is satisfied; b) the F test to compare two variances and box-whisker plot to assess the existence of homoscedasticity in each comparison carried out. On the other hand, the significance level has been set at 0.05 and the effect size and power of the test are used to interpret the results. Finally, the optimal sample size to obtain appropriate values of the test power (80%) has been calculated, given an effect size for balanced experimental designs. Table S   Table S-9 summarizes the results obtained in the t and F test. The conclusion is that the behavioural differences between the two age groups defined are statistically insignificant. Therefore, it can be stated that the changes implemented in the firmware of the M2 meters between 2014 and 2017-2018 and the service period of the meters manufactured in 2014 have not affected its metrology. Regarding homoscedasticity, only significant differences have been detected in the comparison of variances conducted in the T3-500 L/h case. This result is underlined by the boxplot shown in Figure  S-11. M2 (14) Type of meter  The effect sizes are generally smaller than 0.2. Hence the population averages difference is relatively small compared to the variability within groups. In only four comparisons, this effect has exceeded 0.5. This result reinforces the conclusion that the differences found are negligible. On the other hand, the power of the test is generally low. For effect sizes as small as those detected, a very large sample size would be necessary to reach power values around 0.8 ( Figure S-12). Since the aim is to identify mainly those cases where the effect size is large (>0.8), the optimal sample size for the hypothesis testing proposed is 25 meters per group ( Figure S-12). Consequently, the conclusions are considered preliminary until the number of tested units can be increased. The second analysis studies the differences in the average behaviour of M2 meters tested under steady and intermittent flow conditions. In this case, the hypothesis testing conducted is on dependent or paired samples. The control group is composed of the average errors obtained in the tests under steady flow conditions, and the number of units available per group is 13.   for each group is near to a normal distribution. Therefore, the assumption of normality that authorizes the use of the t and F test is satisfied.   Figure S-14, where the error distribution in T1 tests (steady flow conditions) has a notably lower dispersion than the dispersion obtained in the tests under intermittent flow conditions. Despite the proposed intermittent flow tests force the meters to extreme operating conditions, the fact that this type of meter shows such a scattered behaviour generates uncertainty about its correct performance in the field. Although the average error of a M2 meter park is within the limits set by the standard and does not show over or under registration bias, it could be benefiting some users over others, or the water utilities could be adversely affected if the errors are not cancelled out.  Relating to the result of the t-test, T1-T5/200 L/h is the only case in which a statistically significant difference in the average error has been detected. This case has an associated effect size of -0.756, which in absolute value is equal to an error difference of 0.41%. This result invites two reflections: 1) it is necessary to expand the definition of significance [2], taking into account the order of magnitude of the effect size, since there are occasions in which the t-test detects a significant difference in the average error that is negligible from a technical point of view, as it occurs in the T1-T5/200 L/h case; 2) the effect size detected is directly related to the dispersion of the variable considered; therefore, when the variability observed in error is small, the effect size to be detected reaches high values (>0.8) and, consequently, the number of meters per group can be smaller without affecting the power of the test. In contrast, when the variable dispersion is greater, the effect size will decrease and the number of units per group should be greater to reach appropriate power values ( Figure S-15).