Is smart water meter temporal resolution a limiting factor to residential water end-use classification? A quantitative experimental analysis

Water monitoring in households provides occupants and utilities with key information to support water conservation and efficiency in the residential sector. High costs, intrusiveness, and practical complexity limit appliance-level monitoring via sub-meters on every water-consuming end use in households. Non-intrusive machine learning methods have emerged as promising techniques to analyze observed data collected by a single meter at the inlet of the house and estimate the disaggregated contribution of each water end use. While fine temporal resolution data allow for more accurate end-use disaggregation, there is an inevitable increase in the amount of data that needs to be stored and analyzed. To explore this tradeoff and advance previous studies based on synthetic data, we first collected 1 s resolution indoor water use data from a residential single-point smart water metering system installed at a four-person household, as well as ground-truth end-use labels based on a water diary recorded over a 4-week study period. Second, we trained a supervised machine learning model (random forest classifier) to classify six water end-use categories across different temporal resolutions and two different model calibration scenarios. Finally, we evaluated the results based on three different performance metrics (micro, weighted, and macro F1 scores). Our findings show that data collected at 1- to 5-s intervals allow for better end-use classification (weighted F-score higher than 0.85), particularly for toilet events; however, certain water end uses (e.g., shower and washing machine events) can still be predicted with acceptable accuracy even at coarser resolutions, up to 1 min, provided that these end-use categories are well represented in the training dataset. Overall, our study provides insights for further water sustainability research and widespread deployment of smart water meters.


Introduction
Strong emphasis on sustainability in water use has been increasingly brought to light by growing population and urbanization (Cosgrove and Loucks 2015), coupled with climate change impacts on water resources (Pastor-Jabaloyes et al 2018, Karamouz and Heydari 2020). With existing limitations on water resource availability, new developments to increase water storage and supply are often physically or economically constrained. Therefore, better management of existing water resources has become an issue of paramount importance (Mazzoni et al 2021). Public utilities are now investing significant resources and efforts in the development and implementation of water management strategies, both on the supply and the demand side, to ensure future water security (Jain andOrmsbee 2002, Herrera et al 2010). On the demand side, these strategies include water saving technologies, new water policy regulations, public awareness and education campaigns, rebate programs for water-efficient devices, leakage management, and source substitution (e.g., replacing nonpotable end uses with grey, recycled, or harvested rainwater (Dixon et al 1999)) (Inman and Jeffrey 2006, Cominola et al 2015, Beal et al 2016, Ntuli and Abu-Mahfouz 2016. Beside their direct effect on water resources, residential water conservation and efficiency strategies can help save water-related energy required for water treatment, distribution, and heating (Srinivasan et al 2011). Residential end uses are responsible for more than 70% of all water-related energy use (Escriva-Bou et al 2018). However, the effectiveness of these measures hinges on an accurate estimate of water demand from detailed understanding of how and when water is used in the residential sector. Access to high resolution water consumption data can help improve our knowledge of water demand, identify specific fixture/appliance end uses (e.g., toilet, shower, washing machine, outdoor irrigation), or detect anomalies, such as leaks (Luciani et al 2019). Smart water meters, which can provide the fine resolution data necessary to discern end uses, have been proven essential in supporting water conservation and efficiency measures in practice (Britton et al 2013).
Conventional residential water meters typically collect coarse resolution data and require manual readings, limiting the understanding of household-scale water use characteristics and patterns in time. Conversely, smart (or digital) water meters enable the collection and automated reporting of fine resolution water use data, thereby allowing planners and utilities to better understand demand patterns and enact management strategies. Smart metering can help the development of accurate demand characterization and forecasts and, hence, improve the operation and long-term planning of water supply and distribution systems (Sønderlund et al 2016, or promote durable conservation behaviors . In addition, detailed knowledge about water consumption at the household level can also translate into financial savings for home occupants, especially when complemented with information about individual end uses (e.g., Blokker et al 2010).
Obtaining information on residential end uses is not a trivial problem. Information about residential water demand at the end-use level could, in principle, be obtained through direct measurements via intrusive monitoring, i.e., by installing sub-meters at all household end uses. However, this approach is often practically or economically infeasible from a utility perspective and would likely be rejected by home occupants due to its intrusive nature. Instead, water utilities are increasingly installing residential smart water meters that collect fine resolution water consumption data at the service line or entrance into the home, providing aggregate water data, which are so far primarily used for billing purposes (Fogarty et al 2006, Froehlich et al 2009. Similarly to previous experiences in the electricity sector, limits to directly collecting water-use data at the residential end uses have motivated the development of several non-intrusive disaggregation approaches, which have the advantage of allowing the decomposition of a signal measured at the household level (i.e., aggregate water use) into the individual contribution of each end use (Cominola et al 2017, Di Mauro et al 2020, Bethke et al 2021. Several state-of-the-art disaggregation techniques require additional sensing on the premise plumbing infrastructure and/or a manual characterization of each end use (Fogarty et al 2006, Kim et al 2008. These techniques can be intrusive, expensive, and time consuming, thus they are not easy to develop or replicate at large scales (Froehlich et al 2009, Srinivasan et al 2011, Ellert et al 2016, Ntuli and Abu-Mahfouz 2016. Other disaggregation techniques use single source data collected at the household inlet point. They can classify end uses in a non-intrusive way (Figueiredo et al 2013, Rahimpour et al 2017, with the accuracy of results varying across different data sampling temporal resolutions (e.g., 1-10 s vs minutes; Clifford et al 2018, Vitter andWebber 2018). Understanding the tradeoff between the value of the information provided by fine-resolution data and the economic and operational costs of the metering system is crucial to inform the design of future metering networks and associated analytics to facilitate customer data interpretation.
The availability of fine-resolution smart meter data generates several opportunities for advancing water demand management. Sub-minute sampling resolution is essential for most water end-use disaggregation algorithms to provide a reliable categorization of household level water use into different fixtures/appliances (e.g., shower, toilet, dishwasher, etc) , Nguyen et al 2013, Abdallah and Rosenberg 2014. However, high resolution metering inevitably increases the amount of data the water utility must collect, process, and manage. Sampling at 1 s resolution, for instance, implies replacing the typical 12 monthly readings per user with over 31.5 million data readings. Large amounts of data can compromise hardware and software performance due to issues with meter power sources, battery life, telemetry network capacity, data gaps, and billing software, besides requiring utilities to acquire new skill sets for their employees , Suero et al 2012.
Among the existing literature that has already explored the implication of data sampling resolution on water end-use disaggregation (e.g., Wonders et al 2016), Cominola et al (2018) developed an analysis based on synthetic time series of water end uses generated with STREaM, the STochastic Residential water End-use Model. Their model relies on statistical distributions of end-use characteristics derived from a large dataset of disaggregated water end-uses from over 300 single-family households in nine US cities (DeOreo 2011). STREaM generated synthetic time series of water end uses with diverse sampling resolutions, which were analyzed with a multi-resolution assessment framework to identify potentially critical thresholds in data resolution for different aspects of information retrieval and demand management. While such studies tend to make up for the shortness of (or limited access to) data through stochastic modeling to generate synthetic disaggregated water use data, a data gap remains with limited availability of ground-truth water end-use observations from real-world data (Di Mauro et al 2020, Di Mauro et al 2021.
Here, we address the challenge of testing if and how the theoretical results obtained in the literature on synthetic data change when similar analysis is replicated directly on real-world data. Compared to synthetic data, real-world data might be characterized by higher signal noise, data gaps, and limited dataset size for model calibration. We build on the above modeling efforts through collection and analysis of observed data from a monitored study home in the Midwest United States, exploring the tradeoffs between data sampling resolution and performance in water end-use classification. We examine different data sampling resolutions and explore water end-use disaggregation results by aggregating 1 s water consumption data from a four-person study household to coarser resolutions. We evaluate a set of performance metrics regarding water end-use classification using supervised machine learning informed by ground-truth end-use labels obtained from a water diary recorded over a 4-week study period. Findings from our multi-resolution assessment can support further research and assist utilities in quantifying the benefits associated with second-to-minute data sampling resolutions and the costs of adopting and maintaining fine-resolution metering infrastructures.
The major contributions of this work include: • Training and testing a water end-use classification model on real-world observation data obtained with a single-point smart meter for a four-person household coupled with labels from a water diary. • Quantifying the effects of temporal data sampling resolution on the performance of water end-use classification. • Analyzing the tradeoff between end-use classification performance and data sampling resolution under two scenarios characterized by different model calibration strategies.

Metering setup, data collection, and temporal aggregation
In this study, we use data from a single-point smart water metering system installed at a four-person, singlefamily, fully-detached residence in the Midwest United States, collecting 1-s resolution flow rate data over a 4-week study period from September 3 to October 1, 2021. Aggregate indoor household water use data were collected from a custom ally ® electromagnetic flow meter provided by Sensus, installed on the main water supply pipe into the residents' home. In addition to measuring flow rate (gal/min), the meter also sensed temperature (K) and pressure (psi) data at a 1-s resolution. Although these pressure and temperature data are useful to water system operations, they are not as valuable to demand disaggregation due the large impact the distribution system has on these variables. We validated this assumption through feature analysis based on correlation and data visualization (see figures S15-S18 in the supporting information (https://stacks.iop.org/ERIS/2/045004/mmedia)). Consequently, we focus our analysis on flow rate data. The water meter writes data to a computer running a script that parses the raw data into a suitable format for further analysis. A data acquisition system connected to the water meter parses the raw data into a timestamped comma separated value format for further analysis.
To examine the effects of data sampling temporal resolution on water end-use classification, we aggregated the 1-s resolution time series to resolutions of 5 s, 10 s, 30 s, and 1 min. The 1-min resolution has been recognized as a critical threshold for certain end-use data analytics in the electricity sector (Carrie Armel et al 2013), Similarly, a previous study based on analysis of synthetic data identified the same threshold as critical for end-use disaggregation in the water sector . Here, we test these findings with an experimental study based on real-world data and aim to identify a similar critical data sampling resolution threshold for water end-use classification in the residential sector. Meanwhile, since the study is only based on a four-person household, we preliminarily compare water consumption patterns with a larger study to ensure the study home is representative of larger scale behavioral patterns.
During the study period, the home occupants manually recorded a water diary of labeled end uses. In this study, six types of indoor water end uses contributed to the total household water demand: faucets, toilets, showers, refrigerator, dishwasher, and washing machine. We used a written water diary over the 4-week study period to collect ground truth end use data for model training and validation. The 4-week period was selected based on previous studies and practicality (Beal et al 2011, DeOreo et al 2016. The water diary included end-use labels, start time, and date that were completed by the household occupants. More details about the diary are reported in the supporting information, including the water diary template (figure S19) and an example of completed recordings (figure S20). This data collection included only factual data such that this work was determined not to meet the definition of human subjects research and, therefore, did not require Institutional Review Board (IRB) approval. Documentation of this IRB decision is available upon request. Limitations that naturally arose during the water diary process were as follows: • Events that occupants would forget to fill in the diary could not be labeled after the disaggregation of the data. • Start times listed in the diary would sometimes correspond to events that occurred 1-2 min before the reported time, implying that occupants would sometimes fill in the diary after the event. • Specifically for faucet events, occupants mentioned occasionally leaving the faucet on to avoid reporting multiple events, resulting in long faucet durations that can represent atypical behavior in the model training process. • The water diary was completed manually and was unreadable for some events.
• Some reported events did not match the information received from the meter. In addition to these limitations, a power outage created a 2-day data gap in the smart water meter dataset, where the water diary was completed but measured water flow was missing.

End-use disaggregation
The end-use disaggregation step separates concurrent water use events along with single events, that, aggregated on the axis of time, would give the original time series collected at the single-point residential water meter. While end-use disaggregation and end-use classification sometimes coalesce into one concept in literature, in this study we consider disaggregation as the first step of the end-use classification process (Nguyen et al 2018). Single events are defined as those that occur in isolation (e.g., dishwasher only), while combined or concurrent events have simultaneous occurrences of water usage (e.g., a toilet flush during a shower). A single i-th water use event E i can be quantitatively characterized by a vector of features F i , which include values of, e.g., start time, end time, average flow, and volume of that event. Separating and identifying overlapping, or concurrent, water use events is a significant challenge in residential water studies, and the accuracy of existing smart meter disaggregation models decreases substantially when these types of events are encountered (Cominola et al 2015). Concurrent events occur especially during longer duration events such as showers or outdoor irrigation. Thus, disaggregating concurrent events from one another by leveraging information on the characteristics of individual fixtures or by learning the patterns of individual end uses is essential for the purpose of creating a comprehensive water profile for the household.
In this analysis, we used the disaggregation model from Bethke et al (2021), developed based on the method created by Nguyen et al (2013) to separate concurrent events by calculating the vector gradients of the flow rate data to identify start and end times of overlapping events. Once we separated events with the above disaggregation approach, we manually labeled each appliance/fixture water event based on the water diary and examined the events further with the classification model described below. We repeated this process for every resampled resolution as well as the original 1 s data. At coarser resolutions, the performance of the disaggregation model deteriorated when detecting multiple short duration events happening simultaneously (e.g., hand washing), or short duration events happening on top of a long duration event. Therefore, in addition to naturally having fewer observations at coarser resolutions, the number of events that we were able to match with the diary also decreased (figure S21).

End-use classification
After disaggregating the original water use time series, we labeled each event by matching with the water diary. We then trained a random forest (RF) classifier to perform appliance/fixture end-uses classification, using the disaggregated water events resulting from the previous step of end-use disaggregation. The classification algorithm allocated each data point (i.e., a i-th water use event E i ) in the dataset to one of the labeled classes, after training on tuples of events and associated features (E i , F i ).
RF models have been presented by Breiman (2001) as classical ensemble learning algorithms and have shown to be outstanding predictive models in classification tasks (Herrera et al 2010, James et al 2013. RFs are built using the same fundamental principles as decision trees and bagging (Bootstrap aggregation). Bagging introduces randomness into the tree building process by building many trees on random subsets of the training data with replacement; this process is also known as bootstrapping. Bagging then aggregates the predictions across all the trees, which reduces the variance of the overall procedure and improves predictive performance (Géron 2019). However, bagging trees could result in tree correlation that limits the effect of variance reduction. RFs help reduce variance by injecting more randomness into the training process (Hastie et al 2009). The RF algorithm is a bagging algorithm that draws random bootstrap samples from the training set. However, while bagging provides each tree with the full set of features, RFs have a random feature selection that makes trees more independent of each other, which often results in better variance-bias tradeoffs (table S1) (Hastie et al 2009, Probst et al 2019. In this study, the two features of average flow and duration were eventually selected to build the final models, based on the results of our feature importance analysis (figure S22). Therefore, the search for the split variable was limited to a random subset of the two chosen features. Feature importance was computed 2001 by evaluating which features contributed the most to the generalization power of the model (permutation-based feature importance; Breiman 2001).
To understand the mechanism used by RF models, it is necessary to understand the construction of classification decision trees. The goal of such a tree is to partition data into small and homogeneous groups. When travelling down the tree, data are split into possible responses called nodes that symbolize the branches of a tree. To perform each partitioning operation, a decision is based on an index (e.g., the Gini index), which allows RF models to partition the nodes of each tree into more homogenous groups that contain a larger proportion of one class in each subsequent node (Kuhn and Johnson 2013 ). The Gini index is calculated as in equation (1), where C is the total number of classes in the model and p nk is the probability of the occurrence of class k at node n. In this study, six different classes were evaluated based on typical household end uses: faucets (f), toilets (t), showers (s), refrigerator (r), dishwasher (d), and washing machine (w). The sum of all probabilities at a certain node is equal to one (see equation (2)): ( 1 ) Other metrics similar to the Gini index can be used to build decision trees, including cross entropy and misclassification error. However, the Gini index is the most commonly used metric in the literature (James et al 2013). Moreover, according to Raileanu and Stoffel (2004), the frequency of disagreement of the Gini index and entropy is only 2% of all cases, yet entropy is generally slower to compute because it requires a logarithmic function. For the above reasons, we used the Gini index in this study.
Besides Gini, the RF algorithm involves several other hyperparameters that can be tuned to optimize model performance. While studies have shown that RF models are less sensitive towards tuning than other algorithms such as support-vector machines (Probst et al 2019), modest performance gains can still be valuable considering the limitations that naturally come with a small dataset. Using grid search, we gave ranges to RF hyperparameters to exhaustively try all possible combinations and select the best hyperparameter combination. Minimum sample at each leaf (2-5), minimum sample split (2, 5, 8, 12) number of sub-features (1, 2), maximum depth (3-10), and the number of trees (10,20,50,100,200) were initially given to the grid for hyperparameter tuning.

Model calibration and data sampling resolution scenarios
We consider two scenarios for calibration analysis of the classification model: the '1 s only calibration' (Scenario 1), and 'multi-resolution calibration' (Scenario 2).
'1 s only calibration' (Scenario 1): In this scenario, the RF model was trained only on the measured data at the 1 s resolution. Extended time series of 1 s resolution water use data are not usually available from utility records, but they can be collected in small-scale customized and experimental smart meter installations. With this scenario, we test whether investing efforts and resources in gathering a small model calibration dataset at sub-minute resolution is worth the potential gain of model disaggregation accuracy also at coarser resolutions. Our assumption behind this scenario is that the features of water use events can be more accurately learned from data collected at higher resolutions. In the 1 s trained RF model scenario, we split the labeled data into train (70% of the data) and test (30% of the data) datasets. The test set was used to assess the model performance on the 1 s trained data. Then, the entire resampled dataset from all other resolutions were separately used as test sets to compare the performance of the model on coarser resolutions. 'Multi-resolution calibration' (Scenario 2): In this scenario, we trained different RF models for each resolution (5 s, 10 s, 30 s, and 1 min) on their own dataset and compare their performances both with one another and with Scenario 1. In this scenario, we examine the value of retraining the RF model specifically for different temporal resolutions to quantify differences in performance between sampling resolution and, comparatively with Scenario 1, across different model training strategies. To retain the value of limited data and improve generalizability of the models, we implemented a k-fold cross-validation strategy (Hawkins et al 2003). We thus split the training set into k subsets, called folds, and then iteratively fit the model k times, each time training the data on k − 1 folds and validating on the remaining fold. In this study, we fit the model with k = 10. At the end of training, we averaged the performance across all validation folds as the final performance value for the model.

Performance metrics
RF is a noise robust technique. However, when considering imbalanced problems, canonical machine learning algorithms generally tend to be biased towards the majority group. This behavior happens because such algorithms consider the number of objects in each group to be roughly similar (Krawczyk 2016, Ribeiro and Reynoso-Meza 2020). However, the minority class is often the most important when dealing with skewed distributions, and a performance metric should be chosen in a way to overcome such bias. While we do not directly balance the dataset used in this study because of its limited size, in this analysis we evaluate and compare the model performance using different formulations of the F1-score (FS). Specifically, we compare (i) micro-FS, which is a global metric attributing equal importance to each sample, thus giving emphasis on the most represented labels, (ii) macro-FS, which attributes equal importance to each class, and (iii) weighted FS, which computes the weighted average of the FS values obtained for individual classes. While using these metrics does not solve class imbalance, we examine different FS formulations to see whether our classifier gets biased towards well represented classes or not.
Micro-FS (usually referred to as simply FS) is a global performance metric that puts more emphasis on the most represented labels in the data set since it gives each sample the same importance. Labels that are underrepresented in the dataset may not be intended to influence the overall micro-FS heavily if the model is performing well on the other more common classes. Micro-FS (equation (3)) is defined as the harmonic mean of the precision (equation (4)) and recall (equation (5)): Macro-FS (short for macro-averaged F1 score) is used to assess the quality of classification in problems with multiple classes. The macro-FS gives the same importance to each class, with low values for models that only perform well on the common classes while performing poorly on the classes with less data. The macro-FS is defined as the mean of class-wise FS in equation (6): where i is the class index and N is the number of classes/labels. The weighted-average FS (equation (7)) is calculated by taking the mean of all per-class F1 scores while considering the number of actual occurrences of each class in the dataset where i and N are as above, and H is the total number of aggregated elements across all classes. The weighted-FS formulation modifies the macro-FS to account for class imbalance, while imbalance is not considered in micro-FS and macro-FS.

Data characterization-time-of-day visualization
To make sure our study home could be a proper representative of a larger study scale, we initially visualized the time-of-day and day-of-week distribution of three major classes of events (shower, washing machine, and dishwasher) to find regular patterns of consumption similar to those displayed in larger datasets. Much of the occupants' water consumption occurs during typical weekday mornings and evenings. Figure 1(a) depicts shower end use distribution throughout the week and time of the day in our study home. The results show that showers have a more sporadic pattern of use on weekends while during weekdays most of them occur during regular morning and evening peak hours. These behavioral patterns align with the time-of-day and day-of-week distribution of showers reported in an analysis of water end use data gathered for 762 US households (Cominola et al 2020), shown in figure 1(b). The time-of-day and day-of-week distribution figures for the washing machine (figure S1) and dishwasher (figure S2) also show similar results. Washing machine events are observed mostly during weekends with a wide distribution throughout time of the day, while dishwashers are typically used throughout the week, either mornings or evenings. Comparison of the results show similar patterns between our study home and the larger study of US households used in Cominola et al (2020), demonstrating the potentially transferrable nature of our study home results. Similar widespread end-use data would help water planners and managers understand water consumption patterns, consumer behavior, and temporal variability. Decreasing consumption during peak time on a widespread scale could contribute to lowering overall peak demand for the local utility and reduce pressure on existing water infrastructure.

Comparative multi-resolution scenario analysis
The overall RF model performance across different resolutions in both calibration scenarios is presented in figure 2. Grey lines represent Scenario 1 (1 s only calibration) and blue lines represent Scenario 2 (multiresolution calibration). The micro-FS, weighted-FS, and macro-FS are represented with dashed, solid, and dotted lines, respectively. We observe that Scenario 2 gives higher performance across different temporal resolutions regardless of the performance metric. For both 1 s and 5 s resolutions, the micro-FS and weighted-FS values are similar: 0.91 and 0.89 for the micro-and weighted-FSs, respectively, at the 1 s resolution, and 0.87 and 0.85 for the micro-and weighted-FSs, respectively, at the 5 s resolution. The macro-FS generally shows the lowest values for all resolutions for both scenarios. We observe a mild decrease in performance metrics with coarser temporal resolutions in Scenario 2, while performance metrics decrease notably for resolutions coarser than 5 s in Scenario 1, dropping as low as 0.2 for the 1 min resolution.
Overall, our results indicate that the RF models learned end use event features better when trained at the same data sampling resolution that they are tasked to use to classify unseen events, provided that a training dataset with labelled events at that resolution is available. If classification models are trained for application on data measured at the same resolution (Scenario 2), those models can perform at an acceptable level of performance even at coarser resolutions, depending on the relative importance of different end use classes. This observation has important implications related to the tradeoffs between fine-resolution data collection and increased data analytics needs. For instance, if a utility wants an estimate of water consumption by the main indoor water uses in households (e.g., toilets and showers), the 1 min resolution model still provides an acceptable performance (weighted-FS equal to 0.73). This performance is lower than the FS of 0.89 obtained for the 1 s resolution model, but this loss in model accuracy is balanced by the benefit of gathering, storing, and analyzing fewer data observations at the coarser temporal resolution. Conversely, if detailed information on all end uses is required, only the 1 s and 5 s resolutions provide high performance predictions on all end use classes; for less represented end uses, performance is compromised at coarser resolutions.

Detailed end-use classification results
Our detailed RF model testing results are presented in figure 3, where the predicted classes (right) are compared to the actual classes (left). Figure 3(a) represents the entire 1 s resolution set of events, while figure 3(b) zooms in on shorter duration events for clarity. The average flow rate (gal/min) and duration (s) were used as identifying features for our model. Of the total 654 events labeled, we used 196 events as a test set. The model predicts the test set with an accuracy of 92% and a weighted-FS of 0.89, which is noteworthy given the fact that the training dataset had limited observations in some classes such as dishwasher and washing machine. The model correctly predicts 179 events out of 196 total events of the test set.
Yet, the high model performance in all classes might overrepresent the overall ability of our RF models to classify unseen end use events. Our results might imply that, due to the fine temporal resolution of the data, the model discerns the constant range of duration and average flow of those end uses with automatic water consumption cycles (e.g., washing machine, dishwasher) and detects them correctly. However, since our study represents a single household only, the model might be overfitting on data from automatic appliances due to the invariance of duration and flow in these specific automatic appliances, thus results on these specific end uses may not be generalizable.
It is important to note that, while individual toilet uses are typically homogeneous in terms of water consumption volume and duration, even considering dual-flush systems, the combination of toilet and bathroom faucet uses are difficult to detect and disaggregate because such uses are often almost simultaneous (e.g., use of toilet and consequent handwashing in a same minute). Although temporal resolutions finer than 1 min reduce disaggregation errors (Mazzoni et al 2021), we were not able to disaggregate all toilet events followed by faucets. Rather, we labeled the mentioned events as toilets since we attributed the subsequent faucet use due to the toilet use. As a result, toilets have a wider range of flow and duration, as shown in figure 3. Figure 4 shows the classification results for Scenario 1 (1 s only calibration) applied to the resampled 5 s (figure 4(a)) and 1 min resolutions ( figure 4(b)), respectively, selected as examples at the two extremes of the considered spectrum of data resolutions. Here, we report our analysis results in both US customary units (gal/min) and SI units (L min −1 ). In comparing different temporal resolutions, coarser resolutions tend to compress data points on the vertical axis (i.e., decrease average event flow) and extend their range on the horizontal axis (i.e., increase event duration) due to temporal averaging. For example, toilet events that originally ranged from 1.7-3 gal/min (6.4-11.4 L min −1 ) average flow in the measured 1 s resolution tend to shift to 1-2.5 gal/min (3.8-9.5 L min −1 ) in the 5 s resolution and decrease further to 0.4-0.8 gal/min (1.5-3 L min −1 ) in the 1 min resolution. The duration of events increases with coarser temporal resolution to an extent that the total volume of events is the same as to the volume in the original 1 s resolution measurements. The mentioned shifts in values of end-use features leads to decreased model performance with coarser temporal resolutions, up to a point where, as shown in figure 4(b), the model can no longer detect any toilet events. The model still correctly predicts showers and a few washing machine events at the 1 min resolution; however, the model application to the 1 min data predicts most other end uses as a faucet under Scenario 1. Similar  Figure 5 shows the confusion matrices of water end use classification across the events of our four-person study household for Scenario 1. Faucets (f) account for the most frequent end uses, followed by toilets (t). The matrices show the total number of events labeled for each resolution, the actual classes, and the predicted classes by the model. The results for the 5 s resolution show that of 382 total events that we were able to match with the water diary, 324 events were classified correctly ( figure 5(a)). The main misclassifications were in predicting 14 actual toilet end uses as faucets and four actual faucet end uses as toilets. This misdetection mostly occurs for data that fall in the area with average flows of 1-1.5 gal/min (3.8-5.7 L min −1 ) and durations of 25-50 s (see figure S5 in the supporting information). For the 1 min resolution ( figure 5(b)), only 187 events had corresponding end uses in the water diary due to disaggregation errors where the model was not able to separate concurrent events because of loss of information that naturally accompanies coarser resolutions. Out of these 187 events, 92 were classified correctly. The classification model predicts 135 events as faucets. While only 73 of these events are actually faucets, they still account for 40% of the prediction accuracy, motivating consideration of F1-score metrics due to the imbalanced dataset. Figure 6 shows the confusion matrices of end use water consumption for Scenario 2, the multi-resolution calibration, for the 5 s (figure 6(a)) and 1 min (figure 6(b)) resolutions. Compared to Scenario 1 (figure 5(a)), the 5 s resolution performance in Scenario 2 has either slightly improved or stayed the same, with the exception of refrigerator events (r). The 1 s resolution trained model in Scenario 1 had a better performance in predicting 5 s resolution refrigerator events. The prediction of toilets improved notably from 43 to 51 out of 57 events. The main misclassifications were in predicting 5 actual toilet end uses as faucets and 6 actual faucet end uses as toilets. This misdetection mostly occurs for data that fall in the area with average flows of 1-1.5 gal/min (3.8-5.7 L min −1 ) and durations of 25-50 s (see figure S5 in the supporting information). Overall, the 5 s resolution has a high performance under both scenarios, with performance metrics slightly less than those of the 1 s resolution (as shown in figure 2). In the 1 min resolution, our model correctly predicts 139 of 187 labeled events, having the highest prediction accuracy in washing machine (100%), faucet (90%), and shower (88%) events. These results imply that if any of the aforementioned end uses are of importance, the 1 min resolution can still be informative.
With further investigation of the diagonals of the confusion matrices, we see how figure 6(b) has ameliorated in comparison to figure 5(b), increasing correct predictions from 93 to 139. The 1 min resolution model is still not able to discern refrigerator faucet events (r) from tap faucet events (f); however, this misclassification is not a critical issue since the refrigerator faucet is a faucet in nature. A noteworthy observation is that although the 1 min resolution model under Scenario 2 incorrectly classifies one shower and one actual faucet event as a washing machine (i.e., false positive, FP in equation (4)), it does not label any other actual washing machine event as other events (i.e., false negative, FN in equation (5)), which leads to a higher recall in this specific class (100%) than in the 1 s resolution model (86%) (refer to figure S12) and the 5 s resolution model (95%), with the tradeoff of lower precision (88% versus 95% in the 1 min and 5 s resolutions, respectively). Additional confusion matrices at other temporal resolutions are available in figures S9-S14 of the supporting information. In general, misclassifications do not cause significant degradation in predicting total water consumption if they are infrequent and roughly symmetric across the diagonal (Srinivasan et al 2011). For example, if toilet events are misclassified as faucet events while the same (or nearly the same) number of faucet events are misclassified as toilet events, these misclassifications can cancel out in terms of the accurate total number of events for those classes.

Broader implications
Overall, our study contributes to the literature showing that smart water meters provide water utilities with more accurate and less labor-intensive information, enabling better knowledge on changing water demands (Gurung et al 2015. High resolution temporal and spatial water consumption data have undeniable social and technical benefits. Smart metering contributes to more accurate water demand forecasting, demand management strategies, and better informed utility operations and planning strategies (McDaniel and McLaughlin 2009, Cominola et al 2015, Salomons et al 2020. Detailed water consumption patterns, which enable researchers to investigate the relationships between human behaviors and the water cycle as part of a broader socio-environmental scale, can be now obtained with advanced analytics, enabled by fast paced computing power improvement and metering technology allowing data collection with unprecedented temporal and spatial granularity (Flint et al 2017, Zipper et al 2019. While these advances support greater understanding of water consumption patterns and water-related human behaviors, we also acknowledge that there are potential privacy concerns regarding individuals and communities that need to be addressed and appreciated. Water consumption information transformed from the meter acts as an information side channel (McDaniel and McLaughlin 2009), exposing household habits and behaviors. End uses like showers and toilets have detectable water consumption signatures, making end-use classification information prone to potential privacy abuse. Consequently, well established privacy policies would benefit utilities in appropriate water demand management. Additionally, researchers have an ethical responsibility to protect participant confidentiality.
Recent studies have addressed privacy issues in both the water and energy sectors and presented solutions to overcome privacy related constraints to maximize the potential of granular data (Khurana et al 2010, Molina-Markham et al 2010, Gurstein 2011, Amin 2012, Cole and Stewart 2013, Harter et al 2013, Sankar et al 2012, Helveston 2015, Park and Cominola 2020, Salomons et al 2020. For instance, smart meter data can be used without invading individual privacy by aggregating data to coarser spatial or temporal scales as presented in our study. Nevertheless, as shown in this study, aggregation limits the ability of end-use classification, or any water consumption related research, to explore fine-scale behavioral dynamics for better demand modeling. Therefore, any research intersecting with human behavior should prioritize confidentiality (e.g., via anonymized data collected over a large sample of households) while providing sufficient information to enable future improvements in that field. While the formulation of privacy and security protection strategies is not within the scope of this study, we acknowledge that privacy and security considerations must be addressed and proactively planned for prior to collecting data throughout the research process so that modern metering technologies could be leveraged to their full extent while securing customer privacy (Meyer 2018).
From the findings of this study, we can identify the following limitations and opportunities for future research. First, future studies could focus on assessing how our results generalize when data from a larger household sample or homes from different socio-demographic, geographical, and climate contexts are available. Second, in this study we only considered six classes of indoor water uses from a four-person household. Further research could include outdoor water use and test end-use disaggregation capabilities on houses with different sizes. Third, as highlighted in the methods, end use datasets are often imbalanced, i.e., the number of events in each end use class might vary substantially. While here we considered class imbalance a posteriori, by assessing the disaggregation results with different formulations of the F-score, an alternative approach to be tested when larger datasets are available is to balance the classes a priori (i.e., before performing the classification), e.g., by oversampling/undersampling, which would solve the problem of class imbalance. Finally, while here we only considered RF classifiers and a specific approach for disaggregation, future studies could comparatively assess the performance of different models, possibly accounting for multi-class events.

Conclusion
In this analysis, we present a supervised approach to classify residential water consumption end-use events and test it on data collected in a four-person household through consideration of multiple temporal resolutions by measuring water use data with a 1 s resolution smart water metering system and labeling events based on a water diary for a 4-week study period. We investigated two different scenarios of model calibration in evaluating the effect of temporal resolution on end-use classification performance. The first scenario consisted of training a RF classifier on the original 1 s resolution data only and testing it also on other labeled temporal resolution datasets (i.e., 5 s, 10 s, 30 s, 1 min). In this scenario, our model exhibited high overall performance on the 1 s and 5 s resolution water use events and classified certain classes of end uses with fairly good accuracy for the 10 s resolution. The performance decreased notably for the 30 s and 1 min resolutions.
The second scenario consisted of training separate models for each temporal resolution using k-fold crossvalidation. We saw that coarser temporal resolutions ameliorated in this second scenario, with F1-score performance metrics as high as 0.89 for certain end-use classes at the finer resolutions. A weighted F1-score above 0.85 was obtained in this scenario for disaggregation tasks performed at 1-and 5 s resolutions.
Our results reveal detailed information that can help utilities and residents make informed water conservation and efficiency decisions based on detailed knowledge on water demands. The analysis of classification model performance versus temporal resolution considering different F1-score formulations provides insight for future water management regarding the selection of an efficient monitoring resolution based on priorities and data management capabilities.
In addition, our approach performed end-use classification of data aggregated at different temporal resolutions that are closer to the resolutions of commercial smart water meters (i.e., 1 min). Thus, while making use of data collected at a finer resolution (e.g., 1 s) might not be available to water utilities due to data management and analysis tradeoffs, we demonstrate possible model extensions to broader and further contexts in the field of residential water demand monitoring.
Ultimately, disaggregating and classifying water events obtained from residential smart water meter data reveals detailed information about how water is consumed within households. Understanding water consumption profiles and performance of different resolutions presents opportunities for improved residential water conservation and efficiency and long-term water resource sustainability (Attari 2014, Inskeep and Attari 2014, Goulas et al 2022. Our study presents an experimental example of how using smart water meter data can provide end-use information to pinpoint opportunities for improved efficiency within residential buildings.

Author contributions
Z.H. analyzed data, created and implemented the classification models and quantified performance metrics, and analyzed and summarized results; A.C. assisted with creating the classification models and performance metrics and contributed to scoping the analysis; A.S.S. formulated the analysis scope, supervised the research, and acquired funding to support the analysis; all authors contributed to writing and reviewed the manuscript.
Universität Berlin) and Riccardo Taormina (TU Delft) for sharing the results reproduced in figures 1(b), S1(b) and S2(b). This work was supported by the National Science Foundation, Grant CBET-1847404; the opinions, findings, and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of the National Science Foundation.