A cluster analysis approach to sampling domestic properties for sensor deployment

Sensors are an increasingly widespread tool for monitoring utility usage (e.g., electricity) and environmental data (e.g., temperature). In large-scale projects, it is often impractical and sometimes impossible to place sensors at all sites of interest, for example due to limited sensor numbers or access. We test whether cluster analysis can be used to address this problem. We create clusters of potential sensor sites using factors that may influence sensor measurements. The clusters provide groups of sites that are similar to each other, and that differ between groups. Sampling a few sites from each group provides a subset that captures the diversity of sites. We test the approach with two types of sensors: utility usage (gas and water) and outdoor environment. Using a separate analysis for each sensor type, we create clusters using characteristics from up to 298 potential sites. We sample across these clusters to provide representative coverage for sensor installations. We verify the approach using data from the sensors installed as a result of the sampling, as well as using other sensor measures from all available sites over one year. Results show that sensor data vary across clusters, and vary with the factors used to create the clusters, thereby providing evidence that this cluster-based approach captures differences across sensor sites. This novel methodology provides representative sampling across potential sensor sites. It is generalisable to other sensor types and to any situation in which influencing factors at potential sites are known. We also discuss recommendations for future sensor-based large-scale projects.

In parallel with this enhanced access to data, there has been an increasing practical drive for energy savings [40,41], renewable energy utilisation [42], healthy home environments [43], and smart control of domestic systems [44,45]. Establishing the influences on these factors requires research into energy usage, water usage and environmental conditions in domestic settings, which relies on feasible monitoring and effective sensor placement.
The purpose of the current study is to address a core long-standing problem for monitoring sensors. The problem is selecting representative and optimised locations for placing sensors [46,47] when it is not possible to place sensors in all potential locations, for example, due to limited numbers of sensors or difficulties with installation. Suboptimal placement of sensors can result in collecting data across similar situations, thereby providing potentially redundant information. Sensors could instead be more usefully deployed across a variety of sites [48], and by targeting locations through representative sampling.
Research on optimum placement of sensors is often necessarily specific to the type of sensor and the application [e.g., water contamination: [49], fluid flow: [50]], with placement design requiring solutions for specific spaces and sensors [51]. For indoor air quality or temperature, fluid dynamic modelling of pollutant or heat diffusion can highlight locations that would be most useful to monitor [52], with methods typically adapted to reduce the computational complexity [46,51,53], or incorporate influences from occupants and the building [54]. For studies on households, previous research has placed sensors in a subset of homes and assessed representativeness of the subset using census data, in order to provide a subset of sensor data alongside a full survey dataset [55].
Selection of sensor locations is often aimed at maximising coverage of a two-or three-dimensional known physical space. In these circumstances mathematical and spatial analysis techniques can be used to provide complete or optimised coverage, for example, for gas detection [56], security monitoring [47], and environmental sensors [57]. However, sensor placement is not always driven by spatial location, but by other potential influences, for example, in the current study, household characteristics and road distance.
Methods for optimising sensor placement include minimising uncertainty in the data estimates [58], evolutionary algorithms and neural networks [59], with machine learning as a promising emerging approach [60]. However, given specific applications and tailored methods, the techniques are not often readily accessible to those requiring sensor monitoring, and cannot guarantee a generalisable solution for users such as building owners [61].
Futhermore, the optimal placement of sensors might not necessarily comprise uniform coverage of a known feature space, rather the coverage needs to reflect the values and weightings of features of the potential locations themselves. For example, in the current study, the purpose is to deploy limited sensors across a representative sample from our participants' homes and their locations. We therefore wish to use a data-driven approach to ensure we capture the similarities and variety specifically within our cohort.
In the current study we use cluster analysis in a novel methodology for selecting a representative sample of homes for sensor installation. Cluster analysis is an established technique for grouping individuals according to similar feature values [62], whilst also representing variety across groups. The benefits of this approach for sensor placement are that it is (1) generalisable to new settings and applications; (2) it is data-driven, so that the groups are defined by the set of potential sensor locations rather than being influenced by feature combinations that may not exist; (3) groups are based on known features that are expected to influence the monitored data.
Clustering methods have been widely applied within the field of sensors and monitoring, including selecting locations for sensor placement. However, unlike our study, clustering was performed on sensor data, rather than only using factors believed to influence those sensor data (i.e., in the current study, household and local environment characteristics). In one study, clusters based on sensor data were further refined with spatial clustering, as detailed below [61].
Clustering by electricity usage has also been used to examine the household characteristics within each cluster [3,4], and supports findings that energy usage is influenced by characteristics such as household size and the time spent at home [81,82].
For environmental measures, clustering has been used to determine patterns in temperature and humidity data [83] and comfort levels across clusters [24], to cluster climate data [84][85][86], and establish emergent geographical patterns [87]. Clustering of air pollution data groups monitoring stations or cities with similar readings, which can inform network development [88][89][90] and appropriate pollution reduction methods [91]. Air quality data has also provided a base for testing methods for clustering of time-series data [92,93].
Specifically applied to optimising sensor placement, clustering has also been primarily based on the sensor measurements. For the detection of water leaks, estimated pressure changes across the water distribution network were clustered into different types, and the most informative locations from each type were selected for pressure monitoring [94].
For environmental sensors (temperature, humidity and luminance), sensors were also clustered into groups based on the sensor data [61]. Separate cluster analyses for different areas of the building allowed for the influence of air conditioning. Clusters were refined using spatial clustering results. Strategies were provided and used to place a limited number of sensors for monitoring an office environment. Sensor coverage was validated by comparison with data from a full set of sensors [61].
The purpose of the current study is to apply cluster analysis in a novel methodology for representative sampling from potential sensor sites. We apply this method to sampling across domestic properties in order to inform the placement of gas, water and environmental sensors. In contrast to previous work, cluster analysis is performed using household and environment characteristics, as opposed to sensor data, to create clusters of similar homes. The method is tested with two specific sensor types, but is generalisable to any situation in which influencing factors at potential locations are known. This methodology addresses a resource limitation problem when sensors are limited in number and need to be placed to maximise coverage of the potential dataset. Groups are defined by the features of the potential sensor locations, such that they represent the similarities and diversity within the cohort to be sampled. Sensors are placed into some homes in each cluster to provide a representative sample across all types of home. The resulting clusters and chosen sensor placements are verified using one year of utility usage data and environmental measurements collected at 3-30 min intervals across a maximum of 280 homes.

Overview
In the next section we describe the broader project to provide the context for the current aim of representatively sampling potential sensor sites (Section 3). There are four main steps to this study. The first is conduct cluster analyses to provide groups of similar homes from which to sample across (Section 4). Secondly, the most appropriate cluster solution is chosen for each application (utility sensors and environmental sensors) (Section 5). Thirdly, sensors of each type are installed using the chosen cluster solutions to inform installation sites (Section 6). Finally, we use the sensor data collected to assess whether this clustering approach achieved its purpose of representative sampling of potential sensor sites (Section 7). Fig. 1 provides an overview of the four steps, including the datasets used and the number of homes or sensors available at each relevant step.

The Smartline project
Over 300 households were recruited to take part in Smartline, from domestic properties that are managed by Coastline Housing, a housing association in Cornwall, South West UK. The overarching aim of the Smartline project is to investigate opportunities for technology to support healthier and happier living in homes and communities [95][96][97][98][99][100][101]. To our knowledge, Smartline is the largest domestic project of its kind to date, although non-domestic projects are ongoing [102].

Smartline data
Survey, sensor and housing data were collected from the participating homes, following informed written consent. The large dataset is a unique combination of cross-sectional and time-series data, including household characteristics and behaviours, environmental readings, and utility usages.

Surveys
Face-to-face surveys were conducted with 329 participants in September 2017 to November 2018. In the broader Smartline project, survey data were collected to provide information about the home, the household, occupant behaviours, community interactions, health and wellbeing.

Sensors
On the Smartline project, utility usage sensors and indoor environmental sensors were installed in up to 280 homes from October 2017 onwards. The broader purpose of the sensors, within Smartline, was to provide information on the indoor environment and utility usage, to be considered in relation to occupant health and wellbeing. Environmental sensors external to the home were also installed to provide a context when considering the indoor environment.
Utility readings comprised electricity, gas and water usage for the. They were each installed on the utility supply meter to provide overall measures of usage for whole property. Readings were recorded every 3-7.5 min. Indoor and external environmental measures comprised air temperature, relative humidity (RH), volatile organic compounds (VOCs), equivalent CO2 (eCO 2 ) and particulate matter up to size 2.5 μm (PM2.5), together with PM10 for the external environmental sensors. Measurements were taken every 3-5 min in the living room and bedroom, and every 30 min for the external sensors.
All sensors were manufactured by Invisible Systems Ltd. and installed by Blue Flame (Cornwall) Ltd. from October 2017 onwards. Table 1 in the provides sensor details. Given the proximity of homes taking part in this study, the sensor gateway in one home can be close enough to transmit readings from sensors in other homes, such that multiple readings can be captured within the update interval. The update interval of 7.5 min is standard for these sensors, providing 8 readings per hour. A shorter interval was chosen for sensors when battery life or mains power allowed (see Table 1). An update rate of 30 min was chosen for external sensors based on estimations for the battery to last for two years.

Current study
Smartline electricity and indoor environmental sensors were available for installation in 280 homes. However, there were limited numbers of water, gas and external environmental sensors. The aim of the current study was to select sites for the placement of these sensors in order to capture a representative range of homes and environments across the Smartline cohort area.
In this study, survey responses are used as factors for the cluster analyses in order to create groups of similar homes to be sampled across. The same survey factors are also used as predictors in regression analyses to verify the cluster solutions. More details are provided about the measures used as factors and predictors in relevant sections below. A reference table for terminology is presented in Table 5.
Data from sensors are used to verify whether the cluster analysis approach was successful in informing representative placement of sensors that were limited in number. The purpose of analysing these data is not to draw conclusions about the measures themselves, but is to verify the clustering method.

Clustering methods
This section presents the cluster analysis methods, using known factors to create groups of similar homes, from which sensor sites can be sampled.
Two sets of cluster analyses were conducted on factors representing characteristics of the potential sensor sites. The purpose of each was to determine a set of homes to provide a representative sample for the placement of utility usage sensors and external environmental sensors, one type per analysis. After assigning homes into clusters, we sampled from each cluster to provide a subset of homes that captured the range of characteristics across all homes.
The first analysis was for the placement of sensors monitoring water and gas usage (m 3 ). The second analysis was for the placement of external environmental sensors for temperature ( • C), relative humidity (RH, %), volatile organic compounds (VOCs, parts-per-billion), equivalent CO 2 (eCO 2 , parts-per-million), and particulate matter of sizes 2.5 μm and 10 μm (PM2.5 and PM10, μg/m 3 ).
The process for each analysis was the same except different sets of factors were included for the clustering. Because of changes in participating households, availability of data for cluster factors, availability of additional sensors, and practical limitations in installing utility sensors, two stages were conducted for each analysis.
For the utility sensors, the first stage was conducted on 248 homes in November 2017 for the placement of up to 50 water and 50 gas sensors. The second stage was conducted on 298 homes in October 2019. Between the first and second stages 81 households withdrew from the project and 131 joined. In addition, more suitable data became available for the cluster factors, as described below. Installations were restricted to 22 water and 41 gas sensors (see Section 6), so the second-stage clustering also allowed a more appropriate number of clusters to be chosen.
For the external sensors, the first stage was conducted on 291 homes in November 2017 for the placement of 30 sensors. In February 2019, an additional 30 sensors became available. However, at this time, 103 households from the first stage had left the project, while 95 households had joined. To determine sites for the placing the additional sensors, we conducted a second stage to repeat the analysis on the updated cohort of 283 homes.

Factors
Factors considered relevant for affecting water and gas use were as follows. Previous research has shown that these factors can affect water usage [103,104] and gas usage [105,106].
1. Property-type (flat or house). 2. Property-size (number of bedrooms or rooms). 3. Property-age (years). 4. The time spent inside the home (two factors in the first-stage analysis).
Number of occupants per age-group. The factors considered relevant for affecting external air measurements were as follows.  1 and DTM data as a measure of surrounding cover (e.g., trees, buildings). 5. The distance from the nearest A-road.
It seems unlikely that there will be sufficient variation in latitude and longitude within the area project location to reflect global climate differences. However, latitude and longitude were used in order to capture the relative location of the property within the local context. Variation in latitude and longitude are likely to reflect other underlying groupings or variables that are unknown in advance, such as differences in surroundings that affect wind speed or local microclimates. Latitude and longitude therefore allow sensors to be distributed over the project area and capture differences in local surroundings. We also included other factors that may be more likely to directly affect influences on environmental measures. Previous research has shown influences of elevation on temperature and, in part, on wind speed [111], 2 whilst surrounding cover or shelter is also likely to affect temperature and humidity by mediating wind speed. Finally, distance to the nearest main road can affect air quality [112]. Split core current transformer providing energy and current measurements using the output low AC voltage (0-0.33V AC).
Readings recorded every 3 min.

Gas
Pulse transmitter (ref: QC0145b) The sensor takes the pulse output from the utility meter installed in the home, and operates with either rotary or electronic pulse meters.
Count of the number of pulses generated per 7.5 min.
In line with equipment it is connected to.

Water
Pulse transmitter (ref: QC0145c) The sensor takes the pulse output from the utility meter installed in the home, and operates with either rotary or electronic pulse meters.
Count of the number of pulses generated per 7.5 min.
In line with equipment it is connected to.  [110]. 2 The findings were made with larger and higher elevations than in the current study.

Determining factor values
The factors for the utility-sensor placement analysis (property-type, bedrooms, property-age, part-and full-time employed counts, and agegroup counts) were determined using Coastline Housing data in the first stage. In the second stage, rooms (previously bedrooms), time at home (previously employment counts) and age-group counts were replaced by participant survey responses.
Latitude, longitude, Easting and Northing of the properties were determined from the postal code [113]. The DTM and DSM values were calculated as the mean average of the 1-m resolution data across a 9 m × 9 m square surrounding the property's Easting and Northing. Distance to the nearest A-road was measured in metres for each latitude and longitude pair using the distance tool on Google Maps [114].

Pre-processing
Factors in a cluster analysis can carry different weightings in influencing the clustering calculations due to differences in variance and magnitudes of the factor values. Transformation into z-scores standardises values to the same mean and standard deviation, and are calculated by subtracting the mean of the values and dividing by the standard deviation. All factors were transformed into z-scores to ensure similar variances and therefore similar weights in the cluster analysis. For utility-sensor factors, given small numbers and high skew in some counts of people, and to maintain relative magnitudes across those count data, we used the standard deviation and mean across all people-count factors to calculate z-scores.

Correlations
In cluster analysis, correlations between factors can be problematic if the factors are representing the same underlying characteristic. A characteristic captured by multiple factors contributes more influence on the cluster process than characteristics that are only captured by one factor. The correlation between each pair of factors was therefore checked. We calculated correlations using values from all homes available.
Figs. 2 and 3 present the correlation coefficients, and show high correlations between some factors for both the utility-sensor and the external-sensor analyses. However, while some of the bases for the factors overlap, it was decided that each of the factors also brings its own quality. For example, number of full-time employed correlates with the number aged 18-65, but the first provides information about daytime occupancy while the second gives the number of adults, both of which may influence utility usage for different reasons.

K-means clustering
We employed a k-means clustering technique, in which the distance between potential sensor sites and the cluster centre (i.e., centroid) is minimised by iteratively updating the membership of the clusters according to the closest centroid then recalculating the centroid location as the mean of the cluster members [62,115].
K-means was chosen above other clustering methods due to the numerical nature of the factors, and its applicability to this dataset [98]. In this study, other benefits of using k-means are to provide a generic and accessible approach for application in other settings. It is probably the most widely used clustering approach, with extensive documentation, tutorials and tools for implementation, and is applicable to any set of numerical factors.  Latitude and longitude data represent fewer than half of the factors in one of the applications we are testing. However, it is worth noting that k-means clustering is sometimes avoided for latitude and longitude data. Distortion can occur due to changes in the distance between longitudes with the curvature of the earth, and locations can be overrepresented or underrepresented due to random selection of starting centroids from the dataset [116]. However, in our study, earth curvature is minimal, and it is advantageous to capture data-driven weightings to ensure representative coverage of our participant homes and locations, rather than obtain uniform coverage of a predefined space.
The k-means algorithm was set to create 50 models with different initial seeds for centroid locations and return the model with the lowest resulting inertia, which is the sum of squared distances between the homes and the centroid within each cluster. Each model converged when the relative change in the inertia was less than 0.0001 between iterations. Models were created for 5 to 25 clusters to determine the number of clusters (k) that provides the most appropriate solution. The whole process was conducted ten times (hereafter called run 1 to 10) and selection of the final solution was guided by consistency of solutions across different runs, which reflects less susceptibility to noise.
For the two stages of the utility-sensor analysis, factors differed across the two. We therefore conducted two separate cluster analyses, providing two independent cluster solutions. Locations of sensors, which were placed according to the first stage, were verified in the context of the second-stage clusters.
Across the two stages of the external-sensor analysis, the factors remained consistent. We could have therefore added the new homes to the cluster solution from the first stage. However, we instead conducted the second-stage analysis on the complete set of 283 homes independently of the first stage because some households had withdrawn. In addition, 30 additional sensors were still to be placed, so we verified the location of existing sensors, and compensated for any lack of cluster coverage by the placement of a new sensor.
For both the utility-sensor and external-sensor analyses, the classification of homes by cluster were validated, and the overlap between first-and second-stage solutions was quantified to ascertain a consistent structure between the two stages. The supplementary material provides details.

Resulting clusters
This section describes the results of the clustering analyses. Measures of fit are presented, which were used to select the most appropriate cluster solutions.

Measures of fit: inertia and silhouette
Two measures of fit were plotted across the different numbers of clusters. These plots were used to visually determine the point(s) at which a so-called elbow occurs, reflecting a change in the rate of change of measure. Figs. 4 and 5 provide plots for the run that was ultimately chosen as containing the solution for each analysis.
The first measure used was the inertia, defined earlier [e.g., see Ref. [126]]. The second was the silhouette [127], which is an inverse measure of overlap of the clusters. It is calculated as the normalised difference between the mean distance between members within a cluster and the mean distance between those cluster members and the members in the nearest other cluster.

Selecting a solution
Plots for all ten runs revealed a change in rate for each measure at k = 9 and k = 6 for the first and second stages of the utility-sensor analysis respectively, and at k = 11 and k = 7 for the first and second stages of the external-sensor analysis respectively. In all cases, except the second stage of the utility-sensor analysis, the expected number of sensors supported a larger number of clusters than that identified. In addition, inertia and silhouette generally improve with more clusters. It was therefore decided to choose larger numbers of clusters than indicated by these initial solutions.
For the first stage of the utility-sensor analysis, k = 16 was indicated by two of the ten runs, while other values of k > 9 were indicated by one run at most. Of those two runs, one solution gave a cluster with only two members, which was considered too small for our sampling purposes, so the other run was chosen. This run had the second largest silhouette and the fourth lowest inertia of all runs for k = 16. Fewer sensors were placed than originally planned (22 for water and 41 for gas). Therefore, in the second-stage analysis, k = 6 was chosen to allow at least two sensors per cluster. Four runs gave the lowest inertia, and the third highest silhouette. All four provided identical cluster membership.
For the first stage of the external-sensor analysis, all runs indicated visual elbows in the rate of change for inertia at k = 17 and for silhouette at k = 14 and k = 17. Given a maximum of 30 sensors available at this first stage, the solution with 14 clusters was chosen to allow for more sensors per cluster. All runs gave identical solutions for k = 14.
For the second stage of the external-sensor analysis, 30 additional sensors were available, providing a maximum of 60 sensors. We therefore decided to use 15 or more clusters. Elbows were indicated at k = 15 in inertia for three runs and in silhouette for five runs, with little consistency for values of k > 15. One run indicated k = 15 in the elbow for both inertia and silhouette, and the provided lowest inertia and highest silhouette across all runs. This cluster solution was therefore chosen to make recommendations for placing the additional sensors.

Sensor placements
Given the clustering solutions selected, homes within each cluster were ordered by Euclidian distance from their cluster centroid to provide recommendations for the placing of sensors at representative sites. However, there were installation restrictions for many of the sites that were recommended following the first-stage cluster analyses, as detailed below. In addition, some households withdrew from the project between the cluster analyses and the installation of the sensors. In all cases, if a recommended site was not available, the next site in terms of distance from the centroid was instead recommended.
Figs. 6 and 7 show the clusters, with each cluster represented by a different colour. Each dot represents one home, with a line connecting to its cluster centroid. Open circles indicate the sites suggested for sensor placement, and open squares represent sites at which sensors were placed. Given nine (first-stage) or eight (second-stage) factors for the utility-sensor cluster analysis and five factors for the environmental sensors, only two factors are plotted for visual clarity. In Fig. 6, principal components of the cluster factors are used [Scikit-learn PCA; [128]]. However, Euclidian distance of homes from the centroid was determined using all cluster factors. In Fig. 7, to ensure anonymity and to increase visual clarity, jitter is applied to individual homes.
Based on the first-stage clustering of the utility sensors, the recommended choice was for two sites close to the centroid and one site distant from the centroid, giving 16 clusters * 3 sites = 48 sensors.
Installation of gas sensors was restricted by some homes having a gas meter that does not produce a pulse output, making it unsuitable for the pulse sensor to be used. Installation of water sensors was also restricted due to some participants preferring to avoid the installation disruption (e.g., having a hole cut into the back of the kitchen cupboard). Of the 48 recommended homes, no gas homes were suitable, and 5 water were installed as planned. When a home was not available, the next home in terms of distance from the centroid was instead approached for install. 22 water sensors and 41 gas sensors were successfully installed. However, 11 of the gas sensors were placed in homes not in the firststage clustering due to being recently recruited participants. For both water and gas, there were clusters without any sensors placed due to the installation restrictions. See first-stage clusters without open squares in Fig. 6 (water cluster numbers 6, 10, 13, 14, 15; gas cluster numbers 0, 6,8,13,14,15).
For second-stage clusters, some households with sensors installed following the first stage had withdrawn from the project or had missing   survey data, leaving 21 homes with water sensors and 32 with gas sensors. Given all sensors had already been installed following the firststage analysis, no recommendations were made following the second stage.
Across the final six clusters, there were 1-5 water sensors per cluster, with four clusters having at least four sensors. The water-sensor home that was closest to the cluster centroid was positioned as follows. In two clusters, it was at the closest home to the centroid, and in three it was Fig. 6. Utility-sensor clusters for the first (right panels) and second (left panels) stages, for water (upper panels) and gas (middle panels), and coefficient weight for "1st" and "2nd" principal components (PC) for each cluster factor (lower panels). See text for details. tenth closest at most. In one cluster (number 3), with 49 unique members and four water sensors, it was at the 36th home from the centroid. There were 2-14 gas sensors per cluster. In four clusters, a gas sensor was within eight homes of the centroid, and in the remaining two clusters (numbers 2 and 4) it was 27th.
For the first stage of the external sensors, the recommended choice was for one site close to the centroid and one distant from the centroid, giving 14 clusters * 2 sites = 28 sensors. The installation of external sensors was restricted by requiring a suitable mount for the sensor. Of the 28 recommended sites, 7 had sensors installed. As for gas and water, when a site was not available, the next site in terms of distance from the centroid was instead considered. Of the 30 sensors planned, 27 sensors were successfully installed including three installed at sites not in the first-stage cluster analysis.
Recommendations for second-stage placements were made under constraints from the 27 sensors that had already been placed. Two of the 15 clusters contained no existing external sensors, and five had only one. Distribution of the additional 30 sensors was recommended across the clusters to give 2-6 sensors per cluster. Twenty-eight additional sensors were successfully placed, achieving 2-6 sensors per cluster, and 55 sensors altogether. Of the 55 sites recommended, sensors were placed at 53, with two of the homes each having two sensors at different orientations. Four clusters had only two sensors. Three of these clusters contained at most eight sites. In the other, all sites in the cluster were located on adjacent roads. In 14 of the 15 clusters, an external sensor was positioned at the closest distance to the cluster centroid, and in the remaining cluster it was placed at the second closest distance.

Clustering verification
In this section, we verify the appropriateness of the clustering methodology to provide a representative sample of sensor sites from the full set of potential sites. If the clusters created are successful in capturing a range of sensor sites, then we would expect the resulting sensor data to vary across clusters, and vary with the factors used to define the clusters. Such variation would suggest that the clusters are meaningful with respect to the sensor data being collected.
Overall, we wanted to test for differences in sensor data between clusters. Water, gas and external sensors were deliberately distributed across clusters, resulting in limited numbers per cluster, and providing limited statistical power for the effects of cluster. However, other related measures across all homes can be used to test for an effect of cluster. First, we test for relationships between measures that could be related, for example between internal and external temperature, and between electricity and gas usage. Second, we test for differences between clusters for these related measures, which should occur if the clusters are meaningful for these types of measures. Thirdly, we test for relationships between the cluster factors and the data from water, gas and external sensors.
For each sensor, readings were used from 1 st November 2018 to 31 st October 2019, taking the mean hourly usage for utilities and the mean average reading for environmental measures. To allow for any differences in the interval between the readings, all readings were interpolated to a resolution of 1 min before means were calculated. For utilities, the usage was summed for each hour before calculating the mean hourly usage. Sensors were excluded if readings did not span the date-range, and outliers were also excluded. For utilities, values were excluded that were more than 12 standard deviations from the mean and dates were excluded if zero usage was recorded. Nine gas sensors were excluded for a mean hourly usage below 0.01 m 3 or for zero usage on 25% of days or more. 32 electricity sensors were excluded for a mean hourly usage below 0.08 kWh or for a visually abrupt change in usage during the date-range.
Most external sensors did not capture data during the full year. Therefore, to maximise the number of valid external sensors, we used two restricted date-ranges: Winter, from 1 st December 2018 to 28 th February 2019, and summer, from 1 st May 2019 to 31 st July 2019. Table 2 provides the final numbers of sensors for each type. Table 3 provides descriptive statistics for the factors used for the cluster analysis.

Correlations between related sensor measurements
The factors used for the utility-sensor cluster analysis should also influence electricity usage. We therefore tested for correlations between  water and gas usages and electricity usage in homes with the relevant utility sensors in place. Fig. 8 shows significant positive relationships between utility usages. For the external sensors, the external air could influence the air internal to the home. We tested for correlations between external-internal pairs of measures using only those homes at which external sensors were installed. Relative humidity is dependent on air temperature because it determines the maximum amount of water that air can hold. To compare external and internal humidity, we therefore compared absolute humidity as well as RH. Absolute humidity was calculated for each reading from temperature (T) and RH sensors as [129]: There are significant correlations between the electricity and other utility usages, and there is reason to expect that the factors used for the water and gas cluster analysis could also affect electricity. There were no correlations between the internal and external environmental measures. However, the lack of significant relationships could reflect the complexity of the indoor environment, as discussed in the Discussion section below.

Relationships between clusters and sensor data
In this section, we wish to test whether sensor readings vary across clusters. To have sufficient statistical power we used the measures that have large numbers of sensors in each cluster, namely the electricity and the indoor environmental measures.
Each household was assigned to the cluster with the nearest centroid for each of the utility-sensor clustering and the external-sensor clustering, both using the second-stage clusters.
For the utility-sensor clusters, an effect of cluster for the electricity usage would suggest that the correlated water and gas measures also vary across cluster.
For the external-sensor clusters we test for an effect of cluster for each internal environmental measure. Despite no significant correlations between the indoor and external environmental measures, an effect of cluster would suggest that the factors used for the external-sensor clustering do influence environmental conditions. Data were analysed using a one-way k-level (number of clusters) ANOVA for each dependent measure. When Bartlett's test for equal variances was violated (p < 0.05), a Kruskal-Wallis test was performed instead.
For the utility-sensor clusters, there was a significant effect of cluster for electricity usage, N = 110, χ2(5) = 29.979, p < 0.001. For the Table 3 Descriptive statistics for each clustering factor for homes with water, gas, and external sensors.    (14) 20.769 (0.108) T. Menneer et al. external-sensor clusters, effects of cluster are provided in Table 4. There were significant effects of cluster for temperature in the living room and RH in both rooms.
The analyses have revealed effects of cluster, but interpretation of influences on the differences is limited given that the cluster factors are not represented in the analysis. Linear regressions were therefore conducted to establish which cluster factors were predictors of the sensor readings. Regressions were only conducted for those measures that revealed an effect of cluster. Variance inflation factors for all predictor variables in all regressions were below 4, except latitude (13.3-13.7) and longitude (10.0-10.2). The plots of residuals against fitted values showed no evidence of heteroscedasticity for all regression models.
The external-sensor cluster factors were used as predictors in three separate regressions to predict temperature in the living room and RH in each room,. The models for RH in the bedroom was not significantly different from the null model with no predictors (F(5, 231) = 1.45, p = 0.207). The regression models for temperature and RH from the living room exhibited a significant overall relationship between the predictors and the outcome (both F(5, 225) ≥ 2.76, p < 0.020). Temperature showed a strong trend towards increasing further south (latitude coefficient = − 54.113, p = 0.055) and increased with cover (coefficient = 0.104, p = 0.025). RH increased further north (latitude coefficient = 225.270, p = 0.025) and further west (longitude coefficient = − 95.977, p = 0.023), and decreased with cover (coefficient = − 0.369, p = 0.025).

Relationships between cluster factors and utility-sensor and externalsensor data
In this final verification, we test for variation of the sensor data with the factors used for clustering. We used the cluster factors in regressions as predictors for the measures from the sensors that were placed as a result of the clustering process. Sensors were purposely distributed across clusters to capture data from a range of homes. An ANOVA was not therefore appropriate given small numbers per cluster (after outlier removal: 0 to 6). Separate regressions were performed for homes with water sensors, homes with gas sensors, and the external sensors.
For homes with gas sensors, there were no occupants in the 0-12 and 13-17 age-groups, and the variation inflation factors were high for agegroup 66+ (13.0), property-age (34.2) and number of rooms (12.5), we therefore removed property-age from the predictors and summed the number of occupants into a single measure.
Variance inflation factors for all predictor variables in all regressions were below 4, except for latitude (15.67 or 15.84) and longitude (8.54). The plots of residuals against fitted values showed no evidence of heteroscedasticity for all regression models.
Water usage was successfully predicted by the model, while the gas usage model only revealed a trend towards significance over the null model with no predictors (F(8, 10) = 7.51 p = 0.002 and F(4, 8) = 2.99 p = 0.087 respectively). Fig. 10 provides the regression coefficients and p-values. 4 Water usage increased with the numbers of 13-17 and 66+ year-olds, and the strongest predictor for gas usage was number of rooms. There were also multiple trends towards significant predictors for water. Effects of factors differed between utilities, suggesting that usages of all utilities are not necessarily affected by the same factors.
For the external sensors, each home was assigned the readings from the nearest external sensor. Only homes that had a unique set of cluster factor values were included in the analysis, giving 38 homes for winter and 50 for summer. All measures were successfully predicted by the respective model in both winter and summer, except VOCs in winter, which was not significant, and summer particulate matter, which showed trends towards significance. Figs. 11-13 provide the Fs, regression coefficients and p-values.
In winter, temperature increased further south, further east and with distance from an A-road, and decreased with elevation. In summer, temperature also decreased with elevation. RH in winter showed patterns opposite to temperature, increasing further north and further west, and with elevation. In summer, there was a trend towards an increase in RH with elevation.
In summer, VOCs increased further west, and decreased with surrounding cover and with distance from an A-road. In winter, eCO 2 increased further west and with elevation,. In summer, eCO 2 also decreased with distance from an A-road. Particulate matter (PM) increased further north in both winter and summer, and further west in winter. PM levels increased with elevation in both winter and summer. PM decreased with distance from an A-road, although these relationships only revealed a trend towards significance in summer. These results demonstrate that the external measures have relationships with the factors used to define the clusters.

Discussion
We present a methodology to achieve representative sampling for placement of a limited number of sensors. The methodology used cluster analysis to segment potential sensor sites into similar groups, and then selected sites from across those groups to provide a representative sample over the variety of potential sites. Meaningfulness of the clusters with respect to the sensors was verified using the sensor data collected. These verification results provide evidence that the clusters exhibit differences across sensor data, and that the cluster factors selected for clustering the potential sensor sites have relationships with relevant sensor measurements.
The aim of this study was to develop, implement and test the clusterbased methodology. While not a direct aim of the study, it is first interesting to discuss and interpret the relationships found in the verification analyses. We then outline the pitfalls revealed by our study. The overall aim of the study is then considered, and finally future examples for use of the sensor data are presented.
We tested for correlations between the different types of sensor data. For utility usage, significant correlations were observed between electricity and water usages and between electricity and gas usages. These relationships are to be expected given that water and gas usage are influenced by similar factors as those found to influence electricity usage [131].
No significant correlations between indoor and external air measures were observed. The lack of relationships is perhaps not surprising given that the indoor environment results from complex interaction of the built environment (e.g., building thermal properties, permeability, solar exposure, insulation levels) and human behaviours (occupancy, heating, ventilation) 6 [e.g., Ref. [132]]. We might have expected stronger relationships during the summer, reflecting more window-opening behaviour in response to warm weather. However, during the winter, it seems more likely that the indoor environment would arise from internal sources due to lack of ventilation and use of heating. 6 In addition, correlations between the indoor and outdoor environments can be lagged, such that internal changes follow external changes, with external air quality levels and window-opening behaviours mediating that relationship [133]. Such dependencies may have been missed in the current analyses given correlations were assessed using mean average values rather than between individual streams of sensor data.
Our verification analyses revealed significant relationships between the factors used to create the clusters and the sensor data. We found that water usage is related to the number of teenagers in the home [104], while gas usage is weakly related to number of rooms [106]. The relationships we observed are in line with previous work showing relationships between energy usage and household size, but not with time spent at home [81]. These relationships between the cluster factors and sensor readings demonstrate that the factors are meaningful with respect to water and gas usage, and provide evidence that the clustering methodology for sampling achieved its purpose.
For the external sensors, latitude and longitude were included to distribute sensors across the locality and capture local variation in environmental influences that were unknown (e.g., microclimates). The analyses showed significant relationships between some external environmental measures and latitude or longitude, despite other potentially stronger influences from the surrounding environment (e.g., sensor orientation).
Increased elevation was associated with decreased external temperature and increased RH, eCO 2 , and particulate matter, with stronger patterns in winter than summer. Cover showed no significant relationships with any external environmental measure, except decreased VOCs in the summer. Increased distance from a main road was associated with decreases in VOCs and particulate matter, in line with previous findings [112].
Despite the lack of correlations between indoor and external environmental measures, three indoor measures did exhibit differences across the clusters, and the factors used to create the clusters did show significant relationships with the external sensor data. These results are consistent with meaningful factors and clusterings for informing the deployment of the external sensors.
The process highlights two main pitfalls and subsequent recommendations for sampling using cluster analysis. First, the installation of utility sensors was limited such that some recommendations could not be implemented, resulting in some clusters with no sensors installed. This sparsity was resolved by repetition of the cluster analysis. However, we recommend that the cluster analysis is constrained such that each cluster includes a minimum number of sites that are known in advance to be suitable for sensor installation. Alternatively the technical feasibility of installing a sensor at each location could be assessed, and clusters created to include a range of different levels of feasibility.
Second, the participant base underwent changes between the cluster analyses and the sensor data analysis. Such changes are to be expected, Fig. 11. Coefficients (and p-values 5 ) for the predictors of temperature and RH in winter (N = 38) and summer (N = 50). 6 We thank an anonymous reviewer for raising these points.
given that time is required to allow sensor data to be collected. However, the impact could be attenuated in a variety of ways, depending on suitability for the project. The number of clusters could be decreased, such that the loss of participants has a reduced impact on each cluster. The entire set of potential participants could be known and clustered in advance to ensure that participants of all types are represented, such that new recruits can be assigned to existing clusters. Time could be allowed for stabilisation of the participant base. For example, some of our participants withdrew when the survey was conducted, at the beginning of the project. Allowing such a milestone to pass before performing the cluster analysis would have reduced the number of homes subsequently lost.
The main aim of this study was to provide a simple (non-specialist) and generalisable solution to the deployment of sensors when numbers are limited. This resource limitation presents a problem for selecting representative sites at which to place sensors [46,47], whilst also avoiding redundancy in resource deployment [48]. As reviewed earlier, previous approaches to selecting locations for a limited number of sensors are usually specific to the type of sensor or setting, and can involve application-specific techniques. Clustering approaches use sensor data to provide similar groupings [61,94], which requires collecting or estimating the data before recommendations can be made. While the specificity of approaches are beneficial for some applications, generalisability and accessibility for future users can be limited [61]. Our novel methodology offers the following benefits.
Cluster analysis is an accessible approach that can readily implemented, facilitated by freely available software packages, and online tutorials and resources. It results in groups of similar items, which can then be sampled in a targeted and strategic manner to provide representation across the diverse range of potential sensor sites.
In this study, we present two applications of this clustering method. The factors we chose to cluster potential sensor locations in the current research were specific to households and the outdoor environment. However, the method does not rely on using these factors, rather they reflect information that we had available in advance that might affect sensor readings. Other applications may have different potentially influencing factors. For example, the relevant factors differ between our two example applications (utilities and environment). In other applications these factors could be replaced by any potential explanatory variables (e.g., Fig. 14).
This methodology is therefore applicable to any type of sensor as long as at least some of the factors influencing sensor measurements are known. This data-driven approach allows creation of similar groups of possible locations that together cover the sensor space, allowing targeted yet unbiased selection of representative locations.
Biased sampling could be accounted for after data collection. However, sufficient variation within the sample would still be required to be able characterise the bias. For example, if all homes selected had four rooms, then the effect of number of rooms on the data could not be estimated without strong assumptions, so extrapolation to homes with a different numbers of rooms would not be possible.
Furthermore, cluster analysis for representative sampling can be applied to other domains than sensor placement. For example, large cohorts of participants can be categorised into archetypes or personas that capture the core characteristics of the cohort without preconceptions or bias in categorisation, given a bottom-up data-driven approach [98]. Such personas can then be used to sample from the cohort for more detailed investigation, such as in-depth qualitative interviews, while being confident that different ranges of multidimensional characteristics are being represented [e.g., Ref. [134]].
Beyond this methodology, and in line with the aims of the broader project, ongoing and future studies will use the data collected from these sensors to investigate relationships between energy usage, indoor and external environments, including temperatures and air quality. There are many research questions that could be addressed with these data. For example, differences between indoor and external temperature data could be used to assess thermal performance, and in conjunction with energy usage could be used to assess energy efficiency. Separately, changes over time in relative humidity or air quality could be assessed in response to environmental interventions (e.g., positive pressure units).
The sensor dataset will also be published for future researchers.

Conclusion
The aim of this study was to test the use of cluster analysis to provide a representative sample of sensor sites when sensors cannot be installed at all available sites. Clusters were successfully created for two types of sensors in separate analyses. Sites for sensor installation were then sampled from each cluster according to the distance of the site from the cluster centre, in order to capture typical and atypical members of each cluster group. Results of analyses to verify the clusterings showed that clusters did capture differences across sensor data, and that the cluster factors selected for segmenting and sampling the potential sensor sites had relationships with relevant sensor measurements. These results suggest that the clusters were meaningful, and successfully captured a range of homes and sensor sites. In conclusion, when sensor deployment is limited, for example, by sensor numbers or access issues, this clusterbased methodology can provide a representative subset of sensor sites by sampling across clusters in order to capture the variety across potential sites.

Terminology
Please see Table 5.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request. In this study, factor denotes a variable or characteristic of a home, household or potential sensor site. It represents a numerical property. The combination of factor values describing a given item is used to calculate distance in the cluster analysis. Items with similar factor values will be grouped together. Cluster analysis A process to create groups (i.e., clusters) of items that are similar to each other and that differ across groups. First cluster centres are randomly created, then a two-step iterative process is used to minimis the inertia. (1) Items are assigned to their nearest cluster.
(2) The cluster centres are recalculated as the average of their item members. Inertia A measure of how well the items are clustered into similar groups. It is calculated as the sum of squared distances between the homes and the centroid within each cluster. Distance is the distance between factors used to define the clusters. Silhouette A measure of overlap of the clusters, with a larger value representing less overlap. It is calculated as the normalised difference between the mean distance between members within a cluster and the mean distance between those cluster members and the members in the nearest other cluster. VOC Volatile organic compounds. These gases can be emitted by substances such as paints, cleaning products, furnishings and cosmetics.