Towards Smart Sustainable Cities: Addressing semantic heterogeneity in building management systems using discriminative models

Building Management Systems (BMS) are crucial in the drive towards smart sustainable cities. This is due to the fact that they have been effective in significantly reducing the energy consumption of buildings. A typical BMS is composed of smart devices that communicate with one another in order to achieve their purpose. However, the heterogeneity of these devices and their associated meta-data impede the deployment of solutions that depend on the interactions among these devices. Nonetheless, automatically inferring the semantics of these devices using data-driven methods provides an ideal solution to the problems brought about by this heterogeneity. In this paper, we undertake a multi-dimensional study to address the problem of inferring the semantics of IoT devices using machine learning models. Using two datasets with over 67 million data points collected from IoT devices, we developed discriminative models that produced competitive results. Particularly, our study highlights the potential of Image Encoded Time Series (IETS) as a robust alternative to statistical feature-based inference methods. Leveraging just a fraction of the data required by feature-based methods, our evaluations show that this encoding competes with and even outperforms traditional methods in many cases.


Introduction
Over 50% of the world's population live in cities [1]. This trend which is expected to be on the increase for the foreseeable future will inadvertently create challenges for the infrastructure, economy and environment of cities [1,2]. In light of this, it becomes evident that cities are a strategic sector in the fight against global issues such as climate change. Naturally, there has been an intensified global move to ensure that urban places are more sustainable in order to alleviate these challenges [2,3,4,5]. An effective approach towards achieving sustainability has been by making cities smart through the use of ICT [6,7]. In order to eliminate the common obfuscation between smartness and sustainability in the context of cities, we adopt the notion of smart sustainable cities [8]. According to the United Nations Economic and Social Council, a smart sustainable city is defined as an innovative city that uses information and communication technologies (ICTs) and other means to improve quality of life, efficiency of urban operation and services, and competitiveness, while ensuring that it meets the needs of present and future generations with respect to economic, social, environmental as well as cultural aspects [9].
There have been several approaches that seek to reduce energy consumption in cities using ICT. This is an important aspect supported by studies showing estimates of up to a 16.5% reduction of total greenhouse gas emissions from ICT solutions [10]. Among these solutions, the Building Management Systems (BMS) are easily one of the most significant [11,12,13]. A BMS can help reduce energy consumption by almost 15% through managing the heating, lighting, ventilation, air conditioning of a building [14].
Typically, a BMS employs Internet of Things (IoT) composed of smart devices 2 and a communication infrastructure [13]. These smart devices could be sensors that monitor environmental phenomena or actuators that perform certain actions [11]. In addition to the benefit of reducing energy demands of buildings, a BMS can be used to improve the well-being of building dwellers and promote livability [15]. The significance of a BMS becomes clearer when you consider the relevance of buildings in society. It has been shown that people spend up to 90% of their lives inside buildings [16]. This implies a huge cost with providing services to make buildings habitable and comfortable such as heating, ventilation, and lighting [15]. In the US alone, buildings consume 30% of the energy generated and make up nearly 41% of the annual energy budget [17,18,19]. With this in mind, it becomes obvious how this can benefit broader critical issues such as climate change, pollution, sustainable and livable cities [19,11,2]. Consequently, the link between Building Management Systems and smart sustainable cities becomes too strong to ignore [17,18,11].
There is a need for the smart devices in a BMS to communicate together in order to achieve a shared goal [11]. However, this communication, otherwise known as interoperability can be impeded by heterogeneity. Indeed, this has been identified as one of the key challenges in deploying a true smart sustainable city [6,11].
In this paper, we associate the heterogeneity of an IoT device with the data generated by that device. Typically, IoT devices are labelled using meta-data [18]. An example is B1_Temp_Sensor which refers to a sensor device for temperature in Building 1. In this paper, we refer to this labelling as the semantic of the device and recognise two categories of heterogeneity. 1) The first one is brought about by the different naming conventions employed by vendors and maintainers of these devices [18,20]. For example, a sensor IoT device can be labeled either Sensor or Sns although they are the same and record the same type of phenomenon. 2) The second category is introduced by subtle differences between phenomena recorded by devices. For example, you could have two devices recording temperature readings labelled as temperature whereas in reality they are different with one taken from a living room and the other taken from a boiler room.
In addition to these problems, the meta-data could be incorrectly labelled, outdated or not available [18]. Bearing all these in mind, we further clarify the challenges by considering a BMS composed of multiple sensing devices in two scenarios. In the first scenario, the BMS is processing IoT data with either an incorrect or outdated or no labelling at all. The problem here is to associate this data with the correct semantic group and update the labelling. In the second scenario, the BMS is processing data that is labelled similarly whereas they are semantically different (see second category of heterogeneity above). Here, the problem will be to deduce this semantic difference and update the labelling to reflect this difference. In both situations, we see that these problems can be addressed by automatically inferring the semantics of devices using data-driven methods [21]. In our experiments, we used datasets that incorporates these heterogeneities and challenges.
There have been some attempts to address this problem using paradigms such as multi-label classification and active learning [18,17,20]. However, the literature lacks a comprehensive exposition of data-driven methods in conjunction with the posed domainal challenges. Hence, in this paper, we carry out a quantitative study of both supervised and unsupervised learning methods for inferring the semantics of IoT devices. Particularly, we propose the use of Image Encoded Time Series (IETS) as an alternative to traditional descriptive feature-based methods for model development.
In addition, we compare our proposed technique to traditional methods. Our study is based on two large datasets that contain a total of over 67 million data points with 22 semantic types.
We formulate the problem as a multi-class classification one. In our evaluations, we observed that models trained with features generated using the proposed IETS method from a fraction (one month) of the datasets outperforms featurebased techniques in many cases. This observation highlights the potential of IETS as an alternative to feature-based model development strategies. This can address the problem of data availability that impedes development of inference models in IoT settings. To the best of our knowledge, this is the first attempt to provide a comparative study of machine learning methods for semantic inference of IoT devices. Our contributions in this paper can be summarized as follows: • We carry out a study that develops and quantitatively compares supervised and unsupervised inference methods on two large IoT datasets.
• We propose Image Encoded Time Series (IETS) and demonstrate that it can be a viable alternative to traditional feature-based methods for model development.
The rest of this paper is organized as follows. Section 2 discusses related work. We describe the proposed IETS method in Section 3 Section 4 describes the methods and materials used. Results are presented and discussed in Section 5 followed by our conclusions in Section 6.

Related Work
Several approaches have been proposed to address the problem of semantic heterogeneity in IoT devices. Project Haystack was developed to standardize the naming conventions of devices in BMS [22]. This convention is yet to be widely adopted which may be explained by the fact that it struggles to recognise the vast array of IoT devices and semantics in the wild. More so, with legacy systems, it may be cheaper or feasible to maintain existing naming conventions [11]. Hence, the focus of recent approaches to automate the inference of these semantics. An approach to automate the deployment of an energy management system by automatically inferring their semantic labels was proposed in [23]. The authors demonstrated their proposal by utilizing liguistic techniques and computing the semantic similarity values of the labels. In [24], the authors sought to infer the location of sensors in a building. Particularly, they focused on evaluating the feasibility of linear correlation measures over spatial dependency measures. In [18], the authors proposed a data-driven framework to infer the semantics of building systems. Similar to our approach, they considered the meta-data of these devices as their semantics. The work in [18] is the closest in comparison to our work in this paper.
Inference methods for IoT data is related to methods the field of signal processing such as time series classification. Time series classification methods can be grouped into instance-based and feature-based methods [25,26,27,28]. Instance-based classification simply refers to the matching of new time series data to a previously identified class based on an encoding of the data stream. An example of this paradigm is time series shapelets [28]. Here, the authors proposed the use of a segment of a time series that is representative of its class, called a shapelet. While this idea was interesting in theory and plausible on small datasets, identifying an adequately representative shapelet can easily become computationally expensive on large datasets. Furthermore, [29] performed a comparative evaluation of methods and demonstrated that shapelets may not be suitable for classification of time-series images.
Another example of instance-based classification is described in [30]. The authors proposed a method to infer semantics of time series data by using depictive slopes of their linear approximations to determine their semantics. Feature-based classifications transform the time series to a vector of features derived from the data itself which is used to make classifications [25]. Our work in this paper is an example of feature-based classification.
The literature on classification using time series images is still developing. In [31], the authors proposed an encoding technique for time series data that mitigates the effect of data drifts. Paparrizos et al. [32] introduced GRAIL -an algorithm for representation learning on time series. While both approaches are methodologically similar to ours, their approaches are only compared with instance-based methods. Whereas, we compare with feature-based methods.
Hatami et al. [29] performed classification of time series images using neural networks. They transformed the time series data to a 2−dimensional array using recurrence plots [33]. In comparison with other methods for time series classification, their approach outperformed state of the art methods in more than 50% of cases. Wang et al. [34] proposed two new encoding strategies for time series data: Gramian Angular Fields and Markov Transition Fields. These techniques produced very good results and exhibited competitive results when evaluated across standard datasets. Our work marries advances in time series imaging to semantic inferences of IoT. Furthermore, we comprehensively evaluate the performance of discriminative algorithms across multiple dimensions. To the best of our knowledge, this is the first body of work to carry out a study of this nature.

Image Encoded Timeseries (IETS)
The goal of IETS is to derive an alternate representation for time series data using a transformation function, where this new representation will be used as input features to a model. For this purpose, we considered two encoding strategies: Recurrence [33] and Gramian Angular field [34]. These encodings have produced competitive results with state of the art [29]. However, in our initial experiments, we observed that the recurrence encoding performed better. Thus, this paper will only discuss the recurrence encoding strategy and the results derived using it.
Recurrence is a common characteristic of many physical systems. For example, one would expect the outside temperature readings to recur at certain points in time. The recurrence plot was proposed by Eckmann et al. [33] as a visualization tool. The basic premise of the recurrence plot is to portray the times which recognised states in the systems recur. The first step is to extract the m−dimensional phase space trajectories from the original time series. For the purpose of this study, we set the dimensions to 2. The output of the recurrence function is a square matrix where each cell holds the times where states recur in the system as represented by the trajectories. The recurrence plot is formally expressed as follows: where, N is the number of considered states − → x i , is a threshold distance, · is a norm and Θ(·) is the Heaviside function. In our experiments, we restricted the number of observations for each instance to 720 time points which corresponds to one month's worth of observations.

Dataset
We used two datasets: the first is an openly available dataset called the REFIT smart home dataset [35]. Meter. Both datasets contain a combined total of 67 million data points. As can be seen from the meta-data of both datasets, the REFIT dataset exhibits heterogeneity across the phenomena being measured. For the industrial dataset, the heterogeneity spans both the measurements and the devices. This reflects the types of heterogeneity considered in this paper which was discussed in Section 1. See Table 1 for a summary of the datasets.

Preprocessing and Feature Generation
Both datasets contained missing values and had non-uniform time intervals. The interval of the original time recordings was less than the minute and we observed that most of the readings were redundant. Hence, we resampled all data recordings to the hour by averaging. We handled the missing values by linear interpolation as described in [36]. Our problem is formulated as a multi-class classification one and we encoded the semantic labels as integers. For all the experimental learning tasks, we generated three types of features. The statistical features, IETS features and the cosine distances. The statistical features were descriptive statistical values which we deemed discriminative enough. They included the mean, standard deviation, variance, stationarity, kurtosis, and skewness. We encoded stationarity on an ordinal scale of 0 to 2 where 2 is highest stationarity. The IETS features were generated as described in Section 3 using the recurrence function. We used different aggregation methods to scale down the dimensionality of the features as will be discussed in Section 5.4. Finally, the cosine features were generated using the cosine distances between every instance of the random variables.

Learning Techniques
In this study, the inference problem is regarded as a supervised learning one. Each of the methods listed below was chosen after extensive preliminary experiments. For brevity, we only present and discuss the most promising methods in this paper.

Supervised Learning Algorithms
We employed six supervised learning algorithms for our experiments. These algorithms were chosen because they had the best performance in our preliminary experiments. These algorithms are Logistic Regression, Random Forest (RF), Decision Trees (DT), K-Nearest Neighbours (KNN), AdaBoost, and Support Vector Machines (SVM) [37,38,39,40,41]. We evaluated these algorithms using F-score and accuracy. Here, accuracy is the proportion (in percent) of accurately classified data. We use both the macroand microaveraged variants of the F-score. The general formulation of the F-score is as follows: where, Precision = (T rue P ositives) (T rue P ositives + F alse P ositives) Recall = (T rue P ositives) (T rue P ositives + F alse N egatives)

Clustering
We employed unsupervised learning using clustering techniques to explore the structure of the datasets. Three clustering algorithms were used: spectral clustering [42], K-medoids [43], and K-means [44]. We evaluated the clustering methods using purity and entropy as described in [45]. We define both metrics in equations 5 and 6: where Ω = {ω 1 , . . . , ω k } is the set of clusters returned and C = {c 1 , . . . , c j } is the set of classes.
where Ω and C represents the set of clusters and set of classes respectively as in equation 5, N ω is the size of cluster ω and N is the size of the entire dataset. H(ω) is the entropy of a single cluster and is defined as follows: Here, n ω is the size of cluster ω and for every class c ∈ C, |ω c | is the number of type c points in cluster ω.

Measure of Similarity
We computed similarities between semantic types using the cosine similarity measure. Cosine similarity is defined as follows: Cosine Similarity (X s , X y ) = X s · X y X s X y where X is a vector representation of the features. Features here represent the statistical descriptive features generated for each instance. Subscripts s and y two random variables. Based on the features extracted, we computed the cosine similarities between each instance of the random variables.

Algorithmic Complexity
The computational complexity of operations described in this paper for both datasets is constant. The time to generate the features is constant since we process one time-series file at each time unit. It implies that this process can be parallelized to improve speed. Similarly, the memory requirement is O(n), where n is the length of the time series file. The final stage of the feature generation process is to stack them into a matrix which increases the memory and storage requirement by a factor of K, where K is the number of time series files processed. The time and memory required to train the models is O(n), with n denoting the number of features in that case.

Results and Discussion
In this section, we present and discuss the results of our experiments as implemented by the methods outlined in Section 4. All experiments were carried out on a Dell PowerEdge R710 Linux server with 20.5 Gb of RAM and Intel Xeon octa-core CPU @2.40GHz.

Clustering
We performed exploratory analysis of the datasets using clustering methods. We considered three algorithms: Spectral clustering, K-medoids and K-means. The features used here are the descriptive statistical features. We evaluated the performance of these algorithms using two metrics: purity and entropy as defined in equations 5 and 6. A high percentage of purity and a low entropy score is better. We present the results of these algorithms in Figures. 2 and 3. The colours in these figures represent the semantic labels plotted against the TSNE vectors on both axes [46]. We can see a clear visual community-like structure to the distribution of the semantic labels using the statistical descriptive features. This would suggest that supervised methods can be successful in inferring the semantics of the IoT devices.
Furthermore, evaluating the performance of the algorithms, as shown in Table 2, indicates that spectral clustering performs best with scores of 68% (purity), 1.19 (entropy), 29.3% (purity), and 2.0 (entropy) for the refit and industrial dataset respectively. Interestingly, the lower purity score recorded in the industrial dataset could be explained by the fact that some semantic labels in this dataset are very similar but measured by different devices. This relates back to the problems highlighted in Section 1.

Cosine Distances
We further explored the dataset by measuring the similarity between the semantic labels in both datasets. We computed the cosine similarity using the average of 20 random samples of each semantic label. We present the results of this analysis in Fig. 1.
In the polar plot for the REFIT dataset as shown in Fig. 1, we see that the Air Temp (0 • ) is closest in similarity to Nominal Temp, then Relative Humidity, Air Humidity, Actual Temp, Surface Temp, Brightness, Intensity, Power Consumption and Motion in that order. Also, we observe that the angular distance between semantic types is comparable to the semantic similarity between the types. For example, Power Consumption and Intensity are so close together (almost completely overlapping), while there is a significant distance between motion and its closest semantic neighbour (Intensity). Thus, these observations strongly suggest that the Cosine similarity captures the semantic similarity between the labels.

Supervised Learning with Descriptive Statistical Features
These experiments were carried out using the algorithms outlined in Section 4. The results presented here are the averages taken over five runs of each model on the test set. At each run, the test is shuffled. Prior to each experiment, the class sizes for each dataset was balanced by downsampling to the smallest occurring class set. For both datasets, we split the training set and test set equally. We performed five-fold cross-validation on the training set and the best hyper-parameters were selected using a iterative search technique. The labels remain as described in Section 4. The features used here are the statistical descriptive features extracted from the time series data as described in Section 4. We evaluated the models using the F-score and present the results in Fig. 4. The y-axis holds the F-score values for each semantic labels. The labels for each dataset are explained using the legends to the right of each plot. In Fig. 4, we can see that for both datasets, Adaboost performs the best. These results are very impressive, as the model achieves an F-score of 70% in many cases. Particularly, when we consider how how semantically close/clustered the labels in the industrial dataset are as shown in Fig. 1 Figure 1: Polar representation of the cosine similarity values of the semantic types present in the datasets. Each arrow represents a semantic label. The labels at 0 • are reference points. The angular distance between any two labels denotes their similarity.

Supervised Learning using Cosine Similarity Metric
For the next investigation with supervised learning, we use the Cosine distance as determined in Eqn. 8. Firstly, to validate their potential as described in Section 5.1.2. We observed that the cosine distances could adequately discriminate the labels in both datasets. Hence, we proceeded to use these distances as features for the supervised learning algorithms. We generated an m × n feature matrix, where m is the number of time series files and n is the number of semantic labels. Then, for each semantic label, we held a signature vector which is an average of twenty random time series files for that type. With these signature vectors, we went ahead to compute a cosine similarity value between point and a signature vector. Thus, a matrix cell for a particular time series file will hold the similarity value between that file and one of the signature vectors. The models were trained using the same methodology described in Section 5.2. We present the results in Table 4.

Supervised Learning with Image Encoded Time Series (IETS)
As described in Section 3, we generated features for the semantic labels in the datasets using the recurrence function. We present the recurrence plots of some of these label in Figures. 5 and 6. Recall that we select only 720 observations from the time series files for all experiments involving IETS. Notice how in Fig. 5, there is a clear textural similarity between time series of type temperature and humidity. Again, we see this similarity in Fig. 6 where the time series of type temp, exhibit the same textural similarity observed in Fig. 5. The temperature time series from both datasets look very similar but distinctly different from the other types.
We use these encoded representations as features for the supervised learning algorithms. The output of the recurrence function will be a matrix of size 720 × 720. This presents a huge computational cost to process and we mitigate this by downsampling. Thus, we proceeded to downsample the image matrix by aggregation to four reduced dimensions: 8, 16, 32 and 48. Then, we transformed these matrices into vectors by flattening before using them as predictors in the algorithms. The same learning methodology described in Section 5.2 is adopted here. We present the evaluations of the models trained using IETS in Tables 3 and 4.

Comparison of Methods
We evaluate the models trained using the descriptive statistical features, IETS features and cosine distances. In all cases, we balanced the datasets by downsampling to the lowest occuring class size. The results presented here are the averages calculated after five runs of each experiments. We present our evaluations in Tables 3 and 4.  Figure 2: Plots of the REFIT dataset. The first plot is that of the ground truth semantic labels. For visualization purposes, we have used the TSNE projections for the axes of all the plots. Spectral clustering performs best with a purity score of 68% and entropy score of 1.19 compared to other clustering algorithms. See Table 2 for the evaluations of the algorithms.
In Table 3, we present the macro-averaged F-score using the descriptive statistical features (DF) and IETS. Best results of both methods for each algorithm is highlighted in bold for emphasis. We can see that for the REFIT dataset, using the descriptive features leads to better results than using IETS. On the other hand, for the industrial dataset, we see that IETS leads to an overall improvement in model performance. Given the semantic closeness of the labels in the industrial dataset, we could deduce that features generated using the IETS encoding are better for discriminating semantically close labels.
In Table 4, we present the best average results for each case and the standard deviations alongside. We compare the results for models trained using the descriptive statistical features (DF), IETS features and cosine distances (CD). For the IETS features, we show the results using the reduced dimensions considered: 8, 16, 32 and 48. Furthermore, we show how the models fare across varying training sizes: 20%, 40%, 70% and 80%. In each case, the test size is a complement of the training size. This experimental methodology is designed to highlight the generalization power of the models. The metric used here is accuracy which is equivalent to the micro-averaged F-score for a balanced dataset. Values highlighted in bold are the best results for that case with statistical significance. We see in Table 4 that the IETS is the clear winner in many cases.
Furthermore, we also see that in cases where it does not win, it is almost as good as the best performing methods for that case or the next best method. For example, in the industrial dataset with just a training size of 20%, using the features derived from IETS give the best accuracies of 72% and 68%, respectively. The significance is further emphasized when we take into consideration the fact that the IETS uses just one month's worth of observations as compared to the others that use the entire dataset. It can also be seen that IETS seems to perform better for the industrial dataset. This may be related to the fact that many of the semantic types are closely related but measured by different devices (e.g. FCU  Figure 3: Plots of the Industrial dataset. The first plot is that of the ground truth semantic labels. For visualization purposes, we have used the TSNE projections for the axes of all the plots. Spectral clustering performs best with a purity score of 29.3% and entropy score of 2.0 compared to other clustering algorithms. See Table 2 for the evaluations of the algorithms.    temp vs AirTemp). While the descriptive features may have been good enough for the REFIT dataset, the closeness in similarity exhibited in the industrial dataset can be best discriminated using IETS.

Conclusions
In this paper, we conducted a study that investigates the feasibility and performance of different supervised learning methods for inferring the semantics of IoT devices in smart buildings. Particularly, we highlight the potential of Image Encoded Time Series (IETS) as an alternative to descriptive statistics features. We show that using just a fraction of the data in a time series file, this encoding can produce features that compete with traditional statistical feature-based methods. This is similar to data-driven paradigms such as budgeted learning [47,48], zeroand oneshot learning [49,50,51] that seek to improve learning on small datasets.
The approaches described in this study will be beneficial to machine learning research endeavours, particularly for IoT in smart buildings where the lack of good quality data can sometimes be an impediment [21]. A key takeaway from the evaluations is that a machine learning panacea for addressing the problem of semantic heterogeneity does not exist yet. We see this in the fact that our approaches perform differently with the datasets. For example, the IETS method outperforms the traditional feature-based methods for the industrial dataset (see Table. 3). Thus, these methods could outperform each other in different scenarios.
We recognize that in this study, we did not incorporate the seasonality in the experimental process. This is a limitation of this study as it is likely that this could affect the results produced from the encoding with regards to different semantic labels. For future work, we will investigate the effect of time series seasonality on the inference methods of IETS. Likewise, we will research the relation between the size of time series points and the performance of the inference models in an attempt to determine a globally optimal size for different semantic types. In addition, we will extend our methods to paradigms like transfer learning as it will be interesting to see how they scale to different tasks and domains [52,53]. Table 4: Comparing supervised methods. Metric used here is accuracy (equivalent to the micro F-score). Results presented here are the averages taken over five runs, standard deviations are shown alongside. For each method, the results of the best performing algorithm is presented here. The best result for each train size is in bold*. Second best results are underlined. Best results represent statistically significant superior values relative to the second best at α = 0.05 level of significance using the t-test.