Data-driven Modelling of Smart Building Ventilation Subsystem

Considering the advances in building monitoring and control through networks of interconnected devices, effective handling of the associated rich data streams is becoming an important challenge. In many situations the application of conventional system identification or approximate grey-box models, partly theoretic and partly data-driven, is either unfeasible or unsuitable. The paper discusses and illustrates an application of black-box modelling achieved using data mining techniques with the purpose of smart building ventilation subsystem control. We present the implementation and evaluation of a data mining methodology on collected data over one year of operation. The case study is carried out on four air handling units of a modern campus building for preliminary decision support for facility managers. The data processing and learning framework is based on two steps: raw data streams are compressed using the Symbolic Aggregate Approximation method, followed by the resulting segments being input into a Support Vector Machine algorithm. The results are useful for deriving the behaviour of each equipment in various modi of operation and can be built upon for fault detection or energy efficiency applications. Challenges related to online operation within a commercial Building Management System are also discussed as the approach shows promise for deployment.

Abstract-Considering the advances in building monitoring and control through networks of interconnected devices, effective handling of the associated rich data streams is becoming an important challenge. In many situations the application of conventional system identification or approximate greybox models, partly theoretic and partly data-driven, is either unfeasible or unsuitable. The paper discusses and illustrates an application of black-box modelling achieved using data mining techniques with the purpose of smart building ventilation subsystem control. We present the implementation and evaluation of a data mining methodology on collected data over one year of operation. The case study is carried out on four air handling units of a modern campus building for preliminary decision support for facility managers. The data processing and learning framework is based on two steps: raw data streams are compressed using the Symbolic Aggregate Approximation method, followed by the resulting segments being input into a Support Vector Machine algorithm. The results are useful for deriving the behaviour of each equipment in various modi of operation and can be built upon for fault detection or energy efficiency applications. Challenges related to online operation within a commercial Building Management System are also discussed as the approach shows promise for deployment.

I. INTRODUCTION A. Smart Buildings Context
Buildings have become major drivers of energy consumption and quality of life challenges in the modern, urbanised, society. As the potential impact of implementing advanced sensing, computing and communication is steadily realised they have also become smart. In a technical context we view and define smartness by having the building comply to the dual objectives of occupant awareness and energy efficiency, achieved by modelling, simulation and control over the network of field devices and controllers. This leads to increased requirements on the control strategies to balance in an online manner the needs of the building users for comfort with the needs of the building operator for reduced costs. Furthermore dynamic energy pricing and electrical grid balancing constraints impose real-time requirements which are often addressed by means of demand response (DR) schemes.
Usually a tool allowing knowledge discovery from large databases with transactional information, customer records or other types of structured and unstructured business information, recently data mining methods have started to be applied to measurements coming from the physical world driven by the emergence of Internet of Things (IOT) solutions in Data Availability The data used to support the findings of this study are available from the corresponding author upon request.
The authors are with the Department of Automatic Control and Industrial Informatics, University "Politehnica" of Bucharest, 313 Splaiul Independentei, 060042 Bucharest, Romania grigore.stamatescu@upb.ro industrial and smart city scenarios [1]. This has been driven to a large extent by more efficient electronic components, communication protocols and advanced algorithms for data processing and dissemination. At their core data mining techniques attempt to identify meaningful, non-trivial, patterns in large bodies of data which are subsequently built into validated models and reevaluated periodically to incorporate new information from the underlying process. The models can be purely data-based or enhanced with domain specific expert knowledge. The data sources are either large static databases of historical measurement which are stored on dedicated machines or in a distributed manner in the cloud or dynamic data streams that have to be evaluated online for timely outcomes.
We focus on the application of such advanced data processing techniques in the field of building automation systems (BAS). In particular, as buildings are key driver of energy consumption in the modern society, by using the information improved decision support systems and control algorithms can be implemented to optimise overall building operation. Within a modular BAS structure, the heating, ventilation and air conditioning (HVAC) subsystem plays a dominant role as both the main driver of energy consumption and of the user satisfaction with the working environment, mainly through subjective assessment of indoor thermal conditions. Air handling units (AHU) are used to ventilate the building with the optimal quantity of fresh air, filtering the air and conditioning it, both heating and cooling, within limited margins. AHU control is responsible for enabling the best ratio of outside and recirculated air which is delivered to the indoor thermal zones of the building. In the context of this contribution the focus is on medium to large commercial buildings where the savings potential has already been validated and the application of advanced data processing and control is thus justified in practical deployments.

B. Related Work
Though academic and industrial research have developed advanced control strategies, in practice most buildings in which some sort of BAS is available, operate on rule-based control or Proportional Integral Derivative (PID) control loops with static schedules and significant oversight from the building operator. Broad overview of data science methods applied to buildings is presented by [2]. The main focus is on assisting building professionals in the field of energy management with appropriate decision support tools. The data mining process is described in conjuction with the specific nature of the application going from raw collected data to preprocessing, processing, modelling, prediction and valida-tion. [3] proposes a two-level load forecasting architecture for both very short term load forecasting: several hours to one day prediction horizon, and short term load forecasting: one day to several weeks energy prediction. Data processing is following a Lambda Architecture which splits the data into 2 layers. Each layer treats data by a different Autoregressive Integrated Moving Average (ARIMA) algorithm, depending on the needed time window with real-time processing for hourly prediction or slower processing for daily prediction.
A paper focused on energy-efficiency improvements leveraging available building-level data for data mining is [4]. The authors list the main predictive tasks in which data mining of large quantities of measurements and contextual information is relevant. These cover: building energy demand prediction, building occupancy and occupant behaviour and fault detection and diagnosis (FDD) for building systems. [20] and [21] further argument through broader studies the relevance of data-driven approaches in timely building energy efficiency applications.
Deployment of distributed sensor networks for finer grained spatio-temporal monitoring of indoor conditions is performed by [6]. The authors argue that the statistical modelling of the indoor environment as non-parametric Gaussian processes can lead to reliable information that is fed back to the building management system in order to improve the HVAC control. Wireless sensors can be implemented with limited costs as compared to conventional wired sensors and the monitoring architecture can be adjusted dynamically in order to best capture field level information. In [7] a thermal comfort application using collected HVAC IoT data is presented. Building-level benchmarking data sets [8] are highly important to assess algorithm performance and produce reproducible outcomes. The authors present a large database of one year data from 507 non-residential building energy meters, mainly from university campuses. A model based predictive control for maintaining thermal comfort in buildings is applied in [9]. The optimal comfort index is achieved by a cost function depending on both occupant comfort and energy cost.
As compared to traditional model-based control (MBC), data-driven control (DDC) represents an emerging field of study which accounts for the need to manage the data deluge produced by dense temporal and spatial monitoring of various systems. A broad survey on the specific nature of DDC and comparison to MBC in various control structures is discussed by [10]. Within this concept, the steps of data mining and classification for prediction and assessment are seen mainly as acting as a higher level supervisor to field level control loops in the case of tuning control parameters, set-points and providing contextual information which contributes to improved robustness. One good application example as reference for DDC with random forests of regression trees [11]. In this case multi-output regression trees are used to represent the system dynamics over the prediction horizon and the control problem is solved in realtime in closed-loop with the physical plant.
Big data analytics for smart city electricity consumption in presented in [12]. The authors use computational intelligence algorithms to model the consumption of eight university buildings. The outcome consists of offline policies to optimise energy usage across the campus. In [13] a different application is described using decision trees for occupancy estimation in office buildings. Occupancy modelling and estimation is a critical task in smart buildings as the occupancy level and its accurate forecasting directly impact the HVAC conditioning strategy of the building and avoiding wasteful control. Fault and anomaly detection with a rulebased system is described in [5]. The main contributions relate to building automated anomaly detection rules with regard to energy efficiency. This is achieved by combining data mining on historical data with expert information about energy efficiency. [14] illustrate the results of the BRIDGE diagnosis strategy on a dedicated building sensor test bed. By considering sensor faults as data deviations, FDD can accurately detect abnormal conditions. FDD for ventilation subsystems is also covered by [22] by using a graph-based approach. [15] describe in detail the explicit data modelling process for smart building evaluation. A case study is carried out for energy forecasting of a target building using techniques such a Bayesian Regularized Neural Networks and Random Forests. SVM are also considered but provide weaker results in this specific scenario. Finally in [16] SVM is applied for a regression problem where instead of a class label the output of the algorithm consists of a numeric value.
The current paper also builds upon own previous work dedicated to decision support systems for renewable energy campus microgrids [17] and carrying out Model Predictive Control (MPC) for building simulations [18]. Earlier work has also included exploratory data analysis from a single building AHU without further analysis and implementation of learning models at a larger scale [19]. In this context we have developed the contributions towards better understanding of collected data from smart buildings. Figure 1 summarises this section with regard to the role of data mining for DDC in this scenario. This generic approach is mapped onto our particular scenario as well. Each of the ventilation units implements local control loops which have to comply to setpoints given by the building operator according to occupancy schedules or seasonal adjustments. Without influencing the low-level control we look at inputoutput data to indirectly characterise the system behaviour and the end goal of improving the control loop parameters and setpoints through a learning framework.

C. Objectives and Structure
The main objective of this paper is thus to illustrate the application of a data mining methodology with application to smart buildings for more energy efficient operation. We consider the chosen approach is representative beyond the current case study of learning the behaviour of the ventilation subsystem as well as extensible to other relevant subsystems.
Main contributions of the work can be summarised as follows: • argumentation of data-driven building modelling as alternative to conventional system identification or greybox model approximations; • presenting a case study of approaching the black-box modelling of air handling units (AHU) data as a data mining problem with control applications. The rest of the paper is structured as follows. Section II focuses on several specific data mining techniques which can serve as a suitable tool for HVAC subsystem data analysis and processing in smart buildings. Section III reviews the exploratory data analysis steps taken for better in conjunction with expert knowledge from the facility management staff. Section IV presents a case study of one year of multi-AHU collected in a medium size multi-functional campus building. We present here the main results of our study, namely the compressed representation of the time series data, feature engineering and finally results of training various multi-class SVM classifiers on three input datasets. Section V highlights the conclusions of the paper with outlook on future work and implementation.

II. DATA MINING PIPELINE: THEORETICAL BACKGROUND AND PROPOSED APPROACH
Time series data-mining [23] refers to the application of data mining methodology and algorithms to time series data. When the data is stored in conventional relational databases or unstructured databases, these methods help by representing the data into a new format and allowing the search in the new representation which leads to fewer original data being inspected. Subsequently the method has to offer guarantees about the search results and provide a means to visualise and assess the results. Types of problems that can be solved using time series data mining include similarity matching, clustering and classification of the relevant segments. For classification the segments are determined using methods such as Fourier decomposition, Haar coefficients resulting from wavelet transformations and Piecewise Aggregate Approximation (PAA), becoming features, and then using suitably defined distance metrics the individual examples are assigned to predefined classes. This type of approach has become highly relevant in the Internet of Things age where data collected from the physical world includes a relevant time component whether in the monitoring and control of manufacturing lines, transportation system, the environment and smart city or smart building applications.
We present our approach to apply time series data mining techniques to the problem of building subsystem modelling. The end goal is to extract relevant information from building sensor time series which is can finally assist the facility management department by being integrated into a decision support or directly into the control framework. The approach illustrates two main techniques, initially we use Symbolic Aggregate Approximation (SAX) [24], [25] to assign symbols to parts of the time series into a unified lower dimension representation. A common technique for classification of discrete patterns of events, Support Vector Machines (SVM) is subsequently illustrated to classify the collected data in order to achieve an indirect model of indoor conditions and operation. The application is focused on the measured exhaust air temperature of the air handling units, as proxy of mixed indoor temperature. We use the other measurement as well as contextual information: outside weather, time of day, scheduling, weekends etc. as additional features in the classification model.
Time series data mining includes various methods that have been applied to non-parametric modelling. SAX has been used by many researchers in various fields and relies on assigning symbols to time series segments based on the observed range of values. Ranges are identified through the data histogram or in a uniform manner. The method provides linear complexity and opens up the use and application of multiple statistical learning tools. One of the tuning factors, the number of regions, can significantly influence the quality of the result.
The method extends PAA [26] by assigning symbols to the PAA identified segments. The segments can be thus incorporated into a Markov model to compute the probability of the observed patterns for future observations. According to the PAA method description, starting with a time series X of length n, this is approximated into a vectorX = (x 1 , ...,x M ) of any length M ≤ n. Each element of the vectorx i is calculated by: This means that we reduce the dimensionality of the time series from n to M samples by initially dividing the original data into M equally sized frame and then compute the mean values for each frame. Putting the mean values together we achieve a new sequence which is considered to be the PAA transform (approximation) of the original data. With regard to computational considerations, the PAA transform complexity can be reduced from O(N M ) to O(M m) with m being the number of frames as tuning parameter of the method. The distance measure is defined as: It has been shown by the proposers of the method that PAA satisfies the lower bounding condition and guarantees no false dismissals such that: Support Vector Machines [27] for classification problems extend linear boundary classification problems by imposing a minimum distance b > 0 from the class separator. In the case of two classes: SVM searches for the optimal solution by minimising the objective function: which corresponds to maximising the distance b. As the classes might be very close to each other or even overlapping, a slack variable ξ is introduced as follows: To keep the value of the slack variable ξ small, a penalising factor is added to the cost function: The SVM paramters w and b are obtained by minimisation of J with the imposed constraints. The vector w represents a linear combination of the training examples as: The general form of SVM is expressed as: can take various forms accounting to non-linear class discriminants: linear, polynomial, Gauss, radial basis function, etc. This one significant advantage of SVM, while accounting for increased computing effort and optimisation constraints.
In our classification problem, namely the correct identification of streaming data to one of the installed ventilation units, we encounter a multi-class svm problem, as extension to the binary svm problem described above. This is handled using the one-versus-one approach [28]. This method leads to the evaluation of all pairs of classifier among the target classes, with k(k − 1)/2 binary classifiers. The outcome of the classification is achieved by counting the respective votes of each classifier to the test example. Figure 2 summarises the proposed approach with regard to the implemented data processing and learning steps as well as their use either for human decision support our in automatic control loops at the BAS level.

III. EXPLORATORY DATA ANALYSIS
We carry out a case study using data mining methodology on measurements taken from four air handling units (AHU) installed in a medium sized campus building. The building is a 7-story, 9000 sqm, facility commissioned in 2016 hosting the PRECIS research center. It contains multiple research laboratories, multi-function spaces, meeting rooms as well as and a large auditorium and administrative offices. It is located at 44 26'06.0"N 26 02'44.0"E in a temperate continental climate with hot summers and cold winters. Cooling is handled using on-site electric chillers while heating is provided from a district heating network. For monitoring and control of the various building subsystems, a commercial Building Management System (BMS) software solution from Honeywell is implemented on a central server and used by the facility management staff.
The AHU units are the main components of the ventilation subsystem of the building. They handle the ventilation function of the HVAC subsystem by extracting air from the building and inserting a mix of fresh, air quality objective, and recirculated, energy efficiency objective, air into the building. The target building has four AHU units labeled AHU1 and AHU2 for the top-half of the building and AHU3 and AHU5 for the bottom part. AHU setpoints for input temperature and air pressure are set by the building operator given various seasonal, occupancy and usage factors. In our focus building each AHU handles between 10-20 thermal zones from similarly placed and oriented areas. Though the AHU is controlled by local temperature and air pressure PID loops, the resulting exhaust air from the building is influenced by multiple other factors such as zone-level temperature setpoints input by the user, occupancy and activity levels, internal loads which make the high-level analysis relevant and worthwhile in the absence of extensive field sensors for observing the system. Figure 3 presents a picture of the target building along with a representative BMS screen for one of the ventilation units. Access to the BMS software is either local or remotely enabled. Within this technical context it was proceeded to collected the necessary raw data for our study. Data is stored in a SQL structured database on the central BMS server and has been collected offline for the purpose of this study. From each of the four AHU systems we have access to the exhaust, input and recirculation temperatures, sampled at five minute intervals. In addition we collected the reference outdoor temperature and air humidity values from AHU1. Humidity is not used in the current analysis. The reference period of the study is the year 2017, more specific the period from January 7 th to December 31 st for a total of 359 days and 103392 data samples for each of the 14 measurement points. The original time series at yearly, monthly and daily timescales are illustrated for AHU1 exhaust and input temperatures in Figure 4. The input temperature refers to the building indoor temperature (AHU output temperature) and the exhaust temperature is determined by the indoor activity disturbance.
Preprocessing the raw data allows the improvement of the quality of the data that is input into the algorithms. These steps can help mitigate faulty or erratic sensor behaviour or communication issue that can result in misleading values or gaps in the time series. By analysis the time series we have observed some zero values as well as noise in some of the data. Two preprocessing steps have thus been implemented: • outlier removal -mainly zeros which are replaced with the previous none zero value in the series; the overall occurrence is below 0.1% of the data set; • smoothing spline for noise removal even though we sample quite infrequently (5mins); the fitting spline parameters have been adjusted empirically.
The resulting time series are input to the next stage in the processing pipeline.
The exhaust temperature data is considered as aggregated proxy for indoor temperature in the corresponding thermal zones assigned to each AHU. The building operator currently sets empirically the input temperature setpoint, pressure setpoint and recirculation ratio setpoint. The main preprocessing steps taken were to eliminate all zero values, due to sensor faults or communication issues, and using a smoothing spline  Table I. These include the minimum, maximum, average and standard deviation, as well as skewness, kurtosis to characterise the underlying probability distribution.
Skewness, s = E(x − µ) 3 \ σ 3 , quantifies the asymmetry of the data around the sample mean. Negative skewness indicates left unbalanced data and positive skewness right unbalanced data. Kurtosis, k = E(x − µ) 4 \ σ 4 , is used as metric for how outlier-prone a distribution is. Distributions which have a value of k higher than 3 are considered to be outlier-prone. In addition we compute Pearson's correlation coefficient r between each measurement array and the outdoor temperature to indirectly determine the recirculation proportion.  Figure 5 presents the histograms for the four data series which illustrate qualitatively the underlying probability distributions for the exhaust temperature values. AHU1-2 values correspond to more predictable zones for laboratory and office space. AHU3-5 cover multi-function and administrative spaces. For subsequent data processing z-score standardisation is applied x n = (x − µ) \ σ. As contextual information, Figure 6 illustrates the outdoor temperature for the analysed period. The measurement profile is typical of a temperate continental climate with hot summers and cold winters, thinly separated by shoulder seasons where the best control strategy is no control strategy by leveraging outdoor air as much as possible for ventilation.

IV. RESULTS
The results section describes the outcome of the following two steps in the data mining processing pipeline. The preprocessed time series are represented into an aggregate form, initially through numeric PAA segments followed by symbolic SAX segments. One of the key tuning parameters is the number of segments for the daily representation of each time series and the alphabet size into which these segments are codified by SAX. A trade-off between the number of approximating elements and granularity of each element is presented. With the chosen configuration this leads to a reduction of a factor of around 30 in the data to be processed. Several SVM classifiers are then trained and tested on the input data with various configurations. For enabling direct comparison, all training is carried out on a single desktop PC, running Windows 7 Professional and MATLAB R2017b, with a quad-core i7 3.6GHz processor and 16GB RAM memory. The results are presented relative to a common benchmark, namely training a svm classifier with a fine gaussian kernel function on the original dataset. We highlight the outcomes while considering that the original dataset contains substantial redundant information given the fine grained timescale of data collection for the slow underlying thermal processes while the aggregated representation might loose some information but provide faster training times which eventually lead to faster online adjustments of an initially lower performing model.

A. DATA AGGREGATION
We apply the PAA/SAX algorithms to the time series in order to illustrate the patterns derived from each operation mode. By operation mode we intend to better understand the patterns resulting from both the underlying thermal behaviour given occupancy and usage patters, weather fluctuations as well as the control strategy decided for each of the ventilation units by the facility manager. Here we presents the results for AHU1 at different timescales with the input data being sampled at five minute intervals over one year. The time scales are similar to the exploratory data analysis in the sense that the yearly representation allows the observation of major seasonal trends in operation of the ventilation system. At the monthly level we are then able to differentiate better shorter term events and usage and weather variations within each season. Finally, the daily representation is what we use further on as it gives the best representation of usage at a fine grained scale. The tuning parameters of the SAX algorithm are the number of segments w and the alphabet size a. The results are shown in Figure 7 at the yearly scale for two configurations: w = 8 a = 4 and w = 10 a = 6. After carrying out multiple experiments on all the available data we find that the latter parameters offer a better and more consistent representation. With the tuning parameters w = 10 a = 6 the monthly and daily sample aggregations for the AHU1 normalised series are illustrated in Figure 8.
There are several factors influencing the aggregated patterns which we should account for. For the usage patterns these related to work and school holidays, weekends, structure of the academic year as well as the work schedule at the daily scale. Using the identified sequence we achieve a compact representation of the data, to be correlated with expert knowledge of the building operator. The resulting segments can be used to reconstruct the original data. Appendix A presents the figures showing full outcome of the application of the SAX method on further data.

B. SVM CLASSIFICATION
We use SVM because it provides a flexible and powerful classifier when parametrising various kernel functions, in spite of increased computational effort for the optimisation problem. SVM is an inherently binary classifier so for our problem of modelling the patterns of the four ventilation units we use the extension to the multi-class setting as described in Section 3. The one-versus-one approach involves training k(k − 1)/2 binary classifiers, with k the number of target classes, in our case four, and then voting on the assignment of each observation to a particular class. The advantage is that each of the classifiers are trained on a smaller sample of the input data. Three datasets result and are used in our experiments for training and investigating the performance of the SVM classifiers: dataset-proc, datasetpaa, dataset-sax. The datasets have the same number of features with different number of observations. datasetproc has 411692 observations representing the preprocessed sensor readings from the ventilation units. dataset-paa and dataset-sax both have 14360 observations. For training the classifiers we reconstruct the time series based on the paa and sax representations. The data table header for datasetpaa with one example from each class is illustrated in Table  2:   1000  2000  3000  4000  5000  6000  7000  8000 Sample No. There are four numeric features originating from temperature measurement at the ahu level and four binary features which codify the temporal aspects with regard to the day of the week and season used in training the models. These are the following: • evac -exhaust temperature of the ahu; normalised temperature values for dataset-raw and dataset-paa and aggregated segment labels for dataset-sax; • in -input temperature of the ahu; • rec -recirculation temperature of the ahu; • ext -outdoor temperature; the same for all units; • iswkn -binary feature used to represent if the measurement was on a weekend (1) or not (0); • iswinter -binary feature used to represent if the measurement was during winter (1) or not (0); • issummer -binary feature to represent if the measurement was during summer (1) or not (0); • isshoulder -binary feature used to represent if the measurement was during a shoulder season (1) or not (0); spring and autumn months are considered to be shoulder season. The target class represents the numeric labels of the individual ventilation units 1, 2, 3 and 5.
Based on this input data we go on to train the svm multi-class classifiers and describe the results. As initial testing has shown, the particularities of the data set require a more complex class delimiter so that we present the results obtained with a cubic and a fine gaussian svm respectively. The cubic svm kernel function has the form k(x j , x k ) = ((x j ) · (x k ) T + 1) 3 . The gaussian svm kernel function is expressed as ). In practice we use a fine gaussian model which uses the kernel scale √ P /4 with P the number of features. Tuning parameter for the fine gaussian is the scale factor which is determined automatically for the presented results on a similar random seed using a heuristic subsampling approach. A total of six classifiers are thus evaluated, two for each of the three training datasets. Each classifier is evaluated according to the following metrics: accuracy, sensitivity, specificity and area under curve (AUC). All results have being obtained using 5-fold cross-validation for improved statistical robustness.
With regard to classifier evaluation, in the binary case, given the number of positive examples P and negative examples N in the input data set, upon running the classification we obtain a number of true positive T P and true negative T N examples. The accuracy of the classifier is given by ACC = (T P + T N )/(P + N ). Sensitivity, also denoted as true positive rate (TPR), is computed as T P R = T P/P . Specificity, also known as true negative rate (TNR) is computed as T N R = T N/N . The precision, or positive predicted value (PPV), of a classifier is defined as P P V = T P/(T P + F P ). Table 3 shows the aggregated metrics for the six multi-class classifiers.
The metrics used for evaluation in our multi-class scenario are defined as follows: • Accuracy -the accuracy metric is a weighted average of the true positive rates across all four classes; • AUC -average area under the curve; as the ROC curve depicts the relation between the false positive rate and the true positive rate when the pairwise discrimination threshold is varied; in our case we report the average AUC between the target class and the sum of the negative classes; Figures 9-11 below show the best results for each data set in graphical form. We represent the confusion matrix and receiver operating characteristic (ROC) curves for the best AUC indicator.
Finally we present a comparative evaluation of the training time and performance for the case study. Table 4 presents the training time in seconds as well as the prediction speed in observations per second for each of the six trained classifiers.
A graphical representation is shown in Figure 12 which is useful to assess the magnitude of the improvements. We observe how, with the same number of features, when the     subsampling method, training times for the cubic polynomial svm kernel are significantly larger than the gaussian kernel svm on the datasets used. The paper has presented a practical application of data mining methodology to data collected from the ventilation subsystem of a smart building. Motivation of the work has been mainly driven by the need to better understand blackbox building and operator dynamics in order to improve the control strategies. The presented framework is applicable beyond the described case study to other challenges which might relate to occupant comfort and/or energy efficiency in a building context. The results are replicable through the available Matlab datasets and associated scripts. The important contribution of the study has been the resulting model of the Air Handling Unit of the studied smart building which can be used further with direct impact on the energy efficiency of the ventilation system.
Our main goal has been to achieve a better understanding indoor temperature and ventilation operational patterns which can lead to an improvement in the building control, either off-line or on-line. Future work will be focused on automatic adjustment of AHU setpoints based on learned analytical rules and models. Thus the data hypervisor control would be achieved. For on-line operation, in order to link the developed software routines and modules with the commercial building management system, the implementation of a suitable middleware platform such as VOLTTRON [29] in the target building is foreseen. With regard to the core data mining and learning techniques, further investigation into the most relevant kernel functions and hyperparameter tuning can be pursued. Both high level tools and low level libraries for machine learning offer good potential to improve training performance, including by leveraging cloud based infrastructures for data analysis and processing.