Deep learning in fault detection and diagnosis of building HVAC systems: A systematic review with meta analysis

Building sector account for significant global energy consumption and Heating Ventilation and Air Conditioning (HVAC) systems contribute to the highest portion of building energy consumption. Therefore, the potential for energy saving by improving the efficiency of HVAC systems is huge and various fault detection and diagnosis (FDD) methods have been studied for this purpose. Although amongst all types of existing FDD methods, data-driven based ones are regarded as the most effective methods. As a relatively new branch of data-driven approaches, deep learning (DL) methods have shown promising results, a comprehensive review of DL applications in this area is absent. To fill the research gap, this systematic review with meta analysis analyses the relevant studies both quantitatively and qualitatively. The review is conducted by searching Web of Science, Science-Direct, and Semantic search. There are 47 eligible studies included in this review following the preferred reporting items for systematic reviews and meta-analyses (PRISMA) protocol. 6 out of the 47 studies are identified as eligible for meta analysis of the effectiveness of DL methods for FDD. The most used DL method is 2D convolutional neural network (CNN). Results suggest that DL methods show promising results as a HVAC FDD. However, most studies use simulation/lab experiment data and real-world complexities are not fully investigated. Therefore, DL methods need to be further tested with real-world scenarios to support decision-making.


Introduction
According to the latest statistics [1] of 2022, building sector accounts for one-third of global energy consumption and has been identified as one of the main climate changers.Heating, Ventilation and Air Condition (HVAC) systems contribute to the highest portion (38%) of building energy consumption, However, HVAC systems can be faulty [2] leading to both system performance degradation and energy penalty [3,4].However, energy saving of 5% -30% is achievable by applying Fault Detection Diagnosis (FDD) [5].
Kim and Katipamula [6] provided a general review of methods for HVAC FDD, which were classified into three categories, quantitative model-based, qualitative model-based, and process history-based methods.According to the authors, data-driven approaches were the most suitable to be applied to complicated systems such as HVACs, where detailed knowledge of the physics of the system is not available.Third, they are robust to noise and can extract the underlying structure of a data set.The main drawback of their review is that the focus area of this study is not clear due to the broad scope.Therefore, only a few methods were selectively reviewed by the authors without deep analysis, the selection criteria used in this review was not justified either.In addition, since modern techniques such as DL methods were not well developed at that point, the methods reviewed in this study were outdated.
Shi and O'Brien [7] separated the HVAC FDD process into several sub-tasks.The main tasks identified by the authors were feature generation, fault detection, symptom generation, fault diagnosis, etc.These relevant tasks and components were reviewed in this study.However, the scope of this study is also broad since its focus area is the entire development process of HVAC FDD.As a result, each component involved was discussed on a general level, and a detailed discussion of various methods was not available in this review.
Zhao et al. [8] reviewed the FDD methods for building energy systems up to the year 2018, with a focus on algorithms of Artificial Intelligence (AI).It was found in this review that physical models played a dominant role in FDD in the early years, while the use of AI models increased significantly in the recent decade.In addition, the application of Deep Learning (DL) methods was not found in this area until 2018 and therefore, there was one DL relevant paper by Guo et al. [9] reviewed in this study.
Similar to Zhao et al. [8], data-driven FDD methods were reviewed by Mirnaghi et al. [10].With regard to faults, chiller and air handling unit (AHU) systems were identified as the two major components of interest in terms of the severity and the cost.The potential of using DL for FDD was identified in this study.However, they found that the application of DL in this area was limited, since only DL-relevant studies [9,11] were covered in this review.
Li et al. [12] reviewed various algorithms used for feature engineering involved in the development of an FDD approach.Five DL papers [13][14][15][16]9] were reviewed in this study.They concluded that DL methods were useful feature extractors.However, the focus area was limited to feature engineering.Therefore, the perspective from this study regarding DL methods for FDD was not somewhat narrow.
Buffa et al. [17] reviewed the control and fault detection strategies for District Heating and Cooling (DHC) systems.The focus was to compare traditional control strategies with more advanced control methods that have been applied to the fourth and fifth generation DHC systems.With regard to FDD methods and their applications to the DHC systems were less discussed.In addition, there were only two DL studies [16,18] reviewed in this study.
It is concluded from the existing reviews published in the last five years that DL methods for HVAC FDD is a relatively new that just started in 2018, though DL methods have been reported to be more effective form HVAC FDD in recent studies.
There are two main reasons make DL methods perform better than conventional machine learning methods for HVAC FDD.First, due to c omponents coupling, changing thermodynamics, local, and long-term temporal decencies, etc., operational data of HVAC systems are complex in nature, while deep learning methods have been proved to be more effective in terms of modelling such complexity and multi-scale temporal-dependencies of HVAC faults [19].Second, data scarcity is a common problem in the field of HVAC FDD, which leads to performance degradation of conventional machine learning methods, deep learning based generative models have been successfully in recent HVAC FDD studies [20] to mitigate the data scarcity issue.
However, none of the existing reviews regarding HVAC FDD was conducted systematically, and no quantitative analysis was conducted to investigate the effectiveness of DL method for HVAC FDD.
On the other hand, a systematic review delivers a clear and comprehensive overview of available evidence on DL methods for HVAC FDD.Moreover, it helps to identify research gaps in the current understanding of this field.It can highlight methodological concerns in relevant studies that can be used to improve future work in this area.Lastly, it can be used to identify questions for which the available evidence provides clear answers and thus for which further research is not necessary.
Against this background, the aim of this study is to provide a systematic review of DL methods for HVAC FDD as well as evaluate their effectiveness through meta analysis.The following research questions are the focus of this review.
• Which DL methods have been used for HVAC FDD?
• What are the advantages and disadvantages?• Are DL methods effective for HVAC FDD?
• What are the potential areas of improvement?
To answer questions 1,2,4, a qualitative analysis of the included papers in the systematic review is conducted, to answer question 3, a quantitative meta analysis based on an eligible subset of reviewed papers is conducted.The scope is for complete HVAC systems as well as the main components, e.g., chiller, air handling unit, etc.
The rest of the paper is organized as following, Section 2 describes the methods used for systematic review and meta analysis.Section provides an overview of the 47 included studies based on the method in Section 2.1, while Section 4 compares different DL methods used in the 47 studies qualitatively.Section 5 performs the meta analysis based on eligible studies filtered according to the method used in Section 2.2, which analyses the effectiveness of DL methods for HVAC FDD quantitatively.The current status and improving areas are discussed in Section 6, followed by the conclusions.A full list of abbreviations used in this paper is summarized in the appendix.

Methods
In this study a systematic review is conducted following the preferred reporting items for systematic reviews and meta-analyses (PRISMA) protocol [21].Three independent researchers have been involved in collecting and reviewing the data according to the protocol.Web of Science, Semantic Scholar, and Scopus are used to identify studies regarding fault detection and diagnosis of building HVAC systems using DL methods.In addition, reference checking is performed to retrieve the relevant papers that are potentially missed out via database searching This review is complemented with a Meta-analysis that analyses the effectiveness of DL methods using a subset of eligible studies.

Search strategy
The process starts with defining keywords to build the search strings.Keyword combinations, the corresponding synonyms, and abbreviations relevant to three subjects: "building HVAC", "fault detection and diagnosis", and "deep learning" are used for searching.To assure the quality of search results, the keyword list is reviewed and agreed upon by three independent reviewers.In addition, an external domain expert is involved in the process of keywords discussion and double-checking.
After finalizing the keywords, eligibility criteria are specified for filtering out non-relevant studies.To be more specific, the studies are included in this review only if all the following criteria are met: 1) the scientific work is written in English; 2) published in peer-reviewed scientific journals, conference proceedings, or books; 3) the studied HVAC systems are used in buildings, not in other domains; 4) implementation of fault diagnosis should be present; 5) deep learning methods are used.6) not a review article.
The literature search is performed on 25th March 2022.The overall process and results are shown in Fig. 1.
From Fig. 1, 139, 2121, and 184 articles are retrieved from Web of Science, Semantic search, and Scopus, respectively.After removing duplicates, there are 2264 articles left for the first stage screening based on the titles and abstracts, amongst which, 2173 articles are excluded, the reason for most excluded articles is because they are not relevant to DL or FDD or building HVAC.After this stage, there are 90 articles available for the second stage screening, the full text of each article is checked thoroughly at this stage, and 47 articles fulfil the eligibility criteria.In addition, reference lists of previously excluded review articles at the screening phase are re-examined to avoid any potentially relevant studies that are missed out.All three reviewers discuss and agree upon results at each stage from the flowchart.Finally, there are eligible articles included in this review.
Among these 47 articles, if one article does not report the confusion matrix or its dataset is not accessible, this article is included for F. Zhang et al. qualitative analysis but excluded for meta analysis.As a result, 6 studies [22][23][24][25][26][27] ,are identified as eligible for meta analysis.

Qualitative synthesis
The 47 eligible papers for qualitative synthesis are reviewed and summarized in detail from various dimensions.To be more specific, information extracted from these papers includes the DL building blocks used, benchmarked methods, and the network structure type, methodswise.Fault-wise and HVAC system-wise, information about the faults and system types is extracted.Data-wise, the description of the dataset, fault generation methods, and data format are summarized.In addition, other supplementary information such as whether the proposed approach is validated using a real-world dataset, whether FDD of faults at different severity levels is performed, etc. is extracted.

Data for quantitative synthesis
The American society of heating, refrigerating and air-conditioning engineers research report 1043 (ASHRAE RP-1043) dataset [28] is used in all six studies eligible for meta analysis, which included seven types of faults, i.e., condenser fouling, excess oil, refrigerant leak, refrigerant overcharge, reduced condenser water flow, reduced evaporator water flow, non-condensable gas in refrigerant of four severity levels, amongst which, confusion matrix of all severity levels are reported in [22].The severity level of the confusion matrix used in [23,24] is unknown, the confusion matrix of severity level one is reported in [25], and that of severity level two is reported in [26].Although the  confusion matrix of all four levels is reported in [27], the severity level four confusion matrix is not correctly constructed due to one missing category.Therefore, only the results of the other three confusion matrix reported in this study are used for meta-analysis.
To fulfil the requirement that the input format has to be a 2 × 2 confusion matrix, the results of the included studies are categorized into two classes and aggregated across all severity levels using the one versus the rest approach [29].For the quantitative analysis of fault detection, results of the seven faults are consolidated together into one faulty class versus the normal class, while performing quantitative analysis for diagnosing a specific type of fault, results of the rest classes are consolidated together.

Quantitative synthesis
There are two types of meta analytic summary: summary points, e.g., summary sensitivity, specificity, diagnostic odds ratio (DOR), and summary lines, i.e., summary receiver operating characteristic (SROC) curve, which shows summary of test performance, visual assessment of threshold effect, and heterogeneity of data in ROC space between sensitivity and specificity [30].Summary points are based on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) calculated from the confusion matrix.
To be more specific, sensitivity is the proportion of positive results out of TP results and specificity, is the proportion of negative results out of TP results.In the context of DL methods for HVAC FDD, sensitivity represents the ability of DL methods to correctly identify the system with faults.Specificity represents the ability of DL methods to correctly identify the system without faults.
DOR is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease.In terms of HVAC FDD, it describes the odds of positive fault detection in HVAC systems with faults relative to the odds of positive fault detection in those without any faults.This measure incorporates information about both sensitivity and specificity and tends to be reasonably constant despite the diagnostic threshold and is regarded as one of the most useful single indicators for diagnostic testing [31].DOR is calculated using Eq.(3).
One common problem of ratio summary statistics is that the number scale is not symmetric; e.g., a DOR of 0.5 (a halving) and a DOR of 2 (a doubling) are opposites such that they should average to no effect, the average of 0.5 and 2 is not a DOR of 1 but a DOR of 1.25.Therefore, log transformation is commonly used in the literature to avoid the asyfmmetric problem, and it is adopted in this study for the same reason.Because DOR alone does not reflect the variance [32], an SROC curve is constructed to analyse the relation between sensitivities and specificities.
After data extraction and formatting, separate effective sizes from each study are pooled together using random-effect models to incorporate the heterogeneity between different studies [33,34].Sensitivity analysis is performed to evaluate the influence of a single study on the pooled results using the leave one out approach [35,36].Funnel plots [37] are used to analyse the potential publication bias caused by the selective publication of favourable results or statistically significant results.The overall process of meta-analysis is shown in Fig. 2.

Results
This section summarizes all the information extracted from the eligible papers remaining after the systematic review process of Section 2. Characteristics of included studies are summarized in Table 1.
From Table 1, DL based hybrid methods were used in nine studies, while the remaining 38 studies used single DL models.Of these studies, eleven applied 2D CNN, and the least applied methods were convolutional variational autoencoder (CVAE), bi-directional gated recurrent unit (BDGRU), and deep bidirectional long short-term memory neural networks (BDLSTM); there was one study per method.There are six studies each for deep MLP and LSTM methods, followed by generative adversarial network (GAN), 1D CNN and deep belief network (DBN), which are 5,4, and 3 studies, respectively.Different DL methods in terms of HVAC FDD are discussed in the below.

Deep MLP based methods
From the reviewed literature, vanilla deep MLP-based is less used in recent years; three out of six deep MLP studies were conducted in 2019.
Because training a deep MLP model using the conventional backpropagation approach with random initialization suffers from the gradient vanishing problem.In addition, the deep multilayer perceptron (MLP) model can get stuck in a local optimal if the amount of training data is small [71], which is common in terms of HVAC FDD since operation data of HVAC systems with labelled faults are difficult to be acquired.Therefore, the performance of a deep MLP model can be even Fig. 2. Flowchart of meta-analysis.worse than that of a shallow MLP in the early years [28].From the reviewed studies, two approaches are used mitigate these drawbacks.The first approach is using an encoder-decoder structured model with an unsupervised layer-wise training paradigm [49,57,47] and initialize the classifier with the pre-trained weights and fine-tune the whole model for FDD.Because training multiple layers from scratch in one shot is not involved in this approach, the vanishing gradient problem is alleviated.
In addition, the distribution from unlabelled data can be learned from the pre-training process, which prevents the model from being stuck in the local optima.This approach is particularly useful if labelled training samples are scarce, while there are abundant unlabelled training samples, which aligns with the data availability of HVAC systems in practice.
The second approach is using another optimization method such as simulation annealing algorithm to optimize the parameters [24].
Compared to the first approach, the second approach is less used, because it only mitigates the local minima problem, while other issues discussed above cannot be effectively addressed using the second approach.
In general, components within an HVAC system are strongly coupled, and the output of each component is time-dependant.However, temporal information cannot be effectively utilized by a vanilla deep MLP model by design.Therefore, other DL methods such as recurrent neural network (RNNs), which are more effective in time series modelling, are considered natural for HVAC FDD compared to a deep MLP model.

1D CNN based methods
Compared to MLPs, 1D CNNs show superior performance in extracting local temporal features from sequential data in terms of FDD [72].Therefore, 1D CNNs can be either used as an feature extractor connecting to another downstream classifier for FDD [42] or used as an end-to-end classifier [60,62,63], which offers greater design flexibility.Although 2D CNN-based methods can also be adapted for sequence data analysis, extra data conversion steps are needed, which costs extra computation and involves potential information.A 1D CNN processes building operational data directly by design without extra data conversion steps.In addition, the number of trainable parameters of a 1D CNN is less than that of a 2D CNN in general due to their designed kernel shapes, allowing a 1D CNN to use larger kernel sizes and add more layers while maintaining the parameter size and computation cost.Such high efficiency enables a 1D CNN to be applied for online FDD or real-time condition monitoring of an HVAC system.

2D CNN based methods
Recent studies show that utilizing multiple sensors with sensor fusion techniques leads to higher accuracy and more robust FDD results [73,74].A 2D CNN is designed to process multidimensional matrics, and such desired property enables it to utilize sensor fusion techniques to improve the FDD performance.Commonly, multiple sensors are installed at various locations of the HVAC systems, and the corresponding metre measurements can be consolidated together with the temporal information to form the input for 2D CNN.By doing so, the superior capabilities of image analysis and automatic feature extraction by a 2D CNN can be fully utilized.
Moreover, formulating the data for a 2D CNN unleashes the feasibility of applying transfer learning approaches for HVAC FDD, which is extremely useful to tackle the lack of labelled building operational data issue.In addition, the transfer learning approach can leverage the knowledge learned from one HVAC system to facilitate the FDD for HVAC systems with different system specifications [65,67], which is another considerable advantage compared to the conventional approaches that are system specific.On the other hand, except 2d CNN, other DL methods included in this review have not been applied for HVAC FDD using the transfer learning paradigm yet.
Methods used to formulate the input data for a 2D CNN include converting numeric features of the original data into the corresponding images with the y-axis representing the time and the x-axis representing the features [13,16,39,40] or with the y-axis representing the features and the x-axis representing the time [53], placing each measured feature value row-wise and column-wise in a matrix at each timestep [51,23].However, the data conversion process causes extra computational cost, and information embedded in the original 1D data can be distorted during the converting procedure.
Another scenario that makes a 2D CNN nature choice over other methods is when the collected data for analysis are images, which can be fed into a 2D CNN without data conversion.Although using image data directly eliminates the extra data conversion steps and potential information loss involved in this procedure, the model performance is influenced by the quality of the images depending on various factors such as the type of cameras, weather conditions, etc. [36].

LSTM
RNNs are the most commonly used neural network (NN) architecture for time series data modelling in general.The feedback loops of the recurrent cells inherently capture the temporal patterns of the sequences [23].Owing to this property, RNNs are effective in modelling highly time-dependant and strongly coupled events, and such characteristics are common within an HVAC system.
Similarly, due to the recurrent structures by design, RNN-based methods are more robust to the changes in system dynamics when performing FDD for a thermodynamics-changing HVAC system.From the literature, while other methods failed to detect the fault when the AHU system changes from occupied to unoccupied, RNN-based methods successfully detect the faults throughout the whole period regardless of such changes [11].
However, the main drawback of a conventional RNN is its limitation of modelling the long-term temporal patterns because of gradient exploding and vanishing [75].Therefore, a vanilla RNN is barely used for modelling complex HVAC systems, while its variation, LSTM is more used in the most recent HVAC FDD literature.The design gating mechanisms and cell states enable the LSTM to perform better in modelling long-term temporal dependencies embedded inside the sequence data [76].To improve the performance of a vanilla LSTM for HVAC FDD [22,[43][44][45], optimization methods such as generic algorithm is used in [59] to optimize learning rate, batch size, etc.As opposed to the optimization, sets of hyperparameters and network structure configurations are hand-crafted and evaluated in [19] to search for the optimal setting, it is also concluded from this study that adding more LSTM neurons or layers potentially causes low convergence speed and overfitting problems.

GRU
Gated recurrent unit (GRU) is another variant of RNN, and compared to LSTM, the structure of a GRU is relatively simple.Therefore, using a GRU is beneficial when the computational cost is a bottleneck.In terms of HVAC FDD, such characteristics of a GRU are highly favoured by an online HVAC health state monitoring system as potential faults can be detected earlier if the model is more efficient.In addition, from the maintenance and cost perspectives, fixing a fault at the less severe stages is much easier for a technician and cost-effective [77].On the other hand, the less complex structure can potentially limit the capability of a GRU compared to that of an LSTM when modelling a complex HVAC system.Although the systematic evaluation of GRU versus LSTM in terms of HVAC FDD performance is not found in the literature, studies showed that GRU achieved the same performance or even better performance for industrial machinery and chemical processes FDD [78,79].

BDLSTM/BDGRU
Other useful variations of RNNS are bidirectional LSTM/GRU, useful information from another direction that a unidirectional LSTM/GRU F. Zhang et al. misses out can be captured using these variations.In terms of HVAC FDD, minor early-stage faults are difficult to detect due to the almost non-observable deviations without discriminative features.Such faults can be more easily identified if the deviation can also be analysed in reverse chronological order.Therefore, using a BDLSTM/GRU can lead to better performance for detecting minor drifting faults than using a conventional LSTM/GRU [80].In addition, HVAC systems present coupling and time-varying dynamics in general.For instance, because the circulation of refrigeration is a closed loop, the output value of the component involved in the circle at the current time step is not only influenced by its previous state but also corresponds to the subsequent state.Such characteristics align with the bidirectional design and make BDLSTM/GRU useful for HVAC FDD [41,66], although the number of parameters of a bidirectional architecture is doubled compared to the conventional LSTM/GRU and thus results in higher computational cost.

Generative models 4.4.1. GAN and VAE
Most data-driven methods in the literature for HVAC FDD are supervised learning based [20].However, faults are rare in real HVAC systems; therefore, most methods are tested using either simulation or experimental datasets, which can deviate from reality.Two types of DL-based generative models, i.e.VAE [128], GAN [129] and their variations [20,55,56,81,82,], have been used to improve the feasibility of applying such methods for the real-world HVAC FDD.GANs have been widely used to generate synthetic images close to real-world samples for computer vision tasks [83].However, the conventional GAN needs to be modified to be adapted to the HVAC data.For example, the up-sampling structure of the corresponding generator needs to be replaced by a down sampling structure corresponding to the dimension of the building operational data, as the feature dimension of HVAC data is much lower than that of image data in general.GANs can generate samples that fit the real distribution without explicitly assuming a certain form of a probability distribution.However, training a GAN is difficult due to the problems of mode collapse, non-convergence, and instability [84].On the contrary, training a VAE is relatively easy as gradient descent algorithms can be directly applied to minimize the reconstruction loss and the Kulback-Leibler divergence loss [85].Although from the literature, data generated by a VAE are less variant, especially for high dimensional data such as text and images [86].This is not a problem for the data of HVAC systems in general.Because the dimensionality of building operational data is much lower than that of images and text data, VAEs can be more effective for HVAC faults generation than GANs.
However, none of the HVAC FDD approach based on GANs and VAEs is end-to-end, the output of the generative models needs to be fed to a downstream classifier for FDD, and therefore, the FDD result can be impacted negatively if the quality of generated data from the upstream generative model is poor.

Deep belief networks
DBN is a probabilistic generative model that stacks multiple layers of Restricted Boltzmann Machines (RBMs) [87].In general, Deep learning methods require a large amount of labelled data for training, while the lack of labelled building operational data makes applications of DL methods for HVAC FDD using real building data challenging.GANs and VAEs address the data challenging by synthetic data generation, while DBM tackles this problem from a different angle, the training process of a DBN consists of unsupervised layer-by-layer pre-training and supervised fine-tuning stages.Therefore, the unlabelled data can be fully utilized by the unsupervised learning process of the DBN [88].
Moreover, the pre-trained weights can be used to initialize the following layers within the network, improve the model generalization ability, and reduce the risk of overfitting [89], which is critical to HVAC FDD given the limited amount of faulty data available from a real-life building system.Another advantage of DBN is that using the staged training strategy, the gradient vanishing problem of training a deep structured neural network is alleviated, which enables the feasibility of adding more hidden layers to model complex nonlinearities of an HVAC system and extract representative features of faults from different abstraction levels [90].In addition, as opposed to GANs and VAEs, a separate classifier is not need in the studies [9,48,50] using DBN approaches, which mitigates the non-end-to-end problem found in GANs and VAEs.
However, DBN has been less used in recent years, because the same problems can be solved by a stacked autoencoder, which shares the same advantages of unsupervised, layer-wise, multi-stage training, weights initialization using the pre-trained model, etc. with the DBN.In addition, training a DBN is difficult due to its complex data models [91], while a stacked AE can be trained easily using the ordinary back-propagation method.Therefore, it can be found from the literature that encoder-decoder-based approaches are more used by researchers for HVAC FDD.

DL based hybrid models
A hybrid DL approach refers to ensembling multiple DL models into one hybrid model or stacking different types of DL building blocks to form one hybrid structured model.Hybrid models are more robust as they often complement the advantages of the individual techniques involved and improve the overall performance [92].
The hybrid models used in the reviewed studies can be categorized into three types.
The first type involves combing 1D CNN with LSTM/GRU/BDLSTM [15,23,26], which utilizes both local and global temporal features extracted by 1D CNN and LSTM/GRU, respectively.Results of these studies show that this hybriding strategy works well for diagnose gradual and minor faults such as fouling components at the early stage, which is generally challenging [22].This hybriding approach can be further enhanced by attention algorithms in terms of the feature extraction process.A fused attention mechanism that consisted of self-attention (SA) [93] and external-attention (EA) [94] is used in [68].The SA was used to weigh the features based on their importance, while EA was used to discover the correlation between different features.As a result, SA enabled the model to focus more on the relatively important features, while EA allowed the model to learn the most discriminative features across the whole dataset.Results show that the proposed method worked well for imbalanced dataset with a small amount of faulty data, which addresses the data scarcity issue in this problem domain.
The second type involves combining a deep AE with 2D CNN [95], it takes advantage of using deep AE to generate high quality multi-dimensional residual that is the thermodynamic deviation and critical feature for HVAC FDD, and by reshaping the residual data into matrix, it can be effectively processed by 2D CNN.The third type ensembles multiple DL building blocks, e.g., ensemble deep MLPs [25] and 1D CNNs [52].Because an ensemble model is more robust to variance, it can be used to remove noise data [25] before FDD.More importantly, although a vanilla 1D CNN is effective in modelling the fast signals in the short fixed-length segments, its performance degrades when extracting discriminative features of multiscale monitoring signals, which is common in HCAC systems.For example, within an AHU system, signals such as temperature and humidity change slowly, while airflow rate fluctuates rapidly.
The third type of hybrid approach mitigates this issue by using multiple kernels of different sizes to capture the temporal features of different time scales, i.e., a larger sized kernel can be used to capture the overall tendency of slow-changing signals, while a smaller kernel can be applied to capture the sudden changes of rapid-change signals [93].Due to the efficiency of 1D CNN, the computation cost of such an approach is still reasonable.Based on the discussions above, advantages and disadvantages of each method briefly are summarized in Table 2.

Meta analysis
To answer the question whether DL methods are effective for HVAC FDD quantitatively, meta analysis is conducted and the results are discussed below.

Effectiveness analysis of fault detection
In terms of the quantitative results of fault detection, Fig. 3 shows the pooled sensitivity of DL for HVAC fault detection.From Fig. 3, the pooled result shows that there are 51,003 fault cases in total, amongst which 48,221 fault cases are correctly identified.The sensitivity values range from 0.855 to 0.996, the lowest value reported by Tra.On the other hand, the highest sensitivity values are reported by Wang and Han; although the upper bound value of 95% confidence interval (CI) is the same in both studies, the 95% CI of Wang's study is smaller due to the larger sample size used in this study, which indicates Wang's results are relatively reliable.The pooled sensitivity was 0.985, which means 98.5% of the fault cases detected by the DL methods are true faults.The 95% confidence interval is 0.958-0.995,suggesting the accuracy of a fault case classification is between 95.8% and 99.5%.
The pooled specificity of DL for HVAC fault detection is depicted in Fig. 4. The number of true negative cases is 6620, and the total number of negative cases is 7449.The specificity values range from 0.697 to 0.997.The lowest specificity value is presented by Tra, while the highest specificity is reported in Wang and Li's studies, while the 95% CI of Wang's study is smaller due to the large sample size and therefore suggests a more reliable result compared to Li's study.The pooled specificity is 0.966, meaning 96.6% of the normal cases classified by the DL methods are fault-free.The pooled 95% CI is 0.894-0.990,indicating the accuracy of a normal case classification is between 89.4% and 99.0%.
It is also observed from the forest plots that Tra's results deviate from the results of the rest five studies.The main reason is that Tra's study uses only 1% of labelled training data.Tra's experiment design aims to simulate the real-world faulty data scarcity scenario.However, it can cause the problem that the discriminative fault information within these small, labelled data is either not representative or insufficient to train the model as DL models require a large amount of training data to achieve good performance in general.
The conclusion of a single study may fail to be generalized to other studies because the DL methods, sample sizes, and objectives are adopted in different studies, while meta-analysis combines, summarizes, and interprets the results of the primary studies to derive the unknown generalized conclusion to a certain extent.Results show both pooled sensitivity and specificity values are greater than 95%, which suggests using DL methods can lead to high sensitivity and specificity for HVAC fault detection in general.
Fig. 5 shows the SROC curve of HVAC fault detection using DL methods with the pooled log DOR estimated.The summary estimate represents the pooled result of sensitivity and specificity from the individual primary studies, and the surrounding dashed grey line represents its corresponding 95%CI.The pooled log DOR is 7.578 with a 5.933-9.22295% CI, which means HVAC systems with faults are approximately 7.6 (after log transformation) times more likely to be detected by the DL methods than those without any faults.A higher DOR value suggests the discrimination ability of the DL methods is better in   terms of HVAC fault detection.The p-value was lower than 0.0001, a p value is the evidence against a null hypothesis, the null hypothesis in this context is HVAC FDD using DL methods is not effective.A p value less than 0.05 (typically ≤ 0.05) is statistically significant.It indicates strong evidence against the null hypothesis.In addition, the area under the curve (AUC) equals 0.988.AUC is the measure of the ability of a classifier to distinguish between classes, and an AUC value greater than 0.9 suggests the discrimination ability of DL methods is outstanding [96].
To evaluate the validity and robustness of the pooled summary, the sensitivity analysis is performed by iteratively removing one study at a time and performing the pooling using the remaining studies.The sensitivity analysis result is shown in Table 3. the log DOR value ranges from 7.313 to 8.409 under the influence of excluding one primary study, amongst which the most influential study is Tra's, this result is aligned with the observations from Figs. 3, 4, and the cause is discussed earlier.
Another problem that potentially distorts the estimated effect under investigation is publication bias, which refers to an editorial predilection for publishing particular findings, e.g., positive results, which leads to authors' failure to submit negative findings for publication.In addition, large studies are more likely to be published because authors may be strongly tempted to dredge through the data from an essentially negative study to find positive results and publish only those [97].The funnel plot is shown in Fig. 6(a) to investigate the publication bias.Each dot represents an individual study, the dashed middle line represents the overall effect, and the two dashed lines on the left and right represent the corresponding 95% confidence intervals.Ideally, the included studies should scatter symmetrically around the overall effect line.However, an asymmetrical pattern is observed in Fig. 6(a), especially for Tra's study, which is deviated from the estimated true effect.The result suggests the presence of publication bias.The reason is similar to that was discussed, i.e., the authors of this study deliberately reduced the number of labelled data for training.
To further confirm the assumption, a basic outlier detection for metaanalysis is performed using the brute force approach proposed by Harrer et al. [98].Studies are defined as outliers if their 95% confidence interval lies outside the 95% confidence interval of the pooled effect.Tra's study is confirmed as an outlier study according to this method.To investigate the influence of the outlier study on the publication bias, Tra's study is removed, and the results of the remaining five studies are used to generate the funnel plot.As shown in Fig. 6(b), the funnel plot's observed pattern is more symmetrical than that in Fig. 6(a).

Effectiveness analysis of HVAC fault diagnosis
Since various types of faults exist in the systems, it is challenging to enumerate the results of each fault quantitatively.According to the survey [99], condenser fouling is ranked amongst the faults of chillers as the most critical fault in terms of repairing cost and occurrence frequency.Therefore, condenser fouling fault diagnosis results are selected for analysis, and the same meta-analysis process is used.Figs.7 and 8 show the pooled sensitivity and specificity of diagnosing condenser fouling fault, respectively.From Fig. 7, there are 7304 samples of condenser fouling fault, amongst which 7234 are corrected diagnosed.The pooled sensitivity value is 0.995 with a 95% CI of 0.983-0.998.The lowest sensitivity value is 0.976, though it is still very close to 1. Two studies deviate from the rest, the reason for Tra's study is discussed in Section 5.1, while the sample size of Liu's study is the smallest amongst all included studies and thus results in a relatively wide CI.Apart from the small sample size, another reason is that all included studies reported high sensitivity values and the lower bounds of CI are high, therefore, the CI width of each study is very narrow.Therefore, although the CI width of Liu's study is just 0.043, it is still considered relatively wide compared to the rest of the studies with extremely narrow CI ranges.
In terms of the pooled specificity, there are 51,148 samples of noncondenser fouling fault, and 47,607 of them are correctly classified as non-condenser fouling fault free.The pooled specificity value is 0.981 with a 95% CI of 0.946 -0.993.The lowest reported specificity is 0.823 from Tra's study, the specificity values reported in the rest studies reported are all greater than 0.96 with a narrow CI.From Fig. 9, the AUC is greater than 0.9, which suggests that DL methods discriminate condenser fouling faults from normal cases effectively.Besides, the pooled log DOR is 7.639 with a 95% CI of 6.140 -9.138, and the p-value was lower than 0.0001, which means the condenser fouling faults are approximately 7.7 (after log transformation) times more likely to be diagnosed by the DL methods than those without condenser fouling faults.
The sensitivity analysis result using the leave one out approach is shown in Table 4 tab.Due to the same reason, Tra's study is the most influential study; by excluding Tra's study the pooled log DOR increases to 8.377 from 7.639.
Publication bias in terms of condenser fouling fault diagnosis is investigated.Meanwhile, Tra's study is detected as an outlier using Harrer's method.Fig. 10(a,b) shows the funnel plots before and after removing Tra's study.
The distribution of individual studies changes after removing Tra's study, Liu and Li's studies are moved into the ideal funnel, while Gao's study deviates from the lower bound of the pooled 95% CI represented by the left side dash line.Because the number of included studies is small, changing one influential study can result in a relatively significant change in the overall distribution.In addition, the results show asymmetrical patterns in both funnel plots, either due to publication bias or the small number of included studies.

The current status and knowledge gaps
The distribution of different DL methods for HVAC FDD is shown in Fig. 11.The most commonly used DL methods are 2D CNN and hybrid DL models, with 11 and 9 studies that use 2D CNN and hybrid DL approaches, respectively.It shows that researchers have favoured 2D CNN, LSTM, 1D CNN, and hybrid DL methods in the past three years.More diverse data formats such as images are collected and explored by researchers, which enables the application of 2D CNNs, and unleashes the potential of transfer learning to solve the data scarcity.Another main reason for transfer learning being favored by researchers is that it enables knowledge learned from one system to be generalized to other    systems with different operating characteristics.The improved generalization ability is a huge advantage over other DL methods, which are typically customized for a specific building energy system, and hence are limited by insufficient extrapolation capabilities [8].
On the other hand, LSTM and 1D CNN are natural choices for sequential HVAC data analysis without the need for data format conversion.While FDD efficiency is critical, 1D CNN can be a more suitable option.The reason hybrid DL methods are popular is that they effectively model complex systems that cannot be solved by a single DL model, e.g., using an ensemble multi-scale DL model to capture temporal features with different time scales of an HVAC system.
Although the reviewed studies show promising results, most are not tested against a real-world scenario using an imbalanced dataset.The overall imbalance ratio (IR) distributions of training, validation, and testing datasets of the reviewed studies are shown in Fig. 12.The imbalanced ratio is calculated using Eq. ( 4).
Where N maj is the sample size of the majority class and N min is the sample size of the minority class.The dataset is balanced if the IR equals one, while it is an imbalanced dataset if the IR is greater than one.From Fig. 12, most studies do not specify the proportion of faults and normal conditions.To be more specific, one study uses an imbalanced training dataset and three use both imbalanced and balanced datasets.Regarding the testing data, two studies evaluate both imbalanced and balanced datasets, while the validation datasets used in the reviewed studies are either balanced or without IR being reported.Therefore, the satisfying results reported in the studies remain to be validated using real-world data.Fig. 13 shows the learning paradigms of DL methods adopted by researchers.Although supervised learning is the most used learning paradigm in the past five years, other learning paradigms have been actively explored by researchers.Because of the data scarcity problem, the vanilla supervised learning paradigm is difficult to be applied to real-world HVAC FDD applications in practice.Transfer learning and semi-supervised learning are two types of new learning paradigms in 2021 and 2022 that have shown promising results for solving the data scarcity problem.However, there are only one semisupervised learning and two transfer learning applications so far; thus, research concerning these two learning paradigms remains a research gap.
Apart from the learning paradigms, generative models can be another research direction to solve the same data scarcity problem.Amongst the 47 reviewed studies, generative models are used in only six studies.Therefore, there is a research space for generative models to improve the feasibility of applying DL methods for real-world HVAC systems FDD.Especially for VAE-based methods, compared to GANs, they are easier to train.Besides, since the dimensionality of HVAC operational data is relatively low comparing to image data, the quality of the generated results does not differ significantly.However, GANs are used in five studies, while only one study utilizes a VAE-based method for HVAC operational data generation.
Method-wise, other knowledge gaps identified from reviewed studies include FDD of simultaneous faults and FDD of unknown/new faults.Only six studies are tested with FDD of simultaneous faults and only one study investigates FDD of unknown/new faults.In addition, FDD of nonsteady processes is another aspect that has not been well tested.Taking the reverse cycle defrosting process of a VRF system as an example, when the initiation of defrosting control is started, the compressor speed decreases to prepare the change in direction of the four-way valve.Suction and discharge pressure change rapidly in reverse.As a result, the discharge pressure diminishes, and the suction pressure increases    sharply to achieve a balance.The series of drastic changes involved in such a transient can cause various faults, such as valve leakage, value stuck, etc. FDD of unsteady processes are tested in only four studies.These characteristics are commonly found in real-world HVAC systems that increase the complexity of FDD and have not been thoroughly studied yet.
Data-wise, the lack of public benchmarking datasets of real building systems limits the development of DL approaches for real-world HVACs FDD.Fault data used in 44 out of 47 reviewed studies are either lab experiments or simulated data, including the two most used public datasets from ASHRAE.However, real-world complexities are generally not fully covered in the lab experiment or simulated data, e.g., neither simultaneous faults nor unknown/new faults scenarios are incorporated by ASHRAE datasets.Composing a public dataset covering real-world complexities can be highly beneficial for future researchers in this area to develop the approaches that can be applied for real building    systems FDD.Last but not least, current studies utilized only two types of data format.The 38 studies use numerical data input, and nine studies analyse image data for FDD.However, with the development of DL methods, state-of-the-art results have been achieved by DL methods for various multimodality data analysis tasks.Therefore, collecting and exploring multiple forms of data, such as documented issue logs, recorded/monitored audio, and video clips of the HVAC systems, can be beneficial and considered in future studies.

Conclusions
This study performs a systematic review with meta analysis of DL methods for HVAC FDD.47 studies are included in the qualitative analysis, and six studies are included in the quantitative analysis.In terms of the fault detection, the pooled sensitivity value is 0.985 (95% CI: 0.958 -0.995),The pooled specificity value is 0.966 (95% CI: 0. 894 -0.990), and the pooled log DOR is 7. 578 (95% CI: 5.933 -9.222), and the pooled AUC is 0.988.Regarding fault diagnosis, one of the most critical faults from the literature, i.e., condenser fouling fault, is selected for meta analysis.The pooled sensitivity of diagnosing is 0.995 (95% CI: 0.983 -0.998), the pooled specificity value is 0.981 (95% CI: 0.946 -0.993), the pooled log DOR is 7.693 (95% CI: 6.140-9.138),and the pooled AUC is 0.989.The results suggest DL methods can be effectively used for HVAC FDD.Although, the published results are satisfying, the existence of potential publication bias indicates that there exist studies without significant results that are not published.
It is also concluded that the most used DL method is 2D CNN due to the development of transfer learning and data format diversity.In addition, most of the existing studies use simulation/lab experiment data due to the challenge of collecting fault data of real building systems.Challenges in real-word HVAC systems including FDD of non-steady processes, simultaneous faults, and unknown/new faults are research areas to be explored in the future.

F
.Zhang et al.

F
.Zhang et al.

F
.Zhang et al.

F.
Zhang et al.

Fig. 5 .
Fig. 5. SROC curve and the pooled log DOR of DL for HVAC fault detection.

Fig. 6 .
Fig. 6. .(a) The funnel plot of study results, (b) The funnel plot of study results without the outlier study.

F
.Zhang et al.

Fig. 9 .
Fig. 9. SROC curve and the pooled log DOR of DL for HVAC fault diagnosis.

Fig. 10 .
Fig. 10.(a) The funnel plot of study results (b) The funnel plot of study results without the outlier study.

F.
Zhang et al.

Fig. 12 .
Fig. 12. Distribution of data imbalance ratio (training, validation, and testing data from left to right).

F
.Zhang et al.
The detailed search query is specified below: (LSTM OR CNN OR convolutional OR long short OR deep OR GAN OR adversarial OR transfer OR graph OR DNN OR DANN OR BDLSTM OR recurrent OR generative OR GNN OR RNN OR GRU) AND (fault detection and diagnosis OR fault diagnosis OR fault detection and) AND (HVAC OR air conditioning OR ventilation OR heating OR cooling OR district heating OR building OR heat pump OR chiller OR air handling unit OR heat pumps OR chillers OR air handling units)

Table 1
Characteristics of the reviewed studies.

Table 2
Advantages and disadvantages of each method.

Table 3
Sensitivity analysis of Log DOR for HVAC fault detection using DL.
F.Zhang et al.

Table 4
Sensitivity analysis of Log DOR for condenser fouling fault diagnosis using DL.