Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A training strategy for hybrid models to break the curse of dimensionality

  • Moein E. Samadi ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Software, Writing – original draft, Writing – review & editing

    moein.samadi@rwth-aachen.de

    Affiliations Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany, Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany

  • Sandra Kiefer,

    Roles Conceptualization, Investigation, Methodology, Resources, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Max Planck Institute for Software Systems, Saarland Informatics Campus, Saarbrücken, Germany

  • Sebastian Johaness Fritsch,

    Roles Data curation, Validation, Writing – review & editing

    Affiliations Department of Intensive Care Medicine, University Hospital RWTH Aachen, Aachen, Germany, Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany

  • Johannes Bickenbach,

    Roles Data curation, Resources, Validation, Writing – review & editing

    Affiliation Department of Intensive Care Medicine, University Hospital RWTH Aachen, Aachen, Germany

  • Andreas Schuppert

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Resources, Supervision, Writing – review & editing

    Affiliations Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany, Joint Research Center for Computational Biomedicine, RWTH Aachen University, Aachen, Germany

Abstract

Mechanistic/data-driven hybrid modeling is a key approach when the mechanistic details of the processes at hand are not sufficiently well understood, but also inferring a model purely from data is too complex. By the integration of first principles into a data-driven approach, hybrid modeling promises a feasible data demand alongside extrapolation. In this work, we introduce a learning strategy for tree-structured hybrid models to perform a binary classification task. Given a set of binary labeled data, the challenge is to use them to develop a model that accurately assesses labels of new unlabeled data. Our strategy employs graph-theoretic methods to analyze the data and deduce a function that maps input features to output labels. Our focus here is on data sets represented by binary features in which the label assessment of unlabeled data points is always extrapolation. Our strategy shows the existence of small sets of data points within given binary data for which knowing the labels allows for extrapolation to the entire valid input space. An implementation of our strategy yields a notable reduction of training-data demand in a binary classification task compared with different supervised machine learning algorithms. As an application, we have fitted a tree-structured hybrid model to the vital status of a cohort of COVID-19 patients requiring intensive-care unit treatment and mechanical ventilation. Our learning strategy yields the existence of patient cohorts for whom knowing the vital status enables extrapolation to the entire valid input space of the developed hybrid model.

Introduction

By learning from data, Machine Learning (ML) allows to model complex systems in which the mechanisms controlling the system are poorly understood [1]. Such challenges appear particularly in medical and biomedical contexts [2, 3]. Here, ML has become one of the most practicable tools for building predictive models [4, 5]. However, the predictions obtained with ML methods are only reliable within the convex hull of the given training data [6, 7]. Extrapolation, i.e., accurate prediction beyond the convex hull of the given training data, is conceptually impossible without further enhancement of the machinery [79].

Another drawback of ML methods is that they suffer from the curse of dimensionality (COD) [10]. The COD refers to the high demand for training data, which is usually exponential in the complexity of the model [11, 12]. Towards breaking the COD, ML methods such as DNNs have been developed for specific classes of input-output (i-o) functions [13, 14]. However, also DNNs cannot offer a generic solution for all classes of functions that need to be approximated [15].

These drawbacks of pure ML methods hamper sufficient performance of predictive models in medical and biomedical applications [16, 17]. In particular, attempts to develop diagnostic and prognostic models for the individual patients suffering from Coronavirus disease 2019 (COVID-19) have shown a moderate performance alongside poor generalizability [18, 19]. The large number of potentially relevant features for prognosis and diagnosis of COVID-19 requires novel data analysis and predictive-model development methods [20]. The obtained models should be capable of making reliable predictions not only for multi-dimensional feature spaces but also from data of biased or small cohorts of patients [19, 21].

The aforementioned problems can be tackled by integrating a priori available knowledge of the system structure into ML learning processes, which is done in Structured Hybrid Models (SHMs) [22, 23]. SHMs can be realized by modular neural networks [24] with given connections among input features and network modules: each module of the first layer represents a known sub-process of the overall system and takes a subset of the input features as its input, every other module reads inputs from previous layers to compute its (intermediate) output. The final output modules then combine the precomputations to determine the overall output of the system. Each module of the network is represented either via known physics-based equations (white-boxes) or via an unknown black-box to be trained by ML methods. As attested by the COD, the complexity of each black-box module, which is the number of examples needed to determine the input-output function of the module, scales exponentially with the dimension of its input vectors [13]. By employing various black-box modules with fewer input variables, the overall complexity of an SHM is usually much lower than the respective complexity of pure ML methods where a single black-box deals with the entire input vector. This way, SHMs can serve as a framework to overcome the conceptual drawbacks of ML.

Particularly in process modeling, for example in chemical engineering [2527], input-output relations are modeled as a composition of unknown black-box and known white-box modules. In such a hybrid structure, the overall model maps input data in to outputs in . The number of inputs for each black-box module in the hybrid structure is typically much lower than the total number of inputs n to the network. It was shown in [22, 23] that all unknown functions of the black-box modules in a tree-structured network can be uniquely determined as long as the training data set is distributed (in a strong formal sense) around a d-dimensional manifold in with sufficient differentiability, where d is a bound on the number of inputs to the black-box modules. In case d < n, the trained hybrid model can extrapolate resulting in the reduction of training data demand towards breaking the COD.

However, the superiority of hybrid models in terms of data-demand reduction and extrapolability as described in [22, 23] is based on the availability of densely distributed training data on low-dimensional subsets of . This property restricts applications of hybrid modeling to cases where highly correlated input-data distributions are available around low-dimensional manifolds within the input data space. In contrast to such controlled systems, observational data [28], such as in clinical data repositories, reflect mostly uncontrolled systems where the data distribution is not squeezed around low-dimensional manifolds. Moreover, observational data are often discrete or even binary.

In this work, we focus on binary data in order to provide a systematic extension of the hybrid models presented in [22, 23] towards hybrid models with randomly distributed training data within a binary feature space. This is meaningful for three reasons: first, binary hybrid models exhibit all characteristics regarding data-demand reduction and extrapolability of generic hybrid models without the specific numerical challenges of training on continuous data. Second, any monotonic discrete black-box function can be represented by a composition of binary black-box functions. This sequence indicates the generalizability of learning strategies on binary hybrid models to generic, discrete black-box-based models. Moreover, we expect learning strategies derived from binary models can further be generalized to even continuous feature spaces because discrete grid-based functions can be interpreted as local approximations for smooth continuous functions. Third, for binary data divided into training and test sets, any label assessment of unlabeled data points in the test set is always extrapolation, since any binary data point not contained in the training data lies outside the convex hull of the training data. Hence, the high prediction accuracy of a hybrid model for the test data, the out-of-sample forecast performance, is a direct indicator of the extrapolability of the hybrid model.

In this paper, we study classification tasks for binary labeled data represented by binary features. Given a set of data, the challenge is to use them to develop a model that accurately assesses labels of new unlabeled data. We present a learning strategy to compute a function that maps input features to output labels. We assume that the structure of the mapping between features and labels is known a priori and fits an SHM with an underlying tree structure. Our strategy uses graph-theoretic methods to deduce labels of new data points and to obtain a function that maps input features to output labels. It turns out that the classification efficiency of our hybrid model outperforms various supervised ML algorithms, namely Deep Neural Network (DNN), Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR). Additionally, our method shows the existence of small sets of data points for which knowing their labels allows for extrapolation to the entire feature space. Accordingly, our algorithm promises a lower training-data demand than sole data-driven methods.

In an application of our strategy, we have fitted a tree-structured hybrid model to the vital status of a cohort of COVID-19 patients requiring intensive-care unit treatment and mechanical ventilation. Our learning strategy yields the existence of patient cohorts for whom knowing the vital status enables extrapolation to the entire valid input space of the developed hybrid model.

The ability of hybrid models to extrapolate can boost applications of ML in medical and clinical research. In medical contexts, ML faces a variety of barriers. On the one hand, the patient-specific disease-driving mechanisms are often widely unexplored [17]. On the other hand, medical data repositories tend to be biased by specific patient cohorts and restricted in size, especially when compared to the reported data demands in DNN applications. Moreover, the pooling of clinical data from heterogeneous sources requires a high degree of administrative effort due to data privacy regulations. As a consequence, pure ML in medicine is currently focused on specific tasks such as time-series analysis and pattern recognition, where data is accessible from wearable devices and medical imaging technologies [16]. Therefore for clinical studies, the integration of knowledge and ML in a hybrid-model setting is essential, particularly for the development of predictive models that can make reliable predictions even outside the convex hull of the given data.

The paper is organized as follows. In the next section, we introduce tree-SHMs for binary classification. Then we present our learning strategy, comprising the Conflict-Graph construction and the Label determination, and we explain the graph-theoretic machinery we use. For the application, we then summarize the synthetic and COVID-19 data sources. Next, we discuss the classification efficiency and the training-data demand of our learning strategy on both the synthetic and the COVID-19 data. The paper ends with a conclusion and suggestions for future projects.

Material and methods

Model

This section introduces our hybrid model, which integrates available measurement data into a priori knowledge about the system. The input to our model consists of data obtained from measurements, e.g. physiological data. The data is represented as d-dimensional vectors for some , where each entry corresponds to one feature that is assumed to be binary. So the input vectors are elements of {0, 1}d.

The general task is to learn an unknown function that assigns to each potential input vector its label, which can again be a 0 or a 1, depending on whether the data point belongs to the first or the second of the two classes that constitute the classification task. We use a given training-data set of input data and associated labels to learn the model. The training-data set covers only a small subset of the d-dimensional hypercube that contains all valid input vectors. From this information, we need to draw conclusions to also predict labels of unlabeled new data points.

In an SHM, see Fig 1 for a schematic example, mechanistic understanding of the underlying system is used to partially pre-determine the structure of a network which maps input variables to output values by combining several sub-computations performed in modules, where each module represents a separate sub-process within the overall system. In our setting, each network module is considered a black-box and will be trained using the available measurement data, up to redundant invariants [22, 23].

thumbnail
Fig 1. A tree-structured hybrid network.

The network maps binary input variables x ∈ {0, 1}9 to binary outputs y ∈ {0, 1}. Three first-layer black-box modules each have separate input variables, and a single black-box module processes the partial outputs of the first layer to compute the overall output of the network, which can then be interpreted as a decision for one of the two considered classes.

https://doi.org/10.1371/journal.pone.0274569.g001

We furthermore assume that the hybrid network has a tree structure with two layers: in the first layer, independent black-box modules operate on separate input variables to complete sub-computations of the main classification. In the second layer, a black-box module processes the outputs of the modules from the previous layer towards the overall output of the network. Without loss of generality, we can assume that the entries of the input vectors are ordered according to the modules: that is, when there are k first-layer black-box modules, then the first first-layer black-box module has as its input the first n1 entries of the overall input, and for i ∈ {2, …, k}, the i-th module takes as input the ni entries of the overall input starting at position .

In the described setting, the overall i-o relation can be represented as a k-dimensional orthotope (hyperrectangle), where k is the number of first-layer black-box modules of the network. Each cell of the orthotope represents a binary input vector, holding the corresponding output label of the vector (which is 0 or 1). Each axis of the orthotope corresponds to the possible inputs of one first-layer module and thus, the i-th axis has length .

An orthotope related to the network structure of Fig 1 is depicted in Fig 2. The network in Fig 1 is tree-structured with three first-layer black-box modules. The number of input variables for each module is 3. Therefore, the orthotope is a 3-dimensional hypercube with 23 elements on each axis of the cube. So, altogether, the orthotope has 29 cells, one for each valid binary input vector for the SHM. We equip each cell corresponding to a training-data point with a label that indicates the correct classification of the data point. The task is now to determine the i-o function, or, equivalently, to predict the labels for all cells in the orthotope.

thumbnail
Fig 2. The orthotope for the network structure of Fig 1.

The dimension of the orthotope is equal to the number of first-layer black-box modules of the tree-structured hybrid network of Fig 1. Each cell of the orthotope, characterized by three coordinates, represents an input data point holding the corresponding output label. Each intersection of a hyperplane with the orthotope holds input data with a constant input for a specific first-layer black-box module. For example, the uppermost horizontal blue slice of the orthotope illustrates all input vectors whose last three entries are 1, 1, 1.

https://doi.org/10.1371/journal.pone.0274569.g002

Learning strategy

Here we present our learning strategy for SHMs which is to perform a binary classification task for binary input data. Recall from the previous section that we assume 2-layer tree-shaped SHMs whose first layer, as well as the output layer, consist of black-box modules. The training strategy consists of two parts called the Conflict-Graph construction and the Label determination. Together, the two procedures serve to evaluate the effect of local modifications in the input on the overall output of the SHM. For some intuition, we give a detailed description of the two procedures first. We then provide the full algorithm in pseudocode.

We assume that our SHM has k first-layer black-box modules and denote for every i ∈ {1, …, k} by ni the number of inputs to the i-th module. For example, the SHM of Fig 1 has k = 3 first-layer black-box modules with n1 = n2 = n3 = 3. For a vector , we call every vector x ∈ {0, 1}d for which for all j ∈ {1, …, ni} the -th component of x equals the j-th component of v an i-extension of v. For example in the SHM of Fig 1, the 9-dimensional vector (0, 0, 0, 1, 1, 0, 1, 0, 1) is a 1-extension of v = (0, 0, 0). We call i-extensions v*, w* of vectors equivalent if v* and w* are component-wise equal except (possibly) for the components with indices in . For instance, v* = (0, 0, 0, 1, 1, 0, 1, 0, 1) and w* = (1, 0, 0, 1, 1, 0, 1, 0, 1) are equivalent 1-extensions of v = (0, 0, 0) and w = (1, 0, 0) for the SHM of Fig 1. The idea behind this definition is that we want to extend inputs from single modules to inputs for the entire SHM.

Step 1: Conflict-graph construction.

In a binary classification setting, the i-o function of the i-th interior black-box module can be studied via a conflict graph G(V, E), see Fig 3. Here, the vertex set V corresponds to the set of all input vectors of the related black-box module. More precisely, the set V consists of the elements in , where ni is the number of input bits to the module. The set E is defined as follows: there is an edge between vertices u and v precisely if there exist equivalent i-extensions u* and v* of u and v, respectively, with different output labels.

thumbnail
Fig 3. A 2-colorable graph representing the i-o function of a black-box module.

Vertices of the graph denote all inputs of the module. Different colors of the vertices represent different outputs of the module. The graph has 8 = 23 vertices because the module takes three binary variables as inputs.

https://doi.org/10.1371/journal.pone.0274569.g003

Our learning strategy translates the information given by the training data into a conflict graph for every black-box module. For a graph G(V, E), the 2-coloring problem on G can be phrased as the task to find a mapping f:V → {0, 1} such that adjacent vertices always have distinct f-values. The partition of the vertex set into the two-color classes is unique if and only if the graph G is connected, i.e., there is a path between any two vertices in the graph [29]. By the definition of the edge set, the introduced conflict graphs must all be bipartite, i.e., 2-colorable. Indeed, consider an edge connecting two vertices u and v in the conflict graph for the i-th module Mi. This means, by definition, that there are equivalent i-extensions of u and v which have different overall outputs. Since the i-extensions are equivalent, this change in the output can only be caused by different (intermediate) outputs of Mi on u and v. Hence, assigning to every vertex in the conflict graph for Mi the output of Mi on the corresponding ni-dimensional 0–1-vector constitutes a valid choice for f, proving the 2-colorability.

Suppose we are given a tree-SHM that maps binary input variables x ∈ {0, 1}d to outputs y ∈ {0, 1} and has k first-layer black-box modules. For a set of training data vectors x with associated labels y, the steps of the Conflict-graph construction are as follows:

For i ∈ {1, …, k}, to construct the edges of the graph Gi of the i-th module, consider all pairs of vertices v, wV(Gi) (i.e., all elements in ) and insert an edge between v and w precisely if there exist equivalent i-extensions v*, w* of v and w such that v* is labeled 0 and w* is labeled 1.

The 2-coloring problem of Gi has an unambiguous solution. In other words, the partition of the vertex set induced by the two colors is unique, if and only if Gi is a connected graph. In that case, we can determine the internal i-o function f of the i-th first layer module Mi up to a permutation of 0 and 1 by starting in an arbitrary vertex and assigning to it the value, say, 0. Then, by using a breadth-first-search, we can compute the partition of V(Gi) into two sets and , where for j ∈ {0, 1}, the set contains all inputs x to Mi with f(x) = j.

However, even if every graph Gi is connected, this only gives us information about the functions computed by the first-layer modules, i.e., for the intermediate outputs. To obtain the overall i-o function, we still need to combine this knowledge to compute the final outputs for all possible inputs. In the second step of our strategy, we, therefore, focus on determining actual output labels.

Step 2: Label determination.

Having prepared the graphs Gi for all the modules, the Label determination aims to determine the unknown output labels of specific input data points by using the knowledge about the i-o functions of interior black-box modules acquired in Step 1. The determined labels for new input data points serve as new training data for Step 1 and may reduce the number of connected components in the updated Gi, thus providing more information about the i-o function of the black-box modules. The Label determination applies the following well-known result due to Kőnig [30] to the bipartite graphs Gi(V, E).

Theorem. A graph G(V, E) is 2-colorable if and only if it has no cycles of odd length.

Building on this insight, the following procedure tries to determine the labels of input data that are not yet labeled. Recall that for a tree-structured hybrid network with binary input vectors and k first-layer black-box modules, we can use a k-dimensional orthotope with cell labels to embody the i-o relation.

  1. I. Let x be the lexicographically smallest d-dimensional binary vector for which the label has not yet been determined.
  2. II. For i ∈ {1, …, k}, check if assigning label ‘0’/‘1’ to x creates a cycle of odd length in any of the Gi (by the definition of their edge relation). If it does, assign to x the opposite label, so as to maintain the 2-colorability of the graph by Kőnig’s Theorem.
  3. III. If II was successful (i.e., a label was assigned) and there are still unlabeled input vectors, go to Step 1. If II was not successful and x was not the vector 1d, i.e., not the lexicographically largest vector, update x to the lexicographically next unlabeled input vector and repeat II. Otherwise, terminate.

Fig 4 gives a schematic representation of the Label determination procedure. By labeling previously unlabeled vectors, we expand the training data set for the Conflict-graph construction. Therefore, we can now repeat Step 1 and update the graphs Gi by inserting additional edges according to the new information. This way, alternating between Step 1 and Step 2, the procedure stops when we have filled the entire orthotope that states the labels for the possible input vectors or reached a situation where we cannot deduce further labels for the empty cells. The former case means that our learning strategy can extrapolate to the entire valid input space. In the latter case, however, additional training data points would be required to determine the labels of all unlabeled data points.

thumbnail
Fig 4. Schematic representation of the Label determination procedure.

The left figure shows an intersection of a hyperplane of constant inputs, say 0, 0, 0, for Module 2 with the orthotope of the network of Fig 1 (i.e., a “red slice”). The right figure represents the related conflict graph for Module 1. In accordance with Kőnig’s theorem, adding the dashed line to the edge set breaks the bipartiteness of the graph. Since assigning label 0 to the input vector (0, 0, 0, 0, 0, 0, 0, 0, 1) would imply the existence of an edge between (0, 0, 0) and (1, 0, 1) in the conflict graph, the label for the ‘?’ cell must be 1.

https://doi.org/10.1371/journal.pone.0274569.g004

Algorithm 1 shows in pseudocode how the Conflict graph construction and the Label determination are combined to form our final algorithm.

Algorithm 1 The full training strategy in pseudocode.

Input: A list of pairs (x, ) of distinct data points x ∈ {0, 1}d and binary labels ℓ ∈ {0, 1}; integers ni for all i ∈ {1, …, k} (for some k) with

Output: A partial mapping f:{0, 1}d → {0, 1} representing the label predictions that can be derived from the input data.

1: For every input (x, ), set f(x)←.

2: for i ∈ {1, ⋯, k} do

3:  Initialize a graph Gi(Vi, Ei) with and Ei = ∅.

4: end for

5: While f is not total do    //[i.e., there are x where f(x) is undefined]

6:  for i ∈ {1, ⋯, k} do

7:   for v, wV(Gi) do

8:    if there are equivalent i-extensions v*, w* of v, w such that f(v*) = 0 and f(w*) = 1 then

9:     EiEi∪{{v, w}}.

10:    end if

11:   end for

12:  end for

13:  ufalse    //u stores whether the following procedure updates f

14:  for x ∈ {0, 1}d do

15:   if f(x) is undefined then

16:    for i ∈ {1, ⋯, k} do

17:     if setting f(x) ← 0 would create a cycle of odd length in Gi (by updating the edge set as described in Lines 7–9) then

18:      f(x) ← 1

19:      utrue

20:     else if setting f(x) ← 1 would create a cycle of odd length in Gi then

21:      f(x) ← 0

22:      utrue

23:     end if

24:    end for

25:   end if

26:  end for

27:  if u = false then

28:   return f

29:  end if

30: end while

31: return f

Data sources

Synthetic data.

To benchmark our learning strategy, we generated 30 tree structures for 2-layered SHMs with binary inputs x ∈ {0, 1}d, d ∈ {8, 9, 10, 11, 12}, three black-box modules in the first layer, and binary output labels y ∈ {0, 1}. The details and the schematic representation of the tree structures used for generating the synthetic data are depicted in S1 Appendix. Each structure was randomly constructed in a way that each first-layer black-box module operates on 2–6 separate input entries and forwards the partial results to the black-box output module.

We now describe how we obtained the labels for the synthetic data sets. For the associated SHM H of each data set, we generated random i-o functions for the three first-layer black-box modules and the output module. Then for each d-dimensional input to the network, we computed the partial outputs of the first layer followed by the overall output of the output layer, which we used as the label for the corresponding input vector.

COVID-19 data.

This analysis was approved by the local ethical review board (EK 091/20; Ethics Committee, Faculty of Medicine, RWTH Aachen, Aachen, Germany). The Ethics Committee waived the need to obtain Informed consent for the collection, analysis of the retrospectively obtained, de-identified data as well as the publication of the results of the analysis. All methods were carried out in accordance with relevant guidelines and regulations.

Concerning the COVID-19 data, the studied population consists of patients with confirmed COVID-19 who had been admitted to the Intensive Care Unit (ICU) at University Hospital RWTH Aachen. The analyzed cohorts consisted of severely ill patients requiring invasive mechanical ventilation at least once throughout their ICU stay.

The clinical information of 63 adult patients (age ≥18 years) was collected between March and the end of June 2020. The median age was 62 years (interquartile range 58–70 years), and 66.7% of the patients (n = 42) were male. 27 patients did not survive during their ICU treatment, resulting in a mortality rate of 42.9%. The median length of stay in the ICU was 27.0 days (interquartile range 16.3—50.8 days). Table 1 presents the biometric and physiological parameters of the studied cohort of COVID-19 patients on the ICU admission, including the physiological parameters required for the sequential organ failure assessment score (SOFA score [31]).

thumbnail
Table 1. Biometric and physiological parameters of the studied COVID-19 patients on the ICU admission.

Values are represented as n (%) or median (interquartile-range).

https://doi.org/10.1371/journal.pone.0274569.t001

Almost all of the biometric and physiological patient information was collected as continuous values in diverse ranges and scales. Since our learning strategy requires binary data, we converted the initial continuous data into binary representations. As outlined in S2 Appendix, the first step of the data binarization was to use a decision-tree classifier to classify patients according to their vital status. We used biometric information and physiological parameters from the first seven days of the patients’ ICU stay as the attributes of the classification. S1 Table depicts the median and the interquartile range of the physiological parameters used in the decision-tree classifier. In the second step of the data binarization, we binarized the most important patient features obtained from the decision-tree classifier. The binarization threshold for each feature is its related critical value in the decision tree classifier. Table 2 shows these features and the critical values used for the binarization. Finally, we labeled the 5-dimensional binarized clinical patient data based on a 0.75 threshold on the mortality ratio. The obtained binary patient information and the associated mortality labels constitute the labeled data set for testing our learning strategy.

thumbnail
Table 2. Critical values for binarization of the most important COVID-19 patient features.

https://doi.org/10.1371/journal.pone.0274569.t002

Results and discussion

Classification efficiency in the synthetic data

The main advantage of hybrid models compared with data-driven ones is the ability to extrapolate, i.e., to accurately predict labels for data points outside the convex hull of the given training data. In binary data, the convex hull of the training data only contains the training data itself. So in order for a binary classification algorithm to be meaningful, the extrapolation ability is crucial.

We summarize the extrapolability of our learning strategy on the synthetic data. We consider three different sizes Ntr of training-data sets containing 20%, 30%, and 40% of the whole data sets. For each training-data size, we sampled for each of the 30 SHM structures and for input dimensions d ∈ {8, 9, 10, 11, 12} five training-data sets of according size from the 2d possible labeled data points. Then we executed our learning strategy and summarized the outcomes with five measurement results: classification accuracy, recall, precision, and F1-score [32]. Table 3 shows the results for the different training-data sizes Ntr.

thumbnail
Table 3. Classification results of the hybrid model on the synthetic data.

https://doi.org/10.1371/journal.pone.0274569.t003

For randomly chosen training-data sets containing at least 40% of the entire valid input space, the average of the classification accuracy is close to 1. In particular, for training data sizes Ntr of at least 40% of the entire valid input space, the median and the lower quartile of the classification accuracy are 1, and the mean of the classification accuracy is above 0.99. Furthermore, there exist (randomly sampled) training-data sets with classification accuracy equal to 1 even with Ntr equal to 20% of the entire valid input space. So the tree structure of the SHM suffices to guarantee the existence of small data sets with classification accuracy equal to 1. This property, which cannot be observed in pure ML methods, underlines that hybrid models have a high potential to reduce the training-data demand, see also [22, 23].

To compare the training data demand and classification efficiency of our method with other ML classifiers, we performed the same classification problem on the same synthetic data using different supervised learning methods, such as DNN, SVM, RF, and LR. We used grid-search cross-validation [33] as a hyperparameter tuning method for SVM, RF, and LR. In particular, we employed 5-fold stratified cross-validation on shuffled training data. The performances of the selected hyperparameters and trained models were then measured on a dedicated evaluation set that was not used during the model selection step. For DNNs, we used Keras Tuner [34, 35] hyperparameter optimization framework to optimize the hyperparameters of DNNs for each data dimension. The details of the optimized hyperparameters of the employed ML methods are shown in S3 Appendix.

As summarized in Table 4, the classification efficiency of our hybrid model notably outperforms the other ML models, especially for smaller training-data sizes Ntr. Fig 5 displays the increase in the classification accuracy when adding training data is much quicker in our classifier compared with the other models. In particular, the median of the classification accuracy of our strategy for training data sizes Ntr of at least 20% of the entire valid input space approaches 1, whereas, for the DNN classifier, it is still 0.93 for training data-sizes equal to 40% of the entire valid input space.

thumbnail
Fig 5. The distribution of classification accuracies for binary classification on the synthetic data.

For each model and each size Ntr of the training data, we sampled 150 training data sets with input dimension d ∈ {8, 9, 10, 11, 12} and visualized the measured performance as box plots.

https://doi.org/10.1371/journal.pone.0274569.g005

thumbnail
Table 4. The comparison of the median of the binary classification measurement results on the synthetic data.

https://doi.org/10.1371/journal.pone.0274569.t004

The superiority of our proposed methodology in the designed binary classification over the other supervised ML methods is firstly due to the reduced complexity in the SHM. An SHM employs various black-box modules with fewer input variables instead of a single black-box that deals with the entire input vector. The overall complexity of an SHM employing various black-box modules is usually much lower than an ML method, where a single black-box deals with the entire input vector. Secondly, our method can extrapolate, which is not the case for the other ML models. The extrapolability of our method boosts the classification performance of our methodology. As one of the consequences of the Curse of Dimensionality, the volume of the convex hull of D-dimensional data scales by 1/D! [36]. In an SHM, the union of the convex hulls of the sub-processes modules covers the volume that the hybrid model can make accurate predictions. As the volume of the union of the convex hulls of the sub-process is notably larger than the convex hull of the original data, 1/d1! + 1/d2! + … + 1/dk! ≪ 1/D!, where d1 + d2 + … + dk = D, the binary classification performance of the hybrid classifier should outperform the other ML classifiers using a single black-box, which only guarantees faithful predictions inside the convex hull of the original data.

Statistical analysis on the synthetic data classification results

We set up a statistical test for comparing the hybrid model with the other ML classifiers (DNN, SVM, LR, and RF) on the synthetic data. It has been shown that non-parametric tests are suitable for statistical comparisons of classifiers since they do not assume normal distributions or homogeneity of variance in accuracies or any other measure for the evaluation of classifiers [37]. In particular, the Friedman test with the corresponding post-hoc tests is recommended for comparing more than two classifiers over multiple data sets [37], which is the case in our problem.

The Friedman test is a non-parametric counterpart of the repeated-measures ANOVA [37, 38]. First, it separately ranks the algorithms for each data set according to their classification performances. Then it determines whether or not there is a statistically significant difference between the average ranks of the algorithms. The null-hypothesis H0 of the Friedman test states that all the algorithms are equivalent and so their ranks are equal. The Friedman statistic can be approximated by the Chi-squared distribution when the number of data sets n or the number of classifiers k is large enough (i.e. n > 15 or k > 4), which are the cases in our problem. For the significance level of α = 0.001 and the degree of freedom (the number of classifiers that we are comparing minus one) of DF = 4 the Chi-squared value equals 18.467. It means: if we calculate a Chi-squared value greater than the critical value of 18.467 in our test, then the null-hypothesis is rejected in favor of the alternative hypothesis H1 that the algorithms are not equivalent.

Table 5 shows the Friedman test results of comparing the five algorithms on 270 different data sets of our synthetic data using Statistical Tests for Algorithms Comparison (STAC) Python Library [39]. The resulted Friedman statistics or Chi-squared is 379.611 that rejects the null-hypothesis. Furthermore, the ranking of the algorithms is presented in Table 6 based on the average ranks of the algorithms over all data sets showing that our method is the best performing algorithm.

thumbnail
Table 5. The Friedman test with significance level of 0.001.

https://doi.org/10.1371/journal.pone.0274569.t005

We proceeded with the Holm method [40] as a post-hoc test to compare the ML classifiers with the hybrid model as a control model. The null-hypothesis in this case states that the control method is equivalent to the other algorithms (compared in pairs). The decision rule for rejecting the null-hypothesis is defines as whether the adjusted P value by the Holm method is lower than the significance level α = 0.001 or the test statistics z is greater than the critical value of 3.090 (for α = 0.001). The z value for comparing the i-th and j-th classifier is z , where Ri is the average rank of the i-th algorithm [37]. Table 7 shows the results of the post-hoc test using STAC Web Platform [39]. All the pairwise comparisons reject the null-hypothesis in favor of the alternative hypothesis that the control method, here the hybrid model, is not equivalent to the other algorithms.

thumbnail
Table 7. Post-hoc test using the hybrid model as the control method.

https://doi.org/10.1371/journal.pone.0274569.t007

We also investigated the influence of the dimensionality of the synthetic data set and the noise intensity in the classification efficiency of our model and compared the results with the other ML models. As the influences of the data dimensionality and the noise intensity in classification are highly dependent on the size of the training-data Ntr, we performed the investigation for a fixed size of the training data, N = 200 data points (or patients), in a way that it is meaningful for clinical studies.

The detailed results of the influence of the data dimensionality and noise intensity in the classification efficiency can be found in S4 Appendix, while visualization of the variation in model performance across different algorithms can be drawn from Fig 6. The classification efficiency of the hybrid model shows robustness against the increase of the data dimensionality, while, as attested by the COD, the performance of the other supervised ML algorithms notably suffers from the increase in the data dimension. Moreover, although adding noise to data can cause undesirable consequences to the prior knowledge that a hybrid model is built upon, the classification efficiency of the hybrid model outperforms other ML models even for noisy data.

thumbnail
Fig 6. The influence of the dimensionality and the noise intensity of the synthetic data in the classification efficiency.

The effect of data dimensionality (left) and noise intensity (right) in the average of the classification accuracy for 150 experiments executed for each model and for N = 200 data points.

https://doi.org/10.1371/journal.pone.0274569.g006

Lastly, we compared the time efficiency of our method with the other ML methods again with a fixed number of N = 200 training data points. Fig 7 shows the average running time for training the models and evaluating the predictions. The running time of the hybrid model is in the range of the other ML models even though the time needed for the hyperparameter optimization of the ML models is not considered.

thumbnail
Fig 7. The running time efficiency.

The comparison of the average running time of the examined methods for 150 experiments executed for each model and for N = 200 data points.

https://doi.org/10.1371/journal.pone.0274569.g007

Mortality estimation for cohorts of COVID-19 patients

To validate the potential applications of our strategy in life sciences, we studied the mortality in a cohort of 63 severely ill COVID-19 patients requiring ICU treatment. First, we fitted an SHM with an underlying tree structure to the COVID-19 data that maps the five binarized patient features (see Table 2) to the corresponding vital status. As discussed in S1 Appendix, the clinical and physiological information of the 63 patients was mapped to twenty different 5-dimensional binary representations out of the possible 25. Based on the knowledge about the nature of the patient features in Table 2, we generated a hybrid network that consisted of two first-layer black-box modules, see the Biometric and Physiology modules in Fig 8. The Biometric module operates on the binarized age and BMI attributes of the patients. The Physiology module receives inputs related to the accumulation value of two physiological parameters, namely PaO2/FiO2 and the urine output. Fig 8 also illustrates the associated orthotope of the binarized COVID-19 data. The empty cells represent input data for which the label, i.e., the vital status, is still to be determined. The i-o functions obtained from the interior black-box modules of the COVID-19 hybrid network after training are shown in S1 Fig.

thumbnail
Fig 8. The SHM and the associated orthotope for the COVID-19 data.

The upper figure shows the hybrid network mapping five binarized patient features to their vital status. The lower figure depicts the associated orthotope consisting of 20 cells with labels and 12 cells without a label.

https://doi.org/10.1371/journal.pone.0274569.g008

Fig 9 illustrates the results of our learning strategy to predict the vital status of the patients. To cross-validate the results of our strategy, we partitioned the 20 available 5-dimensional binary representations of the patient data into a training and a validation set. The test-data set then consisted of the twelve unlabeled 5-dimensional binary representations of the patient data and the validation set. The out-of-sample forecast performance was calculated for randomly selected training data sets with 5 ≤ Ntr ≤ 20 for the tree-structured SHM. The out-of-sample forecast performance of an ML classifier is its test accuracy, which is the number of data points for which the label has been predicted correctly divided by the total size of the test-data set. Similarly, one can define the out-of-sample forecast performance of an SHM that performs a classification task as the number of unlabeled data points for which the SHM computes the right output divided by the total number of unlabeled valid inputs. The results on the COVID-19 data confirm the existence of training-data sets constituting ≈40% of the entire valid input space that has out-of-sample forecast performance equal to 1. Our method also yields out-of-sample forecast performance equal to 1 for all tested (randomly chosen) training-data sets of size at least 62% of the entire valid input space, which consists of the twenty different 5-dimensional binary representations of the considered 63 patients. The filled orthotope related to the COVID-19 hybrid network after training is shown in S2 Fig.

thumbnail
Fig 9. The out-of-sample forecast performance for the vital status of COVID-19 patients.

The x-axis shows the percentage of the full input space that was used as the training data. For each training-data set size, we randomly sampled 1000 training data sets and measured the forecast performance. For randomly chosen training-data sets with a size of at least 56% of the entire valid input space, the median of the out-of-sample forecast performance equals 1.

https://doi.org/10.1371/journal.pone.0274569.g009

Conclusion

In this paper, we proposed a learning strategy for binary classification tasks in which the classification can be computed by a tree-structured network with binary input vectors. We designed a structured hybrid model where the mechanistic knowledge about the system consists of the tree structure of the network that computes the output. The learning strategy is described for hybrid models with randomly distributed training data instead of densely distributed training data on low-dimensional manifolds as assumed in [22, 23], and thereby it can be considered a systematic extension of those works.

Compared with sole data-driven methods, our strategy promises a lower data demand since fewer parameters needs to be trained. As another direct result of incorporating prior knowledge into the modeling, it enables extrapolation. We evaluated our method by comparing its classification performance on synthetic data with various supervised ML algorithms. The numerical results testify to the lower data demand as well as the ability to extrapolate to the entire valid input binary space of our model.

We also applied our strategy to construct a tree-structured hybrid network that predicts the vital status of COVID-19 patients requiring intensive care-unit treatment and mechanical ventilation. The results show that our strategy can capture the mapping between binarized clinical patient information collected in the ICU stay and their vital status. As our application shows, the proposed learning strategy for training hybrid predictive models in clinical studies has the potential to extrapolate, i.e., make reliable predictions outside the convex hull of the given clinical data. This property can boost applications of ML in medical and clinical research where small-sized or biased clinical data sets occur.

There are two major limitations in this study that could be addressed in future research. First, the general method introduced in this paper is limited to binary input data. We are convinced that our method can be extended to continuous input data since the input to any network is always specified within a finite precision and can therefore be discretized and binarized. We plan to develop a proper data binarization step preparatory to the training step to handle this limitation. Second, the study focused on tree-structured networks. In a non-tree structured network, some input features are connected to more than one black-box module, which limits the Conflict-Graph construction part of our training strategy. One way to overcome this limitation is to omit those input features forming a non-tree structure and train the model with the remaining features for all possible combinations of the omitted features. However, this approach is not efficient in terms of training data demand. The generalization of the training strategy introduced in this paper to non-tree structured networks is a follow-up project that seems to require a heuristic approach.

Supporting information

S1 Appendix. Schematic representation of the tree structures.

Tree structured networks mapping x ∈ {0, 1}d, where d ∈ {8, 9, 10, 11, 12}, to binary output labels y ∈ {0, 1}.

https://doi.org/10.1371/journal.pone.0274569.s001

(PDF)

S2 Appendix. COVID-19 data binarization.

Conversion of the clinical and physiological COVID-19 patient features to binary variables.

https://doi.org/10.1371/journal.pone.0274569.s002

(PDF)

S3 Appendix. Hyperparameter optimization results.

The optimized hyperparameters resulted from grid-search cross-validation and Keras tuner for the supervised ML methods.

https://doi.org/10.1371/journal.pone.0274569.s003

(PDF)

S4 Appendix. The influence of the dimensionality and noise intensity of the synthetic data on the classification efficiency.

A comparison between the effect of data dimensionality and noise intensity on the classification efficiency of the hybrid model and the other supervised ML algorithm for the synthetic data.

https://doi.org/10.1371/journal.pone.0274569.s004

(PDF)

S1 Fig. The i-o functions obtained of the interior black-box modules of the COVID-19 hybrid network after training.

(Above:) An overview of the i-o function of the Biometric module, the Physiology module, and the output modules of the COVID-19 hybrid network. The circular and the radial axes represent the binary inputs to the module and mortality rates for the considered 63 COVID-19 patients, respectively. The low mortality rates reflect noise in the patient data. (Below:) The active cells of the orthotope related to each black-box module are highlighted in blue.

https://doi.org/10.1371/journal.pone.0274569.s005

(PDF)

S2 Fig. Filled orthotope of the COVID-19 hybrid network.

The filled orthotope of the COVID-19 network after performing the learning strategy. The black binary numbers represent the vital status in the original data, and the orange binary numbers display the predicted vital status.

https://doi.org/10.1371/journal.pone.0274569.s006

(PDF)

S1 Table. Physiological parameters used by the decision tree classifier of the COVID-19 patients’ vital status.

The physiological parameters required for the SOFA score assessment. The parameters were evaluated for the first 7 days of ICU stay.

https://doi.org/10.1371/journal.pone.0274569.s007

(PDF)

Acknowledgments

Moein E. Samadi’s contribution to this work was performed as part of the Helmholtz School for Data Science in Life, Earth, and Energy (HDS-LEE).

References

  1. 1. Shalev-Shwartz S, Ben-David S. Understanding machine learning: From theory to algorithms. Cambridge university press; 2014 May 19.
  2. 2. Angermueller C, Pärnamaa T, Parts L, Stegle O. Deep learning for computational biology. Molecular systems biology. 2016 Jul;12(7):878. pmid:27474269
  3. 3. Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface. 2018 Apr 30;15(141):20170387. pmid:29618526
  4. 4. Min S, Lee B, Yoon S. Deep learning in bioinformatics. Briefings in bioinformatics. 2017 Sep 1;18(5):851–69. pmid:27473064
  5. 5. Gawehn E, Hiss JA, Schneider G. Deep learning in drug discovery. Molecular informatics. 2016 Jan;35(1):3–14. pmid:27491648
  6. 6. Hooker G. Diagnosing extrapolation: Tree-based density estimation. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining 2004 Aug 22 (pp. 569-574).
  7. 7. Barbiero P, Squillero G, Tonda A. Modeling generalization in machine learning: A methodological and computational study. arXiv preprint arXiv:2006.15680. 2020 Jun 28.
  8. 8. Van Can HJ, Te Braake HA, Dubbelman S, Hellinga C, Luyben KC, Heijnen JJ. Understanding and applying the extrapolation properties of serial gray-box models. AIChE journal. 1998 May, 44(5):1071–89.
  9. 9. Bartley ML, Hanks EM, Schliep EM, Soranno PA, Wagner T. Identifying and characterizing extrapolation in multivariate response data. PloS one. 2019 Dec 5;14(12):e0225715. pmid:31805095
  10. 10. Altman N, Krzywinski M. The curse (s) of dimensionality. Nat Methods. 2018 Jun 1;15(6):399–400. pmid:29855577
  11. 11. Kpotufe S. Escaping the curse of dimensionality with a tree-based regressor. arXiv preprint arXiv:0902.3453. 2009 Feb 19.
  12. 12. Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan EA, Wang Y. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature reviews Cancer. 2008 Jan;8(1):37–49. pmid:18097463
  13. 13. Bach F. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research. 2017 Jan 1;18(1):629–81.
  14. 14. Mallat S. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016 Apr 13;374(2065):20150203.
  15. 15. Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing. 2017 Oct 1;14(5):503–19.
  16. 16. Chen D, Liu S, Kingsbury P, Sohn S, Storlie CB, Habermann EB, et al. Deep learning and alternative learning strategies for retrospective real-world clinical data. NPJ digital medicine. 2019 May 30;2(1):1–5. pmid:31304389
  17. 17. Fröhlich H, Balling R, Beerenwinkel N, Kohlbacher O, Kumar S, Lengauer T, et al.. From hype to reality: data science enabling personalized medicine. BMC medicine. 2018 Dec;16(1):1–5. pmid:30145981
  18. 18. Knight SR, Ho A, Pius R, Buchan I, Carson G, Drake TM, et al.. Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of the 4C Mortality Score. BMJ. 2020 Sep 9;370. pmid:32907855
  19. 19. Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al.. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020 Apr 7;369. pmid:32265220
  20. 20. Syeda HB, Syed M, Sexton KW, Syed S, Begum S, Syed F, et al. Role of machine learning techniques to tackle the COVID-19 crisis: Systematic review. JMIR medical informatics. 2021 Jan 11;9(1):e23811. pmid:33326405
  21. 21. Sharafutdinov K, Fritsch SJ, Marx G, Bickenbach J, Schuppert A. Biometric covariates and outcome in COVID-19 patients: are we looking close enough?. BMC infectious diseases. 2021 Dec;21(1):1–9. pmid:34736400
  22. 22. Schuppert A. Extrapolability of structured hybrid models: a key to optimization of complex processes. InEquadiff 99: (In 2 Volumes) 2000 (pp. 1135–1151).
  23. 23. Fiedler B, Schuppert A. Local identification of scalar hybrid models with tree structure. IMA Journal of Applied Mathematics. 2008 Jun;73(3):449–76.
  24. 24. Schmidt AL, Bandar ZU. Modularity: a concept for new neural network architectures. InProc. IASTED International Conf. Computer Systems and Applications 1998 Mar (pp. 26-29).
  25. 25. Thompson ML, Kramer MA. Modeling chemical processes using prior knowledge and neural networks. AIChE Journal. 1994 Aug;40(8):1328–40.
  26. 26. Von Stosch M, Oliveira R, Peres J, de Azevedo SF. Hybrid semi-parametric modeling in process systems engineering: Past, present and future. Computers & Chemical Engineering. 2014 Jan 10;60:86–101.
  27. 27. Kahrs O, Marquardt W. The validity domain of hybrid models and its application in process optimization. Chemical Engineering and Processing: Process Intensification. 2007 Nov 1;46(11):1054–66.
  28. 28. Overhage JM, Overhage LM. Sensible use of observational clinical data. Statistical methods in medical research. 2013 Feb;22(1):7–13. pmid:21828172
  29. 29. Ries B, de Werra D. On two coloring problems in mixed graphs. European Journal of Combinatorics. 2008 Apr 1;29(3):712–25.
  30. 30. Konig D. Theorie der endlichen und unendlichen Graphen. American Mathematical Soc.; 2001.
  31. 31. Vincent JL, et al.. Use of the SOFA score to assess the incidence of organ dysfunction/failure in intensive care units: results of a multicenter, prospective study. Critical care medicine. 1998 Nov 1;26(11):1793–800. pmid:9824069
  32. 32. Goutte C, Gaussier E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. InEuropean conference on information retrieval 2005 Mar 21 (pp. 345-359). Springer, Berlin, Heidelberg.
  33. 33. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011 Nov 1;12:2825–30.
  34. 34. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, et al.. Tensorflow: large-scale machine learning on heterogeneous distributed systems (2016). arXiv preprint arXiv:1603.04467. 2015;52.
  35. 35. O’Malley T, Bursztein E, Long J, Chollet, F, Jin, H, Invernizzi, L. others: Keras Tuner. 2019, github.com/keras-team/keras-tuner.
  36. 36. Cascos I. The expected convex hull trimmed regions of a sample. Computational Statistics. 2007 Dec;22(4):557–69.
  37. 37. Demšar J. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research. 2006 Dec 1;7:1–30.
  38. 38. Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association. 1937 Dec 1;32(200):675–701.
  39. 39. Rodríguez-Fdez I, Canosa A, Mucientes M, Bugarín A. STAC: a web platform for the comparison of algorithms using statistical tests. In2015 IEEE international conference on fuzzy systems (FUZZ-IEEE) 2015 Aug 2 (pp. 1-8). IEEE.
  40. 40. Holm S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics. 1979 Jan 1:65–70.