-
Weakly Supervised Learning Creates a Fusion of Modeling Cultures
The past two decades have witnessed the great success of the algorithmic modeling framework advocated by Breiman et al. (2001). Nevertheless, the excellent prediction performance of these black-box models rely heavily on the availability of strong supervision, i.e. a large set of accurate and exact ground-truth labels. In practice, strong supervision can be unavailable or expensive, which calls for modeling techniques under weak supervision. In this comment, we summarize the key concepts in weakly supervised learning and discuss some recent developments in the field. Using algorithmic modeling alone under a weak supervision might lead to unstable and misleading results. A promising direction would be integrating the data modeling culture into such a framework.
Weakly Supervised Learning, Algorithmic Modeling, Data Modeling
As an important think piece to both the statistics and machine learning communities, Breiman et al. (2001) laid out the contrast of the two cultures in modeling thinking: data modeling and algorithmic modeling. It pointed out the limitations of data modeling and the opportunities and potentials of algorithmic modeling. Over the past two decades, Breiman et al. (2001)'s vision for algorithmic modeling has been validated by the rapid development and application of complicated yet effective algorithmic models, e.g., deep learning. Meanwhile, new challenges and opportunities are emerging everyday as we continue to deal with data with increasing size and complexity. In this comment, we offer a brief discussion of recent developments in the field of weakly supervised learning and discuss how it creates a need for data modeling thinking in an algorithmic modeling framework.
Following the taxonomy introduced in Breiman et al. (2001), the data modeling culture refers to methods that explicitly assume a stochastic model for data generation. Often, [End Page 203] methods of this culture have shallow structures and are easy to interpret. Typical examples include linear regression, logistic regression, to name a few. The validity of such methods is backed by the probabilistic properties of their outputs, such as goodness-of-fit tests and residual analyses. In contrast, the algorithmic modeling culture aims to learn the complex and unknown nature of true data generation mechanisms through "black-box" algorithms. Typical examples of this culture include decision trees, support vector machines (SVM), and neural networks (NN). The training and evaluation of these algorithms are guided by predictive accuracy.
The past two decades have witnessed the rapid expansion and success of the algorithmic modeling culture. From self-driving cars (Bojarski et al., 2016) to virtual assistants (Devlin et al., 2018), complicated algorithmic models such as deep neural networks (DNNs) have demonstrated their potential for leveraging today's big data and affordable high-performance computational resources in producing predictions that are comparable to human performance. However, training such algorithms to attain impressive performance relies heavily on a large volume of training data with high-quality labels (see Figure 1), which are often expensive or even unavailable in many real-world applications. In particular, such a strong supervision becomes substantially scarcer in application domains that are more specialized, such as healthcare (Miotto et al., 2018) and ecological studies (Christin et al., 2019; Tang et al., 2021), where domain expertise is vital in data labeling. As a result, practical challenges due to the lack of strong supervision in many real-world applications significantly limit the applicability and generalization of algorithmic models.
Weakly supervised learning (WSL) (Zhou, 2018) addresses the more realistic setting when supervision is available but weak under various practical scenarios. It expands the reach of conventional supervised learning and has garnered a lot of interests in applications (e.g. Jorgensen et al., 2008; Oquab et al., 2015; Peyre et al., 2017). In algorithmic modeling, strong supervision comes from a large set of accurately labelled data. Such a supervision may be weakened in approximately three ways: incomplete supervision, inexact supervision, and inaccurate supervision (Zhou, 2018).
Let X be the input features. Let Y be the outcome of interest. When Y's values are available in the training data as labels, they provide strong supervision for the algorithmic modeling, through a loss function L(Y, ) (Figure 1). In practice, the exact and accurate values of Y are often unavailable in the training data. Instead, let be the observed (weak) labels in the training data. Here, we introduce a unified notation, W, for the generating [End Page 204] mechanism of the weakened supervision . Using the above notation, the framework of WSL is summarized in Figure 2. WSL shares the same learning goal with methods of the algorithmic modeling culture in Breiman et al. (2001), that is, train a function f(X) such that Y can be accurately predicted or approximated by f(X). Here, the challenges arise mostly from the lack of strong labels Y and the need to create effective supervision based on the observed.
Directly applying algorithmic modeling to data with weakened training labels without considering the weak supervision generating mechanism W could lead to results that are unstable and overfitted (Frénay and Verleysen, 2013; Van Engelen and Hoos, 2020). Take semi-supervised learning as an example, which can be thought of as a special case of incomplete supervision (Zhou, 2018). Most algorithms in the field of semi-supervised learning rely on the assumption that the labels are missing complete at random. When this assumption is violated in real data, semi-supervised learning algorithms may actually degrade the learning performance, compared to applying supervised learning methods directly on the labeled portion of the dataset (Zhu, 2008). Another example is training DNNs with noisy training labels, i.e., inaccurate supervision. Zhang et al. (2017) provided empirical results showing that DNNs can fit training data with randomly shuffled labels arbitrarily well. Not surprisingly, the generalization performance of the trained DNNs on test sets was no better than random guessing. Even for the relatively shallow tree ensemble models, numerical experiments have shown that the adaptive boosting algorithm (AdaBoost) would disproportionately focus on learning mislabeled instances when label noises exist (Dietterich, 2000). Therefore, an algorithmic modeling framework under weak supervision needs to explicitly acknowledge the weakening mechanism W.
The entire promise of WSL lies within the assumption that the weak labels in the training data carry partial information of Y through the weak supervision generating mechanism W. Most current methods in WSL assume that the mapping from Y to by W is independent of the features X and the true labels Y. A more realistic scenario, however, would be that the mechanism of W may be dependent of both X and Y. Consider the joint distribution of the observed weakened labels and the features X, [End Page 205]
As shown in Figure 2, our learning goal remains to be fitting P(Y|X), the unknown data generating mechanism, with a model f(X), even when we lack direct observations of Y. To allow information in be passed onto the learning of f(X), it is critical to model W in Figure 2, i.e., P( |Y, X) on the right-hand side of Equation (1). In Figure 3, we introduce an overly generalized notation g(|Y, X) to encapsulate models and approaches for the mechanism W. In practice, characterizing P( |Y, X) could be challenging as information is often scarce. In fact, without additional information beyond the training data, it is not possible to effectively leverage the weak supervision that is offered by (Frénay and Verleysen, 2013; Zhou, 2018). In the weakly supervised learning literature, additional information for constructing g(|Y, X) has been introduced in the form of assumptions on W and/or small sets of data with observed Y. This is primarily motivated by the need for transparency and interpretability for g( |Y, X) to incorporate prior knowledge into "end-to-end" modeling frameworks. In other words, it is desirable to have the modeling of W be "assumption-driven" rather than "data-driven" or "accuracy-driven", which creates a role for the data modeling culture within an algorithmic modeling framework (Figure 3). In the weakly supervised learning literature, there has been some progress made to address each type of weak supervision.
For incomplete supervision where labels are only available for a small subset of training data, active learning algorithms (Settles, 2009) attempt to better extract label information by "actively" asking an "oracle" (e.g., a human annotator) for queries of selected unlabeled instances. This framework has been widely used in image classification (Joshi et al., 2009; Kapoor et al., 2007; Li and Guo, 2013). Assuming the existence of an "oracle", the key component of active learning is to choose the most "valuable" instance to query. To this end, measures of informativeness and representativeness of individual observations have been proposed (Settles, 2009). For example, Bayesian active learning methods estimate the expected improvement of each instance query through nonparametric models such as Gaussian process and Monte Carlo estimations (e.g. Gal et al., 2017; Kapoor et al., 2007; [End Page 206] Roy and McCallum, 2001). As another approach to incomplete supervision, semi-supervised learning algorithms (Chapelle et al., 2009; Zhu, 2005) utilize the unlabeled training data as well as labeled data to improve prediction accuracy. Transductive methods were proposed to obtain label prediction for unlabelled data points (Van Engelen and Hoos, 2020), including the use of probabilistic models, such as Markov random fields and Gaussian random fields, for label assignments (e.g. Shental and Domany, 2005; Wu et al., 2012; Zhu and Ghahramani, 2002).
Inexact supervision addresses the situation where the given labels are at coarser scales than desired. For example, in many real-world object segmentation tasks, only image-level training labels are available, while the task is to localize each object. Multi-instance learning (Zhou and Zhang, 2007) was such an example with a bag-of-instances setup: instances xij are organized in bags Xj, and the labels in the training set are only given at the bag level. A common assumption for this task is that the bag-level class probability is the maximum of all the instance-level class probabilities within the bag. This assumption bridges the gap between instance predictions and observed bag labels. Another example is the concept labeling method (Chenthamarakshan et al., 2011), which assumes a soft bag-instance structure. In their Bayesian modeling framework, each document (instance) X has a distribution P(V|X) over the concepts of the ontology V (bag). It is assumed the outcome variable of interest Y, categories, is conditionally independent of document content X, when conditioning on the oncology concept V. Consider P(Y|X) = ΣV P(Y, V|X) = ΣV P(Y|V)P(V|X). As a result, by separately modeling the document-to-concept distribution P(V|X) and the concept-to-class distribution P(Y|V), the instance-level document label predictions P(Y|X) can be obtained.
Inaccurate supervision concerns the situation where labels are a noisy version of the ground truth. To learn with noisy labels, many algorithms make the assumption that the noises are randomly generated. Brodley and Friedl (1999) proposes to first identify the potentially mislabeled instances and perform label correction. Northcutt et al. (2021) proposed the Confidence Learning framework that iteratively determines which labels are more likely to be the contaminated ones, based on an estimated joint distribution of true label Y and observed label . The data programming approach proposed by Ratner et al. (2016) is a paradigm for integrating noisy labels from multiple sources, and deriving a better training set using a dependency graph that incorporates different assumptions on the weak supervision generating mechanisms.
For any WSL framework, optimizing the generalization performance of the learned model f(X), for P(Y|X), remains the main goal. However, it is important to consider the practical issues caused by the imperfection of available data and construct "end-to-end" learning frameworks that take raw training data and deliver reliable final models. In this comment, we argue that the scarcity of strong supervision in many real-world applications calls for a fusion of modeling cultures that allow creative combinations of assumption-driven and data-driven approaches. There are many open problems and challenges that remain to be further explored. In particular, in many real-world learning tasks, all the above weak supervision scenarios may apply at the same time (e.g., noisy and inexact labels are only available on a small subset, as seen in Tang et al. (2021)). Most of the existing weakly supervised learning methods focus only on a single type of weak supervision. As a result, the weak supervision generation mechanism P(|Y, X) is usually over-simplified in practice. [End Page 207] Much of the statistical literature from the data modeling culture, e.g., robust statistics and methods for missing data, may find application in end-to-end workflows of weakly supervised learning. In addition, many current methods in WSL incorporate assumptions on the weak supervision in an ad hoc fashion. For the same reasons that have led to the lack of strong supervision in the training data, it is also impractical to assume that one can validate the learning framework using prediction accuracy on test data alone. Systematic model checking with respect to the weak supervision generating mechanism W is needed. [End Page 208]
Columbia University
1255 Amsterdam Avenue, New York, NY 10027
ct2747@columbia.edu
Columbia University
1255 Amsterdam Avenue, New York, NY 10027
gy2277@columbia.edu
Columbia University
1255 Amsterdam Avenue, New York, NY 10027
tian.zheng@columbia.edu