Taxonomy of machine learning paradigms: A data-centric perspective

Machine learning is a field composed of various pillars. Traditionally, supervised learning (SL), unsupervised learning (UL), and reinforcement learning (RL) are the dominating learning paradigms that inspired the field since the 1950s. Based on these, thousands of different methods have been developed during the last seven decades used in nearly all application domains. However, recently, other learning paradigms are gaining momentum which complement and extend the above learning paradigms significantly. These are multi-label learning (MLL), semi-supervised learning (SSL), one-class classification (OCC), positive-unlabeled learning (PUL), transfer learning (TL), multi-task learning (MTL), and one-shot learning (OSL). The purpose of this article is a systematic discussion of these modern learning paradigms and their connection to the traditional ones. We discuss each of the learning paradigms formally by defining key constituents and paying particular attention to the data requirements for allowing an easy connection to applications. That means, we assume a data-driven perspective. This perspective will also allow a systematic identification of relations between the individual learning paradigms in the form of a learning-paradigm graph (LP-graph). Overall, the LP-graph establishes a taxonomy among 10 different learning paradigms.

systematic discussion of these modern learning paradigms and their connection to the traditional ones. We discuss each of the learning paradigms formally by defining key constituents and paying particular attention to the data requirements for allowing an easy connection to applications. That means, we assume a data-driven perspective. This perspective will also allow a systematic identification of relations between the individual learning paradigms in the form of a learning-paradigm graph (LP-graph). Overall, the LP-graph establishes a taxonomy among 10 different learning paradigms. Machine learning has a long history and systematic studies date back at least to the 1940s. Despite the fact that there is no definite starting date for the field, Arthur Samuel is frequently credited for coining the term "machine learning" in the 1950s when working on checkers playing (Samuel, 1959). In general, one needs to distinguish between a machine learning paradigm and methodological realizations thereof. For instance, supervised learning (SL) is a machine learning paradigm that deals either with classification or regression problems (Carbonell et al., 1983). In the case of classification, this allows the categorization of different instances into separate classes, for example, with a support-vector machine (SVM;Vapnik, 1995), which is a particular method that can be used for such an analysis. Since the beginnings of machine learning the field focused on three major learning paradigms: SL, unsupervised learning (UL), and reinforcement learning (RL). For these paradigms, thousands of different methods and algorithms (Bishop, 2006;Flach, 2012;Haste et al., 2009) have been developed over the decades and studied in various application domains ranging from biology, medicine, and engineering to sociology and psychology (Dwyer et al., 2018;Smolander et al., 2019;J. Wang et al., 2018). We think it is fair to say that to this day, machine learning is dominated by the above three learning paradigms and methodological realizations thereof to the extent that most people would even struggle to name extension thereof. For instance, a recent review of machine learning applications in manufacturing focused only on SL, UL, and RL (Wuest et al., 2016). Similar exclusive reviews from other domains can be found in Jordan and Mitchell (2015), Libbrecht and Noble (2015), and Vamathevan et al. (2019). The purpose of this article is a systematic review and discussion of machine learning paradigms beyond the traditional ones.
Specifically, we discuss seven more recent learning paradigms that extend and complement the three traditional machine learning paradigms. These learning paradigms are multi-label learning (MLL), semi-supervised learning (SSL), one-class classification (OCC), positive-unlabeled learning (PUL), transfer learning (TL), multi-task learning (MTL), and one-shot learning (OSL). In this article, each of these learning paradigms is introduced formally by defining key constituents. We will pay particular attention to the data requirements to allow an easy connection to applications. That means we assume a data-driven perspective.
There are many survey papers focusing on a variety of subfields of machine learning. For instance, there are reviews about methods for SSL (Van Engelen & Hoos, 2020), OCC (Rodionova et al., 2016), PUL (Bekker & Davis, 2020), few/OSL (Y. Wang et al., 2020), TL (Weiss et al., 2016), MTL (Y. , and MLL (Gibaja & Ventura, 2014) or applications in a variety of different fields. However, our paper is different. Specifically, the novel contribution of our paper is 4-fold. First, instead of focusing on methods or applications of methods, we focus on learning paradigms. That means we are concerned with the fundamental ideas and definitions underlying machine learning paradigms. Second, we do not only discuss one learning paradigm, but we present 10 different machine learning paradigms side-by-side. This gives a global overview and insights by avoiding a tunnel vision. Third, despite the formal nature of a learning paradigm, we assume a data-centric perspective which has two advantages. First, for applications this makes it easier to select the most appropriate paradigm for a given problem, and, second, it simplifies the formal comparison of machine learning paradigms. Fourth, we provide a comparison of 10 machine learning paradigms in the form of an interrelation diagram, which we call the learning-paradigm graph (LP-graph). In the LP-graph, two learning paradigms are connected if one can map one paradigm onto another one by a data or information influencing operation (which we define in Section 12). This will establishes a taxonomy among the 10 learning paradigms. Finally, we would like to mention that to the best of our knowledge so far, there is no review that would cover all 10 machine learning paradigms side-by-side and discuss their relations.
This article is organized as follows. In the next section, we present information about the research methods needed for the remainder of the paper. Then we briefly review the three traditional machine learning paradigms, that is, SL, UL, and RL, to make the later comparisons more clear. Thereafter, we discuss seven modern machine learning paradigms and present applications thereof. The following section discusses the interrelation between all machine learning paradigms and introduces formal mappings between these. This will establish a taxonomy. Finally, the paper finishes with conclusions.

| RESEARCH METHODS
In this article, our focus is on machine learning paradigms and their relations. According to Kuhn (1970) and Capra (1996), a paradigm is characterized as follows.
A scientific paradigm is a set of concepts, patterns, or assumptions to which those in a particular professional community are committed and which form the basis of further research.
In order to further emphasize the importance of a paradigm in a scientific context, the term worldview has been suggested as a synonym to describe "a way of thinking about and making sense of the complexities of the real world" (Kaushik & Walsh, 2019;Patton & Fund, 2002). Since machine learning is a scientific field, the above definition can be directly applied to define the machine learning paradigm.
The brief discussion above about the term paradigm should make it clear that machine learning paradigms correspond to worldviews of learning from data and as such are conceptually more important than individual methods or models used for studying practical application problems. Put simply, machine learning paradigms allow to perceive the entire universe whereas the methods correspond to its stars.
In this article, we discuss a variety of machine learning paradigms that are beyond the well-known traditional paradigms SL, UL, and RL. Specifically, we will discuss seven further learning paradigms: SSL, OCC, PUL, OSL, TL, MTL, and MLL. Furthermore, we discuss the interrelations between all these machine learning paradigms.

| Literature review of the different machine learning paradigms
In order to show that each of the above learning paradigms is already in use, at least to a certain extent, and not just a theoretical construct with no value for real-world applications, we conducted a literature study. The result of this literature search provides us with information about the (1) frequency of usage of a machine learning paradigm and their (2) time evolution.

| Frequency of usage of a machine learning paradigm
For our literature study, we conducted a keyword search for published articles in Google Scholar, PubMed, World of Science (WoS), and the Conference on Neural Information Processing Systems (NIPS) and counted the frequency of listed articles. The results are shown in Figure 1 where we show an overview of the usage of the 10 different machine learning paradigms. In order to show all results in one figure, we rescaled the numbers logarithmically (base 10) because the frequency of published articles varies greatly for the different paradigms.
We selected Google Scholar as a reference literature database for general publications in all areas of science, PubMed for publications in biology and biomedical sciences, WoS for a well-curated citation database and NIPS for its F I G U R E 1 Number of published articles about different machine learning paradigms as found in different sources. The y-axis is in a logarithmic (base 10) scale because the number of articles varies greatly. leading role in machine learning conferences. This diversity should allow to obtain a broad overview of the appearance of the machine learning paradigms in very different fields of research across nearly all scientific areas. From this literature search, we can draw the following conclusions.
From Figure 1, one can make a number of different observations. First, despite the fact that the total number of publications in the different sources is considerably different the overall pattern seems quite similar. Specifically, the order in the number of published articles in all different machine learning paradigms is similar across the sources (Figure 1). This is remarkable considering the vast differences in the underlying scientific communities because PubMed represents application studies in biology and the biomedical sciences whereas NIPS is a flagship conference for technical methods and machine learning theories.
Second, there is a large factor range between the usage of advanced machine learning paradigms compared to the traditional machine learning paradigms, for example, SL. Specifically, for TL one finds factors of 7:2, 1:8, 2:4, and 9:3 for Google Scholar, PubMed, WoS, and NIPS, whereas for PUL, the factors are 791:9, 81:1, 236:7, and 268:3 correspondingly. That means, for instance, for every published article in WoS about PUL one finds in average 236:7 ¼ 19, 882=84 ð Þ articles about SL. From these factor ranges one can conclude that all advanced machine learning paradigms are used to a much lesser extent than SL, UL, and RL regardless of the scientific community, as exemplified by the four different sources in Figure 1.

| Time evolution of the usage of machine learning paradigms
In order to complement information about the number of published articles about different machine learning paradigms, shown in Figure 1, we studied also the time evolution. Specifically, in Figure 2, we show the time series of the number of publications about different machine learning paradigms according to Google Scholar (A.) and Web of Science (B.). We use Google Scholar (GS) and Web of Science (WoS) because GS represents general publications in all areas of science while WoS is a well-curated database.
As one can see from Figure 2a,b, the three traditional machine learning paradigms-SL, UL, and RL (denoted by the numbers 1-3 in the figure)-have been first adopted in the literature and are to this day quantitatively dominating the number of publications. It is this observation upon which we base the usage of the term "modern" when distinguishing the three traditional machine learning paradigms from the seven nontraditional machine learning paradigms.

| Traditional machine learning paradigms
Before we discuss the advanced machine learning paradigms, we review briefly the three traditional learning paradigms.
This will allow a well-defined comparison in later sections to identify similarities and differences between the traditional and advanced learning paradigms.

| Supervised learning
For SL, one needs two components: (i) data and (ii) a task. The data provide information about the instances in the form ð1Þ with x i X and y i Y where X is the feature space (also known as input space), Y is the outcome space, and n is the sample size. In general, the elements in the feature space represent information about an experiment, for example, the measurement of gene expression values, sensor signals, or buying behavior of a consumer. Depending on the underlying experiment, the dimensionality of the feature space and its type are defined. For instance, measuring the gene expression of m genes results in m-dimensional feature vectors x X with dim x ð Þ ¼ m. Furthermore, the way the components of a feature vector are measured defines the scale (or level) of a measurement. For example, this can correspond to real values, integer values, or categorical variables. For simplicity, in the following, we assume X ¼ ℝ m , that is, the feature space corresponds to m-dimensional real numbers. For the elements in Y, we need to distinguish two cases. Case (i): Elements in Y are categorical. Case (ii): Elements in Y are real valued. The first case leads to classification problems whereas the second one corresponds to regression problems. In the simplest form a categorical space leads to binary classification, that is, Y ¼ C 1 , C 2 f gwith class labels C 1 and C 2 . In this case, Y is called label space. In contrast, a one-dimensional real valued Y ¼ ℝ leads to (multiple) linear (or nonlinear) regression. Hence, the scale of measurement of Y decides if one has a classification or regression problem.
Given the definition of elements in X and Y, their common occurrence is given by the joint probability distribution P X, Y ð Þ which allows to draw samples, that is, x, y ð Þ$P X, Y ð Þ. Integration over the Y space results in the marginal distribution P X ð Þ, that is, P X ð Þ¼ R P X, Y ð ÞdY , of feature vectors with X X . Combining all the above allows us now to provide the definition of a domain.
Definition 2.1. A domain D consists of a feature space χ and a marginal probability distribution P X ð Þ where X X given by D ¼ X, P X ð Þ f g .
The second component of SL is a task. Put simply, the task provides a mapping from the feature space X into the outcome space Y. Formally, a task, T , is defined as follows.
Definition 2.2. A task T consists of a outcome space Y and a prediction function f X ð Þ with f : In the case of classification, the prediction function f assigns labels to f X ð Þ with X X , whereas for regression the prediction function f assumes continuous values, that is, f X ð Þ ℝ.

| Unsupervised learning
The difference between SL and UL is that for UL, there is no outcome space Y available. From this follow two implications. First, the data assume the form with x i X . Here X is again the feature space and n the sample size. Second, due to the nonavailability of an outcome space, a task, as given in Definition 2.2, is no longer defined. We would like to remark that the latter is responsible that UL finds application in exploratory data analysis (Hoaglin et al., 1983). For completeness, we would like to remark that it is also used for unsupervised classification, data discretization, and dimensionality reduction.

| Reinforcement learning
In contrast to SL and UL, RL is considerably different. In RL, there is no given data, either in the form of Equations (1) or (2). Instead, an agent is interacting with an environment via actions that lead to a change in the state of the environment resulting in a new state (Sutton & Barto, 1998). The iterative application of this cycle generates a sequence of actions and states from which the agent aims to learn the best policy for action making. Importantly, the agent receives a feedback about the quality of its actions consisting only of a scalar reinforcement signal indicating either "reward" or "punishment." Hence, no quantitative feedback is available that could be used for the learning process by the agent. Overall, RL aims at learning a policy to maximize the future return which consists of all future rewards (Kaelbling et al., 1996). In Figure 3, we show the basic components that define the key elements of RL.
Given the above formulation of RL, it is no surprise that the historical roots of the field are inspired by behavioral psychology (Dayan & Abbott, 2001). Furthermore, it is interesting to note that RL has been used to describe the related action-perception cycle (Emmert-Streib, 2003;Sperry, 1952).
In order to establish an optimal policy, the agent faces the dilemma of exploring new states of the environment while maximizing its overall return at the same time. This dilemma is called the exploration versus exploitation tradeoff. To balance both goals, many different strategies have been developed providing different forms of learning approaches for this problem. For instance, by making Markovian assumptions for the transition between states one obtains a Markov Decision Process (MDP; Van Otterlo & Wiering, 2012), whereas for limited sensing capabilities of the agent the problem needs to be formulated as a Partially Observable Markov Decision Processes (POMDP; Jaakkola et al., 1995). For each assumption, practical realization has been suggested (Hauskrecht, 2000), for example, Q-learning for MDP (Watkins & Dayan, 1992) or deep variational learning for POMDP (Igl et al., 2018).
Overall, RL is quite different from all other machine learning paradigms discussed above and below due to its generative character. This is also reflected in its application domains. For instance, popular applications are in robotics, game playing, question-answering, trading, and recommendation systems.

| MODERN MACHINE LEARNING PARADIGMS
The above presentation of the traditional machine learning paradigms-supervised learning, UL, and RL-provides not only information about their basic definitions but also information on the characteristics of the underlying data. This is important because it allows a data-driven perspective. Specifically, depending on the characteristics of the data a learning paradigm can be selected or excluded. This may not be unique but at least allows a preselection of certain principle approaches.
Following this line of thought allows to specify further data with additional characteristics. The advanced machine learning paradigms discussed in the following sections can be distinguished in this way. Specifically, in the following we discuss the seven learning paradigms: 1. Multi-label learning; 2. Semi-supervised learning; 3. One-class classification; 4. Positive-unlabeled learning; 5. Transfer learning; 6. Multi-task learning; 7. Few/one-shot learning.
As we will see below, each of these learning paradigms has different requirements for the underlying data. Hence, they do not merely provide alternative algorithmic or computational approaches for existing data characteristics but establish new conceptual frameworks in the form of machine learning paradigms. In order to emphasize this aspect, we neglect in the following to a large extent algorithmic realizations or statistical estimation techniques which address numerical implementations.
Regarding the presentation order, we made the following selection. We start by discussing MLL because it is a generalization of multiclass classification and is closest related to SL. Thereafter, we present SSL, OCC, and PUL which are the next closest to the traditional learning paradigms and are also related to each other. The following two learning paradigms, TL, and MTL, are as well related to each other and are further extensions of SL. However, both paradigms F I G U R E 3 Basic components of reinforcement learning. The policy, state transition and reward function define the agent and environment, respectively. The overall goal is the maximization of the expected return V of all future rewards. require significant modifications to the traditional conceptual framework. Finally, we discuss few/OSL which is most different from all other learning paradigms. In summary, the order of the following machine learning paradigms reflects the distance to the traditional learning paradigms with respect to the extensions/modifications required. We will return to this argument in Section 12.

| MULTI-LABEL LEARNING
The idea of MLL is to generalize the class labels of a traditional classification having single-valued entities to variable set sizes (Tsoumakas & Katakis, 2007). Therefore, the number of labels as the outcome of a prediction function is variable.

| Definition of MLL
In order to formally define MLL, we need to modify the definition of a data set D. Specifically, for MLL, D is defined as Here, Y i can assume any subset of Y which makes the size of such a set variable, that is, the size is not constant. One can represent such a Y i as a binary vector That means a component of b is 1 if the corresponding label is in Y i and zero otherwise. The goal of MLL is to find a prediction function f that maps the elements of D correctly. Formally, the task is defined as follows.
Definition 4.1. For multi-label learning, a task T consists of a label space Y and a prediction function f Here, 2 Y corresponds to the power set of Y which is the set of all subsets of Y.

| Methodological approaches
For MLL, there are two key conceptual approaches allowing the categorization of available methods.
First, approaches based on problem transformation can be further subdivided into (i) transformation to binary classification, (ii) transformation to label ranking, and (iii) transformation to multiclass classification (Gibaja & Ventura, 2014). Such approaches convert a MLL problem by means of transformations into well-established problem settings. Examples of this are Classifier Chains (Read et al., 2011) which transforms a MLL problem into a binary classification task, Calibrated Label Ranking (Fürnkranz et al., 2008) which maps MLL into the task of label ranking, and Random k-label sets (Tsoumakas et al., 2010) which transforms MLL into the task of multiclass classification. If such a mapping is performed, it is called label powerset (Tsoumakas et al., 2009).
From the definition of MLL and the description of transformation methods, one may wonder why 2 Y is not always directly mapped to a multi-class classification problem because, theoretically, such a mapping is always possible. However, there is a practical problem with this for large j Y j. For instance, let us assume Y ¼ y 1 , …, y 20 f g . In this case, the size of the power set is 1, 048, 576 ¼ 2 20 ð Þ. Hence, if we would map the multi-label problem to a multi-class classification one would have 1, 048, 576 different classes. It is clear that this can result in severe learning problems for such a classifier. For this reason, MLL tries to be more resourceful.

| SEMI-SUPERVISED LEARNING
The idea of SSL is to use both labeled and unlabeled data for performing a SL task (Chapelle et al., 2006).

| Definition of SSL
For defining SSL formally, one needs the definition of a domain D and a task T and the characterization of the data.
The definition of a domain is similar to SL, however, the resulting data are different. Specifically, for SSL, there are two parts of the data, a labeled part, . This means that the available data are of the form D ¼ D L [ D U . In Figure 4, we visualize such data by showing data points with a positive label in blue, data points with a negative label in green, and unlabeled data points in gray.
Formally, SSL can be defined as follows.
Definition 5.3. Given domain, D, with task, T , labeled data D L ¼ x i , y i ð Þ f g n L i¼1 with x i X and y i Y and unlabeled data D U ¼ x j À Á È É n U j¼1 , semi-supervised learning is the process of improving the prediction function, f , by utilizing the labeled and unlabeled data.
We would like to remark that given the data D L are labeled, SSL can be used for classification or regression problems.

| Methodological approaches
For SSL, a broad variety of methods have been proposed. However, there are two key concepts based on which they can be distinguished (X. Zhu & Goldberg, 2009 Both concepts are fundamentally different from each other and the training and prediction parts of such methods are vastly different (Gammerman et al., 2013). Put simply, this can be formulated as follows.
Induction is reasoning from observed training cases to general rules, which are then applied to the test cases.
In contrast, transduction has the following meaning. Transduction is reasoning from observed, specific (training) cases to specific (test) cases.
It is important to note that this implies that transductive learning does not distinguish between the training and testing steps of a model. Instead, it uses both the training and testing data for training the model, in contrast to inductive learning. As a consequence of this transductive learning does not build a predictive model. For this reason, in case one wants to test a new instance then one needs to train the model again for all available data. This is not necessary for inductive learning because it leads to a predictive model that can be used for new instances without re-training the model.
It is interesting to note that many transductive learning approaches are either explicitly or implicitly graph-based because the propagation of information between different data points which can be seen as nodes in a graph (W. Liu et al., 2012;Van Engelen & Hoos, 2020).
A very recent comprehensive review of SSL including details about algorithmic realizations can be found in Van Engelen and Hoos (2020).

| ONE-CLASS CLASSIFICATION
The idea of OCC is to distinguish instances from one particular class from instances outside this class (Moya, 1993;Moya & Hush, 1996;Tax, 2001). This is quite different from ordinary classification and for this reason, OCC has also been called outlier detection, novelty detection, anomaly detection, or concept learning (Japkowicz, 1999;Ruff et al., 2018). Hence, OCC focuses on one particular class only.

| Definition of OCC
OCC is either based on data containing only positive instances, that is, i¼1 and x i X , y i Y or on data that are a combination of positive and unlabeled instances, that is, and x i X . It is assumed that also the unlabeled x i have a label but it is not known to us. We would like to remark that the case for D ¼ D p \ D u is usually called PUL which we discuss in the next section.
Definition 6.1. Given D ¼ D p , one-class learning is the process of improving the scoring function z : x ! ℝ to assign novelty scores to previously unseen test instances x X .
Using such scores, a decision is made based on thresholding (Pimentel et al., 2014).

| Methodological approaches
According to Khan and Madden (2014), one-class learning approaches can be categorized with respect to the way they are using the training data. This allows to distinguish approaches utilizing only positive data from approaches that learn from positive and unlabeled data. The latter has found widespread interest and is called PUL. Due to the importance of such methods, we discuss this subcategory of one-class learning in the next section.
From a methodological point of view, there are three key concepts for OCC that use only positive-labeled data (Bartkowiak, 2011;Tax, 2001 First, density estimation methods are estimating the density of the data points having a positive label. A new instance is classified according to a threshold (Tarassenko et al., 1995). Second, boundary estimation methods focus on setting boundaries around a small set of points, called target points. Example methods from this category utilize SVMs or neural networks (Manevitz & Yousef, 2000;Schölkopf et al., 1999).
For completeness, we would like to remark that one can also find other categorizations of methodological approaches in the literature. For instance, in Perera et al. (2021), one-class learning is divided into six categories: (i) statistical; (ii) representation-based; (iii) deep Learning-based; (iv) discriminative methods; (v) generative models; and (vi) knowledge distillation whereas in Chandola et al. (2009). OCC is divided into (i) classification-based; (ii) nearest-neighbor-based; (iii) clustering-based; (iv) statistical-based; (v) information theoretic-based; and (vi) spectral-based approaches.
It is interesting to note that OCC using only positive-labeled data for density estimation is conceptually similar to statistical hypothesis testing (Emmert-Streib & Dehmer, 2019). However, methodologically these approaches are different because OCC is not based on the concept of a sampling distribution, which specifies not only the estimation precisely but also the statistical interpretations thereof. In contrast, OCC approaches for density estimation are more broad and for this reason, vary in their interpretation considerably.
Comprehensive reviews of OCC learning can be found in Khan and Madden (2014) and Rodionova et al. (2016).

| POSITIVE-UNLABELED LEARNING
For PUL, we are facing a classification problem when only labeled instances of one class are available. In addition, we have unlabeled data which can come from any class but their labels are unknown. For this reason, we have labeled data from one class (termed as "positive") complemented by unlabeled data. The goal is to utilize these data for a classification task.

| Definition of PUL
For obtaining the data, we assume that n p positive samples are randomly drawn from the marginal distribution P xjY ¼ þ1 ð Þand n i unlabeled samples are randomly drawn from P x ð Þ (Niu et al., 2016), resulting in the two data sets Hence, in total, we have the data D ¼ D p [ D u with n ¼ n p þ n u samples. Furthermore, we assume that also for x i D u exist labels in Y, however, these are not observed.
Due to the lack of observable instances for the entire label space Y, the problem is limited to a binary label space (simplifying the complexity).
Definition 7.1. The task T of positive-unlabeled learning consists of a label space Y and a prediction function Based on this definition and the above assumptions, PUL can be formally defined as follows.
Definition 7.2. Given D ¼ D p [ D u , positive-unlabeled learning is the process of improving the prediction function f of the binary task T utilizing D p and D u .
Such approaches exploit inductive and transductive learning approaches, both of which adopt an iterative procedure to obtain reliable negative training data from the unlabeled data [Perera et al., 2021]. An example of such an inductive PU learning algorithm using bagging SVM to infer a GRN (gene regulatory network) is presented in [Mordelet & Vert, 2014].

| Methodological approaches
The main methodological approaches for PUL can be distinguished as follows.
First, the two-step methods use the unlabeled data in step one to identify negative instances, and then in step two use a traditional classifier. Second, the weighting methods estimate real valued weights for the unlabeled data and then learn a classifier based on these weights. The weights represent the likelihood, or conditional probability, that an unlabeled instance belongs to a certain class. Hence, the problem is converted into a (constrained) regression problem. Recently, a generative adversarial network (GAN) has been introduced for PU-learning called GenPU (Hou et al., 2017). GenPu consists of a number of generators and discriminators similar to a minimax game. These components generate simultaneously positive and negative samples with realistic properties which can then be used with a standard classifier.
For completeness, we would like to mention that there are also approaches that can be neither categorized as a twostep method or a weighting method. For instance, the method introduced in Yu and Li (2007) aims at increasing the number of positive instances by combining a graph-based SSL method with the two-step approach.
For comprehensive reviews of PUL, the reader is referred to Bekker and Davis (2020), Jaskie and Spanias (2019), and B. Zhang and Zuo (2008).

| TRANSFER LEARNING
The basic idea of TL is to utilize information from one task to improve the learning for a second one. In order to distinguish the two tasks from each other, the former is called the source task and the latter target task. Correspondingly, for each task, there is a domain and data distinguished in a similar way. In Figure 5, we show a visualization of the underlying idea of TL.

| Definition of TL
Similar to SL (see above) also for TL we need the definition of a domain, D, and a task, T .
Definition 8.1. A domain D consists of a feature space χ and a marginal probability distribution P X ð Þ where X ¼ X 1 , …,X n f g X , that is, D ¼ X,P X ð Þ f g .
F I G U R E 5 Visualization of training and testing for transfer learning (top) and multi-task learning (bottom). For transfer, learning task 1 is usually called source task and task 2 target task. A crucial difference between transfer learning and multi-task learning is that for the latter all tasks are equal, whereas the former focuses only on task 2 (the target task). Furthermore, it is important to note that for multi-task learning all tasks are evaluated independently from each other.
Definition 8.2. A task T consists of a label space Y and a prediction function f X ð Þ with f : X ! Y, that is, The prediction function f X ð Þ is learned from a data set D ¼ x i , y i ð Þ f g n i¼1 with x i X and y i Y and i 1, …, n f gwhere n is the sample size. Some machine learning methods do explicitly provide probabilistic estimates of f in the form of conditional probability distributions, that is, f X ð Þ¼P YjX ð Þ. Hence, this is a generalized form of a prediction function because in the deterministic case this reduces to a delta distribution δ x,y with For TL, one needs to distinguish between two kinds of domains and tasks which are called source domain, D S , and source task, T S , and target domain, D T , and target task, T T with corresponding source data, D S , and target data, D T . From these one can now formally define TL.
Definition 8.3. Given a source domain, D S , with source task, T S , and target domain, D T , with target task, T T , transfer learning is the process of improving the prediction function, f T , of the target task by utilizing D S and T S .
The above definition is quite general in the sense that it does not specify various aspects. Hence, specifying these leads to different subtypes of TL. In the following, we distinguish various subtypes from each other.
• Case: D S ¼ D T and T S ¼ T T : This corresponds to the traditional machine learning setting when we learn f S from source data D S and continue the learning process with target data D T where the resulting prediction function is renamed to f T . From this follows that TL is obtained from D S ≠ D T or T S ≠ T T . Here it is important to emphasize the "or" between the conditions which results in three different cases. • Case: D S ≠ D T : Given that D S ¼ X S , P S X ð Þ f g and D T ¼ X T , P T X ð Þ f g this can either correspond to X S ≠ X T or P S X ð Þ≠ P T X ð Þ. Homogeneous TL: The case when the feature space of the source domain and target domain are the same, that is, X S ¼ X T , is called homogeneous TL. Heterogeneous TL: The case when the feature space of the source domain and target domain are different, that is, X S ≠ X T , is called heterogeneous TL.
This case means that the label space of the source task and the target task is different. For instance, this can be due to a different number of classes in the source task and target task. f S X ð Þ≠ f T X ð Þ: Given that the prediction functions generalize to conditional probability distributions this means P S Y jX ð Þ≠ P T Y jX ð Þ.

| Methodological approaches
For TL, a variety of different perspectives have been suggested for the categorization of this learning paradigm. For instance, one could assume a view with respect to traditional paradigms distinguishing between inductive, transductive, and unsupervised TL (Pan & Yang, 2009) or a model-based view (Zhuang et al., 2020). However, the most common categorization is based on "what to transfer" (Pan & Yang, 2009 (1) For feature-based TL good feature representations are learned from the source task and assumed to be useful for the target task as well. Hence, in this case, the knowledge transfer between source task and target task is via learning feature representations. (2) For parameter-based TL some parameters or prior distribution of hyperparameters are transferred from the source task to the target task. This assumes a similarity between the source model and the target model. Unlike multitask learning, where both the source and target tasks are learned simultaneously, for TL, we may apply additional weightage to the loss of the target domain to improve overall performance. (3) The idea of instancebased TL is to reuse parts of the instances from the source task for the target task. Usually, instances cannot be used directly, instead, this is accomplished via instance weighting. (4) Relational-based TL assumes that instances are not independent and identically distributed but they are dependent. This implies that the underlying data form some kind of network, for example, a transcription regulatory network or a social network.

| MULTI-TASK LEARNING
The idea of MTL compared to TL is two-fold. First, instead of considering exactly 2 tasks, the source and target task, in MTL, there can be m > 2 tasks. Second, these m tasks do not have one or more dedicated targets but all tasks are equally important. That means that there are m source tasks and m target tasks (Caruana, 1997).

| Definition of MTL
Formally, MTL can be described as follows.
Definition 9.1. Given m learning tasks, T k f g m k¼1 , where all tasks or a subset of tasks are related, multi-task learning aims to improve each learning task T k by utilizing information from some or all other models.
For clarity, we would like to emphasize that for each learning task T k , there is a corresponding domain D k ¼ X k , P X k ð Þ f gand data set D k given, from which information can be utilized. In the following, we denote the data set of task k by D k ¼ x ki , y kiÞ o n k i¼1 n with x ki X k and y ki Y k and i 1, …, n k f gwhere n k is the sample size.
• Case: The case where x ki ¼ x li and n k ¼ n l ¼ n for all k,l 1, …,m f gand i 1, n f g, is called multi-view learning. Therefore in this case the x-values of the data D k for all tasks are identical but can have different labels, that is, Y k ≠ Y l for all k, l 1, …, m f g.

| Methodological approaches
For MTL, there are three key methodological approaches used to study such problems (Y.  First, feature-based MTL models assume that different tasks share the same or at least similar features. This includes also methods that perform feature selection or transformation of the original features. Second, parameter-based MTL models utilize parameters between different models to relate the learning between different tasks. Examples of this include methods based on regularization or priors on model parameters. In general, this conceptual approach is very diverse with many different realizations. Third, instance-based MTL models estimate weights for the membership of instances in tasks and then use all instances for learning all tasks in a weighted manner. Comprehensive reviews of MTL can be found in Y. Zhang and Yang (2018), Ruder (2017), and Sosnin et al. (2019).

| FEW/ONE-SHOT LEARNING
The idea of few/OSL is to utilize a (large) training set for learning a similarity function which is then used in combination with a very small data set containing only one or a few instances about unknown classes to make predictions about these unknown classes (Fei-Fei et al., 2006). Hence, few/OSL utilizes semantic information from the training data to deal with few/one instances for new classes that are unknown from the training data. In Figure 6, we summarize the idea of few/OSL.

| Definition of OSL
Few/OSL utilizes three key components.
(1) A labeled data set D.
(2) A support set D Su .
(3) A query q representing a new instance for which a class label should be predicted. The labeled data D is given by D ¼ x i , y i ð Þ f g n i¼1 with x i X and y i Y and i 1, …, n f g, n is the sample size, X feature space and Y label space. If the cardinality of the label space is larger than two, that is, j Y j > 2, then we have a multi-class classification problem, otherwise a binary classification. The data set D serves as a training data to learn a similarity function g. This similarity function will then be used for evaluating the similarity of a query q to instances given in support set D su . The support set D su is defined as follows.
È É S s¼1 providing information about labeled instances of S classes with y i Y 0 . For n 1 ¼ ÁÁÁ ¼ n S ¼ 1 one obtains one-shot learning and for n i > 1 for all i 1,…, S f gwith j n i j small, few-shot learning. For n 1 ¼ ÁÁÁ ¼ n S ¼ n, this is called n-shot, S-way learning.
It is important to note that the label space of the support set D su and the training data D are different, that is, Y ≠ Y 0 . Hence, the semantic transfer from the training data is accomplished via the similarity function and the support set is utilized as a kind of dictionary to look-up the similarity with the query q. In this way, it is possible to make predictions about new classes that have not been present in the training data.
The task that is important for few/OSL is to learn a prediction function, f su : X ! Y 0 , that maps into the classes given by Y 0 , not Y.
Definition 10.2. The task T su for few/one-shot learning consists of outcome space Y 0 and prediction function f su X ð Þ with f su : F I G U R E 6 Overview of few/one-shot learning. There are three key components: (1) Labeled data set D.
(2) Support set D Su with Y 0 ≠ Y.
(3) Query q representing a new instance for which a class label should be predicted. For the testing, the prediction function f su is used to evaluate the similarity between q and the instances in the support set D Su .
The distinction between Y 0 and Y may appear strange at first because it means the classes of the training data and the testing data are different. So how can one learn from the instances provided by the training data for the testing data when the outcome spaces are entirely different? The trick of few/OSL is to assume that the similarity among instances in the training data and the testing data are similar. Hence, learning such a similar function in the form of the function g allows to learn from the training data for the testing data despite the fact that Y 0 ≠ Y.
We would like to remark that the above assumption about the similarity among instances in the training data and the testing data determines the quality of the outcome. Specifically, for infinitely large training data, it should be possible to learn the similarity function g with high accuracy. However, in the case when the similarity in the testing data are not captured by g, the prediction function f su will not be able to provide meaningful results. Strictly, this is true irrespective of the sample size of the training data and the number of instances in the support set. Hence, if the similarity assumption is violated no learning occurs even within the limit of infinite large sample sizes.
Based on the above definitions, few/OSL can now be defined as follows.
Definition 10.3. Given a training data set D and a support set D su , few/one-shot learning is the process of improving a prediction function, f su : X ! Y 0 , for task T su by utilizing D and D su .

| Methodological approaches
In order to establish a few/OSL model, there are essentially two main conceptual approaches.

Semantic transfer via features.
First, semantic transfer via similarities means that knowledge extracted from the training data is utilized for unknown classes via learning similarity concepts. An example of this is the Siamese network used in Koch et al. (2015). Here, the authors learn an image verification task instead of predicting the classes of instances directly. Conceptually, this means to learn the similarity (or lack thereof) between pairs of instances. This network is trained for D and then utilized with D su whereas an instance from D su , that is, x su,i , is used together with a query x. If x is similar to x su,i then the predicted class is y su,i . Second, semantic transfer via features has been suggested by Bart and Ullman (2005). The authors showed that the similarity of novel features to existing features learned from training data can help in feature adaptation.
Recently, deep learning approaches have been used. For instance, in Vinyals et al. (2016), a neural architecture called Matching Networks has been introduced that utilizes an augmented memory including an attention kernel. Another example is Relation Network (RelNet) introduced in Sung et al. (2018). RelNet learns an embedding and a deep nonlinear distance metric with a convolutional neural network for comparing query and sample items.

| APPLICATIONS
From the above definitions and descriptions of the individual learning paradigms, it seems clear that their application potential is enormous due to the extended flexibility offered by these modern paradigms. In order to underline this impression, we provide in Table 1 a brief overview of real-world application domains to which these learning paradigms have been already applied to. Specifically, this table provides some example studies for four different kinds of data, namely, biomedical data, text data, sensor data, and image data.
We used these four data categories to cover a large range of fields from which such data can arise. To make this more clear, we searched the Web of Science to obtain the number of publications from such fields. Specifically, we searched the number of publications for applications in engineering, telecommunications, image science, automation, biology, acoustics, robotics, and materials. The results of this search are shown in Figure 7. There we show detailed information for SSL, OCC, PUL, OSL, TL, MTL, and MLL. We would like to note that the y-axis is again on a logarithmic (base 10) scale due to the large differences in the observed number of publications.
Comparing the scale of the y-axis in Figure 7 with Figure 1 indicates that all of the modern learning paradigms, especially PUL or OSL, are severely underutilized in essentially all fields. We hypothesize the main reason for this difference is not due to the inadequacy of particular learning paradigms for certain application fields but the lack of knowledge of the application-oriented communities due to the inaccessibility of the presentation of modern learning paradigms. Importantly, for the application-oriented communities, algorithmic details of methods or learning paradigms are in general as less administrable than information about the applicability of such methods or learning paradigms to particular data. For this reason, in this article, we assumed a data-driven perspective making it easy to decide if a particular learning paradigm is basically suited for analyzing a given data set. This allows to narrow down all options so one can then focus on the remaining learning paradigms and the selection of appropriate methods.

| INTERRELATIONS BETWEEN MACHINE LEARNING PARADIGMS
After reaching conceptual clarity of the different learning paradigms, their definitions, and applications, we study now the relations between them.
In Figure 8, we show an interrelation diagram for all 10 machine learning paradigms. In this figure, the used acronyms correspond to SL, UL, RL, MLL, SSL, OCC, PUL, TL, MTL, and OSL. We constructed this diagram by defining mappings between the learning paradigms. Specifically, if there is a relation between two paradigms in the form of a mapping then there is an edge connecting these. Each edge has a direction defining a start node (S), an end node (E), and a label. Overall, the interrelation diagram defines a directed, labeled graph, we call the LP-graph.
For the mappings between the learning paradigms, one needs to distinguish between different cases corresponding to different types of mappings. In the following, we distinguish between four different mapping types.
1. Data-deleting mapping: This type of mapping deletes data from the available set of data of a learning paradigm. Specifically, in order to map from SSL to UL, PUL, and SL and from SL to OCC one can define the following datadeleting mappings.
F I G U R E 8 The shown diagram, called the learning-paradigm graph (LP-graph), provides information about connections between the different machine learning paradigms. The acronyms correspond to supervised learning (SL), unsupervised learning (UL), reinforcement learning (RL), multi-label learning (MLL), semi-supervised learning (SSL), one-class classification (OCC), positive-unlabeled learning (PUL), transfer learning (TL), multi-task learning (MTL), and one-shot learning (OSL). Two nodes in the diagram are connected via a specific type of mapping (see main text) indicated by the color.
As one can see, each of the above mappings deletes a part of the available data given by D. Due to the fact that different learning paradigms are based on different data, the meaning of D is paradigm specific. 2. Task-redefining mapping: The second type of mapping does not require a deletion of data but a redefinition of a task. In Figure 8, this occurs just once from MLL to SL. d MLL!SL : Convert a multi-label designation into multi-class categories: 3. Multiple mapping (data-deleting and task-redefining): The third type of mapping provides simultaneous datadeleting and task-redefining mapping. Such a mapping is required to connect TL ! SL, MTL ! TL, and OSL ! SL.
Redefining the task: Redefining the task: Here, i and j with i ≠ j correspond to just one task from 1,…, m f g. For simplicity, one can assume i ¼ 1 and j ¼ 2.
Redefining the task: This type of mapping is more complex because the redefinition of the task is a significant deviation from the original problem. 4. Type-changing mapping: The fourth type of mapping is the most severe one in the sense that it changes the characteristics of a learning paradigm entirely. This mapping is required for RL ! SL.
In order to obtain a better understanding of the LP-graph and its mappings, we discuss in the following some of its implications.
The label of an edge in the LP-graph provides information about the type of a mapping: In general, a label provides information that needs to be deleted/changed from node S to node E. Hence, all mappings correspond to a reduction of information from node S to node E. For instance, in order to obtain UL from SSL one needs to delete the data providing labeled information called D L from the formulation of the SSL problem.
Considering the direction of a mapping, there is only one directed edge that starts from SL (see Figure 8). In this case, information from SL toward OCC needs to be deleted whereas for all other mappings involving SL the deletion of information occurs toward SL. We would like to remind that each mapping has reductionist properties, that is, a mapping deletes information in the form of available data and potentially redefines tasks. Hence, the inverse of such a mapping would require the addition or creation of information in the form of data. Theoretically, this is of course possible, however, we consider the deletion/reduction of information the simpler and more natural perspective because the addition/creation of information is more demanding with respect to the explanation of the origin of this new information. Nevertheless, a revised perspective may be fruitful despite the fact that we did not find good arguments therefor.
MTL, TL, and OSL require the redefinition of a task: In addition to the deletion of data, MTL, TL, and OSL require also a redefinition of the underlying task. This is emphasized by the purple mappings in Figure 8. For instance, for the mapping from MTL to TL one needs to delete the tasks 3 to m including the underlying data D 3 to D m . Furthermore, one needs to redefine the target task which includes for TL only task 2 but not task 1. Similarly, for the mapping from OSL to SL one needs to delete the support data D su and redefine the task because new instances from Y 0 can no longer be studied.
The mapping for RL is a pseudo-mapping: The mapping between RL and SL requires severe modifications. One reason for this is the explorative character of RL that requires the agent to choose actions (make decisions) for changing the state of the environment. This process can be seen as a data generation which is entirely absent in any other learning paradigm. Instead, all other learning paradigms assume that the data are already given.
Interestingly, a proof-of-concept for such a mapping has been provided by Wiering et al. (2011). Specifically, the authors used RL together with definitions for actions and the reward function to perform a standard classification task. Their idea is to define an intricate Classification Markov Decision Process (CMDP) that models the classification task as a sequential decision-making problem by allowing the agent to explore the input data. If such an approach has a substantial advantage over traditional classification methods is currently unclear and remains to be seen.
Meta-learning is not a learning paradigm but a meta-learning paradigm: One may wonder why meta learning (Thrun & Pratt, 1998;Vanschoren, 2019) has not been included in the LP-graph. The reason for this is that meta learning, which is also called "learning to learn" or "lifelong learning," is more than a learning paradigm. Specifically, it assumes two interrelated learning mechanisms, that is, an inner mechanism for a base learner and an outer mechanism that helps the inner mechanism to learn and improve (Hospedales et al., 2020) whereas the outer mechanism leads to a kind of knowledge transfer for the inner mechanism. The iterations of the interaction between outer and inner learning mechanism, which are called episodes, are an important element of meta learning. Due to this iterative element, meta learning has been interpreted as evolutionary principle of learning (Schmidhuber, 1987). Hence, all this establishes meta learning as a meta paradigm of learning paradigms.
From the explanation above, it should be clear that none of the 10 learning paradigms discussed in this article are genuinely within a meta learning framework despite the fact that TL and MTL contain knowledge transfer elements. However, the missing part is the iteration establishing many episodes of learning.
The LP-graph is a taxonomy of learning paradigms: The LP-graph shown in Figure 8 provides a bird's-eye-view of the relations between the different machine learning paradigms. Given the fact that each learning paradigm itself represents a particular problem class based on dedicated definitions (Section 2.2 to 10) and applications (Section 11) which can be quite sophisticated, the LG-graph masks this complexity by projecting transitions between the learning paradigms. Hence, the LP-graph provides information not contained in any individual learning paradigm but the collective thereof. In addition, the structure of the LP-graph organizes the learning paradigm hierarchically due to the direction of the mappings whereas a key element of the mappings is the data-centric perspective.

| CONCLUSIONS
In this article, we provided a discussion of 10 machine learning paradigms. Specifically, we defined key constituents of SL, UL, RL, MLL, SSL, OCC, PUL, TL, MTL, and OSL and their data requirements. Our data-driven perspective allowed a systematic identification of relations between the individual learning paradigms in the form of a LP-graph. This established a taxonomy among the seven modern learning paradigms and the three traditional paradigms, that is, SL, UL, and RL. Overall, the joint presentation and discussion of all those machine learning paradigms should allow a wider appreciation of the broader community of modern learning paradigms and foster their applications.