Supervised Classification Problems – Taxonomy of Dimensions and Notation for Problems Identification

The paper proposes a taxonomy for categorizing the main features of the supervised learning classification problems and a notation for the identification of the supervised learning classification problem categories. The proposed taxonomy has been based on the review and analysis of the recent literature. It allowed the construction of the landscape of decision problem factors influencing the supervised learning processes. To enable a concise and coherent identification of supervised classification problems we have suggested a notation enabling description and identification of various supervised learning classification problem types and their critical features. The notation consists of 5 fields representing, in a sequence, a structure and properties of decision classes, structural model and properties of attributes, features of the data source, and the performance measure used for constructing and evaluating a classifier. The proposed notation is open and could be extended in the case of need new developments within the machine learning theory.


I. INTRODUCTION
Classification is a term commonly used for describing the process of distinguishing and distributing kinds of "things" into different groups. Classification can be viewed as the assignment of elements to pre-defined classes [1] or as the act or process of dividing things into groups according to their type [14]. Classification re-mains one of the main topics of scientific research and is vital to practically all domains of human activities. As it has been recently observed "most of the classifications are still based on the evaluation of resemblances between objects that constitute the empirical data. This one is almost always computed by the means of some notion of distance and some algorithms of aggregation of classes" [58].
Classifications are produced using different reasoning schemes. For example, in statistics, a classification task (also called discrimination problem) requires a classification rule for assigning new data to one of the known classes. Such a rule is identified based on a set of data containing observations (or instances), whose category membership is known [3]. In the machine learning terminology [4], classification is viewed as learning from examples (observations, instances), where a training set consisting of correctly identified observations is available and used to induce a model that describes and distinguishes data classes [11]. Such a model is called a classifier [5]. Classifiers are sometimes referred to as mathematical functions that map input data into a category. Such functions are also called the hypothesis.
Classification problems are encountered in many realworld activities where finding effective methods and tools requires an interdisciplinary effort [1], [2]. Classification methods and tools have been, and still are, under development in numerous research areas of computer science, such as, for example: image processing and analysis [6], [7], computer vision [8], signal recognition [9], decision support systems [10], knowledge discovery in databases and data mining [5], data science [14]. It would be impossible to list here all application areas where machine learning-based classifiers have proven to bring breakthrough advantages. Several examples of such areas include genetics and genomics [127], agriculture [128], molecular and materials science [129], physical sciences [130], hydrology [131], chemistry [132] manufacturing [133], medical image analysis [134], just to mention a few.
In the current paper, we study classification problems that can be solved automatically. Besides, it is assumed that processes of classifier learning are based on the principle of supervised learning, where a classifier, often referred to as a classification model, is learned from a set of examples. Under this assumption classification is a two-stage process. The first stage involves learning from data to induce a classification model. The second one involves using the model to predict the class or category of instances with unknown class labels. Research effort undertaken by specialists in various areas of computer and data sciences has resulted in providing a wide range of classification models and approaches for constructing classifiers. These are based on different paradigms, with different scopes, different complexities, and varying degrees of performance. All of them are, however, based on a 2-dimensional conceptual data model consisting of the set of examples (instances, observations) and each example consisting of feature (attribute) values which are assumed to be in some unknown way related to the class or category an instance belongs to. Within the set of examples labels (categories) of instances are known.
The simplicity of such a conceptual model does not, unfortunately, mean that inducing classifiers assuring the required level of performance is an easy task. There is a multitude of factors that make the task complex and difficult or even impossible. Among such factors one should mention, for example, limited availability of examples, ambiguity, uncertainty, and distortions to feature values, different methods of representing and coding feature values, class imbalanced among instances available for inducing a classifier, presence of the concept drift, presence of outliers, complex feature values, and many others. One of the most difficult barriers encountered in developing classification methods is the sheer size of the data available for deriving a classifier. This barrier may apply to both data dimensionsthe number of features and the number of instances. It is well known that data analysis including classification, is more complex in the so-called Big Data environment [12], [13].
The main contribution is the identification and ordering of factors influencing the construction and outcomes of classification models in supervised machine learning. Besides, we also propose a taxonomy for supervised classification problems categorization. In the available literature on machine learning, the authors haven't found any attempt to provide a wider scheme for categorization of the supervised learning problems based upon considering a variety of combinations of factors important from the point of view of constructive effective classification models. Our goal is to provide specialists and laymen with a simple classification scheme allowing them to identify what kind of classification problem they are facing. Consequently, they can narrow their search for a suitable classification model considering at first methods that have proven successful in solving similar problems.
The proposed classification scheme is based on several characteristics of supervised classification problems referred to as their dimensions. Under the proposed scheme the type of the supervised classification problem can be identified considering the following factors: -Characteristics and properties of the problem categories, (category view).
-Structure and properties of the dataset available for learning a classifier (attribute view).
-Sources of the dataset available for learning (data source view).
-Criterion or criteria associated with the supervised classification problem (performance criteria view). Besides the classification scheme, we also propose the notation for the supervised classification problems. It has been inspired by the notation used in the theory of scheduling for the identification of scheduling problems. The notation was suggested in 1979 by R.L. Graham et al. [87] and is widely used ever since. The reason behind proposing the notation for denoting supervised classification problems in a short but coherent way based on their main properties is to enable efficient communication between specialists developing classification models.
Classification problems in real-life arise in a variety of settings. Researchers have studied hundreds of classification problems and it would be impossible to list all known variants in a single paper. We believe that offering a framework for ordering and grouping classification problems could be of value to boththe researchers looking for methods and tools, and the practitioners trying to find a method or tools for solving their particular classification problems. The proposed classification scheme and notation should be considered as a part of the meta-analysis of machine learning.
We would like also to stress that the proposed classification scheme does not depend on techniques, methods, and tools used for inducing classification models. It is also independent of techniques and methods used at the data pre-processing stage, for example for dealing with the missing data problem or outliers removal. There is, of course, a relation between a particular supervised classification problem type and methods or techniques which might be more effective for solving its cases as compared with other techniques. Identifying such relations could be a subject of analysis when constructing or selecting a learner.
The remainder of this paper is organized as follows. In the next section, the problem of classification is formally defined. Section III contains a review of the different problems and strategies of learning from data based on the relevant literature. Section IV provides a short overview of studies on complexity of classification problems. In Section V, we propose a notation for classification problems. Section VI includes conclusions and discussion on some open research problems.

II. CLASSIFICATION PROBLEM FORMULATION
In the case of supervised learning, to induce a learner some examples are needed. The set of examples denoted U, is a non-empty, finite set and is called the universe. The example ∈ is represented by a fixed set of attributes (features), = { 1 , 2 , … , }, where n is the number of attributes. Each attribute : =1,…, has a value ( ) ∈ , where is a set of all possible values for attribute .
is also called an attribute domain. It is assumed that one particular attribute, say , contains value representing the class label of the example [11].
The aim of learning from examples is to obtain a model (classifier, learner) able to reveal the value of the unknown class label by identifying the dependence between attribute values (our independent variables) and the value of the class label (our dependent categorical variable) using the set of available instances.
The class labels belong to a finite set of predefined decision categories (classes) C = { = 1, … , }, where k is a number of these categories. Hence, the U set for the classification task can be defined as: In [11] the process of learning from examples was called a "concept learning", i.e. "search through a predefined space of potential hypotheses for the hypothesis that best fits the training examples". When instances are represented by the n-dimensional input data (space) and each [ 1j , … , ] =1 ∈ ℜ , then the mission of a classifier is to map instances to the discrete class set, i.e. ℎ: ℜ → .
The machine learning algorithm, called a learner, produces a classifier ℎ ∈ . The classifier is induced from the set U. H is called as the hypothesis space and consist of all possible hypotheses that can be drawn during the learning process. Thus, given a dataset U, a set of hypotheses H, a performance criterion or criteria F, the learning algorithm L outputs a hypothesis ℎ ∈ using the learning algorithm L optimizing F. Thus, learning from examples is to generate L which will be able to determine the best possible ℎ ∈ with respect to the adopted performance measure or measures F.
In case the classifier output is evaluated using a single criterion with the performance measure expressed as ∈ , the learning from examples may be formulated as a process maximizing the performance measure concerning the hypothesis h, such that: The role of a classifier predicts the category of an instance, where the category is unknown.
Definition: A classifier h is a function assigning examples from D to a predefined set of categories as shown in the equation (3): where : =1,.., denotes subset of data set D, such that examples are labelled by a class : =1,.., ∈ , under the following conditions: In view of the above, the classification task is to assign of the class, ∈ , to the instance ∈ .
The process of classification involves several steps. A natural sequence of these steps can be seen as follows: − Sample collection, − Selection of instances and attributes for learning, − Carrying other pre-processing actions (cleaning, removing outliers, etc.), − Producing the training set, − Induction of the classifier using instances from the training set. − Using the induced classifier to predict classes of instances with unknown class labels.

III. PROPERTIES OF THE CLASSIFICATION PROBLEMS
The first known classifier based on the supervised learning paradigm was proposed back in 1967 by Cover and Hart [52]. Since then numerous dedicated and universal classification algorithms have been proposed and published (see, for example, Table I).
It is now widely recognized that the selection of methods and tools for constructing classifiers should be preceded by an analysis and assessment of properties of the classification problem at hand. Such an analysis should be carried out considering the following dimensions, shown also in Fig. 1: − Structure, cardinality, scales, and relations categoryfeatures of classes (categories). − Structural model of data and data characteristics. − Data source features. − Classifier performance criterion (or criteria).

A. THE NUMBER, PROPERTIES, AND STRUCTURE OF CLASSES (CATEGORY VIEW)
It is assumed that in the case of supervised learning the classification problem requires deciding to which class an instance belongs. In supervised learning, unlike the semisupervised, and unsupervised ones, it is assumed that the number of classes (categories) and their possible labels are known at the outset. In the case of the two classes, the problem is called the binary classification [37]. In many cases, the binary classification problem has served as a basis for introducing new classification methods as, for example, in the case of the support vector machine (SVM) approach [38].
In the case of binary classification, the cardinality of the set of decision categories is equal to 2 (i.e. | | = 2), and classifying is carried out into one of the two known classes. However, in many practical applications of machine learning techniques, the number of classes is greater than two, i.e. | | > 2 (see for example [40] [41] [42]). In such a case, a multiclass classification problem is considered and instances belong to one of three or more classes [39]. It should be noted the binary classification is a special case of the multiclass classification. On the other hand, the multiclass classification can be seen as a natural extension of the binary classification problem.
In a special case where | | = 1, one deals with the oneclass classification or unary classification problem. Unary classification problem requires identification of instances belonging to one particular class only. Such class is arbitrarily referred to as the positive or target one, and it is assumed that the positive class is well characterized by instances. Instances that do not belong to the target class are assumed to belong to the negative class, however, they do not form a statistically representative sample of the negative concept [26], [27], [31]. In the discussed case the aim of the classifier is either to identify only one class amongst all the others possible or to identify positive instances when the negative class examples are either not available, not adequately sampled, or ill-defined.
So far we have considered classification problems where each instance belongs to a single class and is associated with a single class label. Such problems can be solved using methods of single-label learning for training the classification model [25]. However in numerous situations classification problems are multi-labelled where one instance can be naturally associated with multiple, nonexclusive labels. Examples include document, gene, and image categorization. In multi-label learning, the aim is to learn model mapping instances to the powerset of the decision categories set C. Both, the single and the multilabel classifications are based on a fixed set of labels. Compared with the single-label classification, which predicts only one label for each instance, the multi-label classification is more complicated. Each instance are different and the number of labels per instance is not fixed. A review of the multi-label classification tools and algorithms can be found in [25], [43], and [45]. A special category of multi-label classification problems encountered in mining data streams is discussed in section III.D. In the literature, the term multi-label classification is often used interchangeably with the term multi-dimensional classification. In fact, the multi-dimensional classification can be viewed as a generalization of the multi-label classification where each data instance is associated with multiple class variables. The goal of multi-dimensional classification is to assign each data instance to multiple classes. In the multidimensional classification, class labels are allowed to have more than a single value [48]. A real-life cases of the multidimensional approach including bio-informatics and multifault diagnosis are discussed in [47] and [48].
Another recently investigated classification problem category is the multi-output classification. The task of the multi-output classification is to simultaneously predict multiple outputs for a single input. In such case the output values belong to a diversified data types, such as, for example, binary, nominal, ordinal, and real-valued variables data. The problem of multi-output learning has attracted so far the interest of researchers from many areas including, for example, speech recognition, language processing, motion tracking, computer vision, document processing, and ranking in information retrieval [46]. The multi-output classification is also known as the multitask classification or multi-outputmulti-class classification. Multi-label classification can be seen as a special case of the multi-output classification problem.
All the above-discussed classification problems assumed that to induce a classifier there is available a training set consisting of instances, each represented by a single attributes vector and where each such vector has an associated class, classes, or labels. In many areas, the above assumption does not hold and the classification problems belong to the category of the multiple instance classification [50]. In the multiple instance classification, the aim is to learn a classifier based on a training set of bags. Each bag contains multiple attribute instances and has an associated class. However, the labels of the individual instances within a bag are unknown. According to [51] "bags may also contain instances that are not necessarily relevant, do not convey any information about bag class, or are better related to other classes of bags". A set of bags can be used as a training set, and the multi-instance classification aims to predict the class of unlabelled bags. Examples of the multiple-instance classification problems can be found in medicine, chemistry, image recognition, etc. More information on multi instance learning can be found in [51].
Usually, it is assumed by default that label values can be measured using interval or ratio scales. Such an assumption does not always hold. It appeared that in case label values are defined using nominal and ordinal scales, the resulting supervised classification problems may require different approaches and techniques.
Ordinal classification is the special case of multiple class problems. Hence, the ordinal classification problems can be solved using standard approaches as in the case of other multiple class problems. While applying multiple class classification methods for ordinal data sometimes works, the outcome can be unsatisfactory since classes are treated equally without considering their interconnections and relative superiority [106]. Ordinal classification problems are further discussed, among others, in [107], [112], [108], [110]. The special case of the ordinal classification is monotonic classification [109]. In the ordinal classification, the different labels show an ordering relation, related to the specific nature of the target variable. If additionally, a set of monotonicity constraints has been imposed on the relationship between independent and dependent variables, then the problem is known as monotonic classification.
Other special cases are multiple criteria ordinal classification problems. An ordinal classification problem with multiple criteria consists of the assignment of objects to a finite number of ordered classes. Objects are characterized by attributes with ordered value sets and monotonicity constraints assuring that a higher value of an object on an attribute, with other values being fixed, should not decrease its class assignment. The problem was studied, by several authors, among them [114], [113], [118], [119].
In nominal classification, categories are mutually exclusive. According to Warrens [135] nominal classification can be further divided into two types. The distinction depends upon the presence of the category "absence". When there is no 'absence' category, a classification can be described as having several unordered categories of "presence" characterizing possible cases. Such type of nominal classification is referred to as regular as opposed to a dichotomous-nominal classification. History of developments in nominal classification can be found in [111].
Classification problems where there is some structure (hierarchical or not) among the classes form a wide category of structured classification problems [117]. According to the above authors, hierarchical classification can be seen as a particular type of structured classification problem, where the output of the classification algorithm is defined over a class taxonomy. Class taxonomy can be defined as a tree-structured regular concept hierarchy defined over a partially ordered set (C,≺), where C is a finite set that enumerates all class concepts in the application domain, and the relation ≺ represents the "IS-A" relationship [121].
According to [120], in many real-world classification problems, one or more classes can be divided into subclasses or grouped into superclasses, and instances can belong to more than one class simultaneously at the same hierarchical level. In this case, the classes follow a hierarchical structure, usually a tree or a directed acyclic graph. These problems are known in the literature of machine learning as hierarchical multiple label classification problems. Evaluation measures for hierarchical classification were discussed in [116]. An approach for hierarchical multilabel classification was proposed in [115].
The category view of the supervised classification problems is shown in Fig. 2. Category view focuses on cardinality, structure, and properties of categories (labels) encountered in different classification problems.

B. DATA VIEW
A natural structure of data used to induce learners is the relational database model or decision table model, where data are kept in tables [62]. Each row in the table represents an instance (example) with a unique class label. The columns of the table hold attribute values, and each instance usually has a value for each attribute. The model is appropriate for binary learning and multi-class learning, and data organized in a table-like structure are called well-structured. The wellstructured data are also used in the case of the multiple instance learning, although the class labels are not necessarily provided for all instances from bags belonging to the training set [51]. The one-class classification is also based on the discussed structure even though the only positive class instances are guaranteed and the negative class instances can be absent, unlabelled, or not properly defined.
A more complex data structure is required for multi-output learning. In this case, the outputs can be of various types and structures. Different output structures typical for multi-output learning, including independent binary vectors, independent real-valued vectors, rankings, sequences, graphs, trees, links, images, text, audio, and time series, are discussed in [55]. VOLUME XX, 2017 FIGURE. 2. Supervised classification problems -category view ("*" denotes remaining options).

FIGURE. 3. Data viewa general scheme.
Unstructured data either does not have a pre-defined data model or is organized loosely. Unstructured data has an internal structure, but it is not predefined through data models. It might be human-generated, or machine-generated in a textual or a non-textual format. Unstructured data can be defined as all the data that is not structured. Unstructured data is mostly qualitative. Examples of such data include all kinds of text and audio data, e-mails, web pages, business documents, FAQ's, multimedia content, spatial data, molecular structures, chemical structures, and others [56], [88], [89]. Solving classification problems based on unstructured data is referred to as learning from unstructured data [88], [89].
Structured data is most often quantitative data. When data has well-structured the process of solving classification problems is less demanding from the implementation and computational side [88]. However, learning from structured data such as sequences, trees or graphs is less trivial than learning from data organized as decision tables. Challenges in learning from structured data arise in the so-called hybrid domains, where, for example, continuous and discrete structures are mixed. Dealing with hybrid structures and structures representing social networks is discussed in [90]. In [91] it is was observed that even learning from the structured data is nontrivial. The main reasons behind this finding include ignorance of structural information on input and output domains and the occurrence of high-dimensional structured data containing huge numbers of features and labels. Current methods for learning from structured data are also limited in handling large, isolated substructures [92].
Apart from structured and unstructured data, some specialists use also the term semi-structured data. Semistructured data has characteristics of both structured and unstructured data. These data are not structured using a relational database model but they have elements of semantic markups that enforce hierarchies assuring that some structure is kept [57]. An example of semi-structured data is also discussed in [93], where conditions for learning from semistructured data are presented. An approach for learning from semi-structured data was suggested in [94], where a genetic programming algorithm for extraction of the multiple treestructured patterns from tree-structured data was proposed.
From the point of view of practical application, important examples of structured data include time-series data, multi-view data. When data is represented by multiple, distinct feature sets one deals with the multi-view data and multiview learning. An excellent survey of the multi-view learning approaches and algorithms can be found in [124]. Another important and challenging problem in data mining is time-series classification. In time-series classification a training dataset is a collection of pairs [ , ], where Xi could either be a univariate or multivariate time series with Yi as its corresponding label vector. Reviews related to time-series classification can be found in [126] and [125].
The knowledge of the structural data model can determine the learning process with respect to the decision on the algorithm or tool which should be used or what learning strategy should be applied to produce a strong and a high generalized system, highly competent to deal with different types of data, from numbers to textual format, from welldefined structures to undefined ones. Hence, we consider structural properties of data as an important dimension of classification problems influencing learning processes. On the other hand, we assume that quality of data is not a dimension of classification problems. Cleansing data, removing outliers, dealing with missing data, reducing data and other imputation efforts, remain an important tasks at the pre-processing stage.
To sum up this subsection, a general scheme for data view is shown in Fig. 3, while a graphic representation of the structural data model is proposed in Fig.4.

C. ATTRIBUTE VIEW
The traditional machine learning paradigm is based on the processing of examples as multidimensional vectors of attributes. Each attribute has a domain determined by the attribute type. The domain of each attribute may be either symbolic or numerical. The majority of the machine learning algorithms deal with the following types of attributes [61]: − Numerical (continuous) attributes -they take real or integer values and can have an infinite number of states. − Nominal attributes (also called categorical) -the values are determined on a predefined set of possible values. − Ordinal attributes -they are numeric or nominal, and contain values that have a meaning in terms of ranking or order. − Discrete attributesthey have a finite or countably infinite numerical or categorical value. In the case of the domain consisting of two possible values for this type of attribute, the attribute type is referred to as the binary. − Complex attributesthey reach beyond a simple attribute-value pair and can be represented by a more complex structure like, for example, graphs [76]. Attributes belong to the qualitative, quantitative, or complex types. Attribute view covering the instance (samples) attributes in the supervised machine learning is presented in Fig. 5.
The type of attributes characterizing the set of instances (examples) may influence the choice of the machine learning technique for solving a classification problem. Hence, in the literature the different techniques of learning from examples have been discussed including, for example, learning from numeric data [63], learning with symbolic attributes [65], learning problem with categorical data [64], learning from spatial data [66], learning from ordinal data [69], learning from collective data (bags of words or items) [67], learning from discrete data [67], and learning from multidimensional data [67], [68].

D. DATA SOURCE VIEW
The basic approach for solving classification problems using machine learning techniques assumes that data remain unchanged during the learning process, so means that are static. Such kind of learning is called batch learning [76] and learners are trained using a single batch of data. Batch learning ignores all new data and focuses entirely on previously learned concepts [70]. Batch learning relies on the assumption that data coming from the data source has a stationary distribution. VOLUME XX, 2017

FIGURE. 5. Attribute view -types of the instance attribute in the supervised machine learning.
In numerous real-life applications, batch learning is impossible or impractical. More and more often the size of the available datasets outpace the capability of computational hardware to analyze them. One method to deal with the problem is applying so called incremental algorithms that sequentially process chunks or packages of data one by one, combining the results from each chunk. Data chunks can be formed by the user to overcome problems with computational resources or may come sequentially in a natural way from the data source. Learning from the current chunk and modifying the model after the prediction results have been revealed to be ready for the next chunk, is called incremental learning, and the data source producing a sequence of chunks is called the incremental data source.
A special case of incremental learning is online learning. Online learning is needed to deal with an endless stream of received data like, for example, sensor data, currency rates, stock market indexes, or video streams. In online learning, the class label for the instance is predicted immediately when this instance incoming and the true class label is revealed afterward. In the next step, the incoming instance is incorporated into the training data dataset. In such an environment, learning is categorized as online learning [76] and the data source as the online data source or data stream.
If the distribution of data from the data source is not constant the domain is said to have a non-stationary distribution. In this case, changes of the underlying data distribution known as the concept drift may occur [73], [74], [77].
Among challenges facing online learning, there is the scalability of the learning process which should be ready to learn from thousands of training examples. Almost at the same time, there is a need to take classification decisions considering new data flowing into the system [80]. Another challenge is coping with the eventual concept drift and dynamic character of the observed data source [71], [72], [75], which requires timely and accurate drift detection mechanisms. It should be noted that distribution changes may occur not only in the feature space but also in class space, or simultaneously, in both of these spaces [70]. Difficulties should be also expected when the data stream used to induce a learner is class imbalanced.
Many real-world problems involve data which are multilabel data streams [49], [95]. The problem of multi-label classification is characterized by unique properties as compared with other types of classification problems. A special feature of the multi-label data streams is that the set of labels is not fixed at the outset and may change during the learning. Besides, more than one label may be assigned to incoming items during the classification process. Examples of multi-label data streams include data from different sensor applications, traffic management, web exploration, manufacturing processes, as well as from the social media networks, where a photo posted on Facebook or Twitter might be labelled continuously and differently by users [49]. Another example is the categorization of the incoming mails, where each email may be relevant to a thematic label, as well as to a label concerning confidentiality. On the other hand, such labels may be correlated. Such labels are called orthogonal [95]. A review of the multi-label data streams learning algorithms can be found in [54], [95].
Data used to induce learners may be stored in one central repository. Another possibility is physical and geographical distribution of the data using, for example, cloud computing technologies, so means that the distributed data sources are considered [76].
Data stored in multiple separated sites may be homogenous. In such case each site consists from instances defined on the same set of attribute. When we have various sets of attributes is different in these separated sites, then these stored data are heterogeneous. Although some attributes among the sites can be common. Of course these separated data sets also may to have distinct structure (format differences, semantic differences, etc.) [78], and could have been exposed to horizontal and vertical data fragmentation [79]. VOLUME XX, 2017

E. PERFORMANCE CRITERIA VIEW
As it has been shown in Section II, classification problems belong to a wide class of optimization problems. A performance criterion (performance measure) or a set of criteria, cannot be considered as a feature of the particular optimization problem since the choice of the criterion is at hands of the user who carries out the optimization process. Besides, it is usual that a problem can be solved to optimum using different performance measures. The above observations also hold for the classification problems. Nevertheless, there exist a set of criteria that is commonly used when solving classification problems using machine learning techniques. Among measures belonging to this set, one can list classification accuracy, classification error, classification cost, sensitivity, specificity, the area under the curve, F1 score, precision, recall, and many others [86].
There are, however, classification problems where a narrow range of possible performance measures is justified. For example, in the case of the imbalanced data classification problems, a meaningful set of criteria include, among others, Geometric Mean (G), Area Under the Curve (AUC), Balanced Accuracy (BACC), and Mathews Correlation Coefficient (MCC). Another example of the supervised classification problems where specialized performance measures are better suited than the standard ones is ordinal classification. Cardoso and Sousa [122] discuss the problem and propose a specialized criterion for measuring the performance of ordinal classification named Ordinal Classification Index (OCI).
The majority of studies on learning from data are focused on a single-objective optimization, where the aim is to optimize a single performance measure selected by the user [82]. The problem of data classification can be also formulated and solved as the multi-objective optimization case. Solving multi-objective classification problems using the machine learning techniques and the supervised learning paradigm has been studied, for example, in [81], [83], [84], [85], [86].
In subsections III.A to III.E we have discussed various factors and properties of the supervised classification problems that may influence the learning process and could be decisive in selecting an effective learning technique or algorithm. In Fig. 6, the landscape of factors characterizing decision problems and influencing the supervised learning processes is shown.

IV. COMPLEXITY ISSUES
The complexity of the classification problems can be studied in several aspects. One of them takes into account the properties of the classifier induced from the available training set. If such a classifier, that is the function h assigning instances to a predefined set of decision classes, is linear and, at the same time, its predictions assure the required performance level, then the complexity of the classification problem can be considered as a low one. In such a case, finding a linear combination of features that characterizes or separates two or more classes of objects is not a difficult task. If, however, a linear discriminant function cannot be found or it does not assure the required performance level then one has to look for a non-linear function and the problem becomes more complex [97].
Though there are so far only a few formal results reported in the literature on the complexity of the machine learning classification problems, some interesting ideas for the twoclass problems were suggested by Zhao and Wu in [97]. According to [97]: "if a two-class problem is not K-degree linear separable, then we refer to it as a K-degree linear nonseparable two class problem", where K is the number of hyperplanes needed to discriminate between each pair of instances from different classes. Zhao and Wu in [97] further state that: "a two-class problem has K-degree classification complexity if it is K-degree linear separable but not (K-1) degree linear separable". The proposed concept of the classification complexity can be used to design a multi-layer perceptron with the minimum required number of layers.
The classification complexity can be also seen as depending on both the feature space and the data size. Big sets of data may cause an excessive demand for computational resources. On the other hand, a small training set can appear deceptively simple, however, when the cardinality of the set of attributes of such training set is high the classification problem may not be easy to solve satisfactorily [96].
Early studies on effects of dimensionality, sample size, and structure of classification algorithm on misclassification have concentrated on measures like use of probability distance measure bounds, entropy measures, interclass distance measures, scatter matrices, information-theory-based approaches, boundary methods, feature space partitioning methods [44], [105].
Singh in [96] emphasized that the classification problem complexity should be studied considering decision boundaries. He proposed two measures of classification complexity based on feature space partitioning: purity and neighborhood separability and compared them with probabilistic distance measures and several other nonparametric estimates of classification complexity.
In [98] the complexity of a discrimination problem has been also discussed taking account of the data structure and the number of data. The authors show that an incomplete or sparse sample (relatively small data set) adds a level of complexity, on the other words it means that when the sample is too small the problem may appear only deceptively simple. The small data effects are also considered in Vapnik's VC-dimension theory [99].
The classification complexity can be also evaluated using the computational learning theory (CoLT). It allows estimating potentials of learning algorithms for function approximation and generalization [100]. The CoLT theory is also related to the probably approximately correct (PAC) learning. PAC learning provides a way to quantify the computational difficulty of a machine learning task [101]. The theory is concerned with binary classification, but it remains valid for cases with more classes [102]. VOLUME XX, 2017

FIGURE. 6. The landscape of decision problem factors influencing the supervised learning processes.
The problem of classification complexity is still considered from several different perspectives. Among them, there are these based on estimating the classification problem complexity using different measures. These measures focus on estimating the shape and size of the decision boundary (like, for example in [103]), for binary as well as for multiclass classification problems respectively (see, for example, [103] and [104]). In [103]  Complexity measures may support various supervised machine learning tasks including data preprocessing, design of machine learning algorithms, and choice of the classifier, adequately to features the available data [103].

V. NOTATION FOR IDENTIFICATION OF THE SUPERVISED LEARNING CLASSIFICATION PROBLEMS
Apart from the classification problem complexity, a rational choice of preprocessing tasks and, later on, the machine learning technique or algorithm, requires identification of all relevant features of the problem at hand. To make this task easier we suggest using a special notation. The idea is inspired by the notation introduced and used in the field of operations research for scheduling problems [87].

A. COMPONENTS OF THE PROPOSED NOTATION
To identify a supervised classification problem we propose to use the following 4-tuple of fields: where each field is a comma-separated string of symbols.
The first field, denoted α, represents the category view. It consists of 4 subfields describing structure, cardinality, scale, and the relation categoryfeatures. Symbols within a field are separated by a colon. Unknown or undefined values are replaced by *. The first subfield denotes the type of structures and may include one of the following symbols: Example: Notation She:Mlc:I/R:Sil|*|*|* refers to a single label, multiple class problem with the hierarchical structure of categories that can be measured using interval/ratio scale.
The second field, denoted , represents the data view. This field contains three subfields. The first subfield describes a structure model of data. The second described a data distribution and the third data size. The first subfield may include the following symbols: -Ststime series structured data.
The second subfield may include the following symbols: -Regregular distribution of data.
-Imbimbalanced distribution of data.
The third field may include the following symbols: -Rsiregular data size.
Example: Notation She:Mlc:I/R:Sil|Und:Reg:Rsi|*|* refers to a single label, multiple class problem with the hierarchical structure of categories that can be measured using interval/ratio scale. Besides, data are unstructured, with regular distribution and regular size.
The third field of the proposed notation -, represents the attribute type and contains two subfields. The first subfield describes data type and may consist of the following symbols: -Ndsnonstationary data stream (nonstationary dynamic, nonstationary online data source, data with a concept drift). Data arrive one by one.
-Dundata stream of the unknown character (dynamic, online data source of the unknown character). Data arrive one by one.
Example: Notation She:Mlc:I/R:Sil|Und:Reg:Rsi|Con:Hom|* refers to a single label, multiple class problem with the hierarchical structure of categories that can be measured using interval/ratio scale. Besides, data are unstructured, with regular distribution and regular size. In addition, data are continuous and come from a homogenous data source.
The fourth field  represents the performance criterion and may contain one of the following symbols: -Sinsingle objective performance criterion. -Mopmultiple objective performance criteria.
Example: Notation She:Mlc:I/R:Sil|Und:Reg:Rsi|Con:Hom|Sin refers to a single label, multiple class problem with the hierarchical structure of categories that can be measured using interval/ratio scale. Besides, data are unstructured, with regular distribution and regular size. In addition, data are continuous and come from a homogenous data source. The problem is to optimize a single objective performance criterion.

B. EXAMPLE CASES
To illustrate how the proposed notation can be used to describe the classification problem, several examples are discussed in this subsection. Example 1. To implement the machine learning system for credit card fraud detection the following arbitrary assumptions have been made: − The system is expected to decide whether a credit card transaction is fraudulent or not. − The system will be used by more than one organization. − Attributes of transactions are nominal, continuous, discrete, and symbolic. − On the whole, there will be many more non-fraudulent transactions than fraudulent ones. − Data sources are distributed. − Transactional data from a stationary data stream.
For the considered case the following notation can be used: She : Bin : I/R : Sil | Sts : Iim | Mix : Ddis : Hom : Sds | Sin Example 2. In [95] the problem of multi-label stream classification problem was considered. To deal with it a Multiple Windows (MW) approach with a word bag model and a single performance criterion was proposed. The authors transformed the multi-label problem into multiple binary problems and solved each problem independently. The approach was validated using three large real-world multilabel datasets as shown in Table II. Classification problems solved in [95] can be denoted, using the proposed notation, as follows: Example 3. In [93], the collective intelligence system called RealTravel is discussed. The system has been designed to work in an environment where: − Data are distributed. − Data are represented by different types, i.e. they are text, numbers, photos, etc., i.e. are semi-structured, mixed, and multidimensional. − The system generates multi-class recommendations. − Evaluation of the recommendation quality is carriedout based on a multi-objective approach.
For the above case the notation can be as follows: She : Mlc | Uud | Mix : Umu : Ddis | Mop

VI. THE RESEARCH EFFORT
To evaluate the research effort spent on the development of models and tools designed for solving various types of classification problems we show in Fig. 6 the number of publications and their h-indexes as provided by Web of Science and Scopus. From Fig. 7 it appears that binary classification and multi-class classification problems have been studied most intensively among all classification problems.

VII. CONCLUSION
Two main contributions of the paper include: − Offering a review of the current research effort in the field of supervised learning covering various types of classification problems tackled in the relevant literature.
− Proposing an original taxonomy for categorizing main dimensions of the supervised learning classification problems ordered by a category view, data view, attribute view, data source view, and performance criteria view. − Proposing a simple notation for identification of the supervised learning classification problem categories. The proposed taxonomy is based on the analysis of factors relevant for constructing and solving the supervised learning classification problems. The analysis of the machine learning publications has enabled compiling the landscape of decision problem factors influencing the supervised learning processes. The proposed notation offers a concise and coherent way to describe various supervised learning classification problem types and their critical features. The ultimate goal of both -the proposed taxonomy and the notation, is to provide those interested in supervised learning with a simple way to identify main factors that have to be considered when looking for a method and a tool for solving the particular supervised classification problem. The proposed notation is open and can be further extended taking into account new methods and techniques. It could be also a starting point for constructing a decision support system or recommender able to help a layman in the machine learning field to select the proper method or tool for solving his problems. Constructing such a system will be the focus of future research.