Rigorous and compliant approaches to one-class classification

https://doi.org/10.1016/j.chemolab.2016.10.002Get rights and content

Highlights

  • Many qualitative analytical issues should be addressed by one-class classification.

  • We present ‘rigorous’ and ‘compliant’ approaches for model development.

  • In the ‘rigorous’ approach, only information from the target class is employed.

  • In the ‘compliant’ approach, the non-target class data is also utilised.

  • The pros and cons of both methods are critically discussed.

Abstract

A wide number of real problems requiring qualitative answers should be addressed by one-class classification (OCC), as in the case of authentication studies, verification of particular claims and quality control. The key feature of OCC is that models are developed using only samples from the target class, so that a representative sampling is not strictly required for non-target classes. On the contrary, in the discriminant analysis (DA) approach, all of the classes considered (at least two) have a non-negligible influence in the definition of the delimiter. It follows that faults in the definition of the classes involved and in representative sampling for each of them may determine a bias in the classification rules. A key aspect in one-class classification concerns model optimisation. When the optimal modelling conditions are searched by considering parameters such as type II error or specificity (‘compliant’ approach), information from the non-target class is being used and may therefore determine a bias in the model. In order to build pure class models (‘rigorous’ approach), only information from the target class should be regarded: in other words, optimisation should be performed only considering type I error, or sensitivity. In the present study, ‘compliant’ and ‘rigorous’ approaches are critically compared on real case studies, by applying two novel modelling techniques: partial least squares density modelling (PLS-DM) and data driven soft independent modelling of class analogy (DD-SIMCA).

Introduction

One-class classification (OCC) [1], [2] consists in making a description of a target class of objects and in detecting whether a new object resembles this class or not. The term class modelling is often used for denoting OCC methods [3]. In some sense, this approach is opposite to the discrimination problem that is to allocate a new object to one of distinct and exhaustive classes [4]. The critical difference between OCC and discriminant analysis (DA) is that the OCC model is developed using target class samples only.

The work of Harold Hotelling on multivariate quality control (1947) can be considered as the first example of multivariate one-class classification in chemistry [5]. The unequal class models (UNEQ) method was developed by Derde and Massart (1986) as an evolution of these concepts [6]. In fact, such a method – closely related to quadratic discriminant analysis (QDA) – is based on the hypothesis of a multivariate normal distribution in the class to be modelled and defines the width of the class space based on Hotelling's T2 statistics, at a selected confidence level.

The first method specifically developed for one-class classification in chemometrics was soft independent modelling of class analogy (SIMCA), by Svante Wold [7], [8]. This method performs PCA on the samples of the class to be modelled – the SIMCA model being defined as the range of sample scores on the significant PCs. A critical distance, at a given confidence level, is obtained by application of the Fisher F statistics to residuals of each training sample to the model, and is used to define the boundaries of the SIMCA class space around the model.

OCC modelling is a rather new strategy in comparison with DA. The classical OCC version does not utilise any information about non-target (extraneous) classes, even when the data regarding such extraneous classes is available. We call such an approach a ‘rigorous’ one. Contributing to the OCC technique elaboration, we consider the outcomes that can be yielded in case the rigorous concept is violated. The most common violation – which we call a ‘compliant’ approach – makes use of some relevant non-target information that can influence the results of the OCC modelling.

The main objective of the present study is the comparison between the outcomes of ‘rigorous’ and ‘compliant’ approaches. For this purposes, two different OCC methods, namely, partial least squares density modelling (PLS-DM) [9], and data-driven soft independent modelling of class analogy (DD-SIMCA) [10] are employed. Method descriptions are presented in 3.1 Dataset, 3.2 Dataset. An additional goal is to compare these techniques using two real world examples.

Section snippets

Figures of merit

Performances of one-class classifiers are usually reported using two parameters: sensitivity and specificity. Sensitivity is the fraction of samples of the target class which are correctly recognised as consistent with the model. It can also be defined as the rate of true positives and, therefore, it is complementary to type I error (i.e., the false negative rate). Specificity is the fraction of samples extraneous to the target class which are correctly recognised as inconsistent with the

Materials

We consider two different datasets. One set, Olives, is comprised of samples of natural origin, olives in brine. Variability among samples is inevitable. In the present study, variability is taken into account both within a single harvest year and between different harvest years. The second dataset, Remedy, consists of samples of artificial origin, uncoated tablets. Certainly, variability between samples is much lower and mainly manifests as variation between batches.

DD-SIMCA

As it was mentioned above, two types of models, ‘rigorous’ and ‘compliant’, are considered. The results regarding model sensitivity are presented in Table 3. The best results for the ‘rigorous’ model are obtained with 3 PCs and type I error α=0.01. Both a-priori α values are in good agreement with a-posteriori sensitivity calculated for subsets T1, I1 and E1.

At the same time, specificity is not completely satisfactory (see Fig. 2a and Table 4). Misclassification results are originated from

Results for dataset Remedy

Unlike the Olives case, we consider three peer subsets corresponding to three different manufactures. Samples originated from each manufacture are considered as target class samples and three OCC models are built respectively.

Discussion

Comparing the two OCC methods, we can conclude that DD-SIMCA is a global modelling method, while PLS-DM represents a local approach. At a fixed level of type I error, α, the first method has the only free parameter – the number of PCs – that can be used for tuning in case of ‘compliant’ approach. When the number of PCs is increased, training sensitivity is varying near to the given sensitivity level (1–α), while validation sensitivity is decreasing. These tendencies are observed due to evident

Conclusions

A distinct feature of OCC is the possibility to build a model for one class without in-depth information regarding other classes or samples. In the ‘rigorous’ OCC approach, all model parameters and validation procedures are based only using information regarding the target class. This can be considered as an advantage of OCC, especially for solving authentication problems. At the same time, for overlapping datasets, this is a drawback. When the classes under study are well separated, the

Acknowledgments

Financial support by the Italian Ministry of Education, Universities and Research (MIUR) is acknowledged – Research Project SIR 2014 “Advanced strategies in near infrared spectroscopy and multivariate data analysis for food safety and authentication”, RBSI14CJHJ (CUP: D32I15000150008).

References (18)

There are more references available in the full text version of this article.

Cited by (134)

View all citing articles on Scopus
View full text