An alternative approach to dimension reduction for pareto distributed data: a case study

Roccetti, Marco; Delnevo, Giovanni; Casini, Luca; Mirri, Silvia

doi:10.1186/s40537-021-00428-8

Research
Open access
Published: 25 February 2021

An alternative approach to dimension reduction for pareto distributed data: a case study

Marco Roccetti ORCID: orcid.org/0000-0003-1264-8595¹,
Giovanni Delnevo¹,
Luca Casini¹ &
…
Silvia Mirri¹

Journal of Big Data volume 8, Article number: 39 (2021) Cite this article

2889 Accesses
33 Citations
Metrics details

Abstract

Deep learning models are tools for data analysis suitable for approximating (non-linear) relationships among variables for the best prediction of an outcome. While these models can be used to answer many important questions, their utility is still harshly criticized, being extremely challenging to identify which data descriptors are the most adequate to represent a given specific phenomenon of interest. With a recent experience in the development of a deep learning model designed to detect failures in mechanical water meter devices, we have learnt that a sensible deterioration of the prediction accuracy can occur if one tries to train a deep learning model by adding specific device descriptors, based on categorical data. This can happen because of an excessive increase in the dimensions of the data, with a correspondent loss of statistical significance. After several unsuccessful experiments conducted with alternative methodologies that either permit to reduce the data space dimensionality or employ more traditional machine learning algorithms, we changed the training strategy, reconsidering that categorical data, in the light of a Pareto analysis. In essence, we used those categorical descriptors, not as an input on which to train our deep learning model, but as a tool to give a new shape to the dataset, based on the Pareto rule. With this data adjustment, we trained a more performative deep learning model able to detect defective water meter devices with a prediction accuracy in the range 87–90%, even in the presence of categorical descriptors.

Introduction

When data scientists write a book on how a supervised deep learning project is to be structured, what it is lived at the beginning, i.e., when input data are analyzed, takes up only the first third, or so. The bulk of the story, in fact, is usually considered what happens next: with the model training and validation process, and then with the testing phases of the outcomes [1].

Unfortunately, the first phase of data analysis and preparation is almost never considered as a silver bullet, and it often remains an underinvested branch in the deep learning practice. In some sense, our scientific community has not been as effective at developing strategies to construct reliable datasets as we have with those to learn from them. But that needs to change, under the penalty of severe consequences, ranging from inaccurate predictions to the lack of explainability of our models [2, 3]. Further, the progress we need to achieve cannot be in the abundance of data we collect, alone. It should be also in our ability to make sure that our project actually benefits from the particular data we choose. Focusing all into a huge amount of data can be a good premise, in fact; yet, not asking any questions about their usage, role and the value that can be drawn from it, will turn that premise into the first motivation behind a failure. In other words, data is just an initial representation of a situation, but key remains the way we analyze it. We go so far as to say that data is there, mostly to stimulate an accurate analysis. Whether a single piece of data is either descriptive or predictive, or even prescriptive, in nature, is just what can be understood through an in-depth analysis of that data [4,5,6,7].

In this context, the target of our study is to demonstrate that putting a proper data analysis at the core of a deep learning project can assist one in identifying the most accurate data descriptors for that project. While it is well known in fact that, in computing, a data descriptor is just a structure containing information that describes data, yet data descriptors for deep learning can span several diverse aspects, as the data’s provenance and type, up to its storage schema. In essence, descriptors encapsulate a basic knowledge about the data, and can thus be used as starting points for the construction of a trustworthy dataset on which a deep learning model can be safely trained.

Along this line, in this paper we describe a deep learning design experience, where we had initially a trouble on developing an appropriate deep learning model able to detect failures in mechanical water meter devices, because we tried to train that model by merging together the numerical information relative to water consumption with some device descriptors based on categorical information, thus resulting into an explosion in data dimensionality, that soon determined a deterioration of the prediction accuracy [8, 9]. After several unsuccessful experiments conducted with alternative methodologies that either permitted to reduce the data space dimensionality or employed more traditional machine learning algorithms, we changed the training strategy. In essence, we moved towards an accurate statistical analysis of the initial data, culminating with the application of an approximation of the 80/20 Pareto rule, that made us understand that categorical descriptors could not be part of the contents our model had to learn. Starting from this changed perspective, we devised a new strategy where categorical descriptors were used just as a driver for data selection, rather than being fed as input to the model. This way, we kept under control the dimensionality of the learning space and, at the same time, we achieved satisfying results of model prediction accuracy, in terms of detection of defective devices, reaching values in the range from 87 to 90%.

Anticipating a part of the final results shown in this paper, we introduce Fig. 1, where the Z, X and Y variables represent, respectively, the prediction accuracies, obtained in the following three different situations: (1) The deep learning model is trained only with numerical data (X); (2) The deep learning model is trained with a mix of numerical and categorical data (Y); (3) The deep learning model is trained with a selection of numerical data, based on a Pareto analysis conducted on the categorical data (Z).

As portrayed in Fig. 1, and discussed at length in the remainder of the paper, Z (90%) outperforms Y (83%).

In conclusion, in this paper we demonstrate that it is possible to train a deep learning model that achieves excellent prediction accuracy levels, even in the presence of both numerical and categorical descriptors. This approach was devised to select the training data, based on a Pareto analysis conducted on the categorical descriptors, thus avoiding the explosion of the data space dimensions while keeping intact the statistical coherence of the portion of the dataset selected for training. We have provided empirical evidence that this approach maintains its validity even if compared with more traditional space dimension reduction methodologies and classical machine learning algorithms, as well.

The remainder of this paper is structured as follows. In the Section devoted to the Related work, the focus is put on what can happen to the dimensions of a learning space when categorical variables are employed, along with a survey on the techniques usually adopted to manage this situation. In the Methodology Section: (i) we present the initial dataset on which we have worked, (ii) we illustrate some (unsuccessful) deep learning model training experiments that employed classical techniques to reduce the data dimensionality space, (iii) we describe the approach through which this initial dataset was reshaped, along with an analysis that demonstrates that these data re-adjustment operations do not change the statistical coherence of the selected data, and (iv) we finally illustrate how those data can be used to train a deep learning model. In the Section devoted to describing the Results, instead, we present and discuss the results we achieved with our approach. The Section devoted to the Discussion supplies: (i) some reflections on the advantages and limitations of our approach, along with a comparison with some alternative machine learning methods, and) (ii) a practical guide on how to use our classifier culminating with the adoption of the additional technique of Bagging, able to further increase the model performances. The final Section provides the Conclusions and terminates the paper with some concluding remarks.

Related work

At the core of this paper lies the question if all types of descriptors should be presented to a deep learning model in a way that is opaque to most of the designers, rather than subjecting them to an accurate data analysis before using them for training. In particular, this problem exacerbates when we have to deal with two very different types of data: i.e., numerical vs. categorical. In fact, while numerical data are measurable in nature, and easily manageable, categorical data, instead, represents a collection of information, that can be divided into groups; e.g., black and white; and, as such, they can take on numerical values (for example: 1 indicating black, and 2 indicating white), but those numbers do not have a precise mathematical meaning. This is the reason why working with categorical descriptors can easily lead to an increase of the dimensions of the space under investigation. This phenomenon may go so fast that the available data become sparse, with a consequent loss of statistical significance [10].

To understand this phenomenon, take, for example, one of the most common techniques applied to encode categories into numerical values: the one-hot encoding technique [11, 12]. Consider a categorical variable with the following values: Yes, No, and Prefer not to say. They can be encoded with the following vectors {[1, 0, 0], [0, 1, 0], [0, 0, 1]}. This produces a new, three-dimensional space, with a total amount of twenty-seven points. However, the only interesting points remain three, and are orthogonal, equidistant, and sparse. Simply said, we have yielded a three-dimensional vector space, with a new dimension for each original value (yes, no, prefer not to say). Unfortunately, things can even get worse. If we had three categories, each with three values, we would get a nine-dimension space, as this would come with the product of the number of categories times the number of possible values [13, 14].

Hence, a general problem can be posed: how to manage categorical variables, while keeping the dimensionality of the resulting space under control. To this aim, many statistical techniques have been proposed in literature that are used to face this problem. Typically, the recurrent idea behind all those methods is as follows:

i.
Consider a high dimensional categorical space.
ii.
Apply a procedure for reducing the number of variables, without loss of information.
iii.
Identify new variables with greater meaning, and finally,
iv.
Keep as the ultimate target that of maintaining visible a lot of points, in this reduced space, to be used as representative examples on which a supervised learning model can be trained.

What is also very common is the fact that the procedure for reducing the dimensionality rests upon the idea of representing the categorical space with a few orthogonal (uncorrelated) variables that capture most of its [15]. In the remainder of this Section, we are going to provide a few details on the principal techniques of this family. Before beginning with this review, we briefly anticipate here that our method will be different. We will avoid to use categorical descriptors as input to the model to be trained. Instead, they will be used as a driver for data selection, thus eliminating, from the start, the need for a dimensionality reduction of the categorical space.

Among the traditional methods mentioned above, probably, the Correspondence Analysis (with all its variants) is the most known one. Akin to the Principal Component Analysis, the Correspondence Analysis (or CA) provides a solution for projecting a set of data onto lower-dimensional plots. Essentially, CA aims at visualizing the rows and the columns of a contingency table as points in a low-dimensional space, so that a global view of the data is made available, yet easily interpretable [16,17,18]. Identically derived from the Principal Component Analysis, we have the CATegorical Principal Components Analysis (or CATPCA). Here, again, the final goal is to reduce the data dimensions by projecting them onto a low-dimensional plane, with the plus that the relationships among observed variables are not assumed to be linear [19].

Of interest in this field, it is also the so-called Multi-Dimensional Scaling (MDS) technique. Technically speaking, MDS is used to translate information about the pairwise distances among a set of n objects into a configuration of n points, mapped into an abstract Cartesian space. In essence, this technique is proven to be useful to display the information contained in a distance matrix, while providing a form of non-linear dimensionality reduction [20, 21]. Sometimes, also some kind of structural equation modeling is employed to individuate groups, or subtypes, in the case of multivariate categorical data. These are called latent classes, as detailed in the following references [22,23,24,25].

Another interesting technique, in the context of multivariate statistics, is that of Binning. Here the target is somewhat different, since, at the basis, we have a form of data quantization. Essentially, all the data values falling into a given interval (the bin, indeed) are all replaced by a single representative value. A typical example, which is provided to explain this technique, is that of representing the ages of a group of people with intervals of consecutive years, rather than with each single age value [26]. Needless to say, going for binning is a delicate choice, since some pieces of information can come sacrificed. Nonetheless, it may result in a valid option when dealing with categorical variables, because a large amount of less frequent values, which could increase the dimensions of the resulting space, can be instead all grouped under a unique generic value (e.g., Other). This way, we yield just one dimension for an entire group of categorical values.

In the following Sections, instead, we will illustrate an alternative approach, where some categorical descriptors were put to good use in this complex context, without any need for dimension reduction. Rather than becoming a portion of the examples on which the learning algorithm was trained, they will be used to select the data to be presented to the learning algorithm.

Methodology

We now present, first, some preliminary information relevant to the present study, second, a description of the methods we used to innovate our approach.

Dataset description: type of variables, deep learning and prediction accuracy

As already mentioned, we were presented with the problem of designing a deep learning model able to predict the imminent failure of a device that measures water consumption in a water distribution network. Initially, we worked on a huge real-world dataset, fed with about one million mechanical water meter devices and with over fifteen million water meter readings of consumed water, supplied by a company that distributes water over a large area in Northern Italy. This large dataset spanned a period in time, from the beginning of 2014 to the end of 2018. To train our deep learning model, at the end of a long validation process which is described at length in [8, 9], we decided to use a smaller dataset, comprised of just those water meter devices, with at least three valid numerical readings. This dataset contained exactly 17,714 devices; where 15.652 were non-defective ones, and the remaining 2.062 were defective.

This dataset can be summarized by means of its eight main attributes, as reported in Table 1. Besides the first attribute, relative to the ID of the water meter, the second and third attributes are of numerical type and are used to report how much water is consumed with the passage of time (Readings and Readings Dates). The final attribute (i.e., 8) does not require any explanation.

Table 1 Dataset: main attributes

An alternative approach to dimension reduction for pareto distributed data: a case study

Abstract

Introduction

Related work

Methodology

Dataset description: type of variables, deep learning and prediction accuracy

Dimension reduction with a Pareto analysis

Results

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors' information

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publications

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords