1 Introduction

Quantile functions, defined as the generalised inverse of cumulative distribution functions, have nice properties that make them a valuable inferential tool. For instance, sums and convex linear combinations of quantile functions are still quantile functions. As a consequence, it is possible to construct arbitrary new quantile functions that have great flexibility and a small number of parameters (see, for instance, Karvanen (2006)). Thus, we can obtain distributions with a wide range of different shapes and also the exact or approximate form of many common distributions, including the normal, Student’s T and logistic distributions. See Gilchrist (2000) for a clear introduction to the use of quantile functions, their properties, and the main estimation methods.

Various flexible quantile functions have been proposed in the literature. The so-called g-and-k distribution (Haynes et al. 1997; Rayner and MacGillivray 2002) is defined as a generalization of the Gaussian distribution with additional skewness and kurtosis parameters. Freimer et al. (1988) introduced the quantile-based representation of the generalized Lambda distribution. Sankaran et al. (2016) proposed a new quantile function based on the sum of generalized Pareto and Weibull quantile functions.

Quantile functions that are linear in their parameters have desirable inferential properties, as will be shown in the following. Well-known examples are the flattened logistic distribution (Sharma and Chakrabarty 2019) and the generalized flattened logistic distribution (Chakrabarty and Sharma 2021).

Quantile functions can be estimated according to different strategies. Distributions that have analytical L-moments can be estimated by matching sample L-moments with their theoretical counterparts, in the same spirit as the method of moments (see, for instance, Chakrabarty and Sharma (2021)). Maximum likelihood estimation is possible as well; however, if the quantile function is not invertible - as is usually the case - then, for each observation of the data sample, say x, a numerical inversion needs to be carried out to find the correspondent percentile u, thus making the parameter estimation process numerically unstable and computationally expensive (Rayner and MacGillivray 2002). An alternative illustrated in Gilchrist (2000) is based on the minimization of the L1 norm between the ordered statistics and their theoretical median, leading to a least absolute deviation method. Without explicit density functions Bayesian estimation cannot be applied; however Allingham et al. (2009) and Drovandi and Pettitt (2011) developed an Approximate Bayesian Computation (ABC) strategy for the estimation of some classes of quantile functions.

In this work we show that the family of linear quantile functions can be efficiently estimated using least squares by exploiting the properties of the order statistics. We also develop the asymptotic distribution of a statistical test to check whether two estimated quantile functions have the same parameters. We also show how the procedure can be used for classification, by constructing a simple Naïve Bayes classifier based on quantile distributions, where the proposed testing procedure is used for variable selection and variable importance in a two-class problem. Empirical studies indicate that the proposed variable screening can help the classification task, and, in this perspective, it is alternative to variable weighting (see, for instance, Jiang et al. (2018) and Jiang et al. (2019)) or structure extensions by hidden variables Jiang et al. (2008). A completely different approach where quantile functions are used for classification is reported in Farcomeni et al. (2022).

The rest of the paper is organised as follows. In the next section we outline linear quantile functions and define our least squares estimator. Asymptotic results are given in Sect. 2.3, where we also derive the null distribution of relevant test statistics. In Sect. 3 we discuss how to use linear quantile functions for supervised classification and variable selection. Simulation studies are reported in Sect. 4 and the proposed strategy is illustrated on real data in Sect. 5. Some concluding remarks are given in Sect. 6.

2 Quantile-based distributions

Denote with \(F(x;\varvec{\theta })\) a distribution function that is right-continuous, depending on a vector of parameters \(\varvec{\theta }\) of length p. The quantile distribution function can be defined as in Parzen (1979):

$$\begin{aligned} F^{-1}(u;\varvec{\theta }) = Q(u;\varvec{\theta })= \inf \{ x: F(x;\varvec{\theta }) \ge u \}, \end{aligned}$$

for \(0<u<1\).

As in Tukey (1965), we call

$$\begin{aligned} q(u;\varvec{\theta }) = Q'(u;\varvec{\theta }), \end{aligned}$$

the quantile density function, which is related to the density function as:

$$\begin{aligned} f(x;\varvec{\theta }) = \frac{1}{q(F(x;\varvec{\theta }))}. \end{aligned}$$
(1)

For certain probability distributions the quantile function can be derived in analytical form through the inversion of the cumulative distribution function. Some examples are reported in Table 1. Most probabilistic densities do not admit closed-form quantile functions though. One notable example is the Gaussian distribution. The contrary is also true: a quantile function can be defined without making reference to an explicit probability distribution function.

Table 1 Quantile functions of some probability distributions

An interesting family of quantile functions is given by the ones that are linear in their parameters. Starting from the symmetric quantile function of the logistic distribution:

$$\begin{aligned} Q(u;\varvec{\theta }) = \alpha + \beta [\log {u} - \log {(1-u)} ] \end{aligned}$$
(2)

Sharma and Chakrabarty (2019) proposed the flattened version

$$\begin{aligned} Q(u;\varvec{\theta }) = \alpha + \beta \left[ \log {\frac{u}{1-u}} +\kappa u \right] , \end{aligned}$$

where the additional component indexed by the shape parameter \(\kappa \) regulates the flatness of the peak of the distribution. They derived classical and quantile-based properties of the distribution and compared its flexibility with respect to the logistic distribution in terms of fitting in empirical contexts.

More recently, Chakrabarty and Sharma (2021) proposed a generalization of the flattened logistic distribution (fgld):

$$\begin{aligned} Q(u;\varvec{\theta }) = \alpha + \beta \left[ (1-\delta )\log {u} - \delta \log {(1-u)} + \kappa u\right] \end{aligned}$$
(3)

that proved to be very flexible and outperformed the existing strategies in terms of model fitting. Figures 1 and 2 show the range of shapes this distribution can take.

Fig. 1
figure 1

fgld with \(\alpha =5,\,\beta =2,\,\kappa =1\) and varying \(\delta \)

Fig. 2
figure 2

fgld with \(\alpha =5,\,\beta =2,\,\delta =0.5\) and varying \(\kappa \)

2.1 Least squares estimation

In order to estimate the quantile function \(Q(u,\varvec{\theta })\), different strategies can be applied. L-moments matching (Chakrabarty and Sharma 2021) requires the analytical form of L-moments for the quantile function, along the same lines of method of moments. Maximum likelihood is a possible alternative strategy but it requires the approximation of the percentiles for each observation and the inversion of the derivative of the quantile function, thus resulting in an computationally expensive method (Rayner and MacGillivray 2002). In a Bayesian perspective, an Approximate Bayesian Computation (ABC) method has been developed (Allingham et al. 2009; Drovandi and Pettitt 2011) for specific classes of quantile functions, but again at the price of computational burden.

In Gilchrist (2000) two estimation methods based on ‘lack of fit criteria’ are introduced, which are denoted as distributional least absolutes and distributional least squares. The first is based on the minimization of the L1 norm between the ordered statistics and their theoretical median. The second approach consists in minimizing the L2 norm between the expected and the observed ordered statistics. Gilchrist highlights that, if no analytical form for the expected order statistics is available, they need to be approximated by a Taylor series expansion. For this reason the author champions the approach of the L1 norm, which does not require such derivation. Here instead, we develop a framework under which the least squares approach can be effectively and efficiently used with a closed form solution, and we also derive some theoretical results.

In fact, there is a specific link between theoretical order statistics and quantile-based distributions (David and Nagaraja 2004). More specifically, the expected value of an order statistic can expressed in terms of the quantile distribution as follows:

$$\begin{aligned} E[X_{(i)}]= & {} \frac{1}{B(i,n-i+1)} \int _0^1 Q(u;\varvec{\theta }) u^{i-1}(1-u)^{n-i}du. \nonumber \\ \end{aligned}$$
(4)

As stated in the following Lemma, if the quantile function is linear in its parameters, the expected value of the theoretical order statistics takes a similar linear form that simplifies the estimation method.

Lemma 1

If a quantile distribution function is linear with respect to its parameters, then the expected order statistics of that distribution will also be linear with respect to those same parameters.

The proof is shown in Appendix. Take for instance the simple quantile function \(Q(u;\varvec{\theta })=\theta _0+\theta _1 u\), with \(\theta _1>0\) and \(\varvec{\theta }=(\theta _0,\theta _1)\). Then by solving the integral in (4) we easily get

$$\begin{aligned} E[X_{(i)}]&= \theta _0 + \theta _1\,\frac{i}{n+1} = \begin{bmatrix} 1&\frac{i}{n+1} \end{bmatrix}\, \begin{bmatrix} \theta _0 \\ \theta _1 \end{bmatrix} = \textbf{b}_i^\top \varvec{\theta }. \end{aligned}$$

For a quantile function with a quadratic term in u, \(Q(u;\varvec{\theta })=\theta _0+\theta _1 u + \theta _2 u^2\), similarly we get

$$\begin{aligned} E[X_{(i)}] = \theta _0 + \frac{i}{n+1}\,\theta _1 + \frac{i\,(i+1)}{(n+2)(n+1)}\,\theta _2. \end{aligned}$$

Thus, for any linear quantile function, the expected values of the order statistics can written as

$$\begin{aligned} E[X_{(i)}] = \textbf{b}_i^\top \varvec{\theta }, \end{aligned}$$

where \(\textbf{b}_i\) are p-dimensional vectors of known coefficients.

Now, given a sample of IID observations \((x_1,\ldots ,x_n)\) from \(X \sim F(\varvec{\theta })\) denote with \(x_{(i)}\) the observed i-th order statistics. We can minimize:

$$\begin{aligned} \phi (\varvec{\theta }) = \sum _{i=1}^n \left( x_{(i)} - E[X_{(i)}] \right) ^2 =\sum _{i=1}^n \left( x_{(i)} - \textbf{b}_i^\top \varvec{\theta }\right) ^2 \end{aligned}$$
(5)

with respect to \(\varvec{\theta }\).

The resulting least squares estimation method is very efficient, since it provides a closed-form solution for the parameters.

By defining \({\textbf {B}}\) as the matrix of dimension \(n \times p\) having as rows \({\textbf {b}}_i\) and by \(\textbf{X}_{(\cdot )}\) the ordered random sample, the estimate of \(\varvec{\theta }\) is

$$\begin{aligned} {\hat{\varvec{\theta }}} = (\textbf{B}^\top \textbf{B})^{-1}\textbf{B}^\top \textbf{X}_{(\cdot )}. \end{aligned}$$
(6)

Furthermore we have:

$$\begin{aligned} E[{\hat{\varvec{\theta }}}] = (\textbf{B}^\top \textbf{B})^{-1}\textbf{B}^\top E[\textbf{X}_{(\cdot )}] = (\textbf{B}^\top \textbf{B})^{-1}\textbf{B}^\top \textbf{B}\,\varvec{\theta }= \varvec{\theta }\end{aligned}$$
(7)

and

$$\begin{aligned} V[{\hat{\varvec{\theta }}}] = (\textbf{B}^\top \textbf{B})^{-1}\textbf{B}^\top \,\varvec{\Sigma }\,\textbf{B}(\textbf{B}^\top \textbf{B})^{-1} \end{aligned}$$

where \(V[\textbf{X}_{(\cdot )}] = \varvec{\Sigma }\) is the covariance matrix of the order statistics. So the estimator \({\hat{\varvec{\theta }}}\) is unbiased, but, given the correlation among order statistics, we can not invoke the BLUE property of the Gauss-Markov theorem.

2.2 An example: the flattened generalised logistic distribution

In this section, we derive the results needed for least squares parameter estimation of the flattened generalized logistic (fgld) quantile function defined in Eq. (3). To this aim it is convenient to re-parameterise the quantile function as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} \alpha = \theta _0 \\ \beta \kappa = \theta _1 \\ \beta (1-\delta ) = \theta _2 \\ \beta \delta = \theta _3 \\ \end{array}\right. } {\left\{ \begin{array}{ll} \alpha = \theta _0 \\ \beta = \theta _2 + \theta _3 \\ \delta = \frac{\theta _3}{\theta _2 + \theta _3} \\ \kappa = \frac{\theta _1}{\theta _2 + \theta _3} \end{array}\right. } \end{aligned}$$

The quantile distribution function of the fgld becomes:

$$\begin{aligned} Q(u) = \theta _0 + \theta _1\,u + \theta _2\,\log {u} - \theta _3\,\log {(1-u)} \end{aligned}$$
(8)

To estimate the parameters via least squares we need to derive the expected value of the order statistics.

Lemma 2

The expected order statistic of the flattened generalised logistic distribution is equal to:

$$\begin{aligned} E[X_{(i)}]= & {} \theta _0 + \theta _1\,\frac{i}{n+1} + \theta _2\,(\psi (i) - \psi (n+1))\nonumber \\{} & {} + \theta _3\,(\psi (n+1) - \psi (n-i+1)) \end{aligned}$$
(9)

where \(\psi (\cdot )\) indicates the digamma function, which is defined as the derivative of the logarithm of the gamma function.

Therefore, in this case we get \(\textbf{b}_i=\Big (1,\frac{i}{n+1},\psi (i) - \psi (n+1),\psi (n+1) - \psi (n-i+1) \Big )\). For a proof see the Appendix.

In order to compute the variance of the estimator we also need to derive the covariance matrix for the order statistics of the fgld.

Lemma 3

The n-dimensional covariance matrix of the order statistics, \(\varvec{\Sigma }\), of the flattened generalised logistic distribution has diagonal variances given by

$$\begin{aligned} V[X_{(r)}] =&\, \theta _1^2\,\frac{r (n-r+1)}{(n+1)^2 (n+2)} + \theta _1\theta _2\,\frac{2 (n-r+1)}{(n+1)^2} + \\&+\, \theta _1\theta _3\,\,\frac{2 r}{(n+1)^2} + \theta _2^2\,(\psi _1(r)-\psi _1(n+1)) + \\&+\, \theta _2\theta _3\,2\psi _1(n+1) + \theta _3^2\,(\psi _1(n-r+1)\\&-\psi _1(n+1)) \end{aligned}$$

with \(r=1,\ldots ,n\) and where \(\psi _1(\cdot )\) indicates the trigamma function, which is the derivative of digamma function \(\psi (\cdot )\).

The covariance between any two order statistics of the flattened generalised logistic distribution is equal to:

$$\begin{aligned}{} & {} Cov[X_{(r)},X_{(s)}] \\{} & {} \quad = \theta _1^2\left[ \frac{r (n-s+1)}{(n+1)^2 (n+2)} \right] + \theta _1\theta _2\left[ \frac{(n-s+1) (r+s)}{(n+1)^2 s} \right] \\{} & {} \qquad +\, \theta _1\theta _3 \left[ \frac{r (2 n-r-s+2)}{(n+1)^2 (n-r+1)}\right] + \theta _2^2 \left[ \psi _1(s) - \psi _1(n+1) \right] \\{} & {} \qquad +\, \theta _2\theta _3\left[ (\psi (n+1)-\psi (n-r+1)) (\psi (n+1)-\psi (s))\right. \\ {}{} & {} \qquad + \left. \psi _1(n+1) \right] \\{} & {} \qquad +\,\theta _3^2 \left[ \psi _1(n-r+1)-\psi _1(n+1) \right] -\theta _2\theta _3\xi (n,r,s) \end{aligned}$$

where

$$\begin{aligned} \xi (n, r, s) =&\Gamma (s-r)\,\Gamma (n-s+1) \\&\sum _{h=1}^{\infty } \frac{1}{h} \frac{\Gamma (h+r)}{\Gamma (n+h+1)}(\psi (n+h+1) - \psi (h+s)) \end{aligned}$$

for \(r,s=1,\ldots ,n\).

A sketch of the proof in given in the Appendix.

2.3 Asymptotic results

In this section we derive the asymptotic distribution of the estimator of the fgld defined in Eq. (6). First notice that this estimator can be expressed as a linear combination of the order statistics:

$$\begin{aligned} {\hat{\varvec{\theta }}} = \sum _{i=1}^n \varvec{c}_{in} X_{(i)}, \end{aligned}$$

where the coefficients \(\varvec{c}_{in}\) are vectors of the same length p as \({\hat{\varvec{\theta }}}\).

Lemma 4

The coefficients \(\varvec{c}_{in}\) for the least squares estimator of the fgld are continuous and bounded.

The proof is given in the Appendix. Given this lemma we can derive the following theorem.

Theorem 1

The least squares estimator for the parameters of the fgld linear quantile function has an asymptotically normal distribution:

$$\begin{aligned} \varvec{{\hat{\theta }}} \xrightarrow []{d} N_p(\varvec{\theta },\varvec{\Gamma }) \end{aligned}$$
(10)

with \(\varvec{\Gamma }=(\textbf{B}^\top \textbf{B})^{-1}\textbf{B}^\top \varvec{\Sigma } \textbf{B}(\textbf{B}^\top \textbf{B})^{-1}\).

The proof of Theorem 1 is shown in the Appendix.

Given the previous result, the null hypothesis that the sample comes from a quantile function with parameters \(\varvec{\theta }_0\) can be tested as stated in the following theorem.

Theorem 2

The null hypothesis \(H_0: \varvec{\theta }=\varvec{\theta }_0\) can be checked through the test statistic

$$\begin{aligned} (\varvec{{\hat{\theta }}} - \varvec{\theta }_0) \xrightarrow []{d} N_p(\varvec{0},\,\varvec{\Gamma }), \end{aligned}$$

where for fgld quantile function the matrices \(\textbf{B}\) and \(\varvec{\Sigma }\) are known quantities derived in Lemma 1 and 3.

As a simple consequence we can also test the hypothesis that two observed samples come from the same population \(H_0: \textbf{B}\varvec{\theta }_1=\textbf{B}\varvec{\theta }_2\) which is equivalent to \(H_0: \varvec{\theta }_1= \varvec{\theta }_2\).

Under the previous assumptions we get

$$\begin{aligned} (\varvec{{\hat{\theta }}}_1 - \varvec{{\hat{\theta }}}_2) \xrightarrow []{d} N_p(\varvec{0},\,2\,\varvec{\Gamma }) \end{aligned}$$

or alternatively

$$\begin{aligned} \frac{1}{2}(\varvec{{\hat{\theta }}}_0 - \varvec{{\hat{\theta }}}_1)^\top \varvec{\Gamma }^{-1} (\varvec{{\hat{\theta }}}_0 - \varvec{{\hat{\theta }}}_1) \xrightarrow []{d} \chi ^2_p. \end{aligned}$$
(11)

3 Application to supervised classification

Let Y be a categorical random variables taking values \(y=\{1,\ldots ,K\}\), where K denotes the total number of classes and let \(\textbf{X}= (X_1, \dots , X_p)\) be a set of observed variables. One of the most used classification methods in the supervised setting is the so-called naïve Bayes classifier (John and Langley 1995; Hand and Yu 2001). Suppose you have a training data set in which both Y and \(\textbf{X}\) are known. According to the Bayesian rule, the posterior probability of belonging to a generic class k (\(k=1,\ldots ,K\)) is

$$\begin{aligned} Pr(Y=k \mid {\textbf {X}}={\textbf {x}})&= \frac{\pi _k f({\textbf {x}}\mid Y=k)}{f({\textbf {x}})} \nonumber \\ {}&= \frac{\pi _k f({\textbf {x}}\mid Y=k)}{\sum _{k'=1}^K \pi _{k'} f({\textbf {x}}\mid Y=k')}, \end{aligned}$$
(12)

where \(\pi _k\) denotes the proportion of units that belong to class k in the training set.

The naïve Bayes classifier assumes conditional independence of the variables given the categorical response

$$\begin{aligned} f(\textbf{x}\mid Y=k) =\prod _{j=1}^p f_j(x_j \mid Y=k), \end{aligned}$$

thus each variable is treated separately.

The class conditional distributions \(f_j(x_j \mid Y=k)\) are usually assumed to be Gaussian. An alternative has been proposed by John and Langley (2013), who suggested the use of kernel density estimation as a tool to allow for more flexible distributional shapes. A further common method is the discretization of all continuous variables, that is estimating the density function via a step function. For this method the main issue is to choose the breaks that define the categories; a recent heuristic proposal is that of Yang and Webb (2009), the so-called proportional discretization. This method achieves (approximately) a discretization with bins having both equal width and equal frequency, with the added advantage that the tuning parameter is derived automatically and based on the sample size (n): \(\texttt {width} = \texttt {frequency} \approx \sqrt{n} \).

Quantile-based distributions can be applied in this setting with the goal of taking advantage of their flexible and parsimonious specifications and the fast and reliable estimation given by the least squares method.

The application of quantile-based distributions in the naïve Bayes algorithm involves the estimation of \(K\times p\) univariate distributions, similarly to the other methods. Each of the univariate samples is identified by a variable and a category of the response, and their quantile function can be estimated via least squares, provided we choose a linear quantile function. The output of the estimation phase is just a set of parameters: \(\varvec{\theta }_{jk}\), with \(j=1,\dots ,p\) and \(k = 1,\dots , K\). Given a single sample identified by a set of variables \(\textbf{x}= (x_1, \dots , x_p)\), the class conditional distribution is evaluated as follows, for each variable j and categorical response k:

$$\begin{aligned} P(X_j = x_j \mid Y = k ) = f_j(x_j; \varvec{\theta }_{jk}) = \frac{1}{q_j(u_j;\varvec{\theta }_{jk})}, \end{aligned}$$

where the density is evaluated based on the relationship shown in Eq. (1) and \(u_j\) is the inverse of \(x_j = Q(u_j; \varvec{\theta }_{jk})\) and needs to be computed numerically in the case of non-invertible quantile functions, such is the case of the fgld.

As a by-product of the least square fit, a simple distance measure between two quantile distributions can be derived. Imagine that \(\hat{\varvec{\theta }}_1\) and \(\hat{\varvec{\theta }}_2\) are the estimates of the parameters of two quantile functions. For instance, the quantile function of the classes 1 and 2 of the training sample. Then for each variable we can measure:

$$\begin{aligned} \Vert \textbf{B}\hat{\varvec{\theta }}_1 - \textbf{B}\hat{\varvec{\theta }}_2 \Vert _2 \end{aligned}$$

where \(\Vert \dots \Vert _2\) denotes the Euclidean distance. The formula can also be interpreted as the Euclidean distance between two vectors containing the expected order statistics for the two distributions.

The formula can be applied seamlessly in the case of two response classes with equal number of observations. When the latter differs between the classes, n can be chosen for instance as the minimum class frequency; when the classes are more than two, the distance can be computed for each pair and the maximum pairwise distance can be retained, meaning that the variable can at least discriminate between those two classes.

This measure can serve to rank variables in terms of their importance, of course limited to their application in the naïve Bayes algorithm. This can be useful in interpreting and explaining the model, in a similar way to the use of variable importance measures derived from algorithms such as random forests.

Moreover, it can serve as the basis of a variable selection procedure as explained in Sect. 2.3 (Theorem 2). Imagine we have \(K=2\) classes, then a variable is relevant for classification if the null \(H_0: \varvec{\theta }_1 = \varvec{\theta }_2\) is rejected, where \(\varvec{\theta }_1\) and \(\varvec{\theta }_2\) denote the parameters in the two class-populations.

4 Simulation study

In this section we present some empirical studies to evaluate the goodness-of-fit of the illustrated quantile functions in different scenarios, their classification performance in the naïve Bayes algorithm and the behaviour of the asymptotic test.

4.1 Empirical bias

In this first simulation we investigate the goodness-of-fit of three different quantile-based distributions: the simple quantile function with a linear term in u (linear), the quantile function with a quadratic term in u (quad) and the fgld. In order to measure the empirical bias and the variability of the estimators of \(\varvec{\theta }\) we compare the observed order statistics with their expectation according to the three models, by computing this empirical bias measure:

$$\begin{aligned} \sqrt{\frac{\sum _{i=1}^n (x_{(i)} - {\hat{E}}[X_{(i)}])^2}{n}} = \sqrt{\frac{\sum _{i=1}^n (x_{(i)} - \textbf{B}\,{\hat{\varvec{\theta }}})^2}{n}}. \end{aligned}$$

We simulated \(n=100\) observations from four different distributions: a standard normal, a T distribution with 3 degrees of freedom, an exponential distribution with rate parameter equal to 0.5, and a \(\log (\mid T_{\nu = 3}\mid )\), that is the logarithm of the absolute value of a t distribution (again with 3 degrees of freedom). For each scenario we generated 100 replicates. Table 2 shows the mean of the empirical bias across the replicates for each scenario and model. In brackets the standard deviations offer an indication of the variability of the estimates.

Table 2 Average empirical bias over 100 replicates for 4 distributional scenarios (rows) and for 3 quantile-based distributions (columns). Standard deviations are reported in brackets

Results show that the fgld is by far the most flexible model, it being able to fit well in all scenarios.

4.2 Classification

We evaluated the performance of the quantile-based distributions in the naïve Bayes algorithm via a simulation study. We considered the fgld and the quantile function with a quadratic term in u (quad) described in Sect. 2.1. We generated p variables \(X_j\) (\(j = 1, \dots , p\)) of sample size n, according to the four different distributions described in the previous subsection.

We fixed \(K=2\) classes, of equal size n/2. Denote \(X_{j0}\) the variable \(X_j\) when \(Y=0\) and \(X_{j1}\) when \(Y=1\). In order to separate the classes we shifted each variable according to the rule

$$\begin{aligned} X_{j1} = X_{j0} + 0.3\, (-1)^j \quad j = 1,\dots ,p \end{aligned}$$

Alteratively, we have applied a scaling as

$$\begin{aligned} X_{j1} = 0.8\,X_{j0} \quad j = 1,\dots ,p \end{aligned}$$

Shifting has been applied to all distributional settings, while scaling has been applied only to the \(\log (\mid t_{\nu = 3}\mid )\) distribution; thus creating five different scenarios: (1) shifted N(0, 1), (2) shifted \(t_{\nu = 3}\), (3) shifted Exp(\(\lambda = 0.5\)), (4) shifted \(\log (\mid t_{\nu = 3}\mid )\) and (5) scaled \(\log (\mid t_{\nu = 3}\mid )\). For each scenario we let \(p=\{10, 50, 100\}\), \(n = \{100, 500, 1000 \}\), and correlated or independent variables.

The five distributional scenarios, three variable set sizes, three sample sizes and two correlation structures lead to ninety settings. For each setting we repeated data generation and estimation 100 times. Misclassification rates were evaluated on test sets generated in same way as the training samples, and we report the average over the replicates.

We compared with other choices for the class-conditional distributions; namely the normal, the kernel (kde), with default Silverman’s rule for the bandwidth, the discrete method (with proportional discretization (Yang and Webb 2009)), the generalized extreme value distribution (gev) estimated via maximum likelihood by the R package evd.

Table 3 contains a summary of the computational times for this simulation. We can note that the time needed for the methods based on the least squares estimation of quantile functions is longer than for simpler methods such as the normal and the discrete, but it is manageable even for the larger data sets. Times are particularly affected by the increase in the number of independent variables (p).

Table 3 Computational average times in seconds for training the naïve Bayes classifier and applying its prediction on a test set over the 100 replications for the 5 distributional scenarios. In brackets standard deviations are reported
Fig. 3
figure 3

Results from a simulation study comparing different methods for the naïve Bayes classifier. Each panel represents a distributional scenario under which the data was simulated. Results are presented as scaled differences from the fgld, where a value higher than 0 means that for a setting (combination of sample size, number of variable and correlation structure) the method had a larger mean misclassification error than the fgld

Results for the classification are presented graphically in Fig. 3 for each data generating distribution, where we collapse over the 18 settings evaluated for each case. We show scaled differences with respect to a reference method for each setting; we choose fgld as the reference. The scaled differences are computed as follows:

$$\begin{aligned} d_{jk} = \frac{e_{jk} - e_{j1}}{{\bar{e}}_{j}} \end{aligned}$$

where \(j=1,\dots ,18\) indicates the setting for fixed data generating distribution, \(k=1,\dots ,5\) represents the method (with 1 being the reference method), and \({\bar{e}}_{j}\) being the average test error for that setting. From Fig. 3 we can see that fgld is very competitive: as expected it performs worse than the normal when the data are indeed normal, but the discrepancy is minimal; it is the best method otherwise with the exception of the exponential data when only gev performs better.

4.3 Testing procedure

In order to evaluate the performance of the test we assess the distribution of the test statistic for the fgld and for the quad quantile functions under the null hypothesis \(H_0: \theta _1 = \theta _2\), and the power of the test when the null hypothesis is not true. The variance of the order statistics of the quad quantile function, needed for the variance of the least squares estimator, is reported in the Appendix.

4.3.1 Type I error

Under the null hypothesis the two samples come from the same distribution. In order to evaluate the convergence of the test statistic to its null distribution we compare empirical type I errors with the nominal significance level that has been chosen in advance.

A total of 200 sets of parameters have been randomly generated, and for each of them 1,000 two-group samples have been simulated. From each of these 1,000 data sets the test statistic can be computed and the empirical type I error corresponds to the proportion of test statistics above the critical value (the 95th quantile of the \(\chi ^2_{df = 4}\) distribution for the fgld and the \(\chi ^2_{df = 3}\) for the quad). This procedure has been repeated for different group sample sizes, with the same parameter sets, and the results are shown in Fig. 4. As could be expected, the empirical type I error converges to the nominal one as the sample size increases in both cases.

Fig. 4
figure 4

Distribution of empirical type I errors across 200 parameter sets for the fgld (left panel) and quad (right panel) for different group sample sizes. As the sample size increases, empirical type I errors get closer to their nominal 5% value. The left panel refers to the fgld, the right panel to the quad

4.3.2 ROC curves

To evaluate the power of the test we have simulated data sets of 1,000 variables, with half of those variables having a different distribution between the two balanced groups, and half having the same distribution. For each variable the p-value associated with the test statistic is computed.

This problem can be re-framed as a classification problem in which the response is whether or not the variable is useful (having a different or equal distribution across the two groups).

In the simulation we know whether the variable is useful or not, so we can evaluate it with the metrics of a classification model, such as a ROC curve. This is particularly suited to the test because the different thresholds (and subsequent classifications) can be interpreted as significance levels.

In Fig. 5 we report the ROC curves for the fgld and quad that evaluate whether test statistics are able to identify correctly useful and not useful variables. In both cases we can see that as n increases the curves move more and more towards the top left corner. Even with low sample sizes there are cutoff points for which the test performs extremely well both in terms of sensitivity and of specificity.

Fig. 5
figure 5

ROC curves based on the identification of whether a variable has the same distribution across two groups. Results are obtained by computing the hypothesis test across 1,000 variables, of which only half have the same distribution across the two groups

5 Real data examples

5.1 Benchmark datasets

We have compared the different methods for the naïve Bayes classifier used in Sect. 4.2 on some real datasets commonly used for benchmarking. The chosen datasets are all publicly available from the UCI machine learning repository (Dua and Graff 2019). When available we used the preprocessed version from the R package mlbench (Leisch and Dimitriadou 2021). In Table 4 some basic information of the datasets used is provided: we can note the general adaptability of the naïve Bayes classifier, being able to deal with both numerical and categorical variables at the same time and with multi-class response variables.

Table 4 Datasets from the UCI Machine learning repository used for comparing naïve Bayes methods, with some information regarding data size and type

On these data we fitted the models that performed the best in the simulation study (Sect. 4.2), namely the fgld, the normal, the kde and the discrete. Results in terms of accuracy from tenfold cross-validation are presented in Table 5. We can note that no method is uniformly superior to the others. In general, the additional flexibility given by the fgld, the kde and the discrete, with respect to the normal, proves advantageous. We can note the fgld performs comparatively well and there are multiple datasets where it achieves the maximum accuracy.

Table 5 Accuracy from different naïve Bayes methods (columns) applied on 12 benchmark datasets (rows). The results are obtained from tenfold cross-validation

5.2 Variable selection

In this section we illustrate the proposed strategy for variable selection on a real dataset. We revisit data from Altman (1968), available in the R package MixGHD (Tortora et al. 2021), by adding noise variables. The original dataset contains information about \(n=66\) companies that have filed for bankruptcy. Our task is to predict the status of the firms (0 for ‘bankruptcy’ or 1 for ‘financially sound’). The original predictors are two measurements related to the earnings of the firm. On top these two relevant variables we added 198 irrelevant variables sampled from a standard normal distribution, for a total \(p = 200\). The goal is to check whether the variable selection procedure developed in Sect. 3 is able to identify the two real variables, and then to compare the accuracy of various naïve Bayes classification algorithms in the complete dataset and with some other values of p.

To this aim we considered the naïve Bayes classifiers, with the previously used methods for estimating the distribution (normal, kde, discretization and fgld). We also compare these classifiers to other commonly used ones: k-nearest neighbors with k = 3, logistic regression and linear discriminant analysis.

First we computed the p values associated with the test for each variable, and by using a procedure for controlling the false discovery rate (the Benjamini-Hochberg procedure), we correctly reject the null hypothesis only for the two original variables. Next, we re-ordered variables in ascending order by the obtained p values and we compare the classifiers in datasets with an increasing number of variables, where variables with progressively higher p values are included. Results are shown in Table 6 for values of \(\textit{p}=2,50,100,150,200\). A visual representation of the naïve Bayes with the fgld is shown in Fig. 6, where the first 7 variables in terms of p-value are visualised, separated by class, with a histogram and the density from the estimated fgld. It can be noted how the fgld can capture the skewness present in the first two original variables.

Fig. 6
figure 6

Naïve Bayes classifier with fgld applied to the bankruptcy dataset with added noise variables. The 7 variables with the lowest p values are shown on the columns, while the rows identify the response class. The visualisation includes a histogram and the density function from the estimated fgld

Table 6 Leave-one-out cross validation accuracy for different classification algorithms applied to the bankruptcy dataset with added noise variables. The columns are for different numbers of variables (p), being the ones with the lowest p values for the fgld test

We can note that the naïve Bayes with the fgld reaches its maximum with \(p = 2\), that is with the original variables. This is the best accuracy obtained in a leave-one-out cross validation scheme, and the method is the best strategy together with logistic regression. As more and more noise variables are included the performance of all methods deteriorates, with the naïve Bayes classifiers being pretty robust. This robustness, in particular of the normal and KDE naïve Bayes classifiers, has also been noted by the fact that it can happen that they retain or improve their accuracy even in presence of a moderate number of noisy variables, probably due random changes related to the small number of units n. However, the improvement given by the selection is sizable for all methods, and most of them benefit from the selection given by the fgld test, reaching very high accuracies when only the two original variables are included.

6 Concluding remarks

We focused on the family of linear quantile functions and in particular on the so-called generalized flattened logistic distribution (fgld). We showed a least squares estimation procedure and we derived its properties. The resulting estimators are unbiased and asymptotically normal, thus allowing us to derive a testing procedure. In the numerical experiments we have investigated the performance of the asymptotic test with different sample sizes. Results show that even with low sample sizes the asymptotic test has some acceptable power.

In principle one could consider any other linear quantile function, provided that the first and second moment of the order statistics can be derived, which are necessary respectively for the least square estimator and the testing procedure. We remark though that in our simulation and real data experiments the fgld distribution, characterized by four parameters, seems to be flexible enough to capture a wide range of shapes. The theoretical results about the fgld distribution have been then used to propose a novel naïve Bayes classifier, based on the quantile distribution rather than the conventional Gaussian density. The fgld quantile function performed very well in all the empirical studies. As by-products, strategies for variable importance and variable selection have been obtained by the simple application of the testing procedure developed in the first part of the work. Notice that the naïve Bayes classifier under the assumption of conditional independence requires univariate densities, and for this reason the fgld quantile distribution represents a useful and flexible tool. A challenging extension for future work is to develop an inferential framework for multivariate quantile functions, in the spirit of Farcomeni et al. (2022), with potentially different applications and statistical purposes. One could also consider an extension to quantile regression, where we speculate that the evaluation of the impact of changes in explanatory variables on marginal distributions of an outcome could be straightforward within the family of linear quantile functions (Firpo et al. 2009).