Large-Scale Mode Identification and Data-Driven Sciences

Bump-hunting or mode identification is a fundamental problem that arises in almost every scientific field of data-driven discovery. Surprisingly, very few data modeling tools are available for automatic (not requiring manual case-by-base investigation), objective (not subjective), and nonparametric (not based on restrictive parametric model assumptions) mode discovery, which can scale to large data sets. This article introduces LPMode--an algorithm based on a new theory for detecting multimodality of a probability density. We apply LPMode to answer important research questions arising in various fields from environmental science, ecology, econometrics, analytical chemistry to astronomy and cancer genomics.


Goals
Many scientific problems seek to identify modes in the true unknown probability density function f (x) of a variable X, given i.i.d observations X 1 , . . . , X n . The presence of unexpected "bumps" in a distribution are interpreted as interesting 'new' phenomena or discoveries whose scientific explanation is a research problem.
However, in the era of big data, we have a slightly more complicated situation where modern data-driven sciences routinely gather measurements on tens of thousands or even millions of variables instead of only one variable. The goal is to learn and compare the multi-modality shape of each variables. This problem of finding structures in the form of hidden bumps arises in many data-intensive sciences. For example, cancer biologists may be interested to identify genes with multi-modal expression from a large-scale microarray database as they might serve as ideal biomarker candidates that can be useful for discovering unknown cancer subtypes or designing personalized treatment. On the other hand, applied economist may be interested to understand how the muti-modality pattern or shape of income per capita distribution evolve over time, which might provide insights into the economic polarization (or lack thereof). Due to the massive scale of these kinds of problems, it is not practical to manually investigate the modality in a case-by-case basis. As a practical requirement, we seek to develop algorithms that are automated and systematic. In this paper, we address the intellectual challenge of developing novel algorithm for 'large-scale nonparametric mode exploration'-a problem of outstanding interest at the present time. To the best of our knowledge, there has been no previous work that can address this important multi-disciplinary applied data problem.
Two different classes of bump-hunting methods are currently prevailing in the literature, which provide insights at different levels of granularity and details: (i) testing multimodality or deviation from unimodality; (ii) determining how many modes are present in a probability density function. The purpose of this paper is to present a new genre of nonparametric mode identification technique for (iii) comprehensive mode identification: determining number of modes (along with locations), as well as standard errors or confidence intervals of the associated mode positions to assess significance and uncertainty.

Two modeling cultures
The vast majority of bump-hunting techniques available to date can broadly be divided into two parts based on the modeling cultures. The first line of work is based on parametric mixture model. The Gaussian mixture model (GMM) is the most heavily studied model in this class. GMM-based parametric bump hunting methodologies are well-studied and have a large literature. For details, see Day (1969), Fraley and Raftery (2002), and references therein. The second and most popular approach extracts modes by estimating the kernel density function, thus completely removes the parametric restrictions. The idea of using kernel density for nonparametric mode identification goes back to the seminal work of Parzen (1962). This was furthered studied by Silverman (1981) based on the concept of "critical bandwidths" and bootstrapping, which is known to be highly conservative, nonrobust (sensitive to outliers), and generate different answers based on various calibration techniques (e.g. bandwidth). For recent applications of kernel density based approach in mode clustering see Chacón et al. (2013Chacón et al. ( , 2015 and Chen et al. (2016).
In the present paper, we shall instead focus on developing a comprehensive solution to the problem (iii) by blending the traditional parametric and nonparametric statistical modeling cultures. The modeling attitude taken here can be classified as a nonparametrically designed parametric statistical modeling culture. It provides nonparametric generalization of the (parametric) Edgeworth-like expansion for density approximation in a way that is especially useful for bump identification. Our specialized technique provides a fundamentally new point of view to the problem of mode-discovery, which borrows modern nonparametric machineries from Mukhopadhyay and Parzen (2014).

Skew-G density representation
We introduce a new class of density representation scheme, called Skew-G models, which lies at the heart of our analysis. As a prototypical example we consider acidity index (more specifically acid-neutralizing capacity (ANC) -a fundamental index of natural water acidbase status) measured in a sample of n = 155 lakes in North-Central Wisconsin * . Assessing water quality standards by monitoring acidity level of lakes (a higher level of acidity is a threat to biodiversity) is a matter of paramount importance as this has a direct impact on our environment. The study has two interrelated goals: to check whether the data supports normal ANC distribution and if not, then the next step is to investigate the presence of multiple distinct populations of lakes with different distributions of ANC. The smooth bimodal connector density estimate is shown. The green curve shows the LP orthogonal series estimator and the orange curve denotes the LP maximum entropy exponential (MaxEnt) estimates; corresponding modes are shown in same color.
To address this question, our first modeling goal is to understand how the true unknown density is different from the unimodal G, which is "normal" distribution for the acidity index data. Depending on the problem, data scientists are free to choose any suitable parametric null-model as a rough starting point to query the data, thereby making data analysis a more interactive and automatic endeavor; see examples in Section 4. The inadequacy of the normal density model as clearly visible from Fig 1(a) begs the following question, which is at the heart of our approach: What is the least or minimal nonparametric perturbation of the null parametric model G is required to produce a density that best fits the data?
Definition 1. Define skew-G density model, an universal representation scheme by Here d(u; G, F ) is the goodness of fit "connector" or "comparison" density defined as where Q(u; G) = inf{x : G(x) ≥ u} denotes the quantile function. The corresponding comparison distribution function is given by Remark 1. In the algorithmic terms, the Skew-G density modeling involves three steps: (1) start with a unimodal parametric candidate model G; (2) manufacture an intermediate density d(u; G, F ); and (3) combine them using (2.1) to construct f (x; X). The comparison density acts as a glue to "connect" the parametric unimodal null-density G with the true unknown distribution F to provide the best fit in a way that reveals the hidden multimodality.
For that reason, we alternatively call d connector density.
The skew-G density formulation simultaneously serves two purposes: (1) Exploratory goodnessof-fit assessment: The hypothesis H 0 : F = G can equivalently be expressed as testing d ≡ 1. Thus the flat uniform shape of the estimated comparison density provides a quick graphical diagnostic to test the fit of the unimodal parametric model G to the true distribution F . Fig 1(b) shows the nonparametric comparison density estimate (a specially designed method will be discussed in the next section) for acidity-index data.
(2) Mode identification: The shape of d(u; G, F ) not only provides a tool to check the null hypothesis, but also indicates how to repair G such that it adequately fits the data. Data scientists are often interested in understanding whether the assumed hypothesized model G is different from the actual distribution F by having extra mode(s). Looking at the Fig 1(b), one may anticipate that the initial clue of modal structure of a true f might be hidden in the shape of the pre-density d. This is, indeed, the case as we demonstrate in the next section.
Remark 2. The comparison density function captures and exposes the modality structure of the data in a more transparent and unambiguous way than the original density f , which is the key observation behind our LPMode algorithm (see Section 3). One of the reasons for it being the added smoothness that d(G(x)) enjoys compare to the original density f (x), which not only tackles the problem of spurious bumps but also allows efficient estimation via sparse expansion in a specialized orthogonal basis.
Remark 3. An anonymous reviewer correctly pointed out to us that the density representation formula (Eq. 2.1 and 2.2) 'is a good formulation for scientists because the term d(G(x); G; F ) contains the information about how the actual distribution is deviated from the approximated distribution g.' In fact in our formulation, the scientists can choose G based on domain-knowledge (such as Novikov et al. (2006)), which we incorporate in our density modeling procedure by viewing it as a 'background distribution.' As a result, it allows the capability to better understand (i) how compatible is the data with the theoretically expected model? (ii) whether the unexplained part manifests itself as a 'peak' ? Note that the modality of d(u; G, F ) (not the original density f (x)) answers the last question of searching for the bump above background, which we get as a byproduct from the proposed LPMode algorithm (see Section 3).

Constructing empirical orthogonal rank polynomials
We introduce a new class of orthogonal polynomials, called LP family of rank polynomials, and discuss the construction procedure. The word "empirical" in the title conveys the fact that data determine the shape of the orthogonal basis functions, which act as a key fundamental ingredient in the nonparametric approximation of d[G(x)] ∈ L 2 (R).
Definition 2. The probability integral transformation with respect to the measure G (or rank-G transform) is defined as G(X F ), where X F indicates X ∼ F ; distinguish it from the uniformly distributed rank-transform of a random variable F (X F ). For notational simplicity, we suppress the subscript F in X from now on.
Remark 4. Asymptotically as n → ∞ the "empirical" (piecewise-constant) orthonormal LP-score functions converges to the "smooth" Legendre Polynomials under H 0 : F = G. To emphasize this universal limiting shape of our empirically constructed score functions, we call it LP basis-Legendre Polynomials of ranks. The substantial departure of empirical LP basis functions from the Legendre polynomials is the consequence of the fact that unlike ranktransform F (X), the rank-G transform random variable G(X) is not uniformly distributed due to F = G, as in acid data.
Remark 5. We perform density function approximation of d(G(x)) in the LP Hilbert spaces.
In particular, we design L 2 and exponential connector density models whose sufficient statistics are LP special functions. Furthermore, we show that the orthogonal Fourier expansion coefficients can be expressed as functional of these LP bases.

Estimation and properties
Two methods for probability density expansion are discussed by choosing the sufficient statistics to be LP basis elements, which differs us from the classical orthogonal series based methods (Wasserman, 2006, Tsybakov, 2009).
Definition 4. For a random sample X 1 , . . . , X n with the sample distribution F (x; Theorem 1. The square integrable comparison density function can be represented by a convergent orthogonal series expansion in the LP Hilbert space with the associated Fourier coefficients LP(j; G; F ).
To establish the theorem it is enough to show that LP(j; G; F ) = Leg j , d L ( 0,1) , which is shown below To derive the asymptotic null-distribution of the LP-orthogonal coefficients first note that under H 0 : F = G we can express as functional of uniform (comparison distribution) empirical process U n ≡ √ n D(u; G, F )− u for 0 ≤ u ≤ 1, which is known (Shorack and Wellner, 2009) to converge weakly to a limiting Brownian bridge process U n w − → U . As a consequence, under H 0 the continuous mapping theorem yields We get the following important theorem by straightforward applications of integration by parts followed by Fubini's theorem on the last expression (2.5).

Model denoising
Akaike information model selection criteria (AIC) selects the significantly non-zero LP means after arranging them in the decreasing magnitude; details in Section 3. Compute LP series estimator by d(u; G, F ) − 1 = j Leg j (u) LP(j; G, F ), sum over AIC selected indices. Interestingly, the proposed AIC-based LP-Fourier coefficient selection criterion can be shown to minimize MISE estimation error (Hart, 1985): (2.6) where the first term denotes the bias component and second term denotes the variance.
Remark 6. Instead of Akaike's rule, one can alternatively use Schwarz BIC rule (Ledwina, 1994) as the selection criterion. In an interesting study, Kallenberg (2000) showed that asymptotic optimality properties continue to hold for a much more general class of penalty starting from the AIC up to even much larger penalties than the one in Schwarz's criterion. Thus from the theoretical perspective, practitioners can use either AIC or BIC to select the 'significant' LP-coefficients without much harm.
For acid data the resulting Skew-G orthogonal series density estimator is given by where (μ,σ) are simply the MLE estimates. Practitioners are free to use any other plugin estimators for specifying the parameters of G. To ensure non-negativity of the density estimate we further enhance the L 2 approach by estimating maximum entropy (MaxEnt) exponential estimator of log d(u; G, F ) = θ 0 + j θ j Leg j (u) satisfying the following moment constraints for significant non-zero LP-means indices: Moment equality constraints: The resulting maximum entropy density estimate provides the best approximation to the comparison density in the sense of information divergence. The estimated specially designed (LP) nonparametric exponential density for acid index data is given below.
can be considered as a nonparametric generalization of Edgeworth-like expansion (Cox and Barndorff-Nielsen, 1994) for density approximation, which modifies the normal density by multiplying with Hermite polynomials.
Remark 8. Unlike traditional approaches where parametric models are constructed before the sufficient statistics, our modeling philosophy starts by constructing novel data representation via LP transform; schematic description of our nonparametrically designed parametric density estimation strategy is given below: Data → Constructing LP nonparametric transform → Sufficient statistics selection → Parsimonious parametric density models.

Consistency of local mode estimates
Theorem 3. The LP-series approximated comparison density admits specialized kernel representation where K m has the following form and L j (u) = (2j + 1) −1/2 Leg j (u).
To prove (2.7), first apply Theorem 1 to verify: . Finish the proof by using the Christoffel-Darboux orthogonal polynomial identity formula to show that Remark 9. (i) By virtue of this kernel representation, one can interpret m −1 as the bandwidth parameter. (ii) Due to the simple polynomial structure, the LP-kernel satisfies the following regularity conditions with r = 0, 1, 2, 3: of Theorem 3 along with (C.1-C.3), we can now utilize the well-known results from kernel density estimates to prove the consistency of local modes, without reinventing the wheel. The proof essentially follows by similar arguments presented in Chen et al. (2016) in conjunction with our Theorem 3 † .

LPMode algorithm and inference
We now present the computational steps of LPMode algorithm for nonparametric mode identification (determining locations as well as confidence intervals of the modal positions) by exploiting the duality relationship between comparison density d and the true unknown density f . Our proposed algorithm provides a computationally efficient and scalable solution to large-scale mode-exploration problems as demonstrated in Section 4. LPMode starts with a unimodal parametric reference or null distribution G (a plausible candidate for true unknown F ) whose parameters are estimated using maximum-likelihood (or any other methods: method of moments, etc.) The LPMode Algorithm 1. LP basis construction. Construct LP system of orthonormal rank polynomials T j (X; G) ∈ L 2 (G) associated with distribution G (following the recipe given in Section 2.3) satisfying Our data-driven non-linear rank polynomials T j (X; G) act as a sufficient statistics in our modeling algorithm, which are determined by the observed data X 1 , . . . , X n and the reference density G.

Computation of LP means. Compute LP means
3. Adaptive filtering. Identify indices j for which LP(j; G; F ) are significantly non-zero by using AIC model selection criterion applied to LP means arranged in decreasing magnitude. Choose k to maximize AIC(k), AIC(k) = sum of squares of first k LP-means -2k/n. † However, to the best of our knowledge, the explicit theoretical arguments for Hausdorff-consistency of ensemble of 'local modes' first appeared in the 1983 Ph.D. dissertation of Steven Boswell, which has not received the due credit. 4. Estimate L 2 and maximum entropy comparison density. Compute L 2 estimator of d(u; G, F )−1 = j Leg j (u) LP(j; G, F ) and maximum entropy exponential estimator log d(u; G, F ) = θ 0 + j θ j Leg j (u), sum over the significant non-zero LP-means selected in the Step (3). The intermediate goodness of fit density d(u; G, F ) assists to uncover the modal shape characteristics of the true density f (x) by capturing the missing shape features of unimodal G.
5. Skew-G density estimate. Construct estimator of f (x) using LP skew-G density model (2.1) as a nonparametric refinement of the parametric null-model: 6. Identify local maxima or mode. Identify the modes of f (x) by counting the number of modes in d(u; G, F ). Exploit the duality relationship between comparison density d and the true unknown density f for mode identification. Define the sets Often this happens when M ( d) contains the boundary points 0 or 1.
To facilitate nonparametric statistical inference, we seek to find an approximate sampling distribution of the k-modal positions. Comparison density offers a neat way to simulate from the "smooth" nonparametric density estimate f . We compute the standard errors (to assess the significance and uncertainty) associated with each putative mode location by the following comparison density-based accept-reject inference algorithm.
Comparison Density Based Inference Algorithm 1. Apply LPMode algorithm on the the original i.i.d sample X 1 , . . . , X n to get the modal positions and the parameters of the estimated comparison density.
2. Generate random variable Y from the parametric distribution G; generate U distributed as Uniform[0, 1] (independent from Y ).

Accept and set
otherwise discard Y and go back to the sampling step (2). 4. Repeat the process until we have simulated sample of size n to produce an i.i.d sample {x * 1 . . . , x * n }.
5. Draw B independent sets of samples each with size n; Repeat the steps 1-4 for each datasets and compute the k-modes. Return the standard errors and confidence intervals of the corresponding mode locations that captures the variability.
As one of the referee pointed out, our comparison density based inference algorithm is reminiscent of the smoothed nonparametric bootstrap approach. The only difference is that instead of classical kernel density based bootstrap smoothing (Silverman and Young, 1987), here we advocate LP Skew-G density based approach. Nonetheless, it seems to be an interesting connection for further research.

Applications
Scientists across many disciplines are interested in quantifying modality in distributions.

Econometrics
We consider the dynamics of the cross-country distribution of GDP per worker over a span of 50 years (between 1959 to 2008). The data is taken from The Penn World Table (PWT), version 7.0 and the output is measured in 2005 International dollars. We seek to investigate the evolving multi-modal structure of the GDP per worker distribution and the associated questions (e.g., when the multi-modality first emerges), which have significant economic importance.
To achieve that we need an automatic bump-hunting algorithm that can learn and compare the multi-modality shape of the GDP per worker distributions at 50 different points in time without requiring to inspect them manually case-by-case basis. Our LPMode is specially designed to tackle this kind of automated large-scale study. The cross-country GDP per worker distributions have a common shape: a peak around zero and a long exponential-like tail (see Fig 6). Thus we select G to be exponential, a common choice for all the time points. Our LPMode algorithm scans through all the 50 distributions and provides (a) the number and locations of modes over time; and (b) nonparametrically corrected parametric specification of the cross-country GDP per worker distribution for all the years. Fig 2 shows the dynamics of modes, which finds the sudden prominent emergence from unimodal distribution to trimodal happens in the year 1970. This aspect of mode phase-transition in the cross-country GDP per worker distributions was first empirically discovered by Quah (1993), which led q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q to a host of articles on this topic including Henderson et al. (2008). Economic scientists traditionally use Silverman's test for kernel density-based multimodality and have noted three well-known difficulties: (a) it is highly conservative, (b) it has problems of spurious modes in the tails of nonparametric distributions, and (c) it can generate different answers based on various calibration techniques (bandwidth and also calibration of Silverman's test). Thus, the authors were forced to manually investigate the modal shapes at certain selected years and found that between 1975-80 the unimodal distribution switches to bimodal shape. This is in contrast to our finding, which indicates the distribution shifted to three-peaked shape (instead of bimodal) in the year 1970 (which marks the emergence of middle-income countries). ‡ The three modes respectively denote the low-, middle-, and high-income countries. Figs 3 plots the comparison densities estimates for the years 1960, 1970, 1990 and 2005. The presence of three modes is most clearly visible from the plot of comparison densities. The evidence of three modes has recently been noted by Pittau et al. (2010), although the year of inception of tri-modality differs from ours. They have used parametric Gaussian mixture model for nine hand-picked years between 1960 and 2000. They noted that in contrast to the commonly held bimodality view "the statistical tests that we use indicate the presence of three component densities in each of the nine years that we examine over this period." The gaps between (a) developing and less-developed countries, and (b) developing and most-developed nations are shown in Fig 7. The gap between low-and middle-income ‡ It is interesting to note that if we plot the histogram the tri-modality is not clear, the reason being the extreme observations 'mask the extra middle bump'; we have to be cautious before we infer modality structure of a distribution simply by looking at the histogram -this could lead to misleading results. countries reduced substantially from early 1980s until the late-1990s; thereafter the gap widens, hinting at polarization between these two classes. On the other hand, recent trends suggest that developing economies have increased their rate of convergence in GDP per worker with developed economies (see Fig 8), which is manifested in the tri-modal density shape. The bottom panel of Fig 7 shows the dynamics of Gini measure, computed by taking correlation of X and F (X; X) for each time point, which measures the degrees of inequality and polarization. Gini measure has dropped from 0.94 in 1959 to 0.65 in 1980, increased almost monotonically between 1980 to 1990 and then stagnated for next decade at the value .86, which is practically the same as that of 1960.

Cancer genomics
We study the bump-hunting based cancer biomarker discovery and disease prediction using three high-dimensional microarray databases summarized in Table 1.
To investigate the significance of the multi-modal genes for predicting cancer classes, we fit random forest (Breiman, 2001), model-free classifier that can tackle high-dimensional covariates, under two different scenarios where the feature matrix X contains (a) all the genes; or (b) only the multi-modal genes. The necessary first step is to identify the modal structures of the genes. The huge dimensionality (thousands of gene expression distributions) makes this a challenging problem. However, LPMode algorithm provides a systematic (and automatic) nonparametric exploration strategy for such large-scale mode identification problems. LPMode categorizes the genes based on their modal shapes as shown in the top panel of  We apply the LPMode bump-hunting based unsupervised gene selection strategy (by screening only the multi-modal genes) for supervised classification learning. We randomly sampled 30% of the data to form a test set. This entire process was repeated 100 times, and test set accuracy rates are shown in the bottom panel of Fig. 4. The predictive performance is summarized in Table 2. The surprising fact to note that based on only the multi-modal gene expression signatures (which achieves significant compression by ignoring all the unimodal genes) we get almost the same accuracy for cancer classification. The distributions (over two classes) of few selected bimodal genes are shown in Fig 9, which indicates the multimodal-gene predictors are highly informative for classifying tumor samples and can be considered promising candidates for disease-specific biomarker. LPMode algorithm provides a new computationally efficient tool for large-scale bump-hunting that could be useful for cancer biologists for discovering novel biomarker genes, which were previously unknown.

Asteroid data
We analyze the physical density (in grams/cm 3 ) of 26 asteroids (Marchis et al., 2006) in the main asteroid belt of our solar system. The goal is to understand the internal structure of asteroids (largely unexplored research area) and classify them into various classes based on their density characteristics.
The trimodal shape denotes the presence of 3 types of asteroids: The low density peak corresponds to C-type (composed of dark carbonaceous objects), the middle one S-type (composed of silicaceous objects), and the higher-mode denotes the X-type asteroids (also known as M-type whose composition is partially known and apparently dominated by metallic iron and possibly icy and rocky composition).
Surprisingly, our pure data-driven findings are in agreement with the modern asteroid taxonomic system -both Tholen classification and more recent SMASS classification confirm the existence of three broad categories based on surface compositions. Our empirical discovery is also validated by the recent work of Marchis et al. (2012) on "Multiple asteroid systems." Overall LPMode could be a handy tool for astronomers to get more refined asteroid classification schemes as we obtain more observations in the future. We emphasize that our analysis is purely data-driven, which finds the underlying three types of asteroids without taking into account any astrophysical models of asteroids.

Galaxy color data
We analyze the rest-frame u-r color distribution (which is sensitive to star formation history) of 24, 346 galaxies with M r ≤ −18 and z < 0.08, drawn from the Sloan Digital Sky Survey database (Fukugita et al., 1996, York et al., 2000, Balogh et al., 2004. For details, see http://www.sdss.org/. LPMode algorithm finds two prominent modes of the galaxy color distribution. The color bimodality conveys important clues for galaxy formation and evolution. In particular, the twin peaks represent two types of galaxy populations, which astrophysicists classify as: red (old stellar populations with no or little cold gas) and blue (young stellar populations with abundant cold gas) or, alternatively as early-and late-type, produced by two different sets of formation processes. Our purely data-driven discovery can be explained (in terms of what caused this bimodality) using galactic formation theories. See Baldry et al. (2004) for further details.

Analytical chemistry
The pH of bioethanol is one of the most important parameters of the quality of biofuel, since its value establishes the grade of corrosiveness that a motor vehicle can withstand.
Here we consider n = 78 measurements from the proficiency testing (PT) scheme organized by Inmetro -the Brazilian National Metrology Institute (Sarmanho et al., 2015). Nineteen laboratories participated in measuring the pH in bioethanol and the participant laboratories were allowed to use their own procedure, i.e. a specific electrode for measuring pH. The different measuring procedures in analytical chemistry are the main cause of multi-modal distribution. The challenge of PT scheme providers is to access the comparability and reliability of the measurements by identifying and estimating consensus values (modes) along with the measure of variability/uncertainty (standard errors of each mode positions); see Lowthian and Thompson (2002), for more details on the role of bump-hunting for proficiency testing.
LP modeling starts by selecting G to be Gaussian to check whether the measurements deviate from normality. LPMode finds 2 major peaks. L 2 Modes are located at 5.35 (0.084) and 6.56 (0.12) and the MaxEnt Modes at 5.35 (.09) and 6.65 (0.15), denoting the high variability of the second mode. Our analysis strongly suggests the presence of two groups of laboratories based on the methods for pH measurements. This is further confirmed by the fact that one group of laboratories carried out the pH measurements with electrodes containing saturated LiCl as the internal filling solution, whereas the other group used electrodes with a 3.0 mol L −1 KCI internal filling solution.

Biological science
We investigate the activity of an enzyme in the blood involved in the metabolism of carcinogenic substances based on the data from n = 245 individuals. The goal is to study the possible bimodality to identify the slow and fast metabolizers as a polymorphic genetic marker in the general population. This data set has been analysed by Bechtel et al. (1993), who identified a mixture of two skewed distributions by using maximum likelihood techniques. Richardson and Green (1997) used a normal mixture estimated using reversible jump MCMC to estimate the distribution of the enzymatic activity.
LPMode starts by selecting G as exponential distribution with mean .622 (the MLE estimate). The estimated L 2 and MaxEnt comparison density is given by The shape of the comparison density clearly suggests the presence of bimodality. The LP nonparametric skew-G density estimate f (x) = g(x) × d(u; G, F ) is shown in Fig 12(B).
Our analysis finds two prominent subgroups of metabolizers -slow and fast -as a marker of genetic polymorphism in the population. Note that the two components associated with the two modes are of very different nature in terms of variability and skewness. Our approach is flexible enough to capture this satisfactorily. L 2 Modes are located at 0.160 ( The excess variance of the second component is caused by the underlying skewness, which GMM fails to capture by design. Fig 11(B) analyzes the distribution of the thicknesses of Hidalgo Stamp data measured by Walton Van Winkle and analyzed first by Wilson (1983) and then by many researchers including Izenman and Sommer (1988).

Philately
The estimated L 2 and MaxEnt comparison density is given by Our LPMode algorithm finds 2 major peaks. L 2 Modes are located at 0.0771 (0.00084) and 0.0970 (0.00284) and the MaxEnt Modes at 0.0761 (0.000857) and 0.102 (0.00259). A similar bimodality conclusion was drawn by Efron and Tibshirani (1994) using a bootstrap based kernel density estimate. Wilson (1983) also arrived at the same solution of bi-modality and concluded that "the un-watermarked stamps were printed on two different papers!" Interestingly, Wilson found the two modes are near to 0.077 mm and .105 mm, which is practically identical to our finding. As the data is unduly rounded, the traditional mixture model known to find many spurious modes.

Ecological science
We consider the body size distribution of North American boreal forest mammals, previously investigated in a famous ecological study by Holling (1992). The important question is to check whether the distribution is different from the "ideal" unimodal normal distribution and in particular we would like to know if there exits multi-modality. This question has profound ecological and evolutionary implications.
LPMode algorithm strongly reveals the presence of bi-modality as shown in Fig 11(A). The estimated L 2 and MaxEnt comparison density is given by Possible ecological explanations of the observed bimodal pattern is discussed in Holling (1992) and Allen (2006). Ecophysiologists have proposed a number of completing mechanistic hypotheses (examples: biogeographical hypothesis, textural discontinuity hypothesis, community interaction hypothesis and many more) to understand what could lead to such a pattern.

Simulation studies
Extensive simulation study is conducted to compare our proposed LPMode algorithm with three other popular benchmark methods that practitioners routinely use for bump-hunting: • Kernel density based on Silverman's 'rule of thumb' bandwidth (Silverman, 1986).
• Finite Gaussian mixture model (Fraley and Raftery, 2002). We have used BIC (Bayesian Information Criterion) to select the number of mixture components throughout this comparison.
The comparison is based on simulated data from the following eight distributions with different characteristics covering a broad class of modal shapes that arise in practice. Fig 5  plots the densities to show the diversity of the shapes.

Discussion
One of the fundamental problem of statistical science is to detect signal from large-scale i.i.d observation X 1 , . . . , X n ∼ F , where F is often called the signal-plus-background distribution. The task depends on the following two scenarios: (a) First scenario, where signal hides at the tail of the probability distribution; (b) Second scenario, often encountered in practice, where signal appear as a form of bump or mode in the probability distribution.
A huge research enterprise has been developed to address (a) under the banner 'Large Scale Inference' (Efron, 2010, Mukhopadhyay, 2016. At the same time, problem (b) has gained renewed relevance in this 21st century with the advent of the 'big data.' Despite of its practical significance and multidisciplinary utility, the progress (to develop pragmatic mode discovery tool) has not been satisfactory since the pioneering work by Parzen (1962). In this article, we intend to fill that gap by offering a new point of view to the problem of mode identification whose foundation is rooted in the modern nonparametric machineries (Mukhopadhyay and Parzen, 2014).
Our LPMode bump-hunting algorithm achieves three goals all at once: • The modes of d(u; G, F ) indicates the existence of 'bump(s) above background ' (by choosing G to be the background distribution), which is often scientifically more interesting (as is the case with Higgs boson discovery) than the modality of the original distribution F .
• It avoids the challenging problem of spurious bumps by jointly analyzing the connector density d(u; G, F ) and the original density estimate f (x), which is fundamentally different  A-10