Abstract
We discuss a statistical method for the classification problem with two groups \(y=0\) and \(y=1\). We envisage a situation in which the conditional distribution of \(y=0\) is well specified by a normal distribution, but the conditional distribution of \(y=1\) (rare observations in imbalanced data sets) is not well modeled by any specific distribution. Typically in a case-control study, the distribution in the control group can be assumed to be normal via an appropriate data transformation, whereas the distribution in the case group may depart from normality. In this situation, the maximum t-statistic for linear discrimination, or equivalently the Fisher’s linear discriminant function, may not be optimal. We propose a class of generalized t-statistics and study asymptotic consistency and normality. The optimal generalized t-statistic in the sense of asymptotic variance is derived in a semi-parametric manner, and its statistical performance is confirmed in several numerical experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baek S, Komori O, Ma Y (2018) An optimal semiparametric method for two-group classification. Scand J Stat 45:806–846
Dottorini T, Sole G, Nunziangeli L, Baldracchini F, Senin N, Mazzoleni G, Proietti C, Balaci L, Crisanti A (2011) Serum IgE reactivity profiling in an asthma affected cohort. PLoS ONE 6:e22319
Duong T, Hazelton ML (2003) Plug-in bandwidth matrices for bivariate kernel density estimation. Nonparametric Stat 15:17–30
Duong T (2007) ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R. J Stat Softw 21:1–16
Efron B (1975) The efficiency of logistic regression compared to normal discriminant analysis. J Am Stat Assoc 70:892–898
Komori O, Eguchi S, Copas JB (2015) Generalized \(t\)-statistic for two-group classification. Biometrics 71:404–416
Lian H (2008) MOST: detecting cancer differential gene expression. Biostatistics 9:411–418
O’Neill TJ (1980) The general distribution of the error rate of a classification procedure with application to logistic regression discrimination. J Am Stat Assoc 75:154–160
Su JQ, Liu JS (1993) Linear combinations of multiple diagnostic markers. J Am Stat Assoc 88:1350–1355
Tibshirani R, Hastie T (2007) Outlier sums for differential gene expression analysis. Biostatistics 8:2–8
Wu B (2007) Cancer outlier differential gene expression detection. Biostatistics 8:566–575
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 The Author(s), under exclusive licence to Springer Japan KK
About this chapter
Cite this chapter
Komori, O., Eguchi, S. (2019). Generalized T-Statistic. In: Statistical Methods for Imbalanced Data in Ecological and Biological Studies. SpringerBriefs in Statistics(). Springer, Tokyo. https://doi.org/10.1007/978-4-431-55570-4_4
Download citation
DOI: https://doi.org/10.1007/978-4-431-55570-4_4
Published:
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-55569-8
Online ISBN: 978-4-431-55570-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)