Journal List > Korean J Leg Med > v.41(2) > 1088013

Kim, Lee, Cho, Kim, Lee, Ha, and Ahn: Asian Ethnic Group Classification Model Using Data Mining

Abstract

In addition to identifying genetic differences between target populations, it is also important to determine the impact of genetic differences with regard to the respective target populations. In recent years, there has been an increasing number of cases where this approach is needed, and thus various statistical methods must be considered. In this study, genetic data from populations of Southeast and Southwest Asia were collected, and several statistical approaches were evaluated on the Y-chromosome short tandem repeat data. In order to develop a more accurate and practical classification model, we applied gradient boosting and ensemble techniques. To infer between the Southeast and Southwest Asian populations, the overall performance of the classification models was better than that of the decision trees and regression models used in the past. In conclusion, this study suggests that additional statistical approaches, such as data mining techniques, could provide more useful interpretations for forensic analyses. These trials are expected to be the basis for further studies extending from target regions to the entire continent of Asia as well as the use of additional genes such as mitochondrial genes.

REFERENCES

1.Butler JM. Advanced topics in forensic DNA typing: methodology. San Diego, CA: Academic Press;2011.
2.Enoch MA., Shen PH., Xu K, et al. Using ancestry-informative markers to define populations and detect population stratification. J Psychopharmacol. 2006. 20:(4 Suppl):. 19–26.
crossref
3.Li JZ., Absher DM., Tang H, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008. 319:1100–4.
crossref
4.Rosenberg NA., Pritchard JK., Weber JL, et al. Genetic structure of human populations. Science. 2002. 298:2381–5.
crossref
5.Pritchard JK., Stephens M., Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000. 155:945–59.
crossref
6.Quinlan JR. Induction of decision trees. Mach Learn. 1986. 1:81–106.
crossref
7.Opitz D., Maclin R. Popular ensemble methods: an empirical study. J Artif Intell Res. 1999. 11:169–98.
crossref
8.Rokach L. Ensemble-based classifiers. Artif Intell Rev. 2010. 33:1–39.
crossref
9.Quinlan JR. Bagging, boosting, and C4.5. AAAI/IAAI '96 Proceedings of the Thirteenth National Conference on Artificial Intelligence. 1996 Aug 4-8; Portland, OR, USA. Vol. 1. Palo Alto, CA: AAAI Press;. 1996. 725–30.
10.Breiman L. Bagging predictors. Mach Learn. 1996. 24:123–40.
crossref
11.Schapire RE. The strength of weak learnability. Mach Learn. 1990. 5:197–227.
crossref
12.Freund Y., Schapire RE. A short introduction to boosting. J Jpn Soc Artif Intell. 1999. 14:771–80.
13.Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002. 38:367–78.
crossref
14.Wang R., Lee N., Wei Y. A case study: improve classification of rare events with SAS Enterprise Miner. In: Proceedings of the SAS Global Forum 2015 Conference. Cary, NC: SAS Institute Inc.;2015.
15.Rahman MM., Davis DN. Addressing the class imbalance problem in medical datasets. Int J Mach Learn Comput. 2013. 3:224–8.
crossref
16.Purps J., Siegert S., Willuweit S, et al. A global analysis of Y-chromosomal haplotype diversity for 23 STR loci. Forensic Sci Int Genet. 2014. 12:12–23.

Fig. 1.
Classification analysis process.
kjlm-41-32-f1.tif
Fig. 2.
Examples of decision rules.
kjlm-41-32-f2.tif
Fig. 3.
Bagging procedure.
kjlm-41-32-f3.tif
Fig. 4.
Boosting procedure.
kjlm-41-32-f4.tif
Fig. 5.
Under sampling.
kjlm-41-32-f5.tif
Fig. 6.
Progress of ethnicity classification model analysis.
kjlm-41-32-f6.tif
Fig. 7.
Gradient boosting and decision tree (chi-square) ensemble model separation rule tree.
kjlm-41-32-f7.tif
Table 1.
Details of populations analyzed
Population Sample size Data source
Vietnam 46 Seoul National University
Nepal 69  
India 23  
Vietnam 45 Purps et al. [16]
Philippines 798  
Singapore 104  
India 298  
Total 1,383  
Table 2.
Composition of data
No. Variable Definition
1 Sample Info National information
2 DYS576 Gene
3 DYS389I  
4 DYS448  
5 DYS389II  
6 DYS19  
7 DYS391  
8 DYS481  
9 DYS549  
10 DYS533  
11 DYS438  
12 DYS437  
13 DYS570  
14 DYS635  
15 DYS390  
16 DYS439  
17 DYS392  
18 DYS643  
19 DYS393  
20 DYS458  
21 DYS456  
22 YGATAH4  
23 TARGET 0: Southeast Asian
    1: Southwest Asian
Table 3.
The results of data splitting and under sampling
Category Dataset Count Target rate
Raw data Y_STR_Raw 1,345 72.0:28.0
Data partition Train dataset 846 72.0:28.0
  Validate dataset 364 72.0:28.0
  Test dataset 135 72.0:28.0
Under sampling Train dataset (under sampling) 470 50:50:00
  Validate dataset (under sampling) 204 50:50:00
Table 4.
Result of classification model
No. Model Resampling Misclassification rate ROC index
Train Validate Test Train Validate Test
1 GB and DT (Chi-square) Ensemble Bagging 0.038 0.068 0.044 0.996 0.975 0.992
2 GB and DT (Chi-square) Ensemble Boosting 0.044 0.063 0.037 0.995 0.973 0.99
3 DT (Entropy) and DT (Entropy) Ensemble Bagging and boosting 0.046 0.073 0.037 0.995 0.978 0.992
4 GB and DT (Gini) Ensemble Boosting 0.046 0.078 0.037 0.995 0.968 0.992
5 DT (Gini) and DT (Gini) Ensemble Bagging and boosting 0.055 0.092 0.037 0.993 0.969 0.994
6 DT (Chi-square) and DT (Chi-square) Ensemble Bagging and boosting 0.057 0.083 0.037 0.993 0.977 0.992
7 GB and DT (Entropy) Ensemble 0.063 0.087 0.044 0.98 0.966 0.985
8 GB and DT (Gini) Ensemble 0.063 0.087 0.044 0.98 0.966 0.985
9 GB 0.065 0.063 0.044 0.981 0.966 0.984
10 GB and DT (Chi-square) Ensemble 0.067 0.087 0.052 0.98 0.966 0.984
11 GB and DT (Entropy) Ensemble Bagging 0.069 0.083 0.037 0.981 0.972 0.987
12 DT (Gini) 0.069 0.083 0.052 0.95 0.955 0.969
13 DT (Entropy) 0.069 0.083 0.052 0.95 0.955 0.969
14 GB and DT (Gini) Ensemble Bagging 0.071 0.078 0.037 0.98 0.971 0.988
15 GB and DT (Chi-square) Ensemble Bagging 0.071 0.087 0.037 0.98 0.971 0.988
16 DT (Chi-square) 0.076 0.083 0.059 0.948 0.954 0.967
17 DT (Chi-square) Bagging 0.08 0.073 0.067 0.963 0.97 0.989
18 DT (Gini) Bagging 0.08 0.073 0.067 0.963 0.97 0.989
19 DT (Entropy) Bagging 0.084 0.083 0.059 0.973 0.966 0.983
20 DT (Gini) Boosting 0.137 0.248 0.163 1 0.962 0.993
21 DT (Chi-square) Boosting 0.149 0.15 0.126 1 0.973 0.993
22 DT (Entropy) Boosting 0.179 0.238 0.185 1 0.977 0.991

ROC, receiver operation characteristic; GB, gradient boosting; DT (Chi-square), decision tree model using chi-square statistics; DT (Entropy), decision tree model using chi-square (entropy) statistics; DT (Gini), decision tree model using chi-square (Gini) statistics.

Table 5.
Ensemble model variable importance
Variable Count of split rules Variable importance
Train Validate
DYS392 1 1 1
DYS390 2 0.68 0.649
DYS448 1 0.256 0.193
DYS643 1 0.219 0.13
DYS438 1 0.193 0.279
TOOLS
Similar articles