Classifying Oryza sativa accessions into Indica and Japonica using logistic regression model with phenotypic data

In Oryza sativa, indica and japonica are pivotal subpopulations, and other subpopulations such as aus and aromatic are considered to be derived from indica or japonica. In this regard, Oryza sativa accessions are frequently viewed from the indica/japonica perspective. This study introduces a computational method for indica/japonica classification by applying phenotypic variables to the logistic regression model (LRM). The population used in this study included 413 Oryza sativa accessions, of which 280 accessions were indica or japonica. Out of 24 phenotypic variables, a set of seven phenotypic variables was identified to collectively generate the fully accurate indica/japonica separation power of the LRM. The resulting parameters were used to define the customized LRM. Given the 280 indica/japonica accessions, the classification accuracy of the customized LRM along with the set of seven phenotypic variables was estimated by 100 iterations of ten-fold cross-validations. As a result, the classification accuracy of 100% was achieved. This suggests that the LRM can be an effective tool to analyze the indica/japonica classification with phenotypic variables in Oryza sativa.


INTRODUCTION
Oryza sativa (Asian cultivated rice) is known to have five subpopulations which are indica, temperate japonica, tropical japonica, aus and aromatic. Of these, indica and japonica, comprised of temperate japonica and tropical japonica, are known as pivotal subpopulations; aus is known to be related to indica, and aromatic is intermediate between indica and japonica (Garris et al., 2005;Thomson et al., 2007;Kovach et al., 2009;Huang et al., 2012;Schatz et al., 2014;Civan et al., 2015;McCouch et al., 2016;Chin et al., 2017). There are genetic barriers, particularly between indica and japonica, which often challenge rice trait improvements by breeding (Kato & Kosaka, 1930;Chen et al., 2008;Kim, Jiang & Koh, 2009;Zhu et al., 2017). Because each subpopulation often has desirable characteristics for cultivars, overcoming the genetic barriers between the subpopulations will help rice breeders freely introgress desirable genes originating from different subpopulations into an elite line. This can eventually minimize the breeding costs and maximize the sustainability of rice production being threatened by climate changes, human population increases and loss of cultivation land. Classifying Oryza sativa is important because it can help breeders develop effective mating paths to overcome the genetic barriers by identifying accessions that can bridge between indica and japonica. Currently, classifying Oryza sativa is widely conducted with genomic tools because genomic data can reflect the variation of subpopulation characteristics at a molecular level (Garris et al., 2005;Kim, Jiang & Koh, 2009;Zhao et al., 2011;McCouch et al., 2016;Chin et al., 2017). However, the sole use of genomic data may have limitations to the understanding of Oryza sativa diversity because some subpopulation-associated traits are related to cytoplasmic effects (Bao, Sun & Corke, 2002;Zhao et al., 2010;Thomson et al., 2007). To overcome the gap that genomic data cannot fill, the use of phenotypic data is reasonable in that it comprehends the genomic and cytoplasmic effects.
This study introduces how to computationally classify Oryza sativa accessions into indica and japonica by applying multiple phenotypic variables to the logistic regression model (LRM). This study used a publicly available data source containing information on 413 Oryza sativa accessions and demonstrated that the LRM is a promising tool for accurate indica/japonica classification in Oryza sativa by phenotype.

Phenotypic data
A set of phenotypic data and subpopulation information was used, which was originally generated and analyzed by Zhao et al. (2011). The data set consisted of 413 accessions originating from 82 countries, obtained at http://ricediversity.org/data/sets/44kgwas. Out of 32 phenotypic variables, 24 variables were selected because they were quantitative and not confined to a certain geography. The selected variables were divided into morphology (culm habit, flag leaf length, flag leaf width), yield components (panicle number per plant, plant height, panicle length, primary panicle branch number, seed number per panicle, florets per panicle, panicle fertility), seed morphology (seed length, seed width, seed volume, seed surface area, brown rice seed length, brown rice seed width, brown rice volume, seed length/width ratio, brown rice length/width ratio), stress tolerance (straighthead susceptibility, blast resistance) and quality (amylose content, alkali spreading value, protein content). Every accession in the data set belonged to one of the following subpopulation groups: admixed (62), aromatic (14), aus (57), indica (87), temperate japonica (96) and tropical japonica (97). In this study, temperate japonica and tropical japonica were combined into japonica.

Stepwise variable selection and parameter estimation
The LRM formula used in this study can be denoted as follows: where P(japonica|x 1 ,...,x n ) is the probability of an accession being japonica (>0.5) or indica (<0.5) given the predictor variables, x 1 ,. . . , x n ; β 0 is the constant term; β 1 ,. . . , β n are the parameters for the predictor variables, x 1 , . . . , x n , respectively. In Eq. (1), the response variable is binary between indica and japonica, and the predictor variables (phenotypic variables) are quantitative. To identify a set of predictor variables that can maximize the indica/japonica separation power, n sets were prepared from 1D to nD, in which the nD is a set containing all possible combinations of n different phenotypic variables (e.g., 1D contains every single phenotypic variable, 2D contains all possible pairs of phenotypic variables, and so forth). The n was increased by one until the maximum P(japonica|x 1 ,...,x n ) was reached, in which every single selection from the nD set was subject to the following steps: 1. Calculate a set of parameters by fitting the LRM (Eq. (1)) with the phenotypic variables in a selection. 2. Applying the phenotypic variables in the selection used in Step 1 to Eq. (1) with the parameters estimated in Step 1. The resulting value must be between 0 and 1, from which the indica or japonica can be determined at 0.5. In order to estimate the indica/japonica separation power, a receiver operating characteristic (ROC) curve was implemented using an R package called pROC (Robin et al., 2011). In an ROC space, the x-and y-axes ranging between 0 and 1 represent the false positive rate (FPR or specificity) and true positive rate (TPR or sensitivity), respectively. Thus, an area under an ROC curve (AUC) can range between 0 and 1. The closer the AUC is to 1, the higher the indica/japonica separation power. By referring to the resulting AUCs, a set of phenotypic variables that maximized the indica/japonica separation power was identified, and the resulting parameters were used to define the customized LRM.

Applications of the customized LRM
1. Estimating the indica/japonica classification accuracy: given the 280 indica/japonica accessions, the customized LRM along with the related phenotypic variables were taken into 100 iterations of ten-fold cross-validations, from which the 100 values were obtained. The resulting values were averaged into the indica/japonica classification accuracy.
2. Estimating the variable interaction: as the LRM uses multiple phenotypic variables, some portion of indica/japonica classification power might be derived from interactions between variables. The interaction magnitude for every variable with all other variables was calculated using the H -statistic (Frieman & Popescu, 2008). The resulting H -statistic can range between 0 and 1 with no interaction resulting in 0 and full interaction resulting in 1. The H -statistic was calculated using an R package, iml (Molnar, Casalicchio & Bischl, 2018).
3. Classifying accessions in each minor subpopulation group into indica and japonica: the customized LRM was applied to each minor subpopulation group (aromatic, aus, admixed) to divide accessions into indica and japonica. This examination aimed to observe how accessions in each minor subpopulation group are phenotypically divided from the indica/japonica perspective.

Dendrogram-based indica/japonica classification
To draw dendrograms, the genomic data set was obtained at http://ricediversity.org/data/ sets/44kgwas (Zhao et al., 2011). The genomic data set consisted of the accessions in the phenotypic data, genotyped with 36,901 SNPs. Two dendrograms were drawn; one included the 280 indica/japonica accessions, and the other included all 413 accessions. Then, the dendrogram-based indica/japonica classifications were compared with the LRM-based indica/japonica classifications. Each dendrogam was graphed based on a genetic distance matrix in which the genetic distances between two different accessions were calculated using the following equation: where the IBS A,B is the IBS (identical by state) coefficient between A and B, which can range between 0 and 2.

Data availability
With the exception of calculating the IBS matrix, all other computations were conducted using R (R Core Team, 2016). The data set and R scripts used in this study are freely available at https://github.com/bongsongkim/logit.regression.rice.

RESULTS
Stepwise variable selection Figure 1 and Table 1 suggest that more phenotypic variables led to stronger indica/japonica separation power of the LRM, and the fully accurate indica/japonica separation power (AUC = 1) was achieved with a 7D selection. Table 2 summarizes the phenotypic variables yielding the maximum indica/japonica separation power in each set (1D to 7D). The set of phenotypic variables yielding AUC = 1 comprises panicle number per plant, seed number per panicle, florets per panicle, panicle fertility, straighthead susceptibility, blast resistance and protein content.

Estimating the indica/japonica classification accuracy
Given the seven phenotypic variables yielding AUC = 1, the indica/japonica classification accuracy was estimated using the 280 indica/japonica accessions, for which 100 iterations of ten-fold cross-validations were implemented. As a result, the indica/japonica classification accuracy of 100% was obtained. This indicates that the seven phenotypes were certainly impactful for the fully accurate indica/japonica classification. Assuming that some portion of classification power might be derived from interactions between variables, the Hstatistic was calculated for the purpose of estimating how much each variable generates the classification power in collaboration with other variables. Table 3 summarizes the resulting H -statistic values, indicating nearly no interactions between variables. Figure 2 shows the dendrogram-based indica/japonica classification with the 280 indica/japonica accessions, in which the indica-and japonica-varietal clades were accurately divided, and the japonica accessions were further accurately divided into temperate japonica and tropical japonica (Fig. S1). This result shows that the dendrogram-based indica/japonica classification accuracy was 100%.
(3) in descending order are 50943.2 (-) for panicle fertility, 38008.7 (-) for florets per panicle, 35800.4 (+) for seed number per panicle, 5832.5 (-) for panicle number per plant, 491.6 (-) for straighthead susceptibility, 353.7 (+) for blast resistance and 214.6 (-) for protein content. These suggest that, when it comes to the indica/japonica separation power, the panicle fertility is most impactful, followed by florets per panicle, seed number per panicle, panicle number per plant, straighthead susceptibility, blast resistance and protein content. Because of missing phenotypic records, the size of each minor subpopulation group was reduced from 62 to 52 for the admixed group, 14 to 12 for the aromatic group and 57 to 26 for the aus group. Figure 3 shows that Eq. (3) split each subpopulation group into indica and japonica in ratios of 5:47 for the admixed group, 5:7 for the aromatic group and 18:8 for the aus group, respectively. Meanwhile, the dendrogram drawn with the whole accessions (413) shows two major clades (upper and lower), in which temperate japonica, tropical japonica and aromatic formed accurately separate groups within the upper clade (hereafter called japonica-varietal clade), while indica and aus formed accurately separate groups within the lower clade (hereafter called indica-varietal clade). The admixed accessions were spread across all subpopulation groups (Fig. S2). Figure 4 shows three Venn diagrams, each of which represents comparison between the LRM-based and dendrogram-based classifications for each minor subpopulation group; the agreements were 92.3% (48/52) in the admixed group, 58.3% (7/12) in the aromatic group and 69.2% (18/26) in the aus group.

DISCUSSION
The indica and japonica in Oryza sativa can be classified based on phenotypic observations by humans. However, the classification by humans can be subjective so that the classification results could sometimes be biased by an observer's perception. Meanwhile, the indica/japonica classification by the LRM is thoroughly systematic and quantitative. In this aspect, the indica/japonica classification qualities by the LRM are expected to be comparable to or better than the qualities made by humans. In fact, the customized LRM (Eq. (3)) achieved the indica/japonica classification accuracy of 100% given the 280 indica/japonica accessions. Table 1 and Fig. 1 suggest that the more predictor variables there are, the stronger the indica/japonica separation power (AUC). This implies that the variation in a single trait is narrowly distinct between indica and japonica perhaps due to intensive genetic admixture by breeding over history (Zhao et al., 2010;Xu et al., 2012), and that the variation given multiple traits is substantially distinct between indica and japonica because multiple layers of the narrow effects collectively magnify differences between indica and japonica. In this study, the fully accurate indica/japonica separation power (AUC = 1) of the LRM was achieved with a set of seven phenotypic variables (Table 1 and Fig. 1). The H -statistic value of nearly zero for every phenotypic variable indicates that the indica/japonica separation power was not overestimated by unexpected synergetic effects between the phenotypic variables. Table 2 shows that the panicle-related traits, straighthead susceptibility and blast resistance frequently appeared across all sets. This may be related to previous knowledge that the variations in panicle characteristics, straighthead susceptibility and blast resistance are strongly associated with the indica and japonica differentiation: panicles are long and sparse in indica but short and dense in japonica (Bai et al., 2016); straighthead resistance and blast resistance are greater in indica than japonica (Yan et al., 2005;Jia et al., 2011).
The customized LRM (Eq. (3)) was applied to each minor subpopulation group (aromatic, aus, admixed). Applying Eq. (3) to the aromatic and aus groups aimed to see how well the resulting classifications reflect their known evolutionary relationships with both indica and japonica: aromatic is intermediate but narrowly closer to japonica; aus is distinct from but closely related to indica (Garris et al., 2005;Thomson et al., 2007;Kovach et al., 2009;Huang et al., 2012;Schatz et al., 2014;Civan et al., 2015;McCouch et al., 2016;Chin et al., 2017). Regarding the aromatic accessions, the dendrogram-based classification formed the aromatic group distantly from the japonica group in the japonica-varietal clade, which is consistent with the previous knowledge that aromatic is narrowly close to japonica between indica and japonica. The agreement between the dendrogram-based and LRM-based classifications was 58.3% (7/12). The low agreement may be related to the subtle phenotypic similarity between aromatic and japonica (Garris et al., 2005). Regarding the aus accessions, the dendrogram-based classification assigned all of them to the aus group in the indica-varietal clade, which was consistent with the previous knowledge that aus and indica are distinct within a close evolutionary relationship. The agreement of 69.2% (18/26) between the dendrogram-based and LRM-based classifications suggests that phenotypic variation between the aus and japonica groups overlaps to some extent. Perhaps, it may be because the sub-speciation of aus from indica, occurred in a geographically isolated area (Bangladesh, India) under short growing seasons and upland conditions, might have confounded its phenotypic characteristics (Garris et al., 2005;Londo et al., 2006). Regarding the admixed accessions, the dendrogram-based classification dispersed them across indicaand japonica-varietal clades. This indicated that the admixed group covered a wide spectrum of genomic properties. The admixed group was a collection of accessions with uncertain subpopulation membership due to intensive inter-subpopulation mating. This allows us to deduce that the admixed accessions may have a high potential to bridge between indica and japonica. The agreement of 92.3% (48/52) between the dendrogram-based and LRM-based classifications shows reliable classification ability of the LRM given Oryza sativa accessions with uncertain subpopulation membership.

CONCLUSION
This study showed that the indica/japonica classification based on phenotypes can be analyzed using the LRM. A set of phenotypes that can collectively generate the fully accurate indica/japonica separation power can be used for researching indica/japonica differentiation. For example, if a study aims to detect quantitative trait loci (QTL) associated with the indica/japonica differentiation, each phenotype that contributes to the fully accurate indica/japonica separation power can be used for genome-wide association studies (GWAS). Furthermore, research on the indica/japonica differentiation in Oryza sativa may