Genetic diversity and population structure of early and extra-early maturing maize germplasm adapted to sub-Saharan Africa

Assessment and effective utilization of genetic diversity in breeding programs is crucial for sustainable genetic improvement and rapid adaptation to changing breeding objectives. During the past two decades, the commercialization of the early and extra-early maturing cultivars has contributed to rapid expansion of maize into different agro-ecologies of sub-Saharan Africa (SSA) where maize has become an important component of the agricultural economy and played a vital role in food and nutritional security. The present study aimed at understanding the population structure and genetic variability among 439 early and extra-early maize inbred lines developed from three narrow-based and twenty-seven broad-based populations by the International Iinstitute of Tropical Agriculture Maize Improvement Program (IITA-MIP). These inbreds were genotyped using 9642 DArTseq-based single nucleotide polymorphism (SNP) markers distributed uniformly throughout the maize genome. About 40.8% SNP markers were found highly informative and exhibited polymorphic information content (PIC) greater than 0.25. The minor allele frequency and PIC ranged from 0.015 to 0.500 and 0.029 to 0.375, respectively. The STRUCTURE, neighbour-joining phylogenetic tree and principal coordinate analysis (PCoA) grouped the inbred lines into four major classes generally consistent with the selection history, ancestry and kernel colour of the inbreds but indicated a complex pattern of the genetic structure. The pattern of grouping of the lines based on the STRUCTURE analysis was in concordance with the results of the PCoA and suggested greater number of sub-populations (K = 10). Generally, the classification of the inbred lines into heterotic groups based on SNP markers was reasonably reliable and in agreement with defined heterotic groups of previously identified testers based on combining ability studies. Complete understanding of potential heterotic groups would be difficult to portray by depending solely on molecular markers. Therefore, planned crosses involving representative testers from opposing heterotic groups would be required to refine the existing heterotic groups. It is anticipated that the present set of inbreds could contribute new beneficial alleles for population improvement, development of hybrids and lines with potential to strengthen future breeding programs. Results of this study would help breeders in formulating breeding strategies for genetic enhancement and sustainable maize production in SSA.


Background
During the twentieth century, the advances in plant science, especially genetics in conjunction with statistics, have enhanced the progress in the selection of agronomically desirable genotypes following systematic reshuffling of the genome in crop plants including staple food crops through breeding. This has resulted in unprecedented improvement in food production which is expected to continue to play a vital role in the world food security [1,2]. Even though these breeding efforts have fulfilled the demands of intensive agriculture, it has been postulated that selective breeding may lead to the narrowing of the genetic base of crop plants which could seriously jeopardize future crop improvement efforts [3].
Following the green revolution which has benefited mainly continents that host developing countries, there has been an increase in awareness regarding the importance of genetic diversity in food crops [4,5]. During the past 5 decades, the Consultative Group for International Agricultural Research (CGIAR) breeders have been actively contributing to the broadening of the genetic bases of their mandate crops worldwide, especially in the third world, through provision of elite genetic materials to their national partners [6,7]. It has been a routine practice for breeders to infuse new genetic diversity into their base populations depending on the breeding objectives [8]. However, this effort has not resulted in marked changes in the diversity of field crops including major cereals such as maize, rice and wheat [3].
Of the cereal food crops, maize is perhaps the most important for food and economic security in SSA including in West and Central Africa (WCA), covering about a quarter of the total land area under cereal production in the sub-region [9][10][11]. However, in this subregion maize is considered as a multipurpose crop, which is consumed predominantly as a staple food crop by humans as well as poultry feed and raw material for livestock industries [12,13]. In an effort to promote the production of the early and extra-early maize varieties in SSA particularly in WCA, IITA collaborated with the International Maize and Wheat Improvement Center (CIMMYT) and the National Agricultural Research Institutes (NARIs) of WCA in 1987 to initiate systematic research efforts to develop source populations combining earliness with tolerance to moisture stress under the Maize Research Network (WECAMAN) [14]. Since then, other beneficial traits such as resistance/tolerance to maize streak virus (MSV), parasitism to Striga, low N and enhanced nutritional quality (such as quality protein and pro-vitamin A) have also been introgressed into the early and extra-early maize by the IITA-MIP [15].
In maize breeding, germplasm from similar heterotic groups and with desirable agronomic characteristics are usually intermated. Consequently, genotypes of different heterotic groups are separately kept to ensure that the developed populations are heterotic. Through this strategy, inbreds generated from different populations are normally heterotic when crossed, thus giving rise to productive hybrids. For example, in a cross between populations A and B, if the resulting F 1 performed better than the mean of the two parental populations, then the F 1 is described as exhibiting mid-parent heterosis. In contrast, if the performance of the F 1 is superior to that of the better parent, it is described as exhibiting better parent heterosis. In either case, the breeder is guaranteed progress from selection for genetic enhancement of the trait of interest. Derived inbred lines from narrow and/or broad-based populations should also display heterosis as an evidence of high specific combining ability. Such inbred lines are useful as parental lines for commercial hybrid development. These concepts have been used extensively in the IITA-MIP program to develop three narrow-based and twenty-seven broad-based source populations which have been taken through several cycles of improvement followed by the extraction of several multiple-stress tolerant inbred lines for hybrid development. These inbred lines exhibit contrasting degrees of resistance and/or tolerance to S. hermonthica, low N as well as drought stress. Some of the lines are parents of hybrids released in different countries, in SSA in different agro-ecological zones [16].
Classifying inbred lines into heterotic groups is important for exploiting their potential worth in the development of outstanding hybrids and synthetics as well as for developing new heterotic groups. It is therefore of utmost importance to study the extent of genetic variability and heterotic groups in the early and extra-early inbred lines in the IITA-MIP. Information on the genetic diversity and heterotic groups in the early and extraearly inbreds would be beneficial to the hybrid program at IITA as well as the national maize programs in SSA.
During the past two decades, the integration of molecular markers into the IITA-MIP has further facilitated the improvement of the efficiency of the breeding process, resulting in rapid generation of multiple stress tolerant early and extra-early maturing maize varieties and hybrids with enhanced nutritional quality for the countries of WCA [16,17]. This is partly due to the low cost and efficiency of molecular makers as a result of the remarkable technological advancement in molecular genetics, resulting in improvement of DNA-based markers over biochemical and morphological markers. In addition to the cost efficiency, other advantages of DNA markers such as abundance and even distribution throughout the genome, relatively rapid and efficient detection, lower genotyping error rates and generally neutral effect of allelic variation on individuals have made them ideal candidates for utilization in breeding processes [18]. The application of molecular markers for characterization of inbred lines complements and perfects classification into heterotic groups based on combining ability [19]. SNP markers are widely distributed and the most abundant molecular markers throughout the genomes of crop plants, thus making them the most commonly used in genetic studies [20]. The Diversity Arrays Technology (DArT) in combination with the next-generation sequencing platforms known as DArT-seq™ [21][22][23] has been recently introduced. This has provided a good alternative of high throughput marker genotyping platform, and due to its nature, is a perfect option for diversity analysis. The DArTseq has several advantages prominent among which are no prior knowledge about sequencing of the plant genome and the capacity to produce high-density results, possibility to score thousands of unique genomic-wide DNA fragments in a single experiment with low-cost genotype information [24,25]. The DArTseq method has been used in discriminating different species for population studies, diversity studies, characterization of germplasm and studies involving genome-wide association [26][27][28].
Information on diversity is important for estimating the amount of genetic diversity lost due to conservation or selection [29,30]. Acquaah [31] pointed out that the diversity and relatedness among inbred lines obtained from the same population or different populations are necessary in deciding the best breeding strategies to be employed to maximize their potential in a breeding program. Furthermore, combination of pedigree information and genetic distance estimates could be invaluable for placing inbred lines in distinct heterotic groups to help prevent crosses between closely related lines [32]. In order to design the most appropriate product development strategies for successful harnessing of heterosis in maize, comprehension of the extent and patterns of diversity and the relationship among the base materials is crucial for developing new inbred lines, and the choice of testers for selecting outstanding inbred line combinations for hybrid development programs [33].
Towards this end, several studies have been carried out at the molecular level to determine the diversity in the IITA-MIP inbreds, including the early and extraearly inbred lines, but these studies were conducted mostly with either few molecular markers or a limited number of inbred lines developed at specific periods in the IITA-MIP [34][35][36]. Thus, there is a need to assess the genetic differences and inter-relationships among the old and new early and extra-early maturing white, orange and yellow endosperm maize inbreds extracted by the IITA-MIP for effective placement into heterotic groups as well as facilitate successful parent selection for hybrid development.
For the purpose of comprehensive and systematic characterization of the early and extra-early maize inbreds developed in IITA-MIP, 439 early and extra-early inbred lines including some widely used inbred lines by national maize breeders of the savanna agro-ecological zones of WCA, standard testers and parents of some early and extra-maturing hybrids released for cultivation in Nigeria, Ghana and Mali were assembled for this study. These inbred lines were developed in different breeding eras during the past three decades by introgressing novel traits from landraces and exotic germplasm sources including wild relatives such as Zea diploperennis. The present study assessed the genetic diversity and population structure of these inbreds using 9642 DArTseq SNP markers.

Summary statistics of SNP markers and diversity analysis
Among the 18,927 SNPs utilized for the DArTseq genotyping of the inbreds in the present study, 12,485 SNP markers with call rate > 0.8 were informative. Thereafter, markers with minor allele frequency < 0.05 and monomorphic markers were eliminated, resulting in 9642 high quality informative SNPs which were used for further analysis. Of these markers, a total of 1370, 1123, 987, 951, 1047, 710, 734, 793, 706 and 622 SNPs were mapped on chromosomes 1 to 10, respectively. Diversity indices statistics across the 9642 SNPs indicated an average minor allele frequency (MAF) of 0.173 and polymorphic information content (PIC) of 0.206 with a range of 0.015 to 0.500 and 0.029 to 0.375, respectively ( Table 1). The mean expected heterozygosity (0.249) was higher than the observed heterozygosity (0.059) values. Of the 9642 SNP markers, 3930 (40.8%) markers showed PIC values greater than 0.25 and were found to be highly informative.
The analysis of chromosome-wise informative SNP markers revealed that SNP markers varied from 622 on chromosome 10 to 1370 on chromosome 1 with an average of 904 markers per chromosome. The gene diversity (GD), PIC and heterozygosity values among chromosomes were consistent and displayed slight variations among chromosomes. The observed GD among the inbred lines varied from 0.243 on chromosome 8 to 0.259 on chromosomes 1 and 3, PIC varied from 0.201 on chromosome 8 to 0.213 on chromosomes 1 and 3 and heterozygosity ranged from 0.055 on chromosome 9 to 0.062 on chromosome 10 (Fig. 1a). PIC was uniformly distributed among the SNPs with values varying from 0.1 to 0.4, but the distribution of MAF values was asymmetrical and skewed towards lower values. More than two-fifth of the markers (42.8%) had a MAF value in the range of 0.01 to 0.10 (Fig. 1b).

Population structure analysis
The different complementary approaches such as STRUCTURE, Neighbour-Joining phylogenetic trees and PCoA were employed to obtain the information on the population structure of the panel of inbred lines. The value of LnP(D) increased continuously from K = 1 to K = 12; nonetheless, an inflexion point was observed before K = 4 that was obvious after K = 10 (Fig. 2a). The highest K model with an elevated ΔK (K = 10), but K = 4 also had high ΔK values (Fig. 2b). Based on the admixture model in the software STRUCTURE at K = 4 and K = 10, the maize inbred panel of 439 inbred lines was grouped into four and ten sub-populations, respectively, using 9642 SNP markers ( Fig. 2c and d). Introducing different assignment thresholds (0.9, 0.8, 0.7 or 0.6) resulted in greater decrease in the number of unassigned inbred lines (Additional file 1: Figure S1). Nonetheless, 13.1 and 15.5% of the inbred lines in the panel showed probability of association less than 60% and were considered as admixture at K = 4 and K = 10, respectively. Of these admixture lines in the panel, 31 inbreds were found to be common at both K = 4 and K = 10 (Additional file 2: Table S1).
The Neighbor-joining (NJ) method assigned all the 439 inbred lines into four clusters (C 1 to C 4 ) which were further re-grouped into two main-clusters (A and B) (Fig. 3). For the purpose of comparison, each branch of the tree was displayed with the same colour as in the STRUCT URE analysis with K = 4 and K = 10 and the respective sub-population denoted by roman numerals (I to IV) and with numerical digits (1 to 10), respectively ( Fig. 3a and b). Broadly, the groupings of the inbred lines based on the PCoA were also in accordance with the NJ-clustering and model-based population partition in grouping lines into the different sub-populations (Figs. 3 and 4). The PCoA explained 20.59% of the total SNP variation among inbreds across the first two axes. The two-dimensional scatter plot showed that PCoA 1 and PCoA 2 accounted for 11.30 and 9.46% of the total variation, respectively, revealing the presence of four major groups (Fig. 4a).
Despite the inconsistency in the NJ-clustering and STRUCTURE analysis at K = 4 and 10 ( Fig. 3), the PCoA clearly differentiated the sub-population-I (SP-I; red colour; K = 4; comprising 76 inbreds) corresponding to cluster C 1 into two groups (1 and 9) and supported the population structure of the panel of inbred lines obtained  at K = 10 ( Fig. 4). Furthermore, the PCoA indicated substantial differences in the level of intra-population structure in groups 1 and 9 (Fig. 4d). The STRUCTURE analysis at K = 10 showed group 1 as comprising 6 Table S1). Furthermore, an early maturing orange kernel endosperm inbred tester, TZEIOR 129 and two extra-early yellow endosperm inbred testers, TZEEI 79 and 81 were also placed in group 9. The first coordinate axis (PCoA1) described genetic differentiation between sub-population II (SP-II; green colour; K = 4; 111 inbred lines) corresponding to C 2 (NJ clustering) and the other clusters. Furthermore, the STRUCTURE analysis at K = 10 suggested that SP-II comprised group 7 (orange colour; consisting of 71 inbred lines) and group 10 (oak colour; consisting of 37 lines) representing 16.2 and 9.8% of the panel of 436 inbred lines, respectively. However, both groups were not well separated by the first three coordinates of the PCoA indicating their proximity at the genetic level (Fig. 4). Group 7 consisted of both white and orange/yellow kernel inbred lines derived from varying genetic  Table S1).
The C 3 (NJ-cluster) contained the highest number of inbred lines and consisted of sub-populations SP-III and SP-IV, whereas C 4 having the lowest number of inbred lines constituted most of the admixture lines together with few inbred lines representing SP-III. This revealed the inconsistency in the results of the NJ-cluster and STRUCTURE analyses when considering the K value of 4 ( Fig. 3a). High level of similarity was observed in the clustering patterns of STRUCTURE (K = 4 and K = 10) and PCoA for SP-IV/group 3 (Figs. 3 Table S1). Similarly, some inbred lines representing SP-III (blue colour) but grouped with members of SP-1 in C 1 were also clearly differentiated by PCoA, further supporting the new group 6 revealed by STRUCTURE analysis at K = 10 (Fig. 4). All the inbred lines in group 6 (Silver oak, 13 inbreds constituting 2.96% of the panel of inbred lines) contained orange endosperm kernels and originated from 2009 TZE OR1 DT STR population except inbred TZdEEI 13 with low threshold (0.6), derived from TZEE-Y Pop STR.
Interestingly, some orange endosperm kernel inbred lines in cluster 3 (C 3 ) classified as admixture by the STRUCTURE analysis at K = 4 formed new group 2 when the value of K was considered as 10 (Additional file 2: Table S1). Nonetheless, PCoA clearly differentiated the group 2 ( Fig. 4b-d; green colour) but showed their proximity with group 6 suggesting that these groups were very similar. The group 2 representing 2.05% of the panel of inbred lines also shared their pedigree with group 6 which had several inbred lines derived from 2009 TZE OR1 DT STR, a broad-based orange endosperm kernel, drought tolerant and Striga resistant population. Although, NJ clustering partitioned the SP-III (blue colour) of the inbred lines panel at K = 4 into three clusters including some inbreds in C 1 with SP-I (Fig. 3a), the lines were not well separated by PCoA into different groups (4, 5 and 8) except the lines in group 6 which were clearly separated by STRUCTURE analysis at K = 10 ( Fig. 4b-d) Table S1). None of the testers were placed in groups 2, 5 and 6 while five testers (ENT 13, TZEEI 29, TZEEIOR 30, TZEI 10 and TZEI 17) had less than 60% probability of association, and hence were classified as admixture (Additional file 2: Table S1).

Discussion
Manifestation of heterosis and its fixation remain the preferred choice for maximizing gains from selection in crop plants and largely depends on the level of genetic diversity of germplasm base. The advent of PCR based markers, greater genome abundance and high reproducibility, have made SSR markers the 'marker of choice' but the availability of high-density genotyping technologies have resulted in a shift from SSR makers to SNP markers such as DArT which are amenable to highthroughput technology and are considered as 'marker in demand' [18,37]. In the recent past, DArTseq marker platforms have been successfully used to quantify diversity in cereals including maize [36,[38][39][40][41][42][43][44]. The mean PIC value for the SNP dataset in the present study was 0.206 (ranging from 0.029-0.375) and was comparable with the PIC value estimated for tropical maize by Adu et al. [36], both in terms of mean value (0.19) and range (0.01-0.38) but lower than those described by Wu et al. [44] and Zhang et al. [45]. In previous studies, low PIC value for IITA maize germplasm has also been reported when compared with temperate, INERA and CIMMYT germplasms [46]. The low to moderate genetic diversity observed in the IITA maize germplasm may be attributed to the breeding strategies adopted at IITA which cut across the extra-early, early, intermediate, and late maturing groups [47]. The maize inbred panel used in our study consisted of 439 early and extra-early maize inbreds, which was a good representation of the genetic variation of contemporary IITA early and extra-early maturing maize germplasm. Previous diversity studies of early and extra-early maturing tropical maize involved much fewer inbred lines: 17, 22, 92 and 94 have been reported by Badu-Apraku et al. [48], Akaogu et al. [49], Ifie [35] and Adu et al. [36], respectively.
The population structure is important for explaining the heterogeneity of genetic architecture and is mainly affected by spatial and gene exchange isolation [50]. Based on 9642 DArT markers, population structure and patterns of relationship of 439 inbred lines was investigated based on different complementary approaches that clearly revealed the existence of genetically distinct groups in the present panel of inbred lines (Figs. 3 and  4). Our results revealed that the pattern of grouping from population STRUCTURE analysis and PCoA methods was more reliable than the Neighbor-Joining clustering method. These findings are consistent with those reported by Semagn, et al. [30]. Nonetheless, the agreement between STRUCTURE and PCoA methods was unexpected, as PCoA summarized variations between pre-defined groups based on population structure. Contrarily, NJ-cluster showed low concordance with STRUCTURE analysis in respect of the number of groups and assignment of genotypes to their respective groups (Fig. 3). However, clustering methods are prone to possible ambiguity, since a single distance matrix and a clustering algorithm may give rise to several other clusters [30,46,51]. The similarity in grouping patterns obtained with PCoA suggested that the groupings obtained were reasonably reliable despite the discrepancies in number and size of sub-populations/groups (Fig. 4).
Since the late 1990s, when there was a major shift in emphasis from maize breeding for open-pollinated varieties towards hybrid development in WCA region, several efforts have been made to classify the numerous IITA early and extra-early maize inbred lines into heterotic groups using different methods including phenotypic data of measured traits, combining ability effects of multiple traits and molecular markers, but heterotic groups are still not fully established [15,47,52,53]. Akinwale et al. [47] suggested four to five heterotic groups on the basis of the combining ability analysis of selected early white and yellow maize inbred lines and concluded that grouping of inbreds using information from only combining ability studies could lead to contradictory results due to G x E interactions and could result in the classification of the same inbred lines into different heterotic groups in different studies as it relied largely on yield which is a polygenic trait with high influence of environment.
In the present study, different multivariate methods were used to group the panel of IITA-MIP early and extra-early inbred lines into four major clusters, but close examination of the available information clearly indicated greater number of sub-populations. Our results revealed clear population stratification which was consistent with the ancestry, selection history and kernel colours of the inbred lines (Figs. 3 and 4; Additional file 3: Table S2). For example, NJ-clustering, STRUCTURE analysis and PCoA methods consistently grouped all the inbred lines extracted from two early broad-based populations (TZE-W Pop STR 108 and TZE-W Pop STR 104) into a single group (SP-IV and sub-population 3 at K = 4 and 10, respectively) along with lines from other pedigree sources (TZEE-W Pop STR 108, TZEI 1 x TZEI 2 and TZEE-W Pop STR 104) with white endosperm kernels and Striga resistant characteristics (  [55,56]. Furthermore, the inbred lines in five groups (2, 5, 6, 9 and 10) had yellow/orange kernels while the remaining groups (1, 4, 7 and 8) contained both white and yellow/orange endosperm inbred lines (Additional file 2: Table S1). All the inbred lines including some testers containing genes from Zea diploperennis background were clustered into five groups (1, 3, 5, 8 and 9). It was striking that all the inbred lines of group 2 and 6 were derived from a common source, 2009 TZE OR1 DT STR while other groups contained inbreds from different pedigree sources suggesting the existence of substantial diversity within the population or pool from which the inbred lines were extracted [35,51]. For example, clustering of inbreds derived from the orange/yellow endosperm broad-based   population (2009 TZE OR1 DT) and the bi-parental population (TZEI 17 x TZEI 11) in group 9 and most of the inbred lines from the yellow endosperm broad-based population (TZE-Y Pop STR Co) and bi-parental population (TZEI 11 x TZEI 8) in group 10 indicated some common attributes in their ancestry (Fig. 3 B and Fig. 4; Additional file 2: Table S1). It is noteworthy that these inbreds were extracted from TZE-Y Pop DT STR and TZE Comp5-Y DT populations improved for drought tolerance and have DR-Y Pool BC 2 F 2 , KU 1414, and TZi 28 (9499) in their genetic backgrounds as the sources of drought tolerance [57,58]. The TZEE-Y Pop is an extraearly yellow endosperm broad-based population formed by compositing CSP-SR BC5, TZEE-Y SR BC5, CSP × Local Raytiri, and TZEE-Y populations while TZE-Y Pop STR is an early yellow endosperm broad-based population with resistance to Striga and tolerance to drought and was developed by recombining DR-Y Pool BC2F2, KU1414 and the intermediate maturing yellow endosperm inbred line 9499 [57]. Similarly, TZE-Comp 5 is an early maturing population derived by crossing TZESR-WC3 to 10 Striga resistant inbred lines [59]. Therefore, the lack of clear heterotic patterns in tropical maize germplasm compared to temperate germplasm is mainly attributed to the earlier maize breeding focus on the development of broad-based populations and genetic pools at both CIMMYT and IITA [16,33]. This might further explain the reason for the low to moderate diversity in the IITA early maturing maize germplasm, as selection pressure was directed more towards fixing of the favourable allele frequency for specific characteristics such as maturity period (early to extra-early), biotic (MSV and resistance to Striga) and abiotic (tolerance to drought) stresses in the populations via recurrent selection. Thus, the complex clustering patterns of the present set of maize inbred lines was not unexpected as the mixed genetic constitution of the populations and pools may be due to the grouping together of inbreds derived from different base populations. Nevertheless, this has made the task of assigning inbreds into distinct heterotic groups at molecular level difficult. This corroborates the findings of earlier researchers in which molecular markers displayed the existence of complex population structure in tropical maize, including CIMM YT maize lines (CMLs) and researchers were unable to group them into complementary heterotic patterns [30,33,46,51]. Knowledge of the genetic relationship among testers and their efficiencies in grouping other inbred lines is important for a hybrid breeding program to be successful. Therefore, plant breeders are continuously studying inbred testers to determine their efficiencies in grouping other inbred lines. Several promising testers have been identified in the IITA early and extra-early maize improvement program over the past twenty years, but precise information with respect to their specific heterotic groups is still not fully established [16]. In agreement with earlier reports, the two inbred testers, TZdEEI 12 and TZdEEI 7 belonging to the same heterotic group were classified into group 1 while TZEEIOR 109 and TZEEIOR 197 assigned to group 7 also belong to similar heterotic group (Additional files 2 and 4: Tables S1 and S3).
Based on the results of the present study, IITA-MIP breeders could formulate breeding strategies for genetic improvement of early and extra-early maize in SSA. Planned crosses involving representative testers from opposing heterotic groups identified in the present study could be initiated to refine the existing heterotic groups in the IITA-MIP. Results presented in this study could serve as an important guide to parent selection for further population improvement and development of productive hybrids in the IITA-MIP to maximize maize productivity in different agro-ecologies of SSA region. For example, the classification of the maize inbreds into distinct heterotic groups in the present study is expected to facilitate the development of superior hybrids, synthetics, pools and breeding populations possessing resistance/tolerance to multiple stresses (such as drought, low-N, and Striga hermonthica) as well as enhanced nutritional qualities including PVA and quality protein levels of tropical maize. Additionally, the information obtained from the DArT-SNP marker-based genetic distance (GD) estimates can employed to minimize the cost of testing in the IITA-MIP by preventing evaluation of crosses between related lines and assist in eliminating crosses with poor performance [60]. Furthermore, the results of the molecular analyses could be combined with morphological and agronomic testing of the IITA-MIP germplasm to provide complementary information and increase the resolving power of genetic diversity analyses [61]. Finally, the identification of diverse parental combinations will facilitate the creation of segregating progenies with maximum genetic variability for further selection [62] and the introgression of favourable alleles from diverse germplasm sources into available breeding populations as proposed by Thompson et al. [63].
The strategy of IITA-MIP has been to establish a pair of heterotic groups each for the different maturity classes, based on the kernel colour and target breeding objectives using line x tester mating design, North Carolina Design II (NCD II), Diallel mating design, and grouping methods such as SCA effects of grain yield, heterotic grouping based on general and specific combining ability effects of grain yield (HSGCA), heterotic grouping based on general combining ability effects of multiple traits (HGCAMT) and DNA markers. Presently, a pair of heterotic groups has been established in the IITA-MIP for developing white normal endosperm hybrids as well as white QPM hybrids of early and extra-early maturity classes (Additional file 5: Figure S2). Similarly, we have a pair of heterotic groups targeted at developing yellow and orange normal endosperm as well as yellow QPM, and orange QPM hybrids of early and extra-early maturity classes. In practice, it is ideal to have two heterotic groups for each maturity class and endosperm colour for a successful practical maize breeding program. Therefore, the four heterotic groups identified in the present study could pose a major challenge to the present strategic decision of the IITA-MIP to classify the inbred lines in the program into a maximum of three heterotic groups designated as A, B, and C (the mixed group). The number and choice of heterotic groups are arbitrary decisions and a breeding program can have two or more heterotic groups. However, working with two distinct heterotic groups, designated as A and B with subgroups within each group for different maturity classes and endosperm colors would facilitate the management of the heterotic groups and accelerate genetic gains from selection. Nevertheless, several challenges would need to be addressed if this strategy is adopted in our program to ensure accurate classification of invaluable inbred lines in the mixed group C that falls outside the classical A and B heterotic groups. Therefore, our goal is to reduce the heterotic groups identified in the present study into A and B categories. This could be achieved by aligning the heterotic affinities of the elite inbred lines with mixed genetic backgrounds into existing heterotic groups A and B using field evaluations of crosses with testers of known heterotic groups and molecular markers. The heterotic groups of some of the inbred lines derived from the breeding populations in the present study which have been used in developing commercial hybrids in the IITA-MIP are presented in the Additional file 6: Table S4 and Additional file 7: Figure  S3. The inbred lines have been classified into heterotic groups A and B. In an effort to determine whether the classification of the inbred lines into heterotic groups based on SNP markers was reasonably reliable, the selected inbred lines which have been used in developing commercial hybrids in the IITA-MIP were grouped using SNP markers in the present study. The groupings based on the SNP markers were then compared with those based on the mating designs and grouping methods such as the SCA of grain yield, HSGCA and HGCAMT. The classification of the selected early white, yellow, and orange endosperm inbred lines into heterotic groups A and B using SNP markers approximated 64 and 56% respectively for the lines that should have been classified based on the different multivariate methods.
Similarly, placement of the extra-early white, yellow and orange endosperm inbred lines into heterotic groups A and B using SNP markers approximated 71 and 50%, respectively compared to the groupings based on the different multivariate methods. The results of this study revealed close correspondence between the groupings of the inbred lines based on the mating designs/classical grouping methods and the SNP markers. However, there is a need for continuous refinement of the heterotic groups to ensure continuous and adequate genetic gains from selection in the IITA-MIP extra-early and early breeding program. Finally, it should be noted that it would be impracticable to have as many as 24 heterotic populations for the early and extra-early maturity groups alone as presented in Additional file 5: Figure S2, so a strategy has to be developed to prioritize the number of heterotic groups that would be manageable and cost effective for the IITA-MIP extra-early and early breeding program.

Conclusions
The present study has provided useful information on the genetic variability and population structure of early and extra-early maize inbreds with wide adaptation to the different agro-ecologies of the SSA. Using DArTseq technology, the multivariate methods identified four distinct groups which are generally in agreement with the ancestry, selection history and kernel colour of the lines but indicated a complex pattern of genetic structure. Our results suggest that the application of complementary approaches is very efficient in predicting the presence of groups and in placing the genotypes into the different groups based on molecular markers. As an additional tool, the molecular markers are useful for preliminary assignment of inbred lines into prospective groups where discrete heterotic groups are not well established. Nonetheless, the grouping of testers into each potential heterotic group may help reduce the number of actual field crosses that would need to be made to validate the grouping of these inbred lines with a limited number of field evaluations of the crosses. Finally, our study has demonstrated the existence of high level of diversity among the present set of early and extra-early inbred lines of IITA with good adaptation to the SSA maize growing conditions in countries including Nigeria, Ghana and Mali. Consequently, during the past decade, molecular approaches have been adopted in the IITA-MIP to refine genetic diversity and combining ability studies and this has resulted in increased hybrid maize productivity at relatively faster and cheaper rates.

Sample preparation and DNA isolation
For genomic DNA extraction, leaf samples from 8 to 10 seedlings of each inbred line were collected at 3 weeks after planting and stored in the deep freezer (− 80°C), freeze-dried and ground as described by Adu et al. [36]. Total genomic DNA from each sample was extracted following standard DArT procedure [36]. In a 96 well plate, ninety-four samples were placed and individual plates were sealed in accordance with DArT instructions. Finally, all the plates were kept in a shipping box and dispatched to the DArT P/L platform, Genetic Analysis Service for Agriculture (SAGA) facility at CIMM YT, Mexico.

DArTseq genotyping, data filtering and statistical analysis
Wide-genome genotyping of the 439 inbred lines was conducted using DArTseq technology [21,40]. Following a strict quality control process involving parameters such as call rate, data reproducibility (~20% of samples replicated), and rate of monomorphism to remove monomorphic markers, 18,927 SNPs were obtained from the studied germplasm. Molecular markers were filtered again utilizing PLINK 1.9 software and those showing greater than 20% missing data were removed. Moreover, SNPs having a variance close to 0 and the rare SNPs with less than 5% minor allele frequencies (MAF) were also eliminated from the dataset resulting in final dataset containing 9642 DArTseq informative SNPs.
Statistical analysis for genetic diversity parameters including MAF, unbiased estimation of gene diversity, observed and expected heterozygosity (H o and H e ), and PIC value were performed using PowerMarker V3.25 software [64].

Genetic structure analysis
To reveal the genetic structure of the panel of maize inbred lines, all the 9642 DArTseq markers were imported into the Bayesian Markov chain Monte Carlo software STRUCTURE V2.3.4 [65]. In the AD-MIXTURE method, the number of sub-populations varying from k = 1-20, and five times simulations with iterations and burn-ins set to 10,000 were used, with no prior information on the origin of individuals [19]. For the most appropriate k-value within the present panel, the Evanno transformation method was used which is useful and better described the data and also exhibited a low cross-validation error compared to other k values [66]. Following the Evanno ΔK method, the results obtained from STRUCTURE were implemented in Structure Harvester to determine the most suitable value of k. Inbred lines with membership probabilities ≥0.60 were assigned to the corresponding sub-population while less than 0.60 were considered as admixture.
To confirm the assignment of inbreds into the sub-population by STRUCTURE analysis, population phylogeny was also studied by imputing the full set of data into DARwin software [58,67] using neighbor-joining (NJ) tree feature by running 30,000 bootstraps. The phylogenetic tree was constructed in FigTree version 1.4.3 software [68]. The inbred lines in each cluster of the NJ phylogenetic tree were highlighted by different colours corresponding to the results obtained by the STRUCTURE analysis. Finally, principal coordinate analysis (PCoA) was also carried out utilizing the DARwin software [69] to visualize the pattern of genetic differentiation within and between the groups of inbred lines and complemented the pattern of diversity and clustering revealed by STRUCTURE analysis and dendrogram, respectively.