Pearl millet genome sequence provides a resource to improve agronomic traits in arid environments

Draft genome, 994 re-sequenced lines and GWAS for yield-traits provide a resource of genetics and genomics tools for pearl millet researchers and breeders. Supplementary information The online version of this article (doi:10.1038/nbt.3943) contains supplementary material, which is available to authorized users.


Genome assembly
To assemble the pearl millet genome, we used whole genome shotgun (WGS) and bacterial artificial chromosome (BAC) sequencing. Ten small inserts (of ~170, 250, 500 and 800 bp), and 13 large inserts (of ~2, 5, 10, 20 and 40 kb) WGS libraries were constructed using Tift 23D 2 B 1 -P1-P5 10 genotype. These libraries were sequenced on the Illumina HiSeq 2000 and 520 Gb of sequence data, representing 296× genome coverage, were produced (Supplementary Table 1). Two BAC libraries, with an average insert size of ~120 kb, were constructed from Tift 23D 2 B 1 -P1-P5 using EcoRI and HindIII. 972 Gb of sequence data were generated from 100,608 BAC clones at ~80× genome coverage (Supplementary Table 2 and Supplementary  Fig. 1). In brief, 1.49 Tb of sequence data, after stringent filtering and correction steps, were assembled into 1.58 Gb of contigs (sequences without gaps or Ns) and 1.82 Gb of scaffolds (contigs joined with estimated gaps filled in).
Based on k-mer statistics, the pearl millet genome size was estimated to be 1.76 Gb (Supplementary Fig. 2), indicating that ~90% of the genome was assembled. Scaffolds longer than 1 kb totaled 1.79 Gb, with 50% of scaffolds (N50) being longer than 884.95 kb (N50 contig = 18,180 bp) and the largest scaffold spanning 4.82 Mb (Supplementary Table 3). To evaluate the assembly, we generated additional whole genome sequence data with 1× coverage on the PacBio platform. More than 90% of these long reads were mapped back to a scaffold with more than 90% similarity and 90% ratio of aligned length (Supplementary Fig. 3).
Linkage information from three biparental mapping populations, and collinearity with the genome of foxtail millet (Setaria italica) 11 were used to assemble genomic scaffolds into pseudomolecules. We assembled 1.56 Gb into seven pseudomolecules (Pg1 to Pg7, Fig. 1 and Supplementary Table 4). The average GC content of pearl millet (47.9%) is higher than that of foxtail millet (46.1%), sorghum (44.5%), barley (Hordeum vulgare, 44.4%), and rice (43.5%) (Supplementary Fig. 4). We assessed the variability in GC content in 10-kb non-overlapping sliding windows (Supplementary Fig. 5) to show that the observed GC content did not arise from sequencingbased GC bias. The GC content in whole genome coding sequence (CDS; 54.76%) and in 384 expanded gene families (53.14%) was examined as well; it was at a similar proportion to the total genome, providing confidence in this result (Supplementary Table 5 and Supplementary Fig. 6). Analysis of completeness was carried out using the core eukaryotic gene mapping approach (CEGMA), which revealed that >97% of genes were present in the assembly (Supplementary Table 6).

Repetitive sequences
In total, 1.22 Gb of repeat elements were identified in a 1.58-Gb genome assembly, indicating that 77.2% of the assembled genome is repetitive. In addition, because the repetitive parts of the genome are always the parts that are under-represented in the genome assembly, most of the unassembled DNA (0.18 Gb) is most likely repetitive, too. This is not surprising, because multiple repeats will often collapse into a single repeat in an assembly and also because "repeat masking" is often performed before some assembly steps [11][12][13] . We expect the true percentage of repetitive DNA to be a minimum of 80%. This is similar to the proportion of repetitive DNA found in the 2.3-Gb maize genome (>85%), and considerably more than in 730-Mb sorghum 14 (~61%), ~400-Mb foxtail millet 11 (~46%) or 466-Mb rice 15 (~42%) genomes. In common with the pattern in many other plant genomes, long-terminal repeat (LTR) retrotransposons were the most abundant class of repetitive DNA, and comprise >50% of the nuclear genome of pearl millet (Supplementary Table 7). Using RepeatMasker, we found that sequence divergence rates were high (peak at 28%) among long interspersed nuclear elements (Supplementary Fig. 7).

Genes and annotation
A total of 69,398 transcriptome assembled contigs (TACs), amounting to 43 Mb in total, were identified using pearl millet transcriptome sequences from two different studies 16,17 and a new pearl millet transcriptome assembly generated for this study (Supplementary Table 8). Ab initio homology-based gene prediction were combined with transcript assembly to infer a non-redundant set of 38,579 gene models with an average transcript size of 2,420 bp and an average coding sequence of 1,014 bp ( Table 1; Supplementary  Table 9). The average lengths of mRNA, CDS, introns and exons in pearl millet were similar to those reported for other cereal genomes (Supplementary Fig. 8). Among 458 of the most conserved genes in CEGMA, 437 (95.4%) genes were complete but 8 (1.7%) genes were not found in the genome sequence, 8 (1.7%) genes were not included in the gene set, and 5 (1.1%) genes had more than one copy (possibly fragmented genes). In addition, for 956 genes in benchmarking universal single-copy orthologs (BUSCO) analysis, we annotated 96.7% genes, and 95.4% of these are complete. Gene models of rice and Arabidopsis thaliana have been annotated and carefully validated. We chose to use the gene models of rice, which is more closely related to pearl millet than A. thaliana, to investigate the completeness of pearl millet genes. Of the 4,202 single-copy genes in rice, 90.86% have homologs in pearl millet, and 86% of these pearl millet genes were complete when compared with rice gene models (ratio of pearl millet length/rice length 0.8), reflecting the completeness of single-copy genes. Gene density increased toward the ends of pseudomolecules ( Fig. 1), consistent with findings in all other cereal genomes published to date 11,14,15 Table 10).
Predicted pearl millet proteins were compared to those already annotated in ten plant species (Arabidopsis 20 , Brachypodium (Brachypodium distachyon) 21 , banana (Musa acuminata) 22 , barley 23 , foxtail millet 11 , maize 24 , rice 15 , sorghum 14 , soybean (Glycine max) 25 and bread wheat) 26 and, as expected according to evolutionary relatedness, the highest number of orthologs were identified in foxtail millet (74.16%) and the lowest number in Arabidopsis (61.88%; Supplementary Table 11). Reciprocal pairwise comparisons of predicted proteins for 38,579 pearl millet gene models with 385,891 gene models from the same ten plant species (as above) identified 17,949 orthologous groups ( Supplementary Table 12), of which 5,232 contained only a single pearl millet gene, which is suggestive of simple orthology (Supplementary Table 13; Supplementary Fig. 9). In addition to protein-coding genes, we predicted 909 tRNA, 235 rRNA, 183 microRNA (miRNA) and 752 small nuclear RNA (snRNA) genes in our assembly (Supplementary Table 14).

Gene families
We identified unique and shared gene families among different species in the grass subfamilies Panicoideae, Pooideae and Ehrhartoideae using OrthoMCL (Ortho Markov Cluster Algorithm http://orthomcl. org/orthomcl/) 27 . Pearl millet and foxtail millet share 15,887 gene families (of those, 14,398 are also found in sorghum) while pearl millet and barley share 13,607 gene families (Fig. 2a). A total of 15,869 gene families are present in at least one species in each of the three subfamilies (i.e., Panicoideae, Ehrhartoideae and Pooideae) analyzed (Fig. 2a). 354 gene families were substantially expanded in pearl millet and 1,692 gene families were contracted (Fig. 2b). We compared the average length of the genes for the 384 expanded gene families among all the ten species and used "Quantile" statistics concept to estimate the short CDS. In this concept, Q 1 is "25th percentile", Q 3 is "75th percentile" and interquartile range (IQR) is estimated as Q 3 -Q 1 . We consider a length shorter than Q 1 -3(IQR) to be an extreme outlier. By using this method, we found that only 24 (6.25%) genes had substantially shorter CDS in pearl millet genes compared to other species. Thus, only a small proportion of the expanded gene families might be misidentified because of possible partial genes (Supplementary Fig. 10).
Expansion and contraction of gene families between species might also highlight differences in bioinformatics analysis carried out for different genomes. Bias in gene model identification among different studies might render a comparison of expansion or contraction challenging. One potential source of bias is if a gene is split, that is, a A r t i c l e s complete gene is instead annotated as two separate genes. Based on eukaryotic orthologous gene sequences, we estimate that 2.3% of our genes might have been misannotated in this way (Supplementary Table 6). Although we found that 1,692 families were contracted in pearl millet, contraction is 5.4 times more likely than expansion. One explanation may be that there was a far higher proportion of split genes in the reference genomes of the other species that we use for comparison than in our pearl millet assembly. This would make our number of gene family contractions an overestimate. Gene families that seem to be the most greatly expanded are those encoding cutin, suberin, wax biosynthetic genes (P < 10 −6 ) and transmembrane transporters of secondary metabolites (ABC transporters, P < 10 −24 ) (Supplementary Table 15). Triterpenoids are a component of wax, and we also observed a substantial expansion of the gene families associated with terpenoid backbone biosynthesis, and monoterpenoid (P < 0.05) and di-terpenoid biosynthesis (P < 0.005). Notably, increased cuticular wax synthesis improves drought tolerance in Arabidopsis species 28 , while reduced wax production has been associated with drought sensitivity in rice 29 . An enriched repertoire of genes for lipid synthesis and export of macromolecules in pearl millet might contribute to its heat and drought tolerance.
Resistance to pathogens is a crucial contributor to crop yield. The majority of resistance genes in plants contain a nucleotide binding site (NBS). Identification of NBS-containing genes in pearl millet will help to identify putative resistance genes. 378 NBS-encoding genes were manually verified after initial searching, comprising ~1% of the total gene set, similar to the proportion found in other cereal genomes (Supplementary Table 16). NBS-leucine rich repeats (NBS-LRR) genes made up ~43% of the NBS-genes, with NBS-only genes comprising ~41%. Of the 378 NBS-encoding genes, 360 were mapped to one of the seven pseudomolecules, with significantly (Chi-squared test P-value < 10 −10 ) biased distribution among the pseudomolecules; ~26.2% and ~25.7% were located on Pg4 and Pg1, respectively (Supplementary Table 17). These are also the same two pseudomolecules to which a downy mildew resistance quantitative trait locus (QTL) was mapped 30 . We observed large tandem arrays of NBS genes near the telomere region of Pg1 (two 4-gene groups, four 5-gene groups and one 6-gene groups) followed by Pg4 (three 2-gene groups and two 4-gene groups) (Supplementary Fig. 11 and Supplementary Table 18), consistent with a biased distribution of these loci and suggesting that tandem duplication may be an important source of local gene amplification.

Population structure, diversity and domestication
To better elucidate population structure, assess genetic diversity and understand pearl millet domestication, we resequenced 994 lines. The lines resequenced comprised 260 inbred male sterility maintainer (B-) and 320 male fertility restorer (R-) lines, 345 Pearl Millet Inbred Germplasm Association Panel (PMiGAP) lines (including cultivated germplasm from Africa and Asia, elite improved open-pollinated cultivars, hybrid parental inbreds and inbred mapping population parents) 31 Tables 19 and 20). In addition, for PMiGAP lines, 78.9 Gb of data at an average coverage of 0.12× was generated using genotyping by sequencing 32 , while for B-and R-lines, 614.45 Gb of data at 0.59× coverage with an average of 1.06 Gb per sample was generated using RAD sequencing 33 Table 32 and Supplementary Figs. 12-15). We conducted a principal component analysis (PCA) and constructed a neighbor-joining tree based on 450,000 high-quality SNPs. The PCA analysis and phylogenetic tree showed four main clusters, three that contained wild accessions and one that grouped together the cultivated germplasm (Fig. 3a,b). The three wild accession clusters were separated by geographical origin into East, Central and West African clusters (Fig. 3a,b).
The closest of the wild groups to the cultivated samples is from the central part of West Africa (Fig. 3b), indicating that pearl millet originated in this region, consistent with prior research 35 . The oldest archaeological remains, which date to 4,500 years ago, were found in the north-central Sahel, in accordance with our genetic analyses 36 . Studies of archaeological remains found that by 3,500 years ago cultivation of pearl millet was widespread in Sahelian Africa [37][38][39] . Spread of pearl millet agriculture to Asia, and in particular to India also dates to 3,500 years ago 40 . Average pairwise nucleotide diversity within populations (θ π ) and Watterson's estimator of segregating sites (θ ω ) both indicated high diversity among wild accessions (average θ π = 0.00366 and θ ω = 0.00342) compared with PMiGAP (average θ π = 0.00238 and θ ω = 0.00289) on all seven pseudomolecules (Supplementary Table 33). In agreement with the PCA analysis and neighbor-joining tree, we observed strong population structure in the wild accessions and weak population structure in PMiGAP lines (Supplementary Figs. 16 and 17). The weak cultivated pearl millet structure suggests a homogenous genetic diversity across large geographical scale. This pattern is certainly associated with a rapid spread of pearl millet agriculture in Africa and India without major bottlenecks during diffusion. This pattern is expected for inbreds derived from a highly allogamous species. The strong structuration of wild diversity and the central geographical origin of the cultivated sample suggest strong untapped and unique diversity for breeding from wild populations found in East Africa (Sudan, Chad) and the West (Senegal, Mauritania).
Domestication in pearl millet, like that observed in maize 24 , was associated with profound modifications of spike morphology and plant architecture (Fig. 3c). We found several genomic regions that showed reduced diversity in the cultivated (but not wild) species that may harbor genes selected for during domestication. Using a negative log ratio of diversity between cultivated (red) and wild (blue) samples, values close to 1 indicate a tenfold decrease in diversity whereas values close to 0 indicate that diversity is maintained in the cultivated samples. We also identified regions with an excess of differentiation based on a fixation index (F ST ) measure (Supplementary Fig. 18). These analyses provided orthogonal and consistent results and identified 140 genomic regions with values above the 95% threshold for both loss of diversity and differentiation. Using a stringent threshold of 99.5%, and considering only values identified by both statistics, 24 genomic regions had reduced diversity in the cultivated germplasm, of which eight were located on Pg7, six on Pg6 and five on Pg1 (Supplementary Tables  34 and 35). Linkage groups 6 and 7 have previously been identified as carrying QTL that explain most phenotypic differences between wild and cultivated pearl millet germplasm 41,42 . Most of the identified regions have negative Tajima's D values (<−2.0), suggesting a signature of positive selection (Supplementary Table 34). One striking case of diversity loss of more than tenfold was associated with the regulation of an auxin-induced gene PINOID on Pg6. This gene is known as barren inflorescence2 (ref. 43) in maize, and variation in this gene has been associated with phenotypic variation of the inflorescence 44 . Our analyses also pinpointed genes encoding protiens that might be associated with morphogenesis (LIM2 and PINOID on Pg6, Myosin 11 on Pg7) or gene regulation (Basic helix-loop-helix, bHLH110 on Pg3, Zinc Finger on Pg6). Validation of the role(s) of each of these genes in domestication will require functional analyses and further phenotypegenotype association analyses using fine-scale QTL approaches.

GWAS
Genome-wide SNP data were used to compute linkage disequilibrium decay (LDD) in all three germplasm sets. We set the r 2 threshold as 0.2 and observed rapid LDD of less than 0.5 kb in B-and R-lines (48 bp) as well as in PMiGAP lines (84-444 bp) (Supplementary Fig. 19). LDD in pearl millet is on par with that in maize, and we note that both these plants are allogamous 45 . Relatively rapid LDD is expected in sets of lines that represent the variation present in a highly allogamous panmictic population. Grain and stover yield, and its component traits, is of crucial importance in pearl millet and has undergone selection during domestication. We carried out GWAS across 288 testcross progenies of PMiGAP lines for 20 traits, and identified 1,054 strongly significant marker trait associations (MTAs) for 15 traits (Supplementary Table 36 This analysis also identified separate groups of cultivated and wild lines from east, west and central parts of Africa. (c) Morphological differences between wild (i, ii) and cultivated accessions (iii, iv) of pearl millet. Wild accessions have numerous bristled spikes in the inflorescence and low seed density (i), and a plant architecture characterized by numerous basal and aerial branches (ii), with a plant height of around 1 m. Cultivated accessions have exposed seeds and a high seed density per spike (iii), with a few basal branches and no aerial branches (iv).

Genomic prediction of hybrid performance
We applied our resequencing data to carry out genomic selection to predict grain yield for test crosses. Four scenarios of prediction were investigated, namely the performance of grain yield in each of the three environments (control, early stress and late stress) and across environments. We observe high prediction accuracy, measured as the Pearson correlation coefficient between the predicted and observed values, standardized with the square root of the heritability (h = 0.78), amounting to 0.6 for the performance across environments. Analyses of this kind have been undertaken for grain yield in other crops using genomic selection 46 . A modelling study recently found that with this level of prediction accuracy, genomic selection could substantially improve selection gain per year 47 .
We also predicted hybrid performance, by using genomic selection strategy that considers additive and dominance effects. The ridge regression best linear unbiased prediction method 46 was trained using phenotypic grain yield data from 64 pearl millet hybrids grown in five environments in India in replicated trials during the time period 2004-2013. The grain yield data were analyzed with 302,110 SNPs with missing values below 5% and minor allele frequency above 5% for 580 B-and R-lines (Fig. 4a). We found 170 promising hybrid combinations (Supplementary Table 37 and Fig. 4a). Of these, 11 combinations were already used for producing hybrids that showed better performance (Supplementary Table 38). However, 159 combinations have never been used in hybrid breeding (Fig. 4b), and therefore they are good candidates for developing high-yielding hybrids.
We inspected the predicted hybrid performance of all possible 167,910 single-cross combinations by applying hierarchical clustering combined with a heat plot, and examined the potential of this approach to identify promising heterotic groups. The analyses revealed two sets of lines that are predicted to have an average 8% higher hybrid performance when crossed to each other than the total set of 167,910 single-cross combinations ( Fig. 4c and Supplementary  Fig. 21). These predicted high-yield hybrids could be used as a nucleus to establish high-yielding heterotic groups for hybrid pearl millet breeding 48 (Supplementary Tables 37 and 38).

DISCUSSION
Pearl millet is a staple food for more than 90 million people in Africa and Asia. People living in arid and semi-arid regions, in particular, rely on pearl millet, which can crop in the harsh conditions. We sequenced the genome of pearl millet reference genotype Tift 23D 2 B 1 -P1-P5 (available at https://www.ncbi.nlm.nih.gov/assembly/ GCA_002174835.1/). The draft genome assembly presents 90% of the pearl millet genome with N50 of scaffolds as 884.95 kb and 87.2% assembled genome into seven pseudomolecules. The genome assembly of cereal species like pearl millet with high levels of repetitive DNA is always challenging. Therefore, in addition to a WGS approach, BAC-sequence data were used to develop the draft genome assembly and PacBio data were generated to validate the assembly. To achieve chromosome level assembly, one can use new approaches of sequencing such as Bionano Genomics optical mapping and Dovetail Genomics chromosome confirmation capture data in different combinations 49 .
Our analysis identified 38,579 protein-coding genes, of which 27,893 (72.30%) were annotated. CEGMA and BUSCO analyses together with comparison with gene models of rice have indicated completeness of predicted genes in pearl millet. Expansion of gene families associated with terpenoid backbone biosynthesis and monoterpenoid and diterpenoid biosynthesis in the genome might explain the high level of heat and drought tolerance in pearl millet as compared to other cereals.
Genome sequence can provide information either about specific genomic regions or specific genes that are associated with agronomically important traits including grain and fodder yield. Pearl millet fodder is the main feedstock for ruminant (and other) livestock, and breeding to improve fodder quality and yield is of crucial importance to both the meat and the dairy industries. In order to identify loci or variants associated with agronomic features, we undertook a largescale resequencing effort. Resequencing of the PMiGAP set revealed that small structural rearrangements, such as insertions and deletions in the genome have occurred throughout the evolution of pearl millet. This is similar to observations made in maize: a third or more of maize genes seem to be optional. Frequent insertions and deletions pose  substantial challenges to resequencing efforts because self-pollinated and small-genome species such as rice are easier to sequence and analyze as compared to cross-pollinated and large-genome species like maize owing to their increased genomic structural variability 50 .
With an objective to save the cost, but without losing information, 1.68× coverage WGRS data and 0.12× GBS data were generated on PMiGAP lines and 0.59× coverage RAD-sequencing data were generated on B-and R-lines. The sequence information from the more genetically diverse PMiGAP inbred panel will be of broader use for genome-wide association mapping and allele mining. All of these sequences are available at https://www.ncbi.nlm.nih.gov//sra/?term= SRP063925. Resequencing data of almost 1,000 pearl millet lines (963 inbreds of cultivated pearl millet and 31 heterozygous wild individuals, available at https://www.ncbi.nlm.nih.gov// sra/?term=SRP063925) provides researchers and breeders with an enormous resource of genome-wide variations including SNPs, indels, SSRs and structural variations (Supplementary Tables 23-32) for mining alleles of genes with significant MTAs and for developing pearl millet hybrids with increased heterosis. Our analysis on resequencing data on PMiGAP lines together with phenotyping data for 20 traits for GWAS and genomic selection suggests that simultaneous improvement of grain and stover yield might be feasible in pearl millet. Indeed, improved grain and stover yield performance of hybrids in India has been noted over the past 50 years, which underlines the potential for further improvements that could be informed by our analyses.
We also show the use of the genome sequence and resequencing information to make predictions of test-cross hybrid performance. After inspecting predicted hybrid performance of 167,910 single-cross combinations, we identified 159 pair of lines that have not been used so far for hybrid breeding but can exhibit high hybrid performance. This type of analysis has considerable potential for accelerating future rates of selection gain. Our prediction models were also applied to define heterotic pools for pearl millet for South Asia, which could be crucial for increasing the efficiency of hybrid breeding programs in the same region.
Together the draft genome and resequencing data provide a resource for the research community that should enable a better understanding of trait variation and accelerate the genetic improvement of pearl millet. For instance, we identified 1,054 MTAs for 15 agronomic traits that will be useful for pearl millet breeding. Our findings will also contribute to a better understanding of the genetic basis of the exceptional drought and heat tolerance of pearl millet as we have identified expansion of gene families associated with drought and heat tolerance. A detailed understanding of how well pearl millet crops do in hot, arid and semi-arid regions might enable engineering of not only pearl millet but also other cereal crops like rice, maize and wheat, which are currently able to provide only limited produce in arid or semi-arid regions. This is especially important owing to the pressing need for heat-and drought-tolerant cereal crops in the coming years.

METhODS
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.  Whole genome shotgun sequencing and assembly. We constructed 10 small insert libraries including 4 with 170 bp insert, 2 with 250 bp insert, 2 with 500 bp inserts and 2 with 800 bp insert, and 13 mate-pair libraries including 4 with 2 kb insert, 4 with 5 kb insert, 2 with 10 kb insert, and 2 with 20 kb insert and 1 with 40 kb insert from pearl millet genotype Tift 23D2B1-P1-P5. To make libraries with ~170 to ~800 bp inserts, high quality DNA samples were sheared, end-repaired, and ' A' bases were added to the 3′ end of the DNA fragments to facilitate ligation to adapters. Fragments in the appropriate size range were selected after separation on an agarose gel and amplified using PCR. For mate-pair libraries, a biotinylation reaction was performed after fragmentation and end-repair. Then DNA fragments of the required size were selected and circularized. Circular DNAs were sheared into approximately 400-600 bp fragments, and biotinylated fragments were captured for terminal modification and adaptor ligation to construct libraries. Paired end reads were generated for each library on an Illumina HiSeq 2000 platform.
For BAC library construction, DNA from pearl millet genotype Tift 23D 2 B 1 -P1-P5 was fragmented using HindIII and EcoRI, and then ligated into vector pCC1BAC. The ligations were transformed into E. coli DH10b host cells. After DNA isolation from BAC clones, Covaris LE220 system was used to shear DNA into ~500 bp. Agilent Bravo Automated Liquid Handling Platform and an Agilent BenchCel Microplate Handler were used to construct BACs for sequencing. Then 96-microTUBE plates (Covaris) were used as sample vessels for automated batch processing followed by index adaptor ligation and size selection 51 . Generally, the sizes of the BAC ranged from 80 -180 kb and fragments for sequencing were about 500 bp. In total 100,608 BAC clones were constructed and HiSeq 2000 was used for sequencing paired end reads of each BAC clone.
For each library, we filtered the reads that comprised more than 5 percent of "Ns" or polyA structure, and also removed reads that possessed 20 or more bases with quality score less than or equal to 7. Reads with >10 bp aligned to the adaptor sequence (allowing ≤3 bp mismatch) were considered as adaptor contaminants and removed. Additionally, paired-end reads with a total length smaller than the library insert size allowing a window of 30 bp were removed. We also trimmed the reads if the quality of bases at the head or tail of the reads was low.
k-mer analysis. We performed k-mer analysis 52 for the estimation of the genome size of pearl millet genotype Tift 23D 2 B 1 -P1-P5. Genome size was estimated by the formula: Genome size = k-mer_num/Peak_depth where k-mer_num was the total number of k-mers and Peak_depth was the expected value of k-mer depth obtained from the distribution curve. The number of k-mers (generally K = 17) was calculated from short fragment size reads with a one bp slide, and then the frequency of each k-mer was determined. A distribution curve of depth versus frequency was plotted, where the x-axis represents the depth and the y-axis represents the proportional frequency at that depth divided by the total frequency of all the depths.
Development and improvement of genome assembly. For WGS assembly, clean reads were assembled by SOAPde novo 53 (Version 2.04) (parameters: pregraph -s assembly.lib -K 63 -R -d 1 -o pm; contig -g pm -R; map -s assembly. lib -g pm -k 45; scaff -g pm). The k-mer frequency follows a Poisson distribution when read length << genome size 54 . Short insert libraries were assembled into contigs. The reads were mapped back onto the contigs to estimate overlap between contigs. Gapcloser 53 (Version1.10, parameter: -a pm.scafSeq.fill -b reads.lib -o pm.scafSeq.fillGap -t 24) within the SOAPde novo package was used to fill gaps in the scaffold with paired end reads. BAC-by-BAC sequencing of 100,608 BAC clones was conducted to improve the quality of the genome assembly. Each sequenced BAC was assembled separately by SOAPde novo. First, sequences shorter than 2,000 bp or having more than 30% unknown bases in BAC clones were discarded. The remaining sequences were then pooled with WGS scaffolds together to extend and collapse redundant sequences.
For improving WGS-based assembly, BAC-sequence data were included in analysis using Rabbit package 55 . This package consists of three modules: Relation Finder, Overlapper and Redundancy Remover. In the first step, 40 bp at the end of each sequence was trimmed as they turn to be of lower quality. Then overlapping between sequences were detected by BLAT 56 with minimum overlap length set to be 3,000 bp. In second module for extension, overlapping with identity greater than 90% were merged and sequences were extended. To avoid the duplicates in the final assembly, segmental duplications and divergent haplotypes were identified and filtered based on the Poisson-based k-mer model following methods described in Liu et al. 52 . To evaluate the assembly of the pearl millet genome, we first calculated the length and N50 distribution for the BAC sequences. The BAC lengths ranged from 80-140k, and their N50s were from 10-40k (Supplementary Fig. 22). Gaps can occur in the fragmented BAC assemblies since the insert size of the pair end reads is 500 bp. PacBio reads were processed using Blasr (processed with PBJelly pipeline) to evaluate the assembled sequence.
GBS and SNP calling on mapping populations. GBS libraries were prepared using restriction enzyme ApeKI as described by Elshire et al. 32 . The MAPPOP1 and MAPPOP2 populations were sequenced at 384-plex (that is, 384 samples per flowcell lane) on an Illumina HiSeq 2000, while the MAPPOP3 population was sequenced at 96-plex (96 samples per flowcell lane). SNPs were called using the TASSEL-GBS pipeline in TASSEL v4.1.32 57 . The TASSEL-GBS pipeline incurs an overhead for each separate pseudomolecule processed, hence we concatenated the thousands of individual scaffolds into ~20 megascaffolds to ease computation. Reads were processed into clean 64 bp "tags" and mapped against the reference scaffolds with Bowtie 2 (ref. 58). SNPs were called with the DiscoverySNPCallerPlugin in TASSEL, with minimal filters to reduce the number of false positives due to sequencing errors (minor allele frequency ≥ 0.01, minor allele count ≥ 10, genotype calls in at least 10% of samples) (Supplementary Code 1). RAD sequencing. Genomic DNA of each B-and R-individual was digested with EcoRI. After electrophoresis, DNA fragments of the desired lengths were gel purified. Adaptor ligation and DNA cluster preparation were performed and fragments were sequenced on an Illumina HiSeq 2000 platform. Similarly, 29 DNA libraries were constructed for B-and R-lines (580 samples) and sequenced using the RAD-Seq approach 33 .
Genetic map construction. SNPs called from the GBS data on three populations (MAPPOP1, MAPPOP2 and MAPPOP3) were first filtered for quality based on minor allele frequency, missingness and heterozygosity (Supplementary Code 2). Linkage groups were defined based on hierarchical clustering of SNPs and ordered with MSTMap. For each population, we created three maps: one from stringently filtered SNPs, one from moderately fileted SNPs, and one mapping GBS sequencing tags back to the stringently filtered map (Supplementary Code 2). The framework map generated in the largest RIL population (Tift 99B × Tift 454) formed the basis of an initial colinearity study between pearl millet and foxtail millet, and the resulting comparative knowledge was used to incorporate additional scaffolds for which orthology to the foxtail millet genome had been established using BLASTP (to identify putative orthologous pearl millet and foxtail millet genes at an E-value threshold of 1e-5) and MCScanX 59 (to identify colinear segments of at least five syntenic genes between pearl millet and foxtail millet) analyses into the framework map. The genetic maps generated for each of the crosses, and the map that we built based on collinearity information between pearl millet and foxtail millet, were merged using ALLMAPS 60 with the most weight assigned to the synteny map followed by the stringent SNP maps, the moderately filtered SNP maps, and finally the GBS sequencing tags (Supplementary Code 3). Linkage group numbering was adopted as per an existing consensus map 17 based on mapping SSR sequences to the assembled genome (Supplementary Code 3).
Repeat annotation, gene prediction and genome annotation. We searched the genome for tandem repeats with Tandem Repeats Finder 61 (Version 4.04) (parameters: 2 7 7 80 10 50 2000 -d -h). Transposable elements (TEs) were identified in the genome by a combination of homology-based and de novo approaches 62 . For homology-based predictions, we used the repeat database Repbase16.10 63 to identify known repeats in the genome assembly with the program RepeatMasker 64 (Version 3.3.0) (parameter: -nolow -no_is -norna -parallel 1 -lib RepeatMaskerLib.embl.lib). At the protein level, RepeatProteinMask, a software in the RepeatMasker package, was used to perform RMBlast against the TE protein database (parameter: -noLowSimple -pvalue 0.0001). For de novo prediction, the programs RepeatModeler 65 (Version 1.0.5) and LTR_FINDER 66 (Version 1.0.5) were used on the entire genome to generate a pearl millet repeat database, which was subsequently used as input library with RepeatMasker (Version 3.3.0) to identify TEs.
For predicting genes, we applied several approaches: (i) Homology-based prediction: Proteins previously annotated in other species (Supplementary Table 9) were mapped to the genome using BLAT 56 (Version 34) with default parameters. Alignments in which the coverage of the query protein was less than 0.3 were removed. In addition, if there were multiple BLAT hits (BLAT output was set to the five best hits), secondary hits were removed if their aligned length was less than 0.3 of the aligned length of the top BLAT hit to filter paralogs with lower sequence identity. GeneWise 67 (with parameter -trev -sum -genesf) was used to predict spliced alignments. (ii) De novo gene prediction: AUGUSTUS 68 (Version 2.5.5,-species = maize-uniqueGeneId = true-noInFrameStop = true-gff3 = on-strand = both) and Fgenesh 69 (Version 1.3) were used to detect gene models in the repeat masked genome. (iii) Prediction based on transcript sequences: The assembled transcriptome sequences were aligned to the genome assembly using BLAT (Version 34) using the parameters identity ≥ 0.98 and coverage ≥ 0.98 to generate spliced alignments. (iv) Integration evidence: Source evidence generated from the three approaches mentioned above were integrated using GLEAN 70 to produce a consensus gene set.
To annotate the function of the final gene models, protein sequences were aligned against KEGG 71 (release 58) and SwissProt 18 (release 20156) with BLASTP (E-value ≤ 1.0e-05) to find the best matches. InterProScan 19 (Version4.8, performed with profilescan, blastprodom, hmmsmart, hmmpanther, hmmpfam, fprintscan and patternScan analysis) was used to identify motifs and domains in the proteins encoded by the gene models along with gene ontology annotations 72 . For ncRNA annotation, tRNA genes in the assembly were identified by tRNAscan-SE 73 (Version 1.23). rRNA genes were aligned with plant query sequences (rRNA from Arabidopsis and rice species) using BLASTN with an E-value threshold of 1.0e-05. Other non-coding RNAs, such as miRNAs and snRNAs were predicted by homology searches against the Rfam database 74 using the INFERNAL 75 (Version 0.81) software.
RNA seq data generation and development of transcriptome assembly. The transcriptome sequence data were generated from individuals "9-8" and "3-9" accessions at IRD. Library preparation and sequencing (PE 100 bp) on an Illumina Hi-Seq 2000 platform was performed by Fasteris (Plan-les-Ouates, Switzerland). A total of 81,207,232 and 74,187,066 sequence reads were obtained for "3-9" and "9-8", respectively. Adaptor sequences were trimmed and reads were processed for de novo assembly using Velvet 1.0.18 76 and then Oases 0.1.18 77 . Several values of hash length were tested to optimize the assembly: 39, 51, 63, 65, 69 and 73. The obtained assemblies were compared for their ability to map raw reads using BWA 78 . We consequently decided for a hash length of 73. The transcript assembly was then searched for redundancy. Contigs sharing identity over ≥95% of the length of the shortest sequence in a set of putative homologous sequences were clustered. The final transcript assembly contained 50,313 contigs, with a total of 36,479,993 nucleotides. Three transcriptomes (Zeng et al. 16 , Rajaram et al. 17 , and the transcriptome data generated at IRD, France, available under BioProject ID PRJNA391885) were combined and clustered using CDHIT-EST 79 with default parameters to eliminate redundancy at the sequence level. Then, CAP3 80 was used to assemble the contigs. Ns on either end of the resultant contigs were trimmed. Finally, contigs of at least 200 bp in length were used in gene annotation.
Gene family and phylogenetic analysis. For gene family analysis, BLASTP with an E-value cutoff of ≤ 1.0e-05 was used to compare all annotated pearl millet protein sequences against a protein data set of 10 sequenced plant species (Arabidopsis 20 , Brachypodium 21 , banana 22 , barley 23 , foxtail millet 11 , maize 24 , rice 15 , sorghum 14 , soybean 25 and T. urartu 26 ). The proteins were clustered using OrthoMCL 27 (-mode 3) to define gene families which included both paralogs and orthologs. The number of gene families in each species and genus was calculated based on the composition of the OrthoMCL clusters. Genes that were single copy in an OrthoMCL cluster for all species analyzed were selected to construct a phylogenetic tree using the PhyML (parameters: -d nt -b -4 -m HKY85 -a e -c 4 -t e) program 81 (Version 3.0). Divergence times between pearl millet and other species were estimated using MCMCTREE 82 with default parameter. First, the gene family size for each species was calculated based on the output of OrthoMCL, and rooted tree in newick format. CAFE 83 (-p 0.05 -t 4 -r 10000 -filter) was used to predict the expansion and contraction of gene family numbers based on the phylogenetic tree and gene family statistics.
Population analysis. Population genetic analyses of the PMiGAP lines, including PCA and diversity detection were conducted essentially as described for rice by Xu and colleagues 84 . We used a subset of 450,000 SNPs, with a missing rate <10% across PMiGAP lines and wild accessions. Briefly, for PCA, eigenvector decomposition of the SNP genotype data was calculated using the R function eigen 85 . A Tracey-Wisdom test with default parameter settings was performed to determine the significance of axes using the twstats program. To build a phylogenetic tree, the percentage of pairwise nucleotide differences between individuals (p-distance) was calculated 85 . The program fneighbor (PHYLIPNEW v3.69.650 within the package EMBOSS v6.6.0.0; parameter: -matrixtype s -treetype n) was used to construct a neighbor joining tree. The resulting tree was edited and visualized using MEGA5 86 by choosing Radiation style. Population structure was assessed using the program Snmf (-k K -c) 87 . Five runs were performed and the values with the smallest Cross-Entropy for K from 2 to 7 were selected to generate the structure graphs. To better assess the structure, we performed the analysis in a geographical context, using TESS3 88 that takes geographical coordinates of the sample into account. Furthermore, parameters of population genetic diversity π, θ ω and differentiation (F ST ) were calculated based on the SNP data as described earlier 85 . To analyze diversity across the genome, we used a window of 100 kb and calculated the diversity π, θ ω and differentiation F ST for each window for PMiGAP lines and wild accessions using BioPerl modules (Bio::PopGen::Statistics and Bio::PopGen:: PopStats) on a sliding window of 100 kb using genotype data. The effective sequence length (without Ns) in each window was used as the denominator to calculate per-bp values. We then calculated a minus log of the ratio of diversity between cultivated and wild samples: -log (π cultivated/ π wild). For this log ratio of diversity and differentiation, we retained the most extreme values using a classical threshold of 95% for a unilateral test and a more stringent threshold of 99.5%. This later stronger stringent threshold was used to identify the most likely gene candidates selected during domestication. Loci with higher levels of differentiation (most extreme F ST ) and stronger loss of diversity in the cultivated compared to the wild accessions were considered to be provisionally involved in the domestication process.
Identification of NBS domain, TIR domain, LRR motif and CC motif. All pearl millet proteins were assessed for the presence of NBS domains (PF00931, NB-ARC) using the Hidden Markov Model based method implemented in hmmsearch (version 3.0) 89 with an e-value cutoff = 1. To filter false positive hits, all identified NBS containing proteins were screened against the Pfam-A database. NBS domains that overlapped with other domains identified at lower e-values were filtered out. Likewise, the TIR domain (PF01582) was used as query against all pearl millet proteins with hmmsearch and further checked by looking at the overlapping domains. To detect LRR motifs, predicted NBS encoding proteins were searched against 10 LRR families in LRR clan (CL0022) with an e-value cutoff = 1. All regions predicted as LRR motifs and not overlapping with other domains identified with lower e-values were considered real LRR motifs. The data used in the downstream analysis were controlled with MAF 0.05 and missing rate 0.5. SNPs with a mean depth > 100 and missing rate > 0.5 were removed. The remaining SNPs were used in further analyses. Variants for wild lines that used in population structure and domestication analysis were detected together with PMiGAP accessions and processed with the same strategy (BAM and VCF files available at http://ceg. icrisat.org/ipmgsc/).
The BAM files from each resequenced accession was analyzed by Breakdancer (version 1.1.2) 91 with default parameters to detect structural variation namely, deletions, insertions, inversions, and intra-chromosomal translocations. Breakdancer results of accessions that come from a same line (see Supplementary Table 32) were combined to remove redundancy and to calculate the number and length of the rearrangements.
Using SNP data sets from PMiGAP lines, Haploview software 92 (-maxdistance 250 -minMAF 0.05 -dprime -memory 5096) was used to calculate correlation coefficient (r 2 ) values for LD. The average (r 2 ) values between pairwise distances (bp) were calculated and figures were plotted using R.

Statistical analysis.
Phenotyping data and GWAS analysis. For establishing marker trait associations, 288 test cross hybrids were generated by crossing of PMiGAP lines as pollen parents with a common seed parent ICMA 843-22. These hybrids were grouped by maturity (early, medium early, medium and late) and phenotyped for 20 morphological traits under two drought stress conditions (early and late stress) along with controls (or no stress) for two years (2011,2012). Experiments were conducted in an alpha-lattice designs with two replications in three test environments during Summer 2011 and 2012 (January to May) in the red precision (RP) experimental fields at the ICRISAT, Patancheru, Telengana, India (545 m above mean sea level, 17.53° N latitude and 78.27° E longitude). The early maturity group consisted of lines which had days to 50% flowering (DFF) from 42-52 days; the medium-early maturity group consisted of lines with DFF from 53-57 days; the medium maturity group consisted of entries with DFF from 58-62 days; the late maturity group consisted of lines which recorded more than 62 days for DFF. Early drought stress is a more severe stress imposed by withholding irrigation from about one week before flowering until maturity. Late stress is a less severe drought stress initiated during early grain-filling by withholding irrigation from 50% flowering time till maturity.
The three test environments consisted of early-onset of stress, late-onset stress, and a common, fully-irrigated non-stress treatment. Drought stress was imposed by withholding irrigation from about one week before flowering in early-onset treatment, while drought stress in the late-onset treatment was imposed by withholding irrigation from 50% flowering. Data were recorded for a total of 20 traits namely, grain yield (GYHA), panicle yield (HYHA), panicle harvest index (PHI), time to 75% flowering (TB), plant height (PH), panicle length (EL), panicle diameter (ED), panicle number (HCHA), number of tillers per plant (Till), biomass yield (BM), grain harvest index (HI), thousand grain weight (TGW), grain number per panicle (GNP), grain number per m 2 (GNM2), agronomic score (AgS), stover dry matter fraction (DMF) and vegetative growth index (GI). PH, EL, and ED were measured on the main stems of five representative plants of each entry in a plot at maturity. At harvest, data were recorded from the harvested area on plant population (PCHA), panicle numbers (HCHA) and fresh stover yield (FSWTHA). Effective tiller number (Till) was calculated as the ratio HCHA/PCHA. HYHA, GYHA and TGW were recorded after oven drying for about 24 h. Stover dry matter yield (DMY) was estimated from plot FSWTHA using the fresh and dry weights of a chopped subsample of stover from each plot. BM was calculated as HYHA + DMY on a plot basis. Grain number per panicle (GNP) was derived from primary data as [(GYHA/HCHA)/ (TGW/1000)]. Grain harvest index was calculated as the ratio between grain yield and biomass yield at harvest, and panicle harvest index as the ratio between grain weight and panicle weight. Flowering time was recorded as days from seedling emergence to stigma emergence for 75% of the main shoots in a plot. The traits measured include grain yield (kg/ha), panicle yield (kg/ha), panicle harvest index (%), time to 50% flowering (number of days), plant height (cm), panicle length (cm), panicle diameter (cm), panicle number, tillers per plant, biomass yield (kg/ha), vegetative growth index (kg/ha/day), grain harvest index (%), fresh stover yield (t/ha), stover dry matter yield (kg/ha), stover dry matter fraction, 1000-grain mass (g), grain number per panicle, and grain number per m 2 (Supplementary Data set 2). Analysis of variance for all traits was performed using the PROC MIXED procedure in SAS 9.3 (SAS Institute Inc 2013) with Kenward-Roger degree of freedom approximation method considering replicates and accessions as fixed effects, whereas incomplete blocks within each replication were considered as random effects for combined intra and inter block analysis. Best linear unbiased estimates (BLUEs) were calculated for all accessions.
For GWAS analysis, a total of 3,117,056 SNPs retained after filtering the minor alleles (MAF<0.05) and 20% missing data were used. Marker-trait associations were established using AOV model with a bloc effect for maturity group in R (Phenotype~Bloc+SNP). We tested the suitability of the model by plotting the observed P-values from the association test against an expected (cumulative) probability distribution. These quantile-quantile (q-q) plots clearly indicated that we corrected properly for population stratification (Supplementary Fig. 23). Significance of associations between loci and traits were determined adjusting for multiple testing by using FDR at a 0.001 threshold level and considering p value lower than 10 −10 .
Genomic prediction analysis for testcross performance. Grain yield performance of 259 PMiGAP lines was used for hybrid prediction analysis. In our analysis, flowering time was considered as a cofactor. For genomic prediction analysis, we performed a one-stage phenotypic data analysis on 259 PMiGAP lines as test cross hybrid trials using a linear mixed model that included genotype, flowering time, year, stress, interaction among genotype, stress and year, replication, incomplete block and residual effects. The effect of flowering time was always assumed to be fixed. When estimating variance components, all other effects were assumed to be random. To get the BLUE of each line, we set the genotype effect as fixed.
The All phenotypic data analyses were done using the ASreml-R 3 software 93 .
A total of 2,235,060 SNPs with <20% missing rates were used with above mentioned phenotyping data for genomic prediction analysis. We used the genomic best linear unbiased prediction (G-BLUP) model for genomic selection: y Zg e n = + + 1 m , where y refers to n-dimensional vector of phenotypic records, 1 n is an n-dimensional vector of ones, m is the mean, g is an n-dimensional vector of additive genotypic values and e is an n-dimensional vector of residual terms.
In the model y Zg e n = + + 1 m we assume that m is a fixed parameter, and g, e are random parameters with e N I e ∼ ( , ) 0 2 s and g N G g ∼ ( , ) 0 2 s , where G denotes the n × n genomic relationship matrix. G was calculated as follows: Let X = (x ij ) be the n × p matrix of SNP markers, where x ij equals the number of a chosen allele at the j th locus for the i th genotype. Let p j be the allele frequency of the j th marker. W = (w ij ) is an n × p matrix with w ij = x ij − 2p j .
Then we have Note that when calculating the kinship coefficient for two genotypes, only those markers without missing values in both genotypes were considered. The accuracy of genomic prediction was evaluated by fivefold cross-validation with a total of 100 cross-validation runs. The cross-validated prediction accuracy was calculated as the Pearson product-moment correlation between predicted and observed genotypic values of the lines in the test set. The GBLUP model was implemented using the R software 94 .
Hybrid prediction analysis. Grain yield of 64 pearl millet hybrids grown at five locations in India (Jamnagar, Anand, SK Nagar, Mahuva, Kothara) dur- Kharif and Summer. We adopted randomized block design with a spacing of 60 cm between the rows and 10-15 cm between the plants and adopted standard agronomic practices. The 64 hybrids were generated by crossing 20 male and 23 female lines. By using the grain yield phenotyping data for 64 hybrids as mentioned above, we used the following linear mixed model to estimate the variance components as well as BLUEs: Yield ~Genotype + Replication.
To estimate variance components, all effects were treated as random. The BLUEs for each environment were calculated by the same mixed model but modelling genotype as fixed effect. Repeatability was estimated as R N

Data availability
Policy information about availability of data Data availability statement All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated source data -A description of any restrictions on data availability A full data availability statement is included in the manuscript.

Required accession codes
Data deposition is mandated for certain types of data. Confirm that all relevant data have been deposited into a public repository and that all accession codes are provided.
Accession codes will be available before publication No data with mandated deposition All relevant accession codes are provided Data presentation

Image integrity
Confirm that all images comply with our image integrity policy.
Unprocessed data must be provided upon request. Please double-check figure assembly to ensure that all panels are accurate (e.g. all labels are correct, no inadvertent duplications have occurred during preparation, etc.).

Data distribution
Data should be presented in a format that shows data distribution (dot-plots or box-and-whisker plots), with all box-plot elements (e.g. center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers) defined. If bar graphs are used, the corresponding dot plots must be overlaid.
Confirm that all data presentation meets these requirements.
Confirm that in all cases where the number of data points is <10, individual data points are shown.

Structural data
Policy information about special considerations for specific types of data If this study did not involve data of these types, check here and skip the rest of this section.

Electron microscopy
For all electron microscopy work, confirm that you have deposited any density maps and coordinate data in EMDB.

Macromolecular structures
For all macromolecular structures studied, confirm that you have provided an official validation report from wwPDB.

Code availability
Policy information about availability of computer code

Code availability statement
For all studies using custom code, the Methods section must include a statement under the heading "Code availability" describing how readers can access the code, including any access restrictions.
A full code availability statement is included in the manuscript No custom code used

Research animals
Policy information about studies involving animals; follow the ARRIVE guidelines for reporting animal research If this study did not use animals and/or animal-derived materials for which ethical approval is required, check here and skip the rest of this section.

Ethical compliance
Confirm that you have complied with all relevant ethical regulations and that a statement affirming this is included in the manuscript.

Ethics committee
Confirm that you have stated the name(s) of the board and institution that approved the study protocol in the manuscript.

Human research participants
Policy information about studies involving human research participants If this study did not involve any human research participants, check here and skip the rest of this section.

Ethical compliance
Confirm that you have complied with all relevant ethical regulations and that a statement affirming this is included in the manuscript.

Ethics committee
Confirm that you have stated the name(s) of the board and institution that approved the study protocol in the manuscript.

Informed consent
Confirm that informed consent was obtained from all participants.

Identifiable images
For publication of identifiable images of research participants, confirm that consent to publish was obtained and is noted in the Methods.
Authors must ensure that consent meets the conditions set out in the Nature Research participant release form. For flow cytometry studies, confirm that you have completed the additional flow cytometry module.
For ChIP-seq studies, confirm that you have completed the additional ChIP-seq module.
I certify that all the above information is complete and correct.