An improved draft of the pigeonpea (Cajanus cajan (L.) Millsp.) genome

The first draft of the pigeonpea (Cajanus cajan (L.) Millsp. cv. Asha) genome with 511 Mbp of assembled sequence information has low genome coverage of about sixty percent. Here we present an improved version of this genome with 648.2 Mbp of assembled sequence of this popular pigeonpea variety, which is liked by the millers and has resistance to fusarium wilt and sterility mosaic diseases. With the addition of 137 Mb of assembled sequence information the present version has the highest available genome coverage of pigeonpea till date. We predicted 56,888 protein-coding genes of which 54,286 (96.7%) were functionally annotated. In the improved genome assembly we identified 158,432 SSR loci, designed flanking primers for 85,296 of these and validated them in-silico by e-PCR. The raw data used for the improvement of genome assembly are available in the SRA database of NCBI with accession numbers SRR5922904, SRR5922905, SRR5922906, SRR5922907. The genome sequence update has been deposited at DDBJ/EMBL/GenBank under the accession AFSP00000000, and the version described in this paper is the second version (AFSP02000000).


Specifications
Provides an updated and much improved draft genome assembly of pigeonpea.
Provides genome wide SSR marker information that can be used to target highly variable regions across the pigeonpea varieties and other closely related taxa for breeding applications.

Data
We present an improved draft genome assembly of pigeonpea having estimated total genome size of 858 Mb [1]. Pigeonpea is the fourth most important food legume, and owing to high protein, mineral and vitamin contents it is playing significant role in the eradication of protein-calorie malnutrition in Asia and Africa [2]. The Illumina HiSeq sequence reads generated in this study have been deposited in the NCBI-SRA database (SRR5922904, SRR5922905, SRR5922906, SRR5922907) and improved draft genome of pigeonpea is deposited in the NCBI-WGS database (AFSP02000000). Data presented in the text includes tables and figures providing information on the different library types of Illumina sequence data ( Table 1), statistics of improved draft genome assembly (Table 2) as well as identified genome wide SSRs (Fig. 1) along with their flanking primer sequence information (Supplementary Table 1).

Plant material, DNA isolation and genome sequencing
High quality DNA was isolated from the leaves of pigeonpea variety 'Asha' using CTAB method [3]. DNA was fragmented with a median fragment size of 350 bp, 550 bp, 3 Kb and 5 Kb and used for whole genome shotgun, paired-end and mate-pair sequencing using Illumina HiSeq-2000 sequencing platform (Illumina, San Diego, CA).

Genome sequencing, de-novo assembly and gene annotation
The Illumina sequence reads were quality-checked using FASTQC (http://www.bioinformatics. babraham.ac.uk/projects/fastqc/) and adapter sequences along with poor-quality bases were removed using Trimmomatic v 0.36 [4] ( Table 1). The high-quality Illumina reads were de novo assembled using software CLC Genomics Workbench version 7.1 (CLC Bio, Aarhus, Denmark, http://www.clcbio. com/). The improved draft version of assembly (Table 2) was generated using software GAM-NGS [5] by merging the first draft 454-GS-FLX sequence based assembly [6] with the new Illumina based draft assembly.
The improved merged draft assembly consists of 360,028 contigs with total size of 648.2 Mb and covers 75.6% of the genome, which is 15% higher than the published first draft genome of pigeonpea [6], and 11% higher than another published draft of pigeonpea [7]. To predict the protein coding genes the improved draft assembly was first repeat-masked using RepeatModler and RepeatMasker software [8], followed by ab-initio gene prediction using the FGENESH module of the Molquest v. 4.5 software package (http://www.softberry.com). The predicted genes were annotated using BLASTX  (E o10 -6 ) [9] search against the NCBI non-redundant (nr) protein database using Blast2GO software [10].

Identification of genome wide SSR and designing of PCR primers
The improved draft version was screened for the presence of simple sequence repeat (SSR) loci using MISA software (http://pgrc.ipk-gatersleben.de/misa/), the output is tabulated and graphically represented in Fig. 1. The SSR flanking primer sequences were designed with the help of Primer3 software [11] and efficiency of primer specificity was checked using software e-PCR [12]. The complete details of the SSR primers are available in Supplementary Table 1.

Transparency document. Supporting information
Supplementary data associated with this article can be found in the online version at http://dx.doi. org/10.1016/j.dib.2017.11.066.