De novo genome assembly and annotation of rice sheath rot fungus Sarocladium oryzae reveals genes involved in Helvolic acid and Cerulenin biosynthesis pathways

Background Sheath rot disease caused by Sarocladium oryzae is an emerging threat for rice cultivation at global level. However, limited information with respect to genomic resources and pathogenesis is a major setback to develop disease management strategies. Considering this fact, we sequenced the whole genome of highly virulent Sarocladium oryzae field isolate, Saro-13 with 82x sequence depth. Results The genome size of S. oryzae was 32.78 Mb with contig N50 18.07 Kb and 10526 protein coding genes. The functional annotation of protein coding genes revealed that S. oryzae genome has evolved with many expanded gene families of major super family, proteinases, zinc finger proteins, sugar transporters, dehydrogenases/reductases, cytochrome P450, WD domain G-beta repeat and FAD-binding proteins. Gene orthology analysis showed that around 79.80 % of S. oryzae genes were orthologous to other Ascomycetes fungi. The polyketide synthase dehydratase, ATP-binding cassette (ABC) transporters, amine oxidases, and aldehyde dehydrogenase family proteins were duplicated in larger proportion specifying the adaptive gene duplications to varying environmental conditions. Thirty-nine secondary metabolite gene clusters encoded for polyketide synthases, nonribosomal peptide synthase, and terpene cyclases. Protein homology based analysis indicated that nine putative candidate genes were found to be involved in helvolic acid biosynthesis pathway. The genes were arranged in cluster and structural organization of gene cluster was similar to helvolic acid biosynthesis cluster in Metarhizium anisophilae. Around 9.37 % of S. oryzae genes were identified as pathogenicity genes, which are experimentally proven in other phytopathogenic fungi and enlisted in pathogen-host interaction database. In addition, we also report 13212 simple sequences repeats (SSRs) which can be deployed in pathogen identification and population dynamic studies in near future. Conclusions Large set of pathogenicity determinants and putative genes involved in helvolic acid and cerulenin biosynthesis will have broader implications with respect to Sarocladium disease biology. This is the first genome sequencing report globally and the genomic resources developed from this study will have wider impact worldwide to understand Rice-Sarocladium interaction. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2599-0) contains supplementary material, which is available to authorized users.


Background
Sarocladium oryzae [(Sawada) W. Gams & D. Hawksw] is an Ascomycetes fungus causing sheath rot disease in rice. It has recently emerged as a major threat for rice production in rice growing ecosystems in the world. In addition to rice, this fungus infects other important cereal food crops such as maize, sorghum, pearl millet, finger millet, and foxtail millet [1]. The commonly occurring weedy species in rice fields also acts as collateral hosts and source of natural inoculum in endemic areas [2].
S. oryzae produces white, sparsely branched and septate mycelium. Conidiophores are branched once or twice with 3-4 phialades in a whorl. The conidium is a aseptate, hyaline, cylindrical in shape and located on tip of phialades [3]. The conidium germinates and invades rice through the stomata and wounds caused by insects. Later mycelium grows intercellularly within vascular and mesophyll tissues [4]. The pathogen infects the uppermost leaf sheath enclosing young panicle and lesion length may range from 1 to 5 cm and lesion may enlarge to whole flag leaf sheath in severe cases. The necrotic lesions on flag leaf retards translocation of nutrients from foliage to panicle leading to complete suppression of panicle exertion. This results in production of partially filled chaffy grains, and yield loss ranging from of 3 to 85 % [5,6]. Despite the considerable loss caused by this fungus, the life cycle and infection biology has been meagerly studied. Sheath rot symptom is also induced by application of Cerulenin which was demonstrated by developing Cerulenin negative mutants, which did not produce rot symptoms [7]. Also virulent strains of the fungi known to secrete proteinases at significantly higher levels compared to less virulent strains indicating the possible roles of fungal proteinases in plant pathogenicity.
The genomic resources for S. oryzae in public databases (NCBI) are limited to internal transcribed spacer (ITS) region sequences of ribosomal DNA and our previous QTL mapping study [8]. Due to lack of information on genes involved in pathogenicity/virulence, host-pathogen interactions and microsatellites markers, rice-Sarocladium pathosystem has not been studied well at global level. Considering these facts, we sequenced whole genome of highly virulent isolate of S. oryzae (Saro-13) from major rice growing region of South India. This is the first report of de novo genome assembly and annotation of S. oryzae. We carried out detailed analyses of gene families, secondary metabolite gene clusters, pathogenicity related genes, transposable repeat elements, phylogenetic relationship with other fungi and microsatellites. In addition, we analysed putative genes involved in helvolic acid and cerulenin biosynthesis pathways, which are very important in Sarocladium disease biology. The genomic resources generated from this study can be translated into designing better disease management strategies to mitigate sheath rot disease epidemics globally and widen the understanding of rice-Sarocladium pathosystem.

Isolation of fungus and confirmation
Diseased flag leaf sheath sampled over 25 locations in major rice growing regions of Karnataka state, India was used for isolation of fungus. Diseased sheath was surface sterilized using 0.05 % mercuric chloride solution followed by three times washing with sterile water. Sterilized diseased sheath pieces were incubated at room temperature for 4-5 days and germinating spores were transferred to potato dextrose agar (PDA) medium. Based on morphological features of conidiophores, phialades and conidiospores [3], the fungus was identified as S. oryzae. The virulence test of S. oryzae was carried out by standard mycelial inoculation [9] and detached tiller assay. The virulent field isolate Saro-13 isolated from Shrirangapatna (12.401035°N, 76.695754°E), Mandya District, a major rice growing region under cauvery command area was selected for whole genome sequencing. The fungus was characterized for internal transcriber region using ITS-4 and ITS-5 markers [10] to confirm the fungus as S. oryzae.

DNA isolation, Illumina library preparation and sequencing
The virulent strain of S. oryzae (Saro-13) was grown on potato dextrose broth (PDB) medium for three days. The mycelium was grinded using liquid nitrogen and genomic DNA was isolated using nucleo-pore gDNA fungal and bacterial mini kit (Genaxy, Catalogue# NP-7006D). The DNA quality and quantity was assessed by Nanodrop and Qubit (Applied Biosystems), respectively. The genomic DNA was sheared to generate fragments of approximately 400-600 bp in Covaris microtube with the E220 system (Covaris, Inc., Woburn, MA, USA). The fragment size distribution was checked using Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA) with high sensitivity DNA Kit (Agilent Technologies). The fragmented DNA was cleaned up using HighPrep beads (MagBio Genomics Inc, Gaithersburg, Maryland). The Illumina paired-end library was prepared as per manufacturers instruction using NEXTflex DNA sequencing Kit (Catalogue # 5140-02, Bioo Scientific). The paired end library was sequenced using Illumina NextSeq 500 in Genotypic Technologies, Bengaluru and the length of read sequence was 151 nts from both the ends of the fragment.

Preprocessing of raw sequence reads
The low quality bases with Phred score less than Q30 (accuracy less than 99.99 % of the base called) and adapter sequence contamination in raw sequence reads of Illumina was discarded using FASTX-Toolkit (http:// hannonlab.cshl.edu/fastx_toolkit/index.html).

Genome assembly and functional annotation
De novo assembly of S. oryzae was performed using SPAdes assembler [11]. SPAdes assembler corrected sequencing errors in reads and performed scaffolding to output de novo assembled scaffolds. The assembled scaffolds were screened for sequences of mitochondrial genome contaminants. The gene prediction was performed using Augustus 3.0.3 (−−species = fusarium_graminearum -strand = both -genemodel = complete) [12,13]. Functional annotation of genes was done by searching homology with Ascomycetes protein sequences of SwissProt (http://www.uniprot.org) using BLASTP with an e-value threshold of 1 e-10 . The annotation of protein domain structures was performed using InterProScan5 software [14]. The gene ontology (GO) terms were assigned by KAAS server [15].

Pathogenicity genes in S. oryzae
The fungal pathogenicity genes were retrieved from the Pathogen-Host Interaction (PHI) database [21] (http:// www.phi-base.org) and BLASTP was performed against S. oryzae proteome. Protein alignments with more than 40 % identity and 70 % query coverage were considered as putative pathogenicity genes in S. oryzae.

Secondary metabolite gene cluster analysis
The scaffold sequences of S. oryzae were analysed for secondary metabolites gene clusters using antiSMASH [25].

Pathway analysis of helvolic acid and Cerulenin biosynthesis
We retrieved amino acid sequences of putative genes involved in helvolic acid biosynthesis from Aspergillus genome database (AspDB; http://www.aspgd.org) and protein homology search was carried out with S. oryzae genes. The genes with minimum 50 % identity and 70 % query coverage were considered as putative candidates in helvolic acid biosynthesis pathway. In addition, a homology search was also performed against NCBI non-redundant protein database to obtain homologous sequences in closely related fungal species. The protein domain based search was performed to identify putative genes involved in Cerulenin biosynthesis.

Prediction of repeats and simple sequence repeats (SSRs)
The S. oryzae scaffold sequences were subjected for de novo repeat prediction using RepeatMasker [26]. Reference based repeats analysis was done by comparing to reference repeat library database of RepBase (http:// www.girinst.org/repbase/). The whole genome of S. oryzae was analyzed to determine the distribution and frequency of various types of SSRs using Microsatellite Identification tool (MISA) [27] (http://pgrc.ipk-gatersleben. de/misa/). The minimum length of SSR motif was set as 10 for mono, 6 for di, 5 for tri, tetra, penta and hexa motifs.

Genome assembly and annotation
The S. oryzae isolate, Saro-13 was selected for whole genome sequencing based on virulence study, and was confirmed by mycelial morphology, colony characteristics and ITS sequencing. S. oryzae (Saro-13) produced sparsely branched mycelium with orange pigmentation on potato dextrose agar (PDA) medium. The conidium was single-celled, cylindrical and hyaline in structure (Additional file 1). The ribosomal DNA internal transcribed spacer (ITS) region of S. oryzae isolate Saro-13 was sequenced using Sanger sequencing platform. Then, the ITS sequence was analysed by BLASTN to confirm the identity of Saro-13. The top 20 hits with e-value of 0 confirmed the identity of Saro-13 isolate as S. oryzae (Additional file 2).
Saro-13 was isolated from major rice growing region in Cauvery canal irrigated area of South Karnataka, India. Sequencing was carried out using Illumina Next-Seq500. A total of 20,963,198 (paired reads,~100x depth) raw reads were generated and the read length was 151 nts. Discarding low quality reads resulted 17,854,048 reads which corresponds to approximately 82x sequence depth of high quality data and these reads were assembled using SPAdes genome assembler [11]. The assembly process resulted 5,856 contigs with total consensus genome size of 32,778,109 bp. The maximum contig size was 89796 bp, and minimum contig size was 209 bp. The average size of the contig was 5597 bp and the N50 value of contigs was 18.07 Kb indicating good quality assembly for further downstream analysis ( Table 1). The simple gene structures of most fungi facilitate accurate gene prediction. Moreover, majority of fungal species lack EST data to use them in gene prediction process. As a result, gene prediction in fungi heavily based on either de novo or comparative gene prediction models [28,29]. The ab initio gene prediction using Augustus 3.0.3 revealed that S. oryzae genome harbors 10,526 protein coding genes. Out of which, 9658 were annotated with Uniprot fungal protein database and remaining 868 genes did not find significant annotation ( Table 1). The average length of gene was 1689 bp with 294 genes were spaced at every one Mb of genome. This indicates that S. oryzae genome is gene dense like other fungi. An average distance between genes was 1.08 Kb with GC content of 45 % in the coding regions. There were 29,293 exons comprising of 15.97 Mb of total exon length. The average length of exon was 545.74 bp with 2.78 exons per gene. Overall, 18769 introns were present in 10526 genes with total length of 1.71 Mb and average length of introns was 93.85 bp ( Table 2). The average exon and intron lengths  and number of introns per gene in S. oryzae are in concordance with other sequenced Ascomycetes fungi like Neurospora and Magnaporthe [30][31][32]. The structural uniformity of genes among Ascomycetes fungi may provide a unique opportunity to study their evolution.

Orthology, multigene families and phylogenetic relationship of Ascomycetes fungi
The orthologous genes are resultant of speciation process and clear delineation of orthologous relationship between species helps us to reconstruct evolution of species [33]. Moreover, orthology is the most accurate way to identify differences and similarities, transfer of functional gene information from model organisms to uncharacterized newly sequenced genomes [34]. To predict ortholog genes and gene family duplications among five sequenced Ascomycetes fungi (S. oryzae, M. oryzae, A. chrysogenum, F. graminearum and F. oxysporum), we clustered their proteomes using orthoMCL tool. The clustering of proteomes resulted 13185 ortholog groups (Fig. 2), of which 5495 were core orthologous groups (COGs) among Ascomycetes fungi. Among COGs, 3246 were single copy ortholog genes indicating they are putative essential genes. There were 480 orthologous groups consisting of 1159 genes found to be duplicated (more than one copies of gene) in S. oryzae genome. The largest multigene family was encoding polyketide synthase dehydratase (PF14765), followed by ABC transporter (PF00005), ABC transporter transmembrane region (PF00664), Aldehyde dehydrogenase family (PF00171), Fibronectin type III-like domain (PF14310), PA14 domain (PF07691), Copper amine oxidase (PF02727), WSC domain (PF01822), OPT oligopeptide transporter protein (PF03169). The polyketide synthase dehydratase gene family is known to produce secondary metabolites and essential for fungal virulence [35,36] to invade the host. The ABC transporters also play a vital role in pathogen virulence [37] by exporting noxious extracellular toxins and impose survivability to the fungus during adverse environmental conditions. The aldehyde dehydrogenase family of proteins are involved in production  [38].
Phylogenetic relationship between S. oryzae with other Ascomycetes fungi was inferred based on protein similarity of hundred randomly choosen single copy ortholog genes from orthoMCL analysis. Based on WAG model [39] of protein evolution, S. oryzae was closely related to M. oryzae (causal organism of rice blast disease) followed by A. chrysogenum, F. oxysporum and F. graminearum (Fig. 3). The closer relatedness to Magnaporthe implies the shared gene arsenal required for adaptation to same host.  To our knowledge, pathogenicity genes/factors are not determined so far in S. oryzae genome due to lack of genomic resources. The S. oryzae infect aerial parts of the rice plant, especially uppermost leaf sheath enclosing the young panicles. To identify putative genes involved in pathogenicity, we analysed S. oryzae proteomes for pathogen-host interaction (PHI) gene database, secretary proteins, carbohydrate-active enzymes (CAZymes), secondary metabolites, transporters, and transcription factors that are required to colonize in the host tissue.

a. Putative Pathogen-Host Interaction (PHI) genes
The PHI database has collection of experimentally verified virulence associated genes from fungi, oomycetes and bacteria [21]. All 10,526 protein sequences of S. oryzae were aligned to PHI fungal genes using BLASTP (e-value 10 −10 ). We identified 953 (9.06 % of total genes of S. oryzae) putative PHI genes in S. oryzae spanning across 59 different fungal species. Highest number of homologs was found in Fusarium graminearum (483 genes), followed by Magnaporthe oryzae (145 genes), Aspergillus fumigatus (66 genes), Candida albicans (36 genes), Botrytis cinerea (20 genes), Cryptococcus neoformans (18 genes), Fusarium oxysporum (18 genes), and other fungal species (167 genes) (Additional files 3 and 4). We assume that these genes are putative candidate pathogenicity determinants to induce pathogenicity in S. oryzae as their role in pathogenesis is already proven in their respective host species (cross-species pathogenicity) [40]. These preliminary results pave the way for future researchers to dissect pathogenicity genes in S. oryzae.

b. Secretory proteins
The secretome analysis of S. oryzae proteome revealed 391 proteins harboring signal peptides (SPs) (Additional file 5). The aspartyl protease domain (Asp) containing secretory proteins were enriched in the S. oryzae genome and are mainly involved in proteolytic activity (hydrolysis of peptide bonds). Another class of secretory proteins like Tyrosinase is known to be involved in melanin production. Other important domain containing secretory proteins like hydrophobic surface binding protein A (HsbA), cupin, fungal hydrophobin, and lipase were enriched in the S. oryzae genome (Additional file 6).

c. CAZymes
Carbohydrate-Active Enzymes (CAZymes) play a vital role in metabolism of structural components of cell wall and storage glucans in plant pathogens.

d. Transcription factors (TFs)
Transcription factors (TFs) play a vital role in signal transduction pathways by acting as a linker between signal flow and target gene expression. Mining of specific repertoire of TFs in the genome gives us an overview about active pathways in the genome [41]. Around 351 (3.34 % of total genes) protein sequences encoded for 7 different classes of TFs in S. oryzae (Table 4). Among seven classes of TFs, Zn2Cys6 (288 genes) were majorly distributed followed by C2H2 zinc finger (38 genes), bZIP (14 genes), heteromeric CCAAT type (7 genes), MADS-box (2 genes), Myb (1 gene), and Grainyhead/ CP2 (1 gene). Similar level of distribution of Zn2Cys6 TFs in Ascomycota was reported by Todd and co workers [42]. These TFs have multifunction in fungi like controlling cellular process like fungal fitness, sugar and amino acid metabolism, gluconeogenesis and respiration, vitamin synthesis, chromatin remodeling, nitrogen utilization and response to drug and stress [43]. These TFs are also known to control calcineurin signaling pathway that is more important for fungal pathogenicity. It is reported  [44] and Botrytis cinerea [45] affected the formation of infection structure resulting in reduced pathogenicity. Another major class of TFs is C2H2 zinc finger, which are most common DNA-binding motifs, around 38 genes contained this motif in S. oryzae. The basic leucine zipper (bZIP) domain containing TFs is third largest family in the S. oryzae genome and they are known to regulate cellular growth and differentiation. There were 14 genes encoding for bZIP TFs in S. oryzae. Deletion mutants of this TFs showed defects in mycelial growth, development and reduced pathogenicity in Magnaporthe pathosystem [46]. The repertoire of TFs signifies that S. oryzae genome fosters diverse classes of TFs required for activation of most of the fungal pathogenicity genes.

e. Cytochrome P450 enzymes and membrane transporters
The cytochrome P450 enzymes in fungi carry out a wide range of bioconversions of complex polyaromatic hydrocarbons (PAHs) and steroid compounds mediated by monooxygenase enzymes [47]. There were 93 genes distributed across 82 various cyp gene families [48] in S. oryzae based on fungal cytochrome P450 database (FCPD) (Additional files 8 and 9). Genes encoding for plasma membrane transporters will help in assimilating the products degraded by CAZymes. The protein family classification of S. oryzae proteome revealed 212 genes encoding for major facilitator superfamily (MFS) and 120 genes encoding for sugar and other transporters. As compared to other gene families, MFS membrane transporters were high indicating their role in transporting small solutes in response to chemiosmotic ion gradients during pathogenesis.
f. Pathway analysis of helvolic acid and cerulenin secondary metabolites production Secondary metabolites (SMs) are small bioactive molecules and they are essential for fungal growth and development. At the same time SMs provide protection against various environmental stresses. The biosynthesis of SMs is catalyzed by either nonribosomal peptides synthases (NRPSs), polyketide synthases (PKSs), hybrid NRPS-PKS enzymes, prenyltransferases (DMATSs), and terpene cyclases (TCs). The catalytic activity of these enzymes results in production of SMs respectively like nonribosomal peptides, polyketides, NRPS-PKS hybrids, indole alkaloids, and terpenes [49]. Searching for SMs revealed that S. oryzae genome is enriched with PKSs, TCs followed by NRPSs, NRPSs-PKSs hybrid clusters (Fig. 4). Several studies have reported that S. oryzae produces helvolic acid and cerulenin SMs [50][51][52][53]. The Zn2Cys6 288 Fig. 4 The secondary metabolites gene clusters in S. oryzae. NRPSs-nonribosomal peptides synthases, PKSs-polyketide synthases, TCs-terpene cyclases, and DMATSs-prenyltransferases biosynthetic pathways of these SMs were found to be different and concomitant production of these two metabolites might have synergistic effect to invade host by changing the cell permeability leading to leakage of electrolytes in the host tissue [52,[54][55][56]. So far, the studies on helvolic acid and cerulenin metabolites were restricted only to chromatographic assays and gene and protein level information of the pathways involved in their metabolism is unknown in S. oryzae. Based on our SM analysis, we hypothesize that PKSs, TCs and NRPSs could be the putative enzymes involved in the biosynthesis of these two metabolites in S. oryzae. We critically examined the proteome of S. oryzae to screen candidate genes involved in biosynthesis of these SMs. Helvolic acid is a steroidal antibiotic, known to be controlled by cluster of genes in Aspergillus flavus [57] and Metarhizium anisophilae [58]. Initial BLASTP searches of S. oryzae proteome against A. flavus protein sequences identified nine candidate genes in S. oryzae. The structural analysis showed these genes were single exonic genes arranged in clusters. Among these gene clusters, four (SoG_03551.T1, SoG_04319.T1, SoG_09546.T1, and SoG_03005.T1) cytochrome P450, two (SoG_03552.T1 and SoG_03554.T1) transferase family protein, one each of short chain dehydrogenase (SDR) (SoG_04320.T1), qualene-hopene-cyclase (SoG_05635.T1), and 3-ketosteroiddelta-1-dehydrogenase (SoG_03553.T1) genes ( Fig. 5 and Additional file 10). The structural arrangement of gene clusters was more similar to Metarhizium aninophilae strain NwlB-02 (NCBI Locus ID: 129929) than A. flavus.
Another important SM produced by S. oryzae is Cerulenin and its biosynthesis is closely related to fattyacid synthesis [56]. The structure of Cerulenin is (2S), (3R) 2,3-epoxy-4-oxo-7, 10-do-decadienoyl amide concluded based on mass and NMR spectroscopic methods [59]. We looked at the enzymes involved in Dodecanoic acid pathway under fatty acid biosynthesis. There are six enzymes (FabD, FabB, FabF, FabG, FabA and FabZ) involved in biosynthesis of trans-dodeca-2-enoyl-[acp], an intermediary product of dodecanoic acid pathway (Additional file 11). The major protein domains of these enzymes are acyltransferase, oxidoreductase, and lyases. We identified putative candidate genes involved in Cerulenin biosynthesis based on protein domain annotation. There were 97 short chain dehydrogenase (SDR), 24 enoyl-(acyl carrier protein) reductases, 12 acyltransferase, seven beta-ketoacyl synthase, and 25 oxidoreductases genes found in S. oryzae genome. These genes are of future interest to understand its biosynthesis since Cerulenin is mainly used as antifungal antibiotic and anticancer agent that inhibits fatty acid and steroid biosynthesis [60,61]. The knowledge of these pathway genes can be utilized for therapeutic and industrial uses, by exploring genetic engineering approaches to convert pathogenic strain to non-pathogenic strain for commercial purpose.

Repetitive DNA content
Repetitive DNA is an integral part of fungal genomes. Repeat sequences play a vital role in generating genetic diversity, genome expansion and might also be detrimental  (Fig. 6). Among 2349 dinucleotides microsatellites, 'GA/TC' type (29.54 %) of microsatellites were enriched in the genome followed by ' AG/CT' type (27.42 %), and ' AC/GT' type (18.09 %). The 'CG/CG' type dinucleotides microsatellites were present in lowest proportion (0.47 %). In case of trinucleotide microsatellites (969), around 8.18 %, 7.52 %, 7.47 % of SSRs were of ' AAG/CTT' , 'GAA/TTC' and 'CTG/GAG' types, respectively. The 'CTA/TAG' type was lowest (0.25 %) in the genome of S. oryzae. The poor distribution (3.16 %) of tetranucleotides SSRs was observed in the S. oryzae genome. The maximum number of tetra nucleotides repeats was of 'TTTC/GAAA' type followed by ' AAGA/TCTT' , ' AAAG/CTTT' , and 'TCCA/TGGA'. The overall analysis showed that the relative abundance of tetra, penta and hexa SSRs types were low as compared to mono, di and tri SSR types in S. oryzae genome. The similar observation was made in other Ascomycetes fungi like A. nidulans, S. cerevisiae, F. graminearum, M. oryzae, and N. crassa [66]. Hence, SSRs identified in this study will have immense importance in the immediate future to study population diversity, evolutionary pattern and understanding the virulence pattern of S. oryzae in the rice growing regions at global level.

Conclusions
Rice sheath rot disease caused by S. oryzae is an emerging disease in rice growing regions. Lack of genomic resource for S. oryzae motivated us to takeup this sequencing effort and report the first ever genome draft of S. oryzae. The whole genome sequencing and de novo assembly revealed 32.78 Mb is the genome size of S. oryzae. This genome of this fungus codes for 10,526 proteins based on ab initio gene prediction algorithm. Furthermore, functional annotation of proteins showed that 73.23 % of total genes distributed across 2,820 protein families. The gene ontology annotation showed 12.21, 39.1 and 47.33 % of genes were involved in biological, cellular and molecular functions, respectively. Comparative orthology studies revealed 8,400 genes were orthologous to other Ascomycetes fungi and remaining (2126) genes were unique to S. oryzae. Multigene families such as polyketide synthase, ABC transporters and other pathogenicity related genes were distributed across 480 orthologous groups. The expansion of these gene families through natural selection denotes survival advantage of this pathogen for acclimatization to diverse environmental conditions. The overall analysis showed that S. oryzae has large sets of pathogenicity-related genes encoding secreted effectors, proteinases, secondary metabolism enzymes, transporters, carbohydrate-active enzymes, cytochrome P450 enzymes and transcription factors. This diversification and maintenance of more number of arsenal of diverse virulence factors may be required to colonize a wider range of host species by S. oyzae. More interestingly, helvolic acid biosynthesis pathway genes were found in a single cluster encoding for cytochrome P450 monooxygenase, transferase, short chain dehydrogenase (SDR), qualene-hopene-cyclase, and 3-ketosteroid-delta-1-dehydrogenase. Genome-wide identification of microsatellites revealed that around 43.71 % of SSRs were di, tri and tetra types, which could be explored in pathogen identification and population dynamic studies. Prior to elucidation of this draft genome sequence, very little was known about molecular mechanisms involved in pathogenicity and research in this area was limited to metabolite studies. Indeed, the availability of this genome in the public domain from our sequencing effort will now allow the researchers to carry out accelerated and rational experiments to dissect Rice-Sarocladium interaction that may help to articulate better disease control measures.

Data availability
The genome assembly/contigs are deposited in NCBI/ DDBJ/Genbank genome database under the accession number LOPT01000000. The raw sequence reads are deposited in NCBI SRA database under accession number SRX1639538. Distribution of mono, di and tri repeat motifs in S. oryzae. Inner, middle and outer circles represents mono, di, tri repeat types, respectively