RicePilaf: a post-GWAS/QTL dashboard to integrate pangenomic, coexpression, regulatory, epigenomic, ontology, pathway, and text-mining information to provide functional insights into rice QTLs and GWAS loci

Abstract Background As the number of genome-wide association study (GWAS) and quantitative trait locus (QTL) mappings in rice continues to grow, so does the already long list of genomic loci associated with important agronomic traits. Typically, loci implicated by GWAS/QTL analysis contain tens to hundreds to thousands of single-nucleotide polmorphisms (SNPs)/genes, not all of which are causal and many of which are in noncoding regions. Unraveling the biological mechanisms that tie the GWAS regions and QTLs to the trait of interest is challenging, especially since it requires collating functional genomics information about the loci from multiple, disparate data sources. Results We present RicePilaf, a web app for post-GWAS/QTL analysis, that performs a slew of novel bioinformatics analyses to cross-reference GWAS results and QTL mappings with a host of publicly available rice databases. In particular, it integrates (i) pangenomic information from high-quality genome builds of multiple rice varieties, (ii) coexpression information from genome-scale coexpression networks, (iii) ontology and pathway information, (iv) regulatory information from rice transcription factor databases, (v) epigenomic information from multiple high-throughput epigenetic experiments, and (vi) text-mining information extracted from scientific abstracts linking genes and traits. We demonstrate the utility of RicePilaf by applying it to analyze GWAS peaks of preharvest sprouting and genes underlying yield-under-drought QTLs. Conclusions RicePilaf enables rice scientists and breeders to shed functional light on their GWAS regions and QTLs, and it provides them with a means to prioritize SNPs/genes for further experiments. The source code, a Docker image, and a demo version of RicePilaf are publicly available at https://github.com/bioinfodlsu/rice-pilaf.


Bac kgr ound
Rice is a global food staple feeding half of humanity.To address the dual concerns of meeting the demands of a growing world population while minimizing contribution to climate change, scientists and breeders are continually seeking genetic sources for high-yield, sustainable, and robust rice varieties [ 1 ].Paramount to this task are the identification and elucidation of the genetic and molecular basis of a gr onomicall y important traits.
The biological inter pr etation of statistical loci-tr ait associations is challenging due to a number of reasons.First, a typical GWAS can implicate hundreds to thousands of SNPs due to high linkage disequilibrium (LD)-estimates of LD extend to hundreds of kilobases for some rice populations [ 15 , 16 ].QTL mappings and GWAS peaks likewise may have tens to hundreds of underlying genes.Not all of the SNPs/genes in the reported loci will be causal, r equiring a mec hanism to narr o w do wn the candidate list.Next, complex tr aits ar e likel y influenced by m ultiple SNPs/genes, whic h individuall y ar e onl y able to explain a small amount of variation.Teasing out biological meaning ther efor e r equir es taking into account coexpression and coregulation patterns of groups of genes.
Furthermor e, ther e might be many associated SNPs that are in intergenic and noncoding reg ions, g iven that a majority of SNPs in genotyping assays and genotype databases do not lie inside gene models [ 17 , 18 ].This r equir es incor por ating r egulatory information in the post-GWAS anal ysis.Lastl y, since SNP genotypes ar e typically called against the Nipponbare reference, GWAS/QTLs are typicall y r eported onl y in Nipponbar e coordinates, e v en though se v er al high-quality genome assemblies of a variety of accessions are now a vailable .Given that a large number of betweenpopulation genomic variations have been reported [ 14 ], a pangenomic view of gene sets implicated by GWAS/QTL mapping is necessary.
T hus , the post-GWAS/QTL mapping task of prioritizing genes or identifying biological mechanisms that link them to the phenotype r equir es the integr ation of GWAS/QTL ma pping r esults with genomic information from a host of other data sources.While tools for computational post-GWAS analysis have been reported for other species (e.g., [ 19 , 20 ]), there are a limited number of tools dedicated to rice, one of which is Rice Galaxy [ 21 ] (now folded into Cr opGalaxy [ 22 ]), whic h utilizes genome position information and lift-ov er acr oss a fe w rice genomes.
Here we report RicePilaf, a web app for post-GWAS/QTL analysis that integrates rice GWAS/QTL mapping results with pangenomic, coexpression, epigenomic, ontology, pathway, regulatory, and liter atur e-mining information coming fr om v arious data sources and produces interactive web reports allowing users to, for example, search and sort data tables, move and click network nodes to display information on pertinent genes, and download the anal ysis r esults in text format (Fig. 1 ).It is built using the Python-based Dash and Flask fr ame works.All dependencies are bundled into a Docker image; hence, it works on any of the major operating systems.It can be run locally on a w eb bro wser or provided as a web service.It is free and open source.Further details of the software are provided in the Availability of Source Code  and Requirements section.We demonstrate the utility of the software using recent GWAS/QTL analysis on 2 k e y traits connected to modern rice cultivation: yield under drought and preharvest sprouting.

RicePilaf o vervie w
RicePilaf takes in as input a set of genomic intervals (Fig. 2 ), obtained from a QTL analysis or from clumping of LD-linked statistically significant SNPs from a GWAS, for example, computed by the LD-clumping pr ocedur e of PLINK [ 23 ].It performs a series of novel bioinformatics analyses on the input interv als, whic h we ov ervie w her e and describe in mor e detail in the Methods section.

Gene list
RicePilaf begins by r etrie ving the gene models ov erla pping the input intervals in the Nipponbare reference.For each gene model, it provides the following: (i) gene description and orthology information obtained from the Rice Gene Index (RGI) [ 24 ]; (ii) pr otein, pr otein domains, and protein family information from UniProt [ 25 ], InterPro [ 26 ], and Pfam [ 27 ], obtained by automated queries using PyRice [ 28 ]; and (iii) scientific liter atur e associating the gene to tr aits, obtained fr om QTARO [ 29 ] and our in-house text-mined dataset.

Lift-over
Nipponbar e serv es as the gold-standard r efer ence genome sequence and genomic coordinate system.Given that genotype calls ar e made a gainst the Nipponbar e r efer ence, GWAS/QTL ma pping results are reported in Nipponbare coordinates.Recently, high-quality genome builds of se v er al accessions hav e been published [30][31][32].Compar ativ e genomics has r e v ealed an abundance of duplications , deletions , insertions , in versions , and translocations during the evolution of rice [ 14 , 33 , 34 ].
For GWAS/QTL analysis on populations that are not derived from or include Nipponbare, by relying only on its genome and its annotation, we likely will miss genes and regulatory features linked to the phenotype of interest.Examples of genes of agronomic importance that are absent in Nipponbare (but present in indica or aus) include Sub1 [ 35 ].
RicePilaf can lift over the intervals in the Nipponbare reference coordinates to se v er al other r ecentl y published genomes, r epr esenting major rice populations.Using the RGI [ 24 ] database, it retrie v es the genes ov erla pping the lifted-ov er interv als and their orthologs .T his pangenomic view of gene sets may be useful if, for example, the GWAS/QTL mapping is on an accession that is closer to a genome other than Nipponbare (Fig. 3 ).

Coexpression network analysis
Complex traits are influenced by hundreds of SNPs/genes that individuall y ar e onl y able to explain a small amount of variation in the trait.Genes with the same or similar biological functions or involved in the same pathway are likely to be also coexpressed [36][37][38].Coexpr ession networks pr ovide a means to identify sets of genes acting together to produce a trait.An additional benefit of a coexpression network is that for genes with poor annotations or unknown functions, their membership in a dense subnetwork containing well-c har acterized genes might be a wa y to unco ver incomplete functional information.Furthermor e, coexpr ession networks have been used for post-GWAS analysis in a number of plants and animals [ 20 , 39 , 40 ].
To identify genes that may be acting collectiv el y to result in a tr ait, RicePilaf searc hes rice coexpr ession networks, RiceNet v2 [ 41 ] and RCRN [ 42 ], for modules (communities or clusters) of genes that ar e statisticall y enric hed in the genes ov erlapping the input intervals.Functional characterization of the modules is done via enrichment analysis against several ontology and pathway databases from agriGO [ 43 ], KEGG [ 44 ], and Oryzabase [ 45 ] (Fig. 4 ).

Regulator y featur e enrichment
A majority of rice SNPs in genotyping assays and genotype databases are located in noncoding regions [ 14 , 17 , 18 ].For example, the 3,000 Rice Genomes Project found twice as many SNPs in intergenic regions than within genes [ 46 ].Unsurprisingl y, GWAS/QTL ma ppings also r eport man y noncoding tr aitassociated variants.It is likely that these influence the activity of regulatory elements.One possible causal link is that variants could alter transcription factor binding affinity, leading to changes in the expression of target genes, ultimately resulting in phenotypic variation [ 47 ].Post-GWAS tools for human data that ov erla p GWAS r esults with r egulatory featur es suc h as tr anscription factor (TF) binding sites, c hr omatin accessibility, and histone marks [ 48 , 49 ] have been previously reported.
To inv estigate v ariants that might be affecting the binding activity of transcription factors, RicePilaf searches for transcription factors whose known/predicted binding sites provided by PlantRegMap [ 50 ] significantly overlap with the input intervals.

Text mining
The gene list provided for the coexpression and regulatory enrichment analyses can be supplemented by genes r etrie v ed fr om the pangenome lift-over and from querying our in-house dataset obtained by text mining PubMed abstracts on rice gene-trait associations (Fig. 5 ).Additionally, the same text-mining dataset is used to find scientific liter atur e r elated to the genes ov erla pping the input interval.

Epigenomic information
For traits that are tissue specific, it may be desirable to deprioritize genes whose epigenetic markers suggest transcriptional inactivity.Using the embeddable Integr ativ e Genomics Vie wer [ 51 ], Ri-cePilaf displays selected BED files obtained from the RiceENCODE database [ 52 ], which contains tissue-specific chromatin accessibility, histone modification, and DNA methylation data, among others, obtained from high-throughput sequencing experiments.

Summary of results
The summary of results of the v arious anal yses performed by a user is integrated into an interactive table and displayed on a dedicated summary page .T he rows of the table correspond to the candidate genes, and the columns contain summary statistics such as the number of pathways containing the gene, the number of pathways or ontology terms associated with the coexpression network cluster to which the gene belongs, and the number of articles in which the gene was found based on our text-mining results .T he rows of the table can be sorted based on multiple columns, allowing the user to prioritize the genes based on criteria they deem important.

Demonstr a tion of use case
We demonstrate the functionality and features of RicePilaf by applying it to a QTL analysis on yield under drought and a recent GWAS on pr eharv est spr outing.

Candidate genes underlying yield-under-drought QTLs in rice
We utilized RicePilaf to examine the genes underlying a largeeffect QTL for impr ov ed rice yield under dr ought (qDTY12.1).Dixit et al. [ 53 ] undertook in silico c har acterization and quantitative PCR (qPCR) analyses of 53 intra-QTL candidate gene models underlying qDTY12.1 [ 54 ].Candidate genes were based on the annotation of the Nipponbare reference genome, which was the bestannotated genome at that time.
RicePilaf includes genomes from circum-Aus N22 and circum-Basmati ARC 10497 subpopulations that are more closely related to the rice varieties used as donors to qDTY12.1, and hence we anticipate the identification of novel genes in the QTL that are not found in Nipponbare reference.Using the physical interval of qDTY12.1 in the Nipponbare reference genome (Chr12:15121175-18184336 bp, estimated from the physical positions of simple sequence repeats or SSRs used in the QTL mapping by Dixit et al. [ 53 ] and Mishra et al. [ 55 ]), a lift-over analysis was conducted against the N22 and ARC 10497 genomes.In total, 389 genes were found in this QTL interval in Nipponbare and 142 genes in the lift-over regions in N22.Among these, 109 genes are common to NB and N22 cultivars, while 28 genes are unique to N22 (Table 1 ).For ARC 10497 lift-over, 142 genes are in ARC 10497 lift-o ver regions , with 111 genes common to Nipponbare and ARC 10497.Twenty-four genes are unique to ARC (Table 2 ).
Coexpression network analysis was conducted using the intra-QTL genes from qDTY12.1 and run with RiceNet v2 as the coexpression network, ClusterONE as the module detection algorithm, and 0.3 as the minimum cluster density for module detection.
Fr om these anal yses done using RicePilaf, a targeted set of nov el candidates fr om the 2 r efer ence genomes fr om subpopulations that are more closely related to the QTL donor varieties were identified, whic h wer e not r eported at the time of the initial studies.Two additional candidate genes from interacting QTL qDTY2.3 were also identified.These additional candidates can be used in future studies to further understand the mechanism of yieldunder-dr ought str ess acr oss v arious dr ought-toler ant rice v arieties and dr ought-toler ance QTL interactions at the gene level.

Post-GWAS analysis of preharvest sprouting
Pr eharv est spr outing (PHS) is a condition in whic h seeds lose dormancy and germinate prior to harvest, thus negatively affecting grain yield and quality [ 56 ].A recent GWAS on PHS using a panel of 277 accessions r epr esenting temper ate and tropical japonica and indica populations found the loci Chr01:1523,625-1770814 and Chr04:4662701-4670717 to be significantly associated with this trait [ 12 ].

Gene list and lift-over
We lifted over the PHS loci to the indica IR64 genome since the PHS GWAS by Lee et al. [ 12 ] contains indica accessions.We found 36 gene models ov erla pping the PHS loci in Nipponbare, of which 22 had orthologs in the corresponding IR64 interv als.Inter estingl y, of the genes unique to IR64, ther e wer e 3 whose Nipponbare orthologs were not contained in the original Nipponbare intervals (Table 3 ).These genes were not considered in the PHS GWAS by Lee et al. [ 12 ], demonstrating the benefit of lift-over.We included them in the 36 Nipponbare genes for further analysis.

Coexpression network analysis
Out of 2,608 modules found by running ClusterONE on RiceNet v2 with the minimum cluster density set to 0.3, we found 39 modules that were enriched in the genes obtained in the pr e vious step (adjusted P < 0.05).Without the 3 additional genes, ther e wer e 36 modules, further emphasizing the importance of RicePilaf's liftov er featur e.These 36 modules provide a narro w er list of candidate genes possibl y involv ed in PHS that could be experimentally tested.
Among the top 3 enriched modules-namely, modules 690 (adjusted P = 0.001214), 2425 (adjusted P = 0.001214), and 901 (adjusted P = 0.001405)-common enriched gene ontology terms include tryptophan biosynthetic process and serine-type carbo xype ptidase (SCP) acti vity.Tryptophan has been re ported to impact seed dormancy and PHS in wheat [ 57 ], and some SCPs and SCP-like proteins are known to be involved in the regulation of seed germination in rice and other crops [58][59][60].In module 690, the phytohormone jasmonic acid-which, along with its   In module 901, enriched gene ontology terms include the activities of β-glucosidase and β-amylase; certain genes belonging to these classes have been reported to be upregulated during pr egermination and earl y germination, pr esumabl y for their r ole in starc h degr adation [ 64 ], whic h corr obor ates with starc h and sucrose metabolism also appearing as an enriched pathway.Another pathway of interest in this module is the biosynthesis of various secondary plant metabolites such as coumarin, the ability of which to inhibit abscisic acid catabolism has been used to block PHS and vivipary in rice [ 65 ].
Except for the starch and sucrose metabolism pathway, these aforementioned ontology terms and pathways were not reported in the PHS GWAS by Lee et al. [ 12 ], showing how RicePilaf's coexpression network analysis can provide further functional insights into rice GWAS loci.The genes in the discov er ed modules (as in Fig. 8 ) can also be investigated experimentally for possible involvement in PHS and related biological processes.
To demonstrate the utility of RicePilaf providing multiple module detection algorithms in enriching post-GWAS analysis, we explored the enriched modules when the algorithm was set to FOX instead (with the weighted community clustering metric set to 0.05).Compared to when only the 36 gene models ov erla pping the PHS loci in Nipponbare were considered, the inclusion of the 3 Nipponbar e orthologs fr om lift-ov er (Table 3 ) resulted in 10 additional enriched modules; in total, out of 4,416 discov er ed modules, 34 modules were found to be enriched.
Among the top 5 enriched modules, modules 1093 (adjusted P = 0.02398) and 1331 (adjusted P = 0.02398) in particular are enriched in germination-and growth-associated ontology terms related to the activity of k e y phytohormones (e.g., regulation of auxin biosynthetic process, auxin homeostasis, and gibberellic acid homeostasis), plant de v elopment (e.g., shoot system de v elopment and lateral root development), and metabolic activities (e.g., sucrose metabolic process and regulation of starch biosynthetic process).Plant embryo stage and seedling development stage also appear as enriched plant ontology terms in module 1331.

Enrichment in transcription factor binding sites
The top transcription factors whose binding sites significantly ov erla p with the PHS loci were CAMTA, FAR1, and ERF (each with an adjusted P of 0.1).These transcription factor families have been reported to be involved in biotic and abiotic str ess r esponse [ 66 , 67 ].

Limitations
RicePilaf integr ates se v er al existing tools and methods and depends on curr entl y av ailable datasets (Table 4 ) and thus carries the limitations inherent in those tools and data.For example, genomes and gene models are only available for the reference genomes curr entl y av ailable at the time of writing; ho w e v er, we intend to update genomes periodically on the public site.Data on regulatory and coexpression networks in rice depend on the curr entl y av ailable data fr om a limited number of tissues and conditions.
RicePilaf curr entl y also lac ks the means to incor por ate tissuespecific gene expression information to narrow down the list of candidate genes, since curr entl y av ailable expr ession data hav e a limited range of tissues or use older technology that is not straightforw ar d to integrate (e.g., [ 68 , 69 ]).Similarly, epigenomic data such as chromatin accessibility and histone modification marks, whic h ar e known to be condition and tissue specific, ar e curr entl y av ailable for a v ery limited v ariety of tissues and come fr om onl y a fe w samples.In the lift-over functionality, there may be limitations related to possible multiplicity of alignment (in case of large segmental duplications).Currently, we force LAST to output only one-to-one alignments; thus, any homologous regions due to duplications are not possible to interrogate.

Handling database updates
RicePilaf integrates information from multiple databases that are bound to see upgrades in the future.Additionally, in order for RicePilaf to respond quickly to user queries, we preprocess raw data from these databases to precompute information such as alignments, module detection, identification of enriched modules, ontology and pathway enrichment analysis, and annotations of PubMed abstracts.We incorporated the scripts for downloading and pr epr ocessing into a Snak emak e pipeline [ 70 ].T hese scripts , along with all the necessary dependencies, are bundled into a Figure 8: Coexpression network analysis of the loci Chr01:1523,625-1770814 and Chr04:4662701-4670717, known to be significantly associated with preharvest sprouting [ 12 ].The graph is a module (module 690) that is enriched in ontology terms and pathways related to seed germination, dormancy, and vivipary, such as activities of β-glucosidase, β-amylase, starch and sucrose metabolism, and biosynthesis of coumarin.The shaded nodes indicate genes that fall within the specified loci.Doc ker ima ge (separ ate fr om the ima ge for running the a pp), which can be downloaded from the code repository.

Adding new features
RicePilaf follows a modular and extensible design that allows for easy addition and updating of features in the future.One k e y feature of immediate interest is collecting additional information fr om r emote RESTful a pplication pr ogr amming interfaces (APIs).Databases that provide API access include UniProt [ 71 ], Gramene [ 72 ], and AgroLD [ 73 ].Another k e y feature is including complementary data r etrie v al APIs like PyRice [ 28 ] to expand and facilitate a broader search of information.

Conclusion
RicePilaf enables rice breeders and scientists to quic kl y cr ossr efer ence GWAS/QTL anal ysis r esults with a v ariety of rice databases .T here ha ve been several publications that identify potential QTLs and GWAS regions that remain poorly characterized for their specific mechanisms, and the ov er arc hing philosophy that drives this software development effort is the desire to solve this big unknown.This software platform is intended as a tool in order to understand many other QTLs' functionality and figure out ways of dissecting the mechanistic granularities.Otherwise, we will just be building layers over layers without getting to the core.As an example, there are at least 3 other QTLs identified at the International Rice Research Institute now that operate as multigene QTLs .Hence , ha ving an analysis pipeline for dissecting such regions for critical genes will add value as we and others discov er mor e m ultigene QTLs in the futur e, especiall y in rice, mainly due to its compact genome.RicePilaf is easy to install, as it simpl y r equir es downloading a Doc ker ima ge, along with the pr eprocessed dataset, and spinning up the container.It is also easy to use as it runs on a browser providing a user-friendly interface and a set of inter activ e web reports.
T he lift-o v er fr om Nipponbar e to a tar get genome is performed as follows.Pairwise whole-genome sequence alignment between Nipponbare and the target genome is precomputed using LAST as described in [ 74 ].LAST produces a set of one-to-one local alignments (i.e., a base pair in Nipponbare aligns to at most 1 base pair in the target and vice versa).Additionally, there is no constraint for the alignments to be colinear, which allows for capturing complex inter-or intr ac hr omosomal genome r earr angements .T hese pr ecomputed alignments ma p Nipponbar e genomic interv als to orthologous regions in the target.The set of gene models ov erla pping the tar get interv als is obtained from the genome annotations provided by the RGI [ 24 ].The same project also provides orthologous gene gr oups, whic h can be used to compare GWAS/QTL gene sets across different accessions.

Coexpression network analysis
RicePilaf integrates coexpression information in 3 stages: detection of modules (also known as communities or clusters), identification of modules enriched in the GWAS/QTL genes, and functional c har acterization of these modules by ontology and pathway enric hment anal ysis.We describe these steps in detail below.

Module/community detection
First, RicePilaf identifies modules of genes given a coexpression network; users can select either RiceNet v2 [ 41 ] or the Rice Combined Mutual Ranked Network (RCRN) [ 42 ].For RiceNet v2, the coexpression network used is the component network derived from the coexpression of Oryza sativa genes across microarray experiments.For RCRN, the integrated network is used.

Identification of enriched modules
From among the detected gene modules, RicePilaf performs overr epr esentation anal ysis to identify which modules ar e statisticall y enriched in view of the coexpression network and the GWAS/QTLimplicated genes.To this end, a 2 × 2 contingency table is constructed, with all the genes across the detected modules comprising the bac kgr ound gene set.The columns count the number of genes implicated by GWAS/QTL (versus those that are not implicated), and the rows count the number of genes present in the module being tested (versus those that are not present).A 1-tailed Fisher's exact test is then applied, follo w ed b y multiple-testing correction via the Benjamini-Hochberg method [ 81 ].A module is consider ed enric hed if its adjusted P v alue is less than 0.05.

Functional characterization via ontology and pathway enrichment analysis
The likely biological functions of the enriched modules are inferred by performing enrichment analysis across several ontology and pathway databases.For the ontology enrichment analysis , RicePilaf displa ys results for 3 sets of ontologies: (i) gene ontology , (ii) trait ontology , and (iii) plant ontology.Gene ontology annotations-whic h cov er cellular components, molecular functions, and biological pr ocesses-ar e a ggr egated fr om the Rice Annotation Project Database (RAP-DB) [ 75 ], agriGO v2.0 [ 43 ], and Oryzabase [ 45 ].Trait and plant ontology annotations-which focus on phenotypic attributes-are obtained from Oryzabase [ 45 ].
For identifying enriched pathwa ys , RicePilaf supports both ov err epr esentation anal ysis via clusterPr ofiler [ 82 ] and topologybased analysis via Pathway-Express [ 83 ] and Signaling Pathway Impact Analysis (SPIA) [ 84 ].Pathway maps are obtained from the KEGG [ 44 ]; accessions are mapped to KEGG identifiers using the R pac ka ge riceidconverter [ 85 ] and mapping tables from RAP-DB [ 75 ].An ontology term or pathway is considered enriched if its adjusted P value after Benjamini-Hochberg correction [ 81 ] is less than 0.05.

Enrichment of regulatory features
RicePilaf determines if the input GWAS/QTL intervals are enriched for binding sites of TFs .T his is done by first computing ov erla ps between GWAS/QTL genomic intervals and predicted binding sites of a TF using Pybedtools [ 86 , 87 ].For TFs that have a nonempty intersection, the statistical significance of the ov erla p is computed using MCDP2 [ 88 ], and multiple testing acr oss m ultiple TFs is accounted for by Benjamini-Hoc hber g corr ection of the significance values [ 81 ].Binding site information of almost 250 TFs is obtained fr om PlantRegMa p [ 50 ].For eac h TF, PlantRegMa p pr ovides se veral sets of predicted binding sites depending on how the prediction was performed-(i) simple motif scanning using FIMO [ 89 ] or (ii) motif scanning paired with conserved element information [ 50 ] or (iii) using the FunTFBS method [ 50 ]-and what target sequence was used: whole genome v ersus pr omoter r egion defined as −500/ + 100 bp of the transcription start site.RicePilaf exposes these choices to the user.

Text mining
Around 17,000 scientific abstracts were retrieved from PubMed by using a curated list of PubMed identifiers provided by the Oryzabase database [ 45 ].This list provides manually checked PubMed entries related to rice genomics.A natural language processing pipeline was written using Python to extract named entities from these abstracts .T his pipeline combines the HunFLAIR named entity recognition tagger [ 90 ] with spaCy [ 91 ], Natural Language Toolkit (NLTK) [ 92 ], and other libraries.It identifies 4 types of named entity annotations: gene names (e.g., "OsMAPK2" or "MOC1"), species (e.g., "Oryza sativ a" or "Ma gna porthe oryzae"), c hemicals (e.g., "gibber ellic acid" or "nitr ogen"), and disease or

Figure 1 :
Figure 1: RicePilaf gr a phical abstr act.RicePilaf cr osses GWAS/QTL-ma pping r esults with m ultiple data sources on rice.A summary of databases used in RicePilaf is provided in Table4.

Figure 2 :
Figure 2: A screenshot of the input interface.

Figure 3 :
Figure 3: A screenshot showing an example of the result of lift-over fr om Nipponbar e coordinates to another genome , IR64 in this case .

Figure 4 :
Figure 4: A screenshot showing an example of coexpression network analysis in RicePilaf.

Figure 5 :
Figure 5: A screenshot showing an example of the result of searching against a text-mining derived database containing information from PubMed abstracts.

Table 1 :
Intra-QTL genes from the lift-over of QTL qDTY12.1 from Nipponbare that are unique to the N22 genome

Table 2 :
Intra-QTL genes from the lift-over of qDTY12.1 from Nipponbare that are unique to the ARC 10497 genome

Table 3 :
Genes unique to IR64 in the pr eharv est spr outing GWAS loci

Table 4 :
Summary of datasets used