CitrusKB: a comprehensive knowledge base for transcriptome and interactome of Citrus spp. infected by Xanthomonas citri subsp. citri at different infection stages

Abstract Citrus canker type A is a serious disease caused by Xanthomonas citri subsp. citri (X. citri), which is responsible for severe losses to growers and to the citrus industry worldwide. To date, no canker-resistant citrus genotypes are available, and there is limited information regarding the molecular and genetic mechanisms involved in the early stages of the citrus canker development. Here, we present the CitrusKB knowledge base. This is the first in vivo interactome database for different citrus cultivars, and it was produced to provide a valuable resource of information on citrus and their interaction with the citrus canker bacterium X. citri. CitrusKB provides tools for a user-friendly web interface to let users search and analyse a large amount of information regarding eight citrus cultivars with distinct levels of susceptibility to the disease, with controls and infected plants at different stages of infection by the citrus canker bacterium X. citri. Currently, CitrusKB comprises a reference citrus genome and its transcriptome, expressed transcripts, pseudogenes and predicted genomic variations (SNPs and SSRs). The updating process will continue over time by the incorporation of novel annotations and analysis tools. We expect that CitrusKB may substantially contribute to the field of citrus genomics. CitrusKB is accessible at http://bioinfo.deinfo.uepg.br/citrus. Users can download all the generated raw sequences and generated datasets by this study from the CitrusKB website.


Introduction
Citrus canker A, caused by the Gram-negative bacterium Xanthomonas citri subsp. citri (X. citri), is one of the main diseases affecting citrus trees and is a threat for orange production in several countries around the world (1)(2)(3). The symptoms of citrus canker on susceptible trees include raised brownish circular lesions on leaves, stems and fruits (4). Several disease management procedures have been applied in attempts to control citrus canker, including pruning of infected trees, post-harvested treatment of fruits, decontamination of equipment and personnel and, within endemic regions of citrus canker, it is applied a spray of copper-containing chemicals to protect young leaves and fruit against bacterium infection (5,6). Furthermore, additional measures to control the disease involve planting windbreaks in the citrus orchards, control of the citrus leaf miner Phyllocnistis citrella and production of healthy citrus nursery trees (6,7).
Due to these challenging attempts for disease control, several efforts have been made to understand the mechanisms of plant-pathogen interaction and the disease development molecular basis, with the objective to establish more effective measures for the control of citrus canker.
The molecular basis and the genetic mechanisms involved in the early stages of the citrus canker development may be revealed by studies on citrus species and cultivars with different levels of susceptibility to the disease. For instance, kumquats (Fortunella spp.) and 'Mexican' lime (Citrus aurantifolia (Christm.) Swingle) are considered resistant and susceptible to citrus canker, respectively. Furthermore, other citrus cultivars exhibit intermediate levels of susceptibility and resistance to the disease, such as the sweet oranges (Citrus sinensis L. Osbeck) 'Bahia', 'Hamlin', 'Valencia' and 'Pera', and the mandarins 'Ponkan' (Citrus reticulata Blanco) and 'Satsuma' (Citrus unshiu Marcovitch) (7,8) (Figure 1). Therefore, the study of the biological processes of the plant-pathogen interaction at molecular level in such diverse citrus genotypes may help to better understand citrus canker disease and may allow identifying potential targets for disease management. The use of the high-throughput RNA-Seq technology can be used to detect the induced or repressed genes in the host during the bacterial infection process. Then, the transcriptome can unveil the genetic and molecular mechanisms that confer the different levels of citrus susceptibility and resistance against the bacterial pathogen. However, current transcriptomebased studies did not include a large range of citrus cultivars exhibiting different levels of resistance and susceptibility to citrus canker. For instance, a previous study was carried out using microarray-based analysis in sweet oranges 'Pera' and 'Cristal' varieties (C. sinensis) and 'Mexican' lime (C. aurantifolia) and 'Siciliano' lemon (C. limon (L.) Burm. f.) (9). In another investigation, based on RNA-Seq analysis, only two cultivars, the canker-resistant 'Meiwa' kumquat (Fortunella crassifolia Swingle) and the canker-susceptible 'Newhall navel sweet orange' (C. sinensis) were compared (10). In addition, RNA-Seq technique was also used to develop a high confident reference of citrus transcriptome based on 12 Citrus species from all main phylogenetic groups (11).
In this study, we aimed to determine early genetic expression changes in citrus plants under the development of the citrus canker disease using RNA-Seq approach to sequence the expressed mRNA of eight different citrus genotypes inoculated with X. citri at early stages of the infection. All sequenced data were analysed to detect the differentially expressed genes between compared genotypes. Thus, we have developed a transcriptome knowledge base of distinct citrus cultivars, the CitrusKB, to facilitate further research on the biology of citrus. Despite other citrus databases, such as Citrus sinensis annotation project (12) and Citrus Genome Database (https://www.citrusgenomedb.org), Cit-rusKB is the only web tool that comes to integrate a wide range of citrus species and cultivars exhibiting different levels of resistance to citrus canker. The Cit-rusKB is a user-friendly web-based interface that provides tools to allow researchers to visualize, search, analyse, recover and browse information on citrus and X. citri interactome. indicates that a citrus genotype is more resistant to citrus canker disease, as well as 'More susceptible' indicates that a citrus genotype is less resistant to citrus canker disease.

Construction and content
Currently, CitrusKB hosts a database to study the effects of initial stages of citrus canker disease including in vivo RNA-Seq data of 8 citrus genotypes, which have been prepared during 3 infection time points, at 24, 48 and 72 hours post-inoculation (hpi) of X. citri. In addition, the web tool includes the sequences of other citrus reference genomes that were available in other databases.

Inoculation of citrus genotypes with X. citri
The eight citrus genotypes exhibiting different levels of resistance to citrus canker included in the study were: 'Kumquat' (Fortunella spp.); Mandarins 'Ponkan' (Citrus reticulata) and 'Satsuma' (Citrus unshiu); sweet oranges (Citrus sinensis) 'Bahia', 'Valencia', 'Pera' and 'Hamlin' and 'Mexican' lime (Citrus aurantifolia). The trees were maintained in a greenhouse under controlled temperature (28 • C) and photoperiod conditions (light/dark periods of 12 h). The trees were pruned 3 weeks prior to inoculation in order to produce uniform young leaves, with 75% of leaf expansion (ideal condition for bacterial inoculation test in citrus (13). The citrus trees were grafted on 'Rangpur' lime (Citrus limonia Osbeck). The X. citri strain 306 (14) was streaked from glycerol stock on Nutrient Agar (0.3% beef extract, 0.5% peptone, 1.5% agar) for 72 h at 28 • C. A single colony was transferred to a new Nutrient Agar plate and after 72 h at 28 • C, a bacterial sample was diluted into sterile bi-distilled water and the inoculum was prepared by dilution to 10 8 cfu/ml (0.3 OD reading at 600 nm). The bacterial suspension was infiltrated into the abaxial surface of 5 leaves of each of the 3 plants in the entire leaf blade by using a 1 mL syringe without needle. Similarly, leaves of other three check plants were infiltrated with sterile distilled water (control).

Citrus RNA extraction and sequencing
Treated and control plants were kept in a growth chamber under controlled temperature (28 • C) and photoperiod conditions (light/dark periods of 12 h). At 24, 48 and 72 h after inoculation, 3 leaves inoculated with X. citri and showing canker symptoms (infected -I) and 3 leaves inoculated with sterile water and without canker symptoms (control -NI) were collected from each plant and immediately frozen in liquid nitrogen and further stored at −80 • C.
For each time point, frozen leaf tissue was ground with mortar and pestle using liquid nitrogen. Total RNA was extracted from macerated leaf samples using Trizol (Invitrogen, Carlsbad, California, USA), according to the manufacturer's protocol. RNA quality and yield were determined by using Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, California, USA) and Qubit 2.0 Fluorometer (Invitrogen). Only RNA having integrity number (RIN) ≥ 8.0 were used. The cDNA libraries were multiplexed and then subjected to high-throughput RNA-Seq using Illumina HiScanSQ (Illumina Inc., San Diego, California, USA). The sequenced samples (8 genotypes, 6 libraries each: I and NI for each time) produced 48 raw RNA-Seq libraries, corresponding to more than 50 millions of 50 bp single-end reads per library.

Transcriptome analysis
The raw data of each library were filtered using the software NGS-QC Toolkit (15) to remove bad quality, adaptors and contaminated sequences. The software was configured with the following parameters: Primer/Adaptor library: Rapid Library (Standard); Homopolymer trimming: On; Length of the homopolymer to be removed: 8; Cut-off read length for HQ: 70%; Cut-off quality score: 20. The libraries were then aligned to the Citrus sinensis reference genome (16) using the Tophat2 (v2.1.14) (17) plus Cufflinks (v2.2.1) (18) softwares with default parameters. The identification of novel transcripts and isoforms was performed based on Cufflinks (v2.2.1) software (18), with default parameters, using the reference transcriptome annotation of Citrus sinensis reference genome (16). All identified genes, including annotated and novel isoforms, are available at   Figure 2).

Gene expression analysis
For differential gene expression analysis (DEG), the highquality reads used for transcriptome annotation step were mapped against the citrus reference genome using the STAR software (v2.4.1) with default parameters (25). Next, the absolute number of aligned reads was extracted for each individual transcript using the HTSeq software (v0.6.1) (26). Further, the normalized expression variation with statistical significance analysis for each individual transcript was calculated between samples. The False Discovery Rate (FDR) correction (27) over the statistical significance found between the samples was applied to control false-positive significance transcript expression variation. These analysis were performed using DESeq, a Bioconductor R package (28). The transcripts were considered as differentially expressed when statistical significance (adjusted P-value) was less than 0.05.

Simple sequence repeats
The search for Simple Sequence Repeats (SSR) was performed using the TandemSmart software (version v1.0, unpublished). This approach performs custom SSR search using a reference sequence combined with high-throughput sequencing alignment data. This data is then used to validate detected SSRs and providing normalized occurrence values in a RPKM-based calculation for each SSR marker. Higher RPKM-based values denote higher expressed markers. The detected markers through all citrus genotypes, stored in a PostgreSQL database, describe the following information: sample name for each SSR occurred, sequence ID, sequence motif, start and end coordinates for SSR in reference sequence, motif length, length of repeat and RPKM-based value.

Single nucleotide polymorphism
To detect single nucleotide polymorphisms (SNPs), RNA-Seq data were aligned to Citrus sinensis reference genome using the STAR aligner (25) with default parameters. Next, the genetic variants having expressed all alternate alleles were extracted using the SAMtools package (29). The data generated were stored within the database and are represented by the following attributes: sequence ID, position, reference allele and SNP.

CitrusKB implementation
CitrusKB was developed on a web-based environment and is executed on a Linux operating system (Debian v7.7), which includes several common software packages: Apache HTTP server, PostgreSQL database, PHP, Java and Perl. The website (Figure 3) was developed using PHP, Java and HTML languages, and Twitter Bootstrap was used to achieve an enhanced user interaction. A Basic Local Alignment Search Tool (BLAST) (30) server is available in CitrusKB environment. As genome browser for transcriptome visualization, JBrowse (31) was implemented, showing all details of the transcripts and their alignments at nucleotide level.

Local blast
To provide an intuitive web-based graphical user interface and a rapidly search in a large volume of data available in the CitrusKB, we used the standalone BLAST server (v2.2.26) (32). To create the BLAST alignment database, we used the makeblastdb program of the NCBI BLAST (v2.2.30) software package (33). Using web forms, users can perform nucleotide searches using the BLASTN tool and protein searches, using BLASTP tool.

Sequence search
To perform searches for sequences from transcriptome, users can use as parameter a single or multiple sequence IDs, annotation keywords or gene names. The transcript sequence is retrieved by an 'in-house' script that performs the search over the stored transcriptome.

Gene expression
In this section, users can view the gene expression profile of the sequenced libraries. For each citrus cultivar, differential analysis was performed to determine expression variation between pairs of conditions, including the following information: sequence ID, annotation, InterProScan

Genetic molecular markers
The SSR and SNP can be visualized by two distinct tables.
In the first

Tools page
To facilitate the interpretation and to visualize the transcriptome annotation and differential expression analysis results, CitrusKB provides two different third-part commonly used tools for transcriptome analysis: (a) Venn diagrams, based on jvenn (34), and (b) HeatMaps, implemented by the Heatmapper software (35). Further details on how to use these tools are shown in the topic 'Using CitrusKB: a study case'.

Download page
CitrusKB provides a user-friendly interface for the download of entire data sets. The HTTP links provide the download of all RNA-Seq raw sequencing reads from the 48 libraries, available in FASTQ format and compressed as Gziped files. Users can also download the following data: reference genome, assembled reference transcriptome sequences and gene structure in GFF format, all the gene expression analysis, and other details of SSRs and SNPs genetic markers. The raw data are also available under BioProject number PRJNA470961.

Using CitrusKB and discussion
The following section summarizes one example of CitrusKB usability. It is based on a study of differentially expressed genes (DEGs), revealing important transcripts found in the spatial and temporal citrus and X. citri interactome. The used thresholds during transcript search in the CitrusKB gene expression tables to consider a DEG were set as: foldchange ≥ 2 for up-regulated gene, fold-change ≤ −2 for down-regulated gene, both with an adjusted P-value ≤ 0.05 (FDR-corrected). The resulting tables were exported in the 'TSV' file format ('export data' button) and directly imported in the CitrusKB Venn Diagram Tool ('Tools' tab). The GAF file available in the 'download section' of the Cit-rusKB was used for the exploration of the enriched GO terms, with the use of the WEGO tool. In addition, the TSV file can be also imported in the CitrusKB HeatMap Tool ('Tools' tab) and interactively inspected in the form of heat maps. The annotation products, Gene Ontology terms and IDs, predicted NOGs and enzyme code of each DEG transcript can be retrieved through the gene expression tables or exported in the TSV file that can be further explored for other biological interpretations and/or downstream analysis. It is worth to mention that the following study consist in a general example of the application of the CitrusKB which was not intended to show a comprehensive transcriptome analysis, but to illustrate the database use, the built-in tools to facilitate the transcriptome data integration, visualization and comparison. The objective to include this example was to present the power of the CitrusKB to revel the plant-pathogen related gene targets for further and more advanced studies. Researchers working with different plant-pathogen interaction topics can raise specific questions using the CitrusKB gene expression tables search engine and genome browser.
A case of study: global overview of the DEG of the Citrus and X. citri interactome with identification of transcripts potentially related to plant-pathogen interactions The analysis of the predicted DEGs of eight citrus genotypes in three-time stages (24,48, and 72 h) revealed a plethora of DEG transcripts which can be strictly related with plant pathogen interaction, such as defense mechanisms and resistance and susceptibility to X. citri. In this analysis, the eight genotypes were grouped in five clusters: the first contains the highly resistant 'Kumquat' genotype; the second contains the resistant Mandarins 'Ponkan' and 'Satsuma'; the third with the less-resistant sweet oranges 'Pera' and 'Valencia'; the fourth with the more susceptible genotypes 'Hamlin' and 'Bahia', and the fifth with the highly susceptible 'Mexican' lime genotype. In summary, 'Kumquat', 'Ponkan' and 'Satsuma', 'Hamlin' and 'Bahia' stand as the genotypes with the larger number of up-regulated transcripts (average of 2350), whereas 'Kumquat' exhibits the largest number of down-regulated transcripts (1602) (Figure 4). The 'Kumquat' genotype also exhibits the largest number of exclusive up-regulated (961) and down-regulated transcripts (1602). The list of the transcripts used to generate the Venn diagrams is provided as supplementary material (Table S1) for further exploitation.
In general, the presented analysis also revealed 3827 (7%) up-regulated and 4889 (9%) down-regulated unique DEGs, and thus representing an overall pattern of C. sinensis response to X. citri infection. Among those, 642 up-regulated and 782 down-regulated transcripts correspond to hypothetical or uncharacterized proteins, and thus, may be considered as potential new targets for functional studies of plant-pathogen interaction mechanisms (Table S2). In addition, 2307 up-regulated and 2743 down-regulated transcripts are classified in NOGs ( Table 2). The most prevalent DEGs NOGs belongs to 'transcription', 'signal transduction mechanisms', 'posttranslational modification, protein turnover, chaperones', 'carbohydrate transport and metabolism' and 'secondary metabolites biosynthesis, transport and catabolism'. There are more than 40% of up-regulated DEGs related to 'replication, recombination and repair', and 'cytoskeleton' NOGs categories in comparison to down-regulated DEGs. In contrast, much more enriched down-regulated DEG were identified, such as those related to 'transcription', 'cell wall/membrane/envelope biogenesis', 'carbohydrate transport and metabolism', 'coenzyme transport and  metabolism', 'inorganic ion transport and metabolism' and 'secondary metabolites biosynthesis, transport and catabolism'. These findings may suggest for a global set of genes potentially related to the plant response against the pathogen infection. For instance, among the identified NOGs (Figure 2A-2B), 440 out of 2296 (19%), 1118 out of 4488 (24%) and 551 out of 2814 (19%) represent DEGs related to 'transcription', 'signal transduction mechanisms', and 'posttranslational modification, protein turnover, chaperones', respectively, which may regulate the citrus metabolism, through the inactivation and activation of genes in a putative attempt to control the infection caused by the pathogen. Conversely, 69 out of 290 (23%) DEGs related to 'defense mechanisms' were also identified, revealing several transcripts with detoxification related functions ( Figure 2A, Table S2) which may be involved in the scavenge of the reactive oxygen species (ROS) generated by aerobic metabolism, and also by other processes, such as growth development, programmed cell death and response to biotic and abiotic environmental stimuli. All these processes can be linked to X. citri infection. Overall, these results summarized the use of CitrusKB to identify several transcripts (Table S2) with potential correlation to plant-pathogen interactions.

Conclusions
The knowledge base for transcriptome of Citrus interactome, CitrusKB, was created as a web tool to researches on citrus biology, in particular to the genes involved on plant-pathogen interaction. The CitrusKB also provides tools on a user-friendly web interface that allows researchers to visualize, search, analyse and browse information regarding citrus and their interaction with the citrus canker pathogen X. citri. To the best of our knowledge, this is the first in vivo database for analysis of citrus infected by a pathogen, and we expect that it will bring a substantial contribution for researches in the area, particularly in the field of molecular plant-pathogen interactions. We will continue updating CitrusKB by incorporating novel literature data as they get available over time.