RNA sequencing (RNA-Seq) of lymph node, spleen, and thymus transcriptome from wild Peninsular Malaysian cynomolgus macaque (Macaca fascicularis)

The cynomolgus macaque (Macaca fascicularis) is an extensively utilised nonhuman primate model for biomedical research due to its biological, behavioural, and genetic similarities to humans. Genomic information of cynomolgus macaque is vital for research in various fields; however, there is presently a shortage of genomic information on the Malaysian cynomolgus macaque. This study aimed to sequence, assemble, annotate, and profile the Peninsular Malaysian cynomolgus macaque transcriptome derived from three tissues (lymph node, spleen, and thymus) using RNA sequencing (RNA-Seq) technology. A total of 174,208,078 paired end 70 base pair sequencing reads were obtained from the Illumina Hi-Seq 2500 sequencer. The overall mapping percentage of the sequencing reads to the M. fascicularis reference genome ranged from 53–63%. Categorisation of expressed genes to Gene Ontology (GO) and KEGG pathway categories revealed that GO terms with the highest number of associated expressed genes include Cellular process, Catalytic activity, and Cell part, while for pathway categorisation, the majority of expressed genes in lymph node, spleen, and thymus fall under the Global overview and maps pathway category, while 266, 221, and 138 genes from lymph node, spleen, and thymus were respectively enriched in the Immune system category. Enriched Immune system pathways include Platelet activation pathway, Antigen processing and presentation, B cell receptor signalling pathway, and Intestinal immune network for IgA production. Differential gene expression analysis among the three tissues revealed 574 differentially expressed genes (DEG) between lymph and spleen, 5402 DEGs between lymph and thymus, and 7008 DEGs between spleen and thymus. Venn diagram analysis of expressed genes revealed a total of 2,630, 253, and 279 tissue-specific genes respectively for lymph node, spleen, and thymus tissues. This is the first time the lymph node, spleen, and thymus transcriptome of the Peninsular Malaysian cynomolgus macaque have been sequenced via RNA-Seq. Novel transcriptomic data will further enrich the present M. fascicularis genomic database and provide future research potentials, including novel transcript discovery, comparative studies, and molecular markers development.

In this present study, we sequence three tissues-lymph node, spleen, and thymus, 137 harvested from cynomolgus macaques exposed to urban environments using the Illumina HiSeq 138 2500 platform. A reference-guided assembly was performed to assemble the sequencing reads by 139 mapping the reads to the M. fascicularis reference genome, followed by empirical analysis of 140 differential gene expression (EDGE test), and functional annotations and categorisation of the 141 assembled reads. The dataset presented here will prove valuable for future immunological or 142 drug translational studies involving cynomolgus macaques as NHPs. Three male conflict M. fascicularis individuals from the state of Selangor in Malaysia 161 were captured by DWNP. The three individuals were adolescents with total length of 79.0cm, 162 80.5cm, and 91.0cm and weighing at 1.63kg, 1.43kg, and 2.11kg respectively, and appeared to 163 belong in the same family group. Based on visual examination, the macaques were free of 164 disease. Lymph node, spleen, and thymus tissues of the three individual macaques were 165 harvested and stored separately in 1.5 ml tubes filled with RNAlater RNA Stabilization Agent 166 (Qiagen), which were incubated in 4°C overnight, and subsequently stored in -80°C. Total RNA samples were extracted from the nine harvested tissues using Qiagen's 171 RNeasy Mini Kit. Due to the presence of genomic DNA in the RNA extract, a modified protocol 172 was employed by utilizing Epicentre's DNase I solution to remove genomic DNA 173 contamination. The intactness of the RNA extracts were visualised using 1% agarose gel, and 174 their respective concentrations quantitated using a Qubit fluorometer (Life Technologies, United 175 States of America). The integrity of the RNA extracts were determined using Agilent's 2100 176 Bioanalyzer (Agilent Technologies, Inc., United States of America) via an RNA Pico chip. RNA 177 samples with RIN of more than 7.0 were selected for the subsequent library preparation step.  A total of nine sequencing libraries were prepared, with three replicates for each tissue 182 type. The library preparation step consisted of two main sections; the removal of ribosomal RNA 183 (rRNA) from the total RNA, and the conversion of the rRNA-depleted RNA to cDNA. rRNA 184 depletion was carried out using the Ribo-Zero™ Gold Kit (Human/Mouse/Rat) (Epicentre, 185 United States of America), while the conversion of treated RNA to cDNA was performed using 224 225 Empirical analysis of differentially expressed genes (EDGE test) was carried out on 226 normalised expression profiles to identify differentially expressed genes between the lymph 227 node, spleen, and thymus tissue samples. Three separate tissue comparisons were performed -228 spleen vs. lymph, thymus vs. lymph, and spleen vs. thymus. Differentially expressed genes were 229 filtered by selecting for genes with normalised fold change values of > 2 or < -2. Selection of 230 significantly differentially expressed genes were performed by filtering for genes with FDR 231 corrected p-value < 0.05. Genes were assigned Gene Ontology and KEGG pathway annotations by applying gene 236 ontology and pathway association files (.gaf2) obtained from the UniProtKB and KEGG 237 databases respectively. Functional and pathway categorisation of expressed genes (normalised 238 expression value > 1) were carried out using a suite of online databases. GO categorization was 239 carried out via Panther Database version 11.1 released on 2016-10-24 (http://pantherdb.org/) 240 utilising Macaca mulatta as the reference organism. Classifications of genes into biological 241 process, molecular function, and cellular component GO categories were visualised via the 242 "Functional classification viewed in pie chart" analysis. For KEGG pathway categorisations, 243 DAVID Bioinformatics Resources version 6.8 (https://david-d.ncifcrf.gov/home.jsp) was utilised 244 with the M. fascicularis as the background organism. EASE threshold score was set at 0.01 when 245 determining significantly enriched pathway categories. KEGG pathway mappings with 246 highlighted mapped objects were retrieved from KEGG mapper version 2.8 tool in the KEGG 247 website (http://www.kegg.jp/kegg/mapper.html). To identify and visualise tissue-specific genes in lymph node, spleen, and thymus tissues, 252 expressed genes for each tissue type were input to the web tool Venn Diagrams 253 (http://bioinformatics.psb.ugent.be/webtools/Venn). Validation of RNA-Seq gene expression profiles were performed using 258 NanoStringnCounter XT gene expression assay (NanoString Technologies Inc., Seattle, WA, 259 USA). A total of 24 genes were selected for validation based on fold change patterns and 260 immune-related functions. An additional three housekeeping genes (ACTB, GAPDH, and 261 YWHAZ) were also selected for reference genes normalising of digital counts data. Fold change 262 patterns for 24 genes in their respective tissue comparisons were compared between the RNA-263 Seq and NanoString platforms. Custom CodeSet probes were designed from M. fascicularis 264 FASTA sequences obtained from the reads mapping assembly from this study. Table S1 lists the 265 genes utilised for the RNA-Seq data validation and their respective probe pair sequences. The 266 amount of total RNA utilised for the hybridisation step was 200 ng. Digital scanning of RNA-267 probe hybrids were performed at High 280 Field of View (FOV) setting. Hybridisation and 268 digital scanning of the RNA-probe hybrids were performed at High Impact Research Centre in 269 Universiti Malaya, Malaysia. We investigated the transcriptome of three tissues in triplicates obtained from the 276 Malaysian M. fascicularis using the RNA-Seq platform. The number of raw sequence reads 277 obtained from the high-throughput sequencing of lymph node, spleen, and thymus tissues were 278 50,180,478, 63,956,392, and 60,071,208 respectively with a sequence length of 70 base pairs for 279 each read. Raw sequence reads were trimmed and filtered to remove PhiX Illumina positive 280 control sequences, adapter and index barcode sequences, and low quality sequences based on 281 base quality Q 20 . Post-trimming, the number of sequence reads for lymph node, spleen, and 282 thymus tissues were 47,559,293, 60,285,505, and 56,583,352 respectively, with the average 283 length of the sequence reads ranging from 57.5 base pairs to 60.3 base pairs. The percentage of 284 reads that were retained post-trimming ranged from 96.52% to 97.73%, indicating that at least 285 96% of the reads were above base quality Q 20 and were suitable for reference-guided assembly.

287
The filtered reads from each replicate were then mapped to 36,233 genes within the M. 288 fascicularis reference genome obtained from NCBI (Accession: GCF_000364345.1). The total 289 number of paired and broken paired reads mapped to the reference genome for lymph, spleen, 290 and thymus samples are 25,476,162, 37,536,104, and 30,782,715 reads respectively. For the 291 counting of fragments mapped to exon and intron regions, only paired end reads were taken into 292 account. The overall mapping percentage (sum of percentage of reads mapped in pairs and 293 broken pairs) ranged from 53.00% to 63.00% with lymph showing lower overall mapping 294 percentage compared to spleen and thymus. Table 1 summarises the filtering and mapping 295 statistics of lymph node, spleen, and thymus sequence reads. Supplementary information on the 296 individual mapping statistics for each individual library are included in Table S2. 298 Table 1: Filtering and mapping statistics of lymph, spleen, and thymus sequence reads to Macaca 299 fascicularis reference genome (GCF_000364345.1). Reads were mapped to gene regions only 300 with a maximum number of hits for a read = 10. Only intact paired reads were taken into account 301 when counting the fragments by type. To assign Gene Ontology (GO) and KEGG pathway annotations to the genes, gene 329 ontology and pathway annotation files (.gaf2) were obtained from UniProtKB and KEGG 330 database respectively and imported to CLC Genomics Workbench. The numbers of genes with 331 gene ontology and pathway annotations are listed in Table 2.  To classify the genes into GO categories, expressed genes were input to the Panther 337 Database. For lymph node tissue, 15,611, 7876, and 6141 genes were categorised to more than 338 one GO categories in the Biological Process, Molecular Function, and Cellular Component 339 domains respectively. In the spleen tissue, 14,838, 7465, and 5892 genes were categorised to 340 more than one GO categories in the Biological Process, Molecular Function, and Cellular 341 Component domains respectively. While in the thymus tissue, 12,161, 6143, and 5029 genes 342 were categorised to more than one GO categories in the Biological Process, Molecular Function, 343 and Cellular Component domains respectively. The distribution of the genes into different level 2 344 GO categories are shown in  Expressed genes were also input to DAVID Bioinformatics Resources to classify the 355 genes into KEGG pathway categories (Table 4). In the lymph node, spleen, and thymus, 4848, 356 4658, and 3893 genes respectively were assigned to one or more KEGG pathway categories. For 357 lymph node, spleen, and thymus, the Global overview and maps pathway umbrella category 358 contains the highest number of expressed genes, followed by the Signal transduction category.  The total number of expressed genes enriched in the KEGG pathway Immune system 364 category for lymph node, spleen, and thymus were 266, 221, and 138 genes respectively. The 365 four Immune system pathways that were enriched include Platelet activation pathway, Antigen 366 processing and presentation, B cell receptor signalling pathway, and Intestinal immune network 367 for IgA production. Genes enriched in both Platelet activation and B cell receptor signalling 368 pathways are expressed in lymph node, spleen, and thymus tissues. For the Platelet activation 369 pathway, 87 genes were expressed in all three tissues, including TLN1, TLN2, FERMT3, ITGB3, 370 FGA, FGB, and FGG which are involved in the activation and aggregation of platelets. Genes 371 involved in the B cell receptor signalling pathway had 51 genes expressed across all tissues, 372 which include genes such as SYK, PIK3AP1, AKT1, AKT2, AKT3, CHUK, IKBKB, and IKBKG 373 that also take part in the PI3K-Akt signalling pathway and NF-kappa B signalling pathway. 374 Genes enriched in the Antigen processing and presentation pathway were expressed in both 375 lymph node and spleen tissues with 61 and 60 genes respectively. All 60 lymph genes expressed 376 in the Antigen processing and presentation pathway were co-expressed in spleen, with genes 377 involved in the MHC I (PSME1, PSME2, PSME3, TAP1, TAP2) and MHC II (CD74, LGMN, 378 CTSS) pathways. The Intestinal immune network for IgA production pathway was specifically 379 enriched in lymph node, with 41 genes enriched in this pathway including genes such as IL5, 380 IL6, IL10, TGFB1, TNFSF13, and AICDA which are involved in the proliferation of B cells into 381 IgA+ B cells. The Platelet activation pathway is activated by the adhesion of ADP to purinergic 400 receptor P2Y that begins the complement and coagulation cascade involving genes TLN2, 401 FERMT3, ITGB3, FGA, FGB, and FGG at the end of the cascade. TLN2 codes for the 402 cytoskeleton protein Talin-2 and is a component that links actin cytoskeleton with integrin and 403 catalyses focal adhesion signalling pathways (Zhang et al., 2008). FERMT3, together with TLN1 404 and TLN2, activates integrins beta-3 (ITGB3), and when activated, triggers platelet-platelet 405 interactions via binding of fibrinogens alpha, beta, and gamma. These fibrinogens (respectively 406 encoded by FGA,FGB, and FGG) interact to form a polymerisation that form a fibrin matrix that 407 physically plugs ruptured endothelial surface.

409
In the B cell receptor signalling pathway, a response to foreign antigen binding to the B 410 cell receptor (BCR) is for the gene SYK to encode enzyme tyrosine kinase. In effect, tyrosine 411 kinase activates the phosphatidylinositol 3'-kinase (PI3K)-Akt signalling pathway via the BCAP 412 signalling adapter encoded by PIK3AP1. The PI3K-Akt signalling pathway then activates the 413 AKT kinase coded by AKT1, AKT2, and AKT3 that functions to ensure cell survival by reducing 414 oxidative stress acting on the cell, thus preventing cell apoptosis. AKT kinase also activates the 415 Nuclear factor-kappa B (NF-kappa B) signalling pathway via the IKK enzyme complex (coded 416 by CHUK, IKBKB, and IKBKG) which activate genes that further ensures cell survival (Faissner 417 et al., 2006).

419
In the MHC I pathway, genes PSME1, PSME2, and PSME3 encode for proteasome 420 activator PA28, an immunoproteasome involved in the Antigen processing and presentation 421 pathway. Genes TAP1 and TAP2 encode for antigen transporters 1 and 2 respectively, and is 422 located on the membrane of the endoplasmic reticulum of a cell. PA28 activator complex play a 423 role in presenting antigens to the antigen transporters, which antigen transport the antigens from 424 cytoplasm to the endoplasmic reticulum to be presented to the MHC I molecules. The CD74 425 gene is significant in the MHC II pathway for its antigen processing regulatory function. It codes 426 for HLA class II histocompatibility antigen gamma chain protein that binds to the invariant chain 427 (Ii) polypeptide to form the MHC II protein complex. The complex is then migrated to the 428 endoplasmic reticulum of the cell where the complex is cleaved by legumain (coded by LGMN) 429 and cathepsin S (coded by CTSS) to a smaller and more stable form of MHC II called CLIP that 430 is ultimately presented on the cell surface.

432
In the Intestinal immune network for IgA production pathway, as MHC II proteins 433 recognise and present antigens to the A proliferation-inducing ligand (APRIL) receptor (coded 434 by TNFSF13), a cascade begins that ultimately differentiate B cells into immunoglobulin A+ B 435 cells. Cytokines interleukin-5, interleukin-6, and interleukin-10 (coded by IL5, IL6, and IL10 436 respectively) together with transforming growth factor beta-1 (coded by TGFB1) regulate the 437 proliferation of B cells and eosinophils. The AICDA gene that codes for activation induced 438 cytidine deaminase enzyme, also facilitates the class-switch of B cells into IgA+ B cells. Two pathways are enriched in the TNF signalling pathway involving genes TNFRSF1A 443 and TNFRSF1B that code for the TNFR1 and TNFR2 receptors respectively. Each receptor, 444 when bound with their respective ligands activate two separate pathways with different 445 functions. When the TNFR1 receptor is activated, a signalling cascade activates the MAPK 446 signalling pathway cascade which is mediated by the expression of the TAK1-TAB1 kinase 447 complex coded by genes MAP3K7 and TAB1. The activation of the MAPK signalling pathway 448 ultimately regulates the expression of Transcription factor AP-1 (coded by JUN), which is 449 known to play a role in apoptosis, cytokine regulation, and leukocyte activation (Foletta, Segal & 450 Cohen, 1998;Kappelmann, Bosserhoff & Kuphal, 2014). The activation of the TNFR2 receptor 451 by the ligand LTA activates a signalling cascade that is mediated by TNF receptor associated 452 factors 1 and 3 (coded respectively by TRAF1 and TRAF3). The TNF receptor associated factors 453 then recruit Apoptosis inhibitors 2 and 3 (coded by BIRC2 and BIRC3) to ensure cell survival 454 (Wang et al., 2012). Out of the three tissues, thymus has the highest number of pathways that are specifically 459 enriched with genes. A total of 984 genes were distributed across 12 pathway categories as 460 analysed with DAVID Bioinformatics Resources using the Macaca fascicularis genome 461 reference as the background. Five immune-related pathways were identified, which include Ras 462 signalling pathway, Rap1 signalling pathway, RNA degradation, N-Glycan biosynthesis 463 pathway, and Protein processing in endoplasmic reticulum. Two enriched pathway categories, 464 the Ras signalling pathway and Rap1 signalling pathway, each had 142 and 140 genes expressed 465 in their pathways. The Ras signalling pathway functions as a binary molecular switch to regulate 466 cellular processes such as cell proliferation, migration, differentiation, or growth. Several other 467 pathways are activated by the Ras pathway, and based on the enrichment analysis of the 468 expressed macaque thymus genes, the Mitogen-activated protein kinase (MAPK) cascade, and 469 the Rap1 signalling pathway were both activated by the Ras signalling pathway. The Rap1 470 signalling pathway also acts as a molecular switch to activate cellular processes including the 471 MAPK cascade. These linked pathways suggest a role in the proliferation and differentiation of 472 T cells in the thymus, as the T cell receptor signalling pathway precedes both the Ras and Rap1 473 signalling pathways. The expressed genes as illustrated in Figures S12 and S13 suggest that the T 474 cells were undergoing proliferation and differentiation in the individual macaques, however the 475 lack of expressed genes enriched in the T cell receptor signalling pathway and also the lack of 476 the activation of αβ T cell receptor (TCR) in both Ras and Rap1 signalling pathways further 477 suggests that the proliferation and differentiation of the T cells was not caused by interactions 478 with foreign antigens. The Ras signalling pathway is also known to activate apoptosis functions 479 in a cell, and based on the expression of genes RASSF1, RASSF5, and STK4 in the Ras signalling 480 pathway, it is suggested that the apoptosis function is switched on and likely functions in the cell 481 death of thymocytes that fail the positive and negative selections during the developmental 482 process.  In the lymph node tissue, CLEC4G, CCL20, SHOX2, SHISA3, and LOC102127387 are 534 the top five most expressed lymph node-specific genes. Gene CLEC4G is expressed in liver and 535 lymph node sinusoidal endothelial cells, and codes for C-type lectin domain family 4 member G 536 protein. Previous studies have identified the protein as a glycan-binding receptor, and is 537 suggested to be involved in cell-cell adhesion processes and acts as a receptor for antigen 538 clearance in lymph nodes (Liu et al., 2004). CCL20 codes for chemokine C-C motif ligand 20 539 that acts as a ligand for chemokine C-C receptor CCR6. The CCL20-CCR6 ligand-receptor pair 540 is responsible for the chemotaxis of lymphocytes, neutrophils, proinflammatory IL17 producing 541 helper T-cells (Th17), and the regulatory T-cells in response to inflammation (Frick et al., 2016). 542 SHOX2, a gene which codes for Short stature homeobox protein 2, is a transcriptional factor 543 involved in various embryological development processes (Branchi et al., 2016), and is 544 implicated in Turner syndrome, whereby gene knockout of SHOX2 in mice results in skeletal 545 growth abnormalities and short stature phenotypes (Gu et al., 2008). SHISA3, which codes for 546 protein shisa-3 homolog, is a member of the Shisa family of proteins that modulate WNT and 547 FGF signalling in developmental processes (Chen et al., 2014).