A tree frog (Boana pugnax) dataset of skin transcriptome for the identification of biomolecules with potential antimicrobial activities

Increases in the prevalence of multiply resistant microbes have necessitated the search for new molecules with antimicrobial properties. One noteworthy avenue in this search is inspired by the presence of native antimicrobial peptides in the skin of amphibians. Having the second highest diversity of frogs worldwide, Colombian anurans represent an extensive natural reservoir that could be tapped in this search. Among this diversity, species such as Boana pugnax (the Chirique-Flusse Treefrog) are particularly notable, in that they thrive in a diversity of marginal habitats, utilize both aquatic and arboreal habitats, and are members of one of few genera that are known to mount a robust immunological response against the fungus Batrachochytrium dendrobatidis, which has decimated the population of frogs worldwide. To search for molecules with potential antimicrobial activity, we have assembled and annotated a reference transcriptome from the skin of four wild captured B. pugnax from Antioquia, Colombia. Analysis of potential antimicrobial and immunological components was performed using ontology analyses, we identified several antimicrobial chemokines with particularly strong potential for exhibiting broadscale antimicrobial activities, as well as several genes related to rapid alteration of transcriptional (KRAB zinc finger protein) and phosphorylation (MAPK) responses to exogenous stressors. We also found eight families of transmembrane transport proteins, including sodium, potassium and voltage-dependent calcium channels, which will be invaluable in future studies aimed at more precisely defining the diversity and function of cationic antimicrobial peptides with alpha-helical structures. These data highlight the utility of frogs such as Boana pugnax in the search of new antimicrobial molecules. Moreover, the molecular datasets presented here allow us to expand our knowledge of this species and illustrate the importance of preserving the vast potential of Colombian biodiversity for the identification of useful biomolecules.


a b s t r a c t
Increases in the prevalence of multiply resistant microbes have necessitated the search for new molecules with antimicrobial properties. One noteworthy avenue in this search is inspired by the presence of native antimicrobial peptides in the skin of amphibians. Having the second highest diversity of frogs worldwide, Colombian anurans represent an extensive natural reservoir that could be tapped in this search. Among this diversity, species such as Boana pugnax (the Chirique-Flusse Treefrog) are particularly notable, in that they thrive in a diversity of marginal habitats, utilize both aquatic and arboreal habitats, and are members of one of few genera that are known to mount a robust immunological response against the fungus Batrachochytrium dendrobatidis , which has decimated the population of frogs worldwide. To search for molecules with potential antimicrobial activity, we have assembled and annotated a reference transcriptome from the skin of four wild captured B. pugnax from Antioquia, Colombia. Analysis of potential antimicrobial and immunological components was performed using ontology analyses, we identified several antimicrobial chemokines with particularly strong potential for exhibiting broadscale antimicrobial activities, as well as several genes related to rapid alteration of transcriptional (KRAB zinc finger protein) and phosphorylation (MAPK) responses to exogenous stressors. We also found eight families of transmembrane transport proteins, including sodium, potassium and voltage-dependent calcium channels, which will be invaluable in future studies aimed at more precisely defining the diversity and function of cationic antimicrobial peptides with alpha-helical structures. These data highlight the utility of frogs such as Boana pugnax in the search of new antimicrobial molecules. Moreover, the molecular datasets presented here allow us to expand our knowledge of this species and illustrate the importance of preserving the vast potential of Colombian biodiversity for the identification of useful biomolecules.
© • There are few transcriptomic data on Colombian frogs and frogs from the family Hylidae. Such data will aid in understanding the evolution of mechanism that contribute to protection against pathogenic microorganisms and bioprospecting of peptides with antimicrobial activity. • This study will provide useful comparative data for similar studies in other frogs and in the search for antimicrobial agents in Boana pugnax .

RNA sequencing and transcriptome assembly
Sequencing yielded a total of 809,394,848 raw reads. Prior to assembly ( Fig. 1 ), adapter sequences and ribosomal RNAs were filtered from the population of raw reads leaving a total of 573,994,007 clean reads (Specifications table). Assembly of these reads yielded a collection of 1,136,865 transcripts with an average GC content of 43.0%, which is similar to that of other frogs such as Rana temporaria 44% [1] and Oreobates cruralis 45% [2] . The alignment-free abundance estimation methods Kallisto was used to filter the transcripts based on expression values [4] . A total of 1,008,042 transcript were retained after filtering the most highly expressed isoform per gene and from those 252,491 were transcriptionally supported isoforms ( > 1 TPM, Transcripts per Million). The quality of the assembly was evaluated using BUSCO [5] and by calculating the E90N50 (N50 for the top 90% expressed transcripts), resulting in a value of 600 bp ( Fig. 2 )  Fig. 3. Number of transcripts aligned to various databases. There were eight databases with 104,603,835 proteins used to align the transcripts. The largest number of them was aligned to the COG/KOG database, followed by the Xenopus database that includes the two species, X. laevis and X.tropicalis . Other databases with an alignment of 10,0 0 0 transcripts were Swissprot, PDB and Pfam. Among the databases with the lowest number of transcripts are KEGG and the mitochondria of NCBI.

Annotation
Using Transdecoder, we identified 112,473 candidate open reading frames. Putative homologs were identified for 72,018 of these by alignment to multiple databases. The largest number of predicted protein alignments were found in searches against the COG / KOG database with 17,047, followed by Xenbase with 13,174, Pfam, PDB, and Swissprot with 10,0 0 0, NCBI nonredundant 8850, KEGG with 2461 and Mitochondria with 486 as shown in Fig. 3 Building on homology-based annotations, functional annotations were carried out to more fully characterize functions represented within the skin and to specifically identify genes and pathways associated with innate defense. For this we used PANTHER (information from 4242 genes), EggNOG (information from 17,047 genes) and KEGG (information from 2461 genes), ( Fig.  4 ).
In the analyzes of proteins involved in the immune system pathways using PANTHER pathways, we found 21 proteins, of these 17 (81%) are related to the innate immune response and four (19%) to the adaptive immune response ( Fig. 5 ). Pathways related to innate immunity included several genes related to inflammation mediated by cytokines and chemokines and the MAP kinase signaling pathway, which integrates transduction of stress responses to mediate transcription. Among the proteins associated with chemokine signaling were several receptors of chemokines, including: C-X-C chemokine receptor type 2 (CXCR2, receptor of CXCL1 and CXCL2), C-X-C chemokine receptor type 3 (CXCR3, receptor of CXCL9 and CXCL10), C -C chemokine receptor type 7 (CCR7), C -C chemokine receptor type 8 (CCR8), and C -C chemokine receptor type 9 (CCR9), and the GTPase KRas. The latter encodes for the RAS protein that is part of a signaling pathway known as the RAS / MAPK pathway (Furukawa, 2015). The NOD-like receptor signaling pathway, complement, and coagulation cascades and platelet activation pathways were also highly represented. These pathways share a common feature in that they all contribute to the first line of defense against pathogens and other invaders, the innate response [5,6] .

Protein domain identification using Pfam
To supplement functional information gathered from homology-based studies, we also performed analyses to identify enriched domains within the skin transcriptome. In general, the most frequent domains in the transcriptome were those involved in transcriptional regulation, signal transduction pathways of the immune system and cellular proliferation ( Table 1 ). The 10  Putative genes related to pathways and its association to immune response. The pathways related to the immune system to which these genes were associated mainly to chemokine signaling pathway. These components are associated with 81% of the innate immune system and 19% of the adaptive system.

Transmembrane transport complexes with alpha helical structure
We found 32 families of proteins and 8 classes of proteins within the search for transmembrane transport complexes in the transcriptome of B. pugnax ( Table 2 ). Among these are the transmembrane receptor regulatory/adaptor protein (PC00226), the voltage-gated calcium channel (PC00240), the voltage-gated potassium channel (PC00242), the voltage-gated sodium channel (PC00243), the ion channel (PC00133), the ligand-gated ion channel (PC00141), the GABA  receptor (PC0 0 023) and the acetylcholine receptor (PC0 0 037). All of these gens ae predicted to possess transmembrane alpha helical secondary structures ( Fig. 7 ).

Ethics approval
The Ethics Committee for Animal Experimentation of the University of Antioquia approved this project, under endorsement 569. Specimen collection and access to genetic resources was approved by the Ministry of Environment on September 24, 2015 for non-commercial research number 119. Collections are deposited under record 080 at the Museum of Herpetology University of Antioquia (MHUA) within the biological collections of the Alexander von Humboldt Institute.

Sample collection
Four individuals of the species Boana pugnax ( B. pugnax) were collected in Vegas de la Clara (University of Antioquia property)latitude 66.815.224 longitude -752.194.104, Antioquia, Colombia, on June 13, 2016 at an average temperature of 25 °C. The individuals had an average weight of 9.11 g (10 g, 6 g, 9 g, 11.47 g) and an average length of 6.25 cm (6 cm, 5.5 cm, 6.5 cm, 7 cm). Animals were transported to the laboratory in sterile plastic bags at 25 °C.

Sample preparation for RNA-seq
Frogs were anestietized and euthanized by intercardial injection of 200 mg/kg buffered tricaine, after which the dorsal and ventral skin was removed under sterile conditions and deposited in 2 mL vials with 1.5 mL RNAshield (ZYMO REASEARCH) to avoid RNA degradation. We designated each frog as frog 1, 2, 3 or 4, with 2 samples collected per individual, totaling 8 samples. Pooling biological replicate RNA samples, such as those derived from a number of experimentally similar animals, may retain biological information, while reducing the cost of sequencing [6,7] . We therefore mixed the tissues of frog 1 with 2 and 3 with 4, prior to RNA extraction for a total of 4 samples (frog 1-2 ventral, frog 1-2 dorsal, frog 3-4 dorsal and frog 3-4 ventral).
We used a TissueLyser II R Qiagen to homogenize the tissues at 30 oscillations/second for 10 min. After tissue maceration, we add 1 mL TRIZOL R (Life Technologies) to each sample for the extraction of RNA. Total RNA was resuspended in 300 μL of Shield R Zymo RNA stabilization buffer and shipped to Macrogen for library preparation and sequencing. The integrity of the RNA (RIN) was verified using an Agilent 2100 Bioanalyser (Agilent Technologies) and all samples yielded RIN scores > 8, which was considered of sufficiently high quality for library preparation [8] . The construction of 4 barcoded the TruSeq (Ilumina) cDNA libraries was carried out according to manufacturer protocols and the resulting libraries were paired-end sequenced (2 × 100 bp) using an Illumina HiSeq 40 0 0 with a depth of 40 million reads per sample.

Read filtering
Verification of read quality was carried out using the FASTQC software [8] https://www. bioinformatics.babraham.ac.uk/projects/fastqc/ ). Reads were subsequently trimmed to remove sequencing adapters and ribosomal RNA in order to improve the accuracy of transcriptome assembly ( Fig. 1 ). Prior to assembly, reads were filtered using SORTMERNA-v2.1 [9] in order to remove reads derived from ribosomal RNAs. This filtering step used the databases SILVA 16S bacteria, SILVA 16S archaea, SILVA 18S eukarya, SILVA 23S bacteria, SILVA 187 23s archaea, SILVA 28S eukarya, Rfam 5S archaea / bacteria, Rfam 5.8S eukarya. We then removed sequencing and PCR adapters with TRIMMOMATIC-v0.32 [10] . Finally, FASTQC was used to review post filtered read quality before proceeding with de novo assembly of the transcriptome.

Transcriptome assembly
Filtered reads were assembled using Trinity v2.9.0 [11] . Assembly completeness was assessed using BUSCO-v1 to identify conserved orthologs in the Vertebrata, Eukaryota and Metazoa databases [5] . Following the pipeline reported by Trinity ( https://github.com/trinityrnaseq/ trinityrnaseq/wiki/Trinity-Transcript-Quantification ) the alignment-free abundance estimation methods Kallisto was used to filter the transcripts based on expression values [3] . The transcriptome of B. pugnax was deposited in GenBank under SRA accession number: SRP151854 (BioProject PRJNA476387). Coding regions were predicted based on the identification of the longest ORF using TransDecoder-v3 [11] .

Transcript abundance and expression level analyses
Sequence reads generated from dorsal and ventral skin samples were aligned to the reference transcriptome using Bowtie2 [16] , and RSEM (RNA-Seq by Expectation Maximization) was used to obtain estimates of transcript abundance for all transcripts [17] . Expression levels were calculated as transcripts per million (TPM). Differential expression of genes was evaluated statistically using EBseq. Transcripts were considered differentially expressed between samples (ventral vs dorsal) with FDR (False Discovery Rate) < 0.05. This list of differentially expressed genes was used for analyses of gene ontology enrichment using EggNOG, PantherDB and KEGG pathways as described above.

Identification of transmembrane transport complexes with alpha helices
We used Panther to identify transmembrane transport proteins that will serve as a template to identify new antimicrobial peptide sequences. Once the proteins were obtained, we search their structure in the protein data bank ( https://www.rcsb.org/ ) to select those with alpha helices in the transmembrane region.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.