Improved annotation of Lutzomyia longipalpis genome using bioinformatics analysis

Lutzomyia longipalpis, a sand fly, is a vector-spreading pathogenic protozoan in the New World. MicroRNA (miRNA) is evolutionarily-conserved non-coding RNA, which plays critical roles in various biological processes. To date, the functions of most proteins in L. longipalpis are unknown, and few studies have addressed the roles of miRNAs in this species. In the present study, we re-annotated the protein-coding genes and identified several miRNAs using a set of comparative genomics tools. A large number of L. longipalpis proteins were found to be homologous with those in the mosquito genome, indicating that they may have experienced similar selective pressures. Among these proteins, a set of 19 putative salivary proteins were identified, which could be used for studying the transmission of Leishmania. Twenty-one novel miRNAs were characterized, including two miRNAs, miR-4113-5p and miR-5101, which are unique to L. longipalpis. Many of the targets of these two genes were found to be involved in ATP hydrolysis-coupled proton transport, suggesting that they may have important roles in the physiology of energy production. Topology analysis of the miRNA-gene network indicated that miR-9388-5p and miR-3871-5p regulate several critical genes in response to disease development. In conclusion, our work provides a basis for improving the genome annotation of L. longipalpis, and opens a new door to understanding the molecular regulatory mechanisms in this species.


INTRODUCTION
Lutzomyia longipalpis, commonly known as the sand fly, is a major blood-feeding vector for the transmission of Leishmania (Teixeira et al., 2018). When a sand fly bites, it introduces saliva with infective promastigotes into the host. Leishmania living in the intestine of the sand fly can evade the pro-oxidative responses of L. longipalpis (Cunha et al., 2018). Once infected with Leishmania, an individual exhibits fever and hyperglobulinemia (Salomon et al., 2015). Currently, no effective vaccine is available, and increasing drug resistance has been reported for this disease (Batista et al., 2016). The control of L. longipalpis will be important in the foreseeable future.

Protein function annotation
Because most of the proteins are annotated as ''uncharacterized protein'', re-annotation of L. Longipalpis is needed. From the SwissProt protein dataset, we extracted animal sequences. Pairwise sequence alignment was conducted between L. Longipalpis and the SwissProt-Animal database using BLAST with a cutoff E-value ≤ 1e−5. For sequences whose homologs were not identified by BLAST, we applied our newly-developed package SSEalign (Yang & Hu, 2018) to detect homology. This algorithm aligns the secondary structure elements of two proteins. Because the protein structure is relatively conserved compared with the primary sequence, SSEalign can identify protein homologs in distantly related species. Subsequently, the functions of the UHP proteins in L. longipalpis were re-annotated. After functional annotation of L. longipalpis proteins, protein homology was compared with those of other insects, such as mosquito (Aedes aegypti) and fruit fly (Drosophila melanogaster). Dillon et al. (2006) investigated EST sequences to identify protein functions involved in insect-parasite relationships. The whole genome annotation of L. longipalpis was released by Baylor College of Medicine (2012). The transcriptomes of L. longipalpis were sequenced by Abrudan et al. (2013) and the sand fly proteins were compared with their homologs in P. papatasi. Because of the importance of these studies, we have compared our annotation with their results. Previous studies have shown that several proteins found in the saliva of L. longipalpis contribute to the enhancement of Leishmania pathogenesis. The newlyannotated salivary proteins are discussed in this paper.

miRNA identification
The miRNA is a type of conserved non-coding RNAs which plays a critical role in the regulation of transcription. We aligned the sequences of known animal miRNAs against EST sequences of L. longipalpis using the tool Bowtie (Langmead et al., 2009) with a mismatch toleration of ≤ 2. The upstream and downstream sequences of the miRNA candidates were identified and could be regarded as the transcribed precursor miRNA (pre-miRNA) fragment of the correspondent miRNAs. The pre-miRNA sequences were aligned against the NCBI non-redundant (nr) protein database using BLASTx to remove the protein-coding sequences for subsequent analysis. The secondary structures of captured pre-miRNAs were predicted using the minimum free energy (MFE) theory by RNAfold (Gruber, Bernhart & Lorenz, 2015). Those sequences meeting the following rigorous criteria were considered to be novel miRNAs in this study: (1) the pre-miRNA could fold into an appropriate stem-loop hairpin structure; (2) no more than five unpaired nucleotides were present in hairpin structure; (3) the mature miRNA sequences were in one arm of the hairpin structure.

Target prediction
The miRNAs bind to the three-prime untranslated region (3 -UTR) of the messenger RNA (mRNA) to regulate the genes involved in cell processes. All known 3 -UTR sequences were downloaded from the UTRdb database (Grillo et al., 2010). Each gene and miRNA pair was identified, and the interaction potential of the pairs were checked using the tool RNAhybrid (Kruger & Rehmsmeier, 2006). The miRNA-gene pairs with a p-value ≤ 0.01, and minimal free energy ≤ −25 kcal/mol were considered to be real interactions in this study. The targeted genes were analyzed using gene ontology (GO) enrichment through the PANTHER system (Mi et al., 2019). The genes overrepresented in molecular function, cellular component and biological process were identified and further analyzed.

Species comparison
All known insect miRNAs were extracted from the miRBase database and compared to the L. longipalpis miRNAs. The numbers of homologous miRNA in different insects were calculated and the distribution was plotted. Those miRNAs that were present in L. longipalpis but absent in other insects were identified and defined as ''unique miRNAs'' in L. longipalpis. The target genes of unique miRNAs were further analyzed to uncover the potential mechanism.

Network analysis
An miRNA-gene network of L. longipalpis was constructed using Cytoscape (Kohl, Wiese & Warscheid, 2011). Topology analysis of the network identified biomarkers in the proteinprotein interaction network of disease (Yan et al., 2016). The nodes and edges of the miRNA-gene network were then analyzed. Several critical miRNAs were picked, and the possible mechanism was investigated.

Homolog distribution of proteins
We compared the proteins of L. longipalpis and other insects, including the mosquito (A. aegypti) and fruit fly (D. melanogaster). Interestingly, 88.2% (9312/10552) of the L. longipalpis proteins had homologs in A. aegypti, while 83.1% (8766/10552) of L. longipalpis proteins shared homology with D. melanogaster (Fig. 2). The high proportion of homology between L. longipalpis and mosquito indicates that there may have been similar selective pressures on the two species.

Comparisons with previous studies
A preliminary version of the L. longipalpis whole genome was published by the Baylor College of Medicine (Baylor College of Medicine, 2012). In this genome, 10,729 proteincoding genes were predicted using a eukaryotic genome annotation pipeline. However, 8,829 coding proteins remain annotated as ''uncharacterized protein'' (UHP), indicating that their functions are unknown. Using several bioinformatics tools, the functions of 6,213 UHPs were identified in this study (Table S1). Dillon et al. have obtained a library of 10,203 ESTs of L. longipalpis. Among these ESTs, 5,962 sequences were found to meet the threshold of E-value ≤1e−5 using BLASTx. This number is slightly smaller than our identified 6213 proteins. Dillon et al. have compared the proteins in the Drosophila and Anopheles databases, and have estimated sequence similarities of 44.7% and 45.9%, respectively. As shown in the paragraph regarding 'Homolog distribution of proteins', we found that 83.1% and 88.2% of L. longipalpis proteins were homologous to fruit fly and mosquito genomes, respectively, a higher number than those identified in Dillon et al. 's (2006) study. Abrudan et al. (2013) have obtained a total of 8Mb of transcriptome in L. longipalpis and assembled the reads into contigs. By searching against non-redundant protein (nr) and InterPro databases, 4,411 proteins were found to be associated with known protein categories. The level of Abrudan et al.'s (2013) annotation result is smaller than our results. The major methodological difference in our work is that we applied a combination of primary sequence alignment using BLAST and secondary structure comparison using SSEalign. Our work is also based on the more precise SwissProt database, and we have checked the protein secondary structures for the homology in the twilight zone. Therefore, our method has higher precision than those of Dillon et al. 's (2006) and Abrudan et al.'s (2013) studies.

miRNA identification
In the current version of the L. longipalpis genome, no miRNAs have been reported. Thus, the identification of miRNAs was important. Mature miRNAs are highly conserved across different species, which could facilitate miRNA identification using computational methods. A set of 48,885 miRNAs were identified in the miRBase database, from which 37,808 animal miRNAs were extracted. A total of 35,918 non-redundant EST sequences of L. longipalpis were obtained from the NCBI database. The animal miRNA sequences were searched against EST sequences using Bowtie (Langmead et al., 2009). A total of 165 EST sequence fragments were found with slight divergence (mismatch ≤ 2) with known animal miRNAs, and these fragments were considered as miRNA candidates. The secondary structures of the precursor sequences of these miRNA candidates were analyzed using RNAfold. After filtering by a series of strict criteria as described in the 'Material and Methods' section, 21 candidates remained, which were considered to be miRNAs from L. longipalpis. To check the effectiveness of these miRNAs, we downloaded the small RNA sequencing data (accession number SRP055920) of L. longipalpis and tested the miRNA expression. Results showed that all of our miRNAs were present in this transcriptome data, indicating that our method is reliable. The characteristics of L. longipalpis miRNAs, including length, MFE and frequency of MFE (fMFE) are listed in Table 1. The length of the miRNAs ranged from 18nt to 24nt, a length frequently observed in known miRNAs. MFE and fMFE discriminate miRNAs from other non-coding RNAs. A negative MFE value indicates the thermodynamical stability of the RNA sequence, while fMFE value reflects the Boltzmann-weighted probability of forming a thermodynamically stable structure (Trotta, 2014). In this study, the MFE value of pre-miRNAs ranged from −79.28 kcal/mol to −20.33 kcal/mol, which is much lower than a similar study of miRNA identification in fish (Huang et al., 2015). The fMFE values were also low, indicating that the identified sequences are true miRNAs in L. longipalpis.
The distribution of nucleotide content in the miRNAs was not uniform. The nucleotides A and U were more predominant than G and C. A lower GC content suggests a higher possibility of a pre-miRNA folding into a hairpin structure (Carmel, Shomron & Heifetz, 2012). The 21 identified miRNAs were categorized into 20 different families. Only one miRNA family, miR-2545, had more than one member: llo-miR-2545a-3p and llo-miR-2545b. No miRNAs were identical to their homologs in other species, a finding which needs to be further investigated in the future.

Target prediction
Multiple genes contribute to miRNAs participating in biological processes. Target identification is therefore essential for understanding their functions. A set of 166 targeted genes (Table S2) were identified by tool RNAhybrid in this study. After removing duplicated in the dataset, only 143 unique genes were observed. Several transcription factors were observed in the targeted genes, including the TATA-box-binding protein Tbp, and a Fork head domain-containing protein Fd96ca. We conjecture that these transcription factors could play a critical role in signaling transmission in L. longipalpis. A detailed investigation is critical to understanding the role of miRNAs in L. longipalpis. Thus, GO enrichment analysis of the identified target was carried out using PANTHER analysis. The top GO terms in biological process, cellular component and molecular function, were recorded (Fig. 3). In the biological process group, the miRNA-regulated genes were highly involved in translation (GO:0006412), oogenesis (GO:0048477), and axon guidance (GO:0007411). In the molecule function group, metal ion binding

Species comparison
After searching the miRBase database, we found 43% (9/21) of the L. longipalpis miRNAs were homologous to miRNAs in Tribolium castaneum (red flour beetle), while this vector showed lower homology with Drosophila melanogaster (fruit fly) (Fig. 4). These results suggest that the characteristics of L. longipalpis are more similar to those of beetles than to fruit fly. Two miRNAs, llo-miR-4113-5p and llo-miR-5101, were found to have no homology with other insects and were considered to be unique to L. longipalpis. The target genes of these unique miRNAs were enriched in ATP hydrolysis-coupled proton transport (GO:0015991). Network analysis showed both miRNAs target genes coding V-type ATPases that regulate proton transportation across the plasma membrane. We suggest that the presence of llo-miR-4113-5p and llo-miR-5101 in L. longipalpis raises the possibility of selectively targeting V-type ATPase complexes with specific inhibitors. The target genes of miRNAs were compared with their homologs in other insects. Results showed that a high number of genes were shared with D. melanogaster (Fig. 5).

Network analysis
To investigate the regulation network of miRNAs in L. longipalpis, we constructed an miRNA-gene network using Cytoscape (Fig. 6). Thirteen miRNAs and 166 edges were included in this network. Most genes were found to be co-regulated by more than two miRNAs, and these genes are understood to work systematically through co-regulated miRNAs. Several ribosomal proteins have been found to be regulated by multiple miRNAs, including the 60S ribosomal protein L3 (Rpl3) and the 60S ribosomal protein L13a (Rpl13a). The ribosome carries out a vital cellular function, the synthesis of proteins from amino acids, and these miRNAs could play an essential role in cell growth. Several critical miRNAs, including llo-miR-9388-5p and llo-miR-3871-5p, were found to target genes  involved in Leishmania-related processes, llo-miR-9388-5p was found to target nuclear cap-binding protein Cbp80 with a p-value of 7.2E-4. It has been reported that Cbp80 could form the cap-binding complex eIF4G to affect translation control in Leishmania (Yoffe et al., 2009). llo-miR-3871-5p has also been found to target the calpain protein Sol, with a p-value of 8.1E-4. The calpain inhibitors have been known as leading compounds for the treatment of leishmaniasis, suggesting an indirect role of llo-miR-3871-5p in Leishmania transmission (Ennes-Vidal et al., 2017). These results suggest that llo-miR-9388-5p and llo-miR-3871-5p are critical molecules involved in Leishmania-induced infection and

DISCUSSION
Leishmaniasis is one of the most important parasitic diseases on Earth, and L. longipalpis is the main vector for the spread of this disease. The incidence of leishmaniasis is Figure 6 The layout of the miRNA-gene network. This network was constructed by tool RNAhybrid with cutoff p-value ≤ 0.01. A set of 13 miRNAs and 143 unique genes are shown. A triangle represents a miRNA, while an elliptical node represents a gene. For clarity, genes with no interaction were not shown. Full-size DOI: 10.7717/peerj.7862/ fig-6 gradually increasing due to a range of risk factors. Leishmaniasis can be controlled by managing L. longipalpis. In recent years, DNA sequencing technology has provided a deeper understanding of the genome of L. longipalpis. A set of 10,552 proteins is reported in its currently-published genome, among which 8,829 proteins are labeled as ''uncharacterized protein'' or ''hypothetical protein'' (UHP); indicating that their functions are unknown. By using the sequence alignment software BLAST and SSEalign, we found that 6,213 UHP homologs could be found in other species.
Using PANTHER enrichment analysis, we found that these 6,213 genes were significantly enriched in 12 metabolic pathways, using a cutoff p-value of ≤ 0.001 (Table 2). Among these, 46 genes were members of the Wnt signaling pathway, and 27 genes were part of the Nicotinic acetylcholine receptor signaling pathway. We also found that genes in two vascular-associated metabolic pathways, the PDGF signaling pathway and angiogenesis, was statistically significant enriched, with p-values of 1.23E-5 and 2.66E-4, respectively. These two pathways included several important regulatory factors, including Ets4 and Ets6. The ETS transcription factors are believed to be expressed during vascular development, and may function in the formation and maintenance of blood lineages (Craig & Sumanas, 2016), suggesting a special role in angiogenesis in the L. longipalpis genome.
Based on these results, we identified several important protein groups, including salivary proteins. L. longipalpis saliva has been reported to modify the bite site environment and play a major role in Leishmania transmission (Tavares et al., 2011;Martin-Martin et al., 2018). A set of 19 salivary proteins were identified, including salivary protein Mys2 and salivary plasminogen activator Urtg (Table 3). The Mys2 proteins were previously identified in Sitobion avenae, and were reported as effective RNAi targets in vivo (Wang et al., 2015). The plasminogen activator Urtg has been implicated as an effector in the nervous system of the vampire bat (Courtney et al., 2005). These salivary proteins could be used to study Leishmania transmission. Using the EST-based approach, we identified 21 new miRNAs in L. longipalpis. The MFE values of these miRNAs ranged from −79.28 kcal/mol to −20.33 kcal/mol, significantly lower than the average MFE values of known miRNAs, indicating that the results of the miRNA identification are highly reliable. Though a target prediction pipeline, we found that the target genes of these miRNAs are significantly enriched in metal ion binding functions, indicating a possible role in proton transportation. By comparison with other insect miRNAs, we found llo-miR-4113-5p and llo-miR-5101 to be unique in the L. longipalpis genome. These two miRNAs were found to regulate V-type ATPase complexes, indicating a possible proton role in transportation across the plasma membrane.

CONCLUSIONS
Lutzomyia longipalpis is a biological vector transmitting the protozoan Leishmania in the New World. In this study, we re-annotated L. longipalpis genome using bioinformatics tools. The functions of UHP proteins were assigned and novel miRNAs were identified. We found a high proportion of homology between L. longipalpis and other vectors, indicated that they were probably subjected to similar selective pressures. The functions of 6,213 previous un-annotated proteins were characterized. Twenty-one novel miRNAs were identified based on their EST sequences. The target genes of these miRNAs were predicted and enriched using RNAhybrid and PANTHER. We found that these targeted genes were highly enriched in metal ion binding, ATP binding and actin filament binding. Two miRNAs, llo-miR-4113-5p and llo-miR-5101, were found to be unique to L. longipalpis Table 3 Identified salivary proteins in L. longipalpis. All these proteins were annotated with ''uncharacterized protein'' in previous version. when compared to other insect miRNAs. Both miRNAs target V-type ATPases involved in proton transportation across the plasma membrane, suggesting potential roles in energy production. Another two miRNAs, llo-miR-9388-5p and llo-miR-3871-5p, were found to target nuclear cap-binding protein Cbp80 and calpain protein Sol, respectively. These proteins have been previously reported to be involved in the progression of Leishmania infection. In summary, our results lay the basis for the discovery of the function of protein-coding and non-coding genes in L. longipalpis genome, and for understanding the molecular regulatory mechanisms in this species.