Virome analysis for identification of novel mammalian viruses in bats from Southeast China

Bats have been shown as important mammal resevoirs to carry a variety of zoonotic pathogens. To analyze pathogenic species in bats from southeast coastal regions of China, we performed metagenomic sequencing technology for high throughput sequencing of six sentinels from southeast coastal area of China. We obtained 5,990,261 high quality reads from intestine and lung tissue of 235 bats, including 2,975,371 assembled sequences. 631,490 reads predicted overlapping sequences for the open reading frame (ORF), which accounts for 2.37% of all the sequences (15,012/631,490). Further, the acquired virus sequences were classified into 25 viral families, including 16 vertebrate viruses, four plant viruses and five insect viruses. All bat samples were screened by specific PCR and phylogenetic analysis. Using these techniques, we discovered many novel bat viruses and some bat viruses closely-related to known human/animal pathogens, including coronavirus, norovirus, adenovirus, bocavirus, astrovirus, and circovirus. In summary, this study extended our understanding of bats as the viral reservoirs. Additionally, it also provides a basis for furher studying the transmission of viruses from bats to humans.

that the composition of bat viromes differs depending on geographical location and bat species. These studies show that the study of the bat viruses by the metagenomic analysis can be insightful.
The coastal wetlands southeast of China consist of growing ports, industrial districts, and port cities. Studies show the occurrence of natural focal diseases, such as Dengue fever and hemorrhagic fever with renal syndrome, in species-rich and densely populated southeast of China 15,16 . It is important to understand the distribution of pathogens in different animals across different habitats. An understanding of the natural habitat of bat-associated viruses can prevent newly emerging and re-emerging zoonoses. To expand these studies to southest China, Here we combined second-generation sequencing technology with meta-genomics to understand the outbreak of new infectious diseases caused by animal-origin pathogens and explore the unknown viruses from the natural environment, humans, and animals. In this study, we collected 235 bats from six locations in the southeastern coastal area of China from July 2015 to August 2015. Using Illumina platform for sequencing gut and lung tissue from bats, we detected a total of 25 species of the virus family, including norovirus, which was detected first. Also, we sequenced Myotis formosus for the first time from the southeastern coastal area and found that astrovirus. This work extended our understanding the diversity of bats harboring virsues and provide new clues to monitor these transmittable zoonotic viruses.

Results
Sampling. A total of 235 adult bats were captured live with nets near or in human dwellings from Zhoushan (ZS), Daishan (DS), Xiamen (XM), Changle (CL), Shishi (SS), and Lianjiang (LJ) between July to August in 2015 in Zhejiang and Fujian Provinces of China (Fig. 1). All bats looked healthy and had no obvious clinical signs at capture. Based on the mitochondrial cytochrome b gene, the bats can be divided into five species based on their muscle tissues: Rhinolophus ferrumequinum, Myotis formosus, Scotophilus kuhlii, Myotis davidii, and Rhinolophus pusillus. All intestine and lung samples were classified and combined into 12 pools and then subjected to metagenomic sequencing.
High throughput sequencing. The original data was processed with in-house scripts. After quality control and host read removal, a total of 5,990,261 reads were employed for further analysis. The read length was between 80 bp and 100 bp. For each sample, an average of 89.85 Mbp data were obtained (range: 48.16Mbp-167.31Mbp). The percentage of bases with a quality score of > = Q20 relative to the total bases in each sample ranged from 94.23% to 96.70% (mean percentage: 95.35%). The corresponding percentages of high-quality bases (> = Q30 and > = Q40) in each sample ranged from 87.57% to 92.34% (mean percentage: 89.76%) and from 62.41% to 72.80% (mean percentage: 67.57%), respectively. The filtered clean data matched with the NCBI bacterial genome, fungal genome, and virus database. Reads on the reference database were compared, and 327,159 reads were detected as bacterial sequences, 298,008 reads (4.93%) were detected as eucaryon sequences, 51 reads (0.00085%) were detected as archaeal sequences, and 2,180 reads (0.036%) were detected as virus sequences. 49.67% (2, Supplementary Table S1. Plant viruses accounted for 0.51% (77/15012) of annotated tags, which included the following four viral families: Luteoviridae, Totiviridae, Virgaviridae, and Phycodnaviridae. Phycodnaviridae accounts for the largest proportion of plant viruses (77%). Phycodnaviridae virus infection is widely present in fresh water algae or seaweed across the world.
It is worth mentioning that 35.27% virus sequences shared low similarity with known viral sequences. To verify the reliability of metagenomic sequencing results, six viruses (coronavirus, mastadenovirus, mamastrovirus, circovirus, norovirus, and bocavirus) closely related to human disease were selected, and nucleic acid from bat tissue samples of all 235 bats was amplified by PCR (  Table S1). All bat samples were amplified by a nested PCR which targeted a 440 bp fragment of conserved protein RdRp; 16 positive amplifications were detected from ZS, CL, and SS groups. The ZS group had seven out of 45, CL group had five out of 55, SS group had four out of 35 positive amplifications from Rhinolophidae family, including 14 cases of Rhinolophus pusillus and two cases of Rhinolophus ferrumequinum. Subsequently,16 positive amplifications were further sequenced and phylogenetically analyzed against Alphacoronavirus and Betacoronavirus samples (Fig. 3). Nine sequences shared 91.4%-99.8% identity to Betacoronavirus, SARS coronavirus SZ3, and SARS-related bat coronavirus in a large branch, while the remaining seven sequences belonged to the Alphacoronavirus, along with BtRf-AlphaCoV/YN2012, and Human coronavirus 229E in a large branch.
Detection and identification of Adenovirus. Adenoviruses (AdVs) are one of the most common viruses that infect humans and animals. These infections cause a variety of sub-clinical to lethal symptoms and outcomes 19 . In this viral metagenomic analysis, we found 21 tags showing identity with genus Mastadenovirus in group ZS, one tag showing identity with genus Mastadenovirus in XM group and 15 tags showing identity with genus Mastadenovirus in CL group (Supplementary Table S1). Further screening of intestine from the bat samples by performing PCR of a partial hexon gene sequence of mammalian adenovirus (766 nt) confirmed that three sequences existed in the intestine of Rhinolophus pusillus from the CL group but not the other two groups. At present, partial genomic sequences(10603 bp) of bat adenovirus CLR6 were obtained by using degenerated primers, including three major ORFs (E1B1, Iva2,Pol, Genbank:MF278269). Phylogenetic analysis based on a partial hexon gene sequence (Fig. 4) showed that bat adenoviruses, CLR1, CLR6, and CLR10 shared 80.0% to 98.2% nucleotide sequence identity to each other and 55.1% to 78.7% nucleotide sequence identity with bat adenovirus as previously reported in China and Germany [20][21][22] . Therefore, the three bat adenoviruses were termed as Mastadenovirus-related virus 1.

Detection and identification of Astroviruses.
Mamastrovirus in the family Astroviridae infects many mammals, including humans, and causes gastroenteritis. In our study, three tags related to the genus Mamastrovirus were obtained from CL group (Supplementary Table S1). We tested all the bat intestine by hemi-nested PCR using pan-bat astrovirus primers which targeted a 422 nt fragment of ORF1b 23 . We found that ZS, DS, and CL groups had one, one, and two positive amplifications in the intestine sample. The positive amplification rate in Rhinolophus pusillus was 1.6% (3/187) in ZS and CL groups. In group DS, Myotis formosus showed positive amplification in 4% (1/25) of samples, while all other bat species were negative. We sequenced four amplicons and identified four bat AstVs. These four bat AstVs showed 65.4% to 96.7% nucleotide sequence identity to each other and 61.6% to 87.1% nucleotide sequence identity with bat AstVs as reported previously in Hong Kong, Guangxi, and Germany (Fig. 5).    (Fig. 8). However, bat norovirus detected in this study could be further grouped into two branches with 66.3%-78.4% nt identity between them. Nine noroviruses amplified from ZS shared the highest identity (99%) with each other. The positive viruses amplified from Rhinolophus pusillus shared the highest identity (84%) with bat norovirus from YunNan of China,and they were likely a new norovirus species based on the new ICTV criteria.

Discussion
In this study, we collected samples for metagenomic analysis from southeast coastal areas of China as the region has vast coastline, abundant species, and suitable temperature. Bats are the only flying mammals which venture out every day and some of the species also retain seasonal migratory habits. Additionally, they as a viral reservoirs, by carrying viruses to different regions, including human habitats. A total of six sampling points along the southeast coast were selected, and the sequencing results were obtained for 2.37% tags ratio of assembled viral sequences. However, 95.73% of sequences were related to vertebrate virus which is far higher than previously reported percentages (8.97%-45%) 13,14,17 . This may be due to two reasons. First, reason includes the use of intestine and lung tissue from bats as samples for the virus metagenomics analysis; Before the advent of virus metagenomics, detection samples mostly consisted of bat faeces and throat swabs. This caused a large virus interference from the environment, and higher plant and insect virus ratio compared to this study 13,17,29 . In addition, due to the updated virus database, we had abundant reference sequences compared to previous studies. Through a complex comparison process, we annotated a total of 25 viruses, including 16 infected vertebrate viruses. This ratio is higher than previous research, but compared to the other studies fewer phage sequences were obtained. However, this result is limited by less sampling and may not reflect the real situation. Six viruses closely related to human pathogenic virus species were selected for PCR verification of all 235 bat specimens. We found that the virus carrying rates and PCR results were consistent with the metagenomic sequencing results. For example, coronavirus sequence was only detected in ZS, CL, and SS group in the metagenomic sequencing results, while 11 positive amplications of coronavirus were obtained in PCR, including six cases in ZS group, four cases in CL group, and two cases in the SS group. During PCR detection of the circovirus, positive amplifications were detected in the ZS group, CL group, and LJ group, while metagenomic sequencing results only in the CL group notes to one tags. These differences may arise when metagenomic samples and PCR samples are amplified in different parts of the intestinal tissue or in bats with low viral load. Overall, the metagenomic sequencing results can provide strong technical support to classify virus spectrum. At the same time, these results also show that different bats carry different viruses in different regions. Rhinolophus pusillus can carry the most variety of viruses, including bocavirus, coronavirus, adenovirus, iflaviruse, porcine circovirus, astrovirus, and norovirus. Bats can carry a variety of viruses at a given time and different kinds of bats could carry different specific viruses. For example, in a collection of 55 Rhinolophus in the CL, norovirus, coronavirus, circovirus, and adenovirus were tested positive in Rhinolophus pusillus, while in Rhinolophus ferrumequinum only coronavirus tested positive. For future studies,it would also be interesting to study how different viruses exist at the same time without causing a pathogenic situation in bats.
After Rhinolopus sinicus was suggested to be the natural reservoir of SARS coronavirus which caused the 2003 pandemic, it was further confirmed that SARS-like virus strain SHC014-CoV isolated in China has the ability to re-emerge from viruses circulating in bat populations using the SARS-CoV reverse genetics system 30 . Coronavirus sequence was detected in Rhinolophus pusillus and Rhinolophus ferrumequinum in our study, but not in other bat species in ZS, CL, SS groups. PCR and metagenomic sequencing results were consistent. Coronavirus is the main host for Rhinolophus bat in China which is also consistent with previous research. Coronavirus has a high positive rate (6.8%-18.2%), especially in ZS and CL groups due to the amplification of limited fragments. However, for future research we will focus on the positive samples which detect full-length sequence amplification of coronavirus in bats.
Various types of adenoviruses of the genus Mastdenovirus infect a range of different mammals and cause respiratory, ocular, and gastrointestinal diseases 31 . The acquired partial sequence of the adenovirus CLR1/6/10 hexon from the three bats has the closest relation to bat mastadenovirus WIV9 (only 78.7% at the nucleic acid level) and porcine adenovirus 5 (only 63.1% nucleotide sequence identities); Moreover, our findings showed very close genetic relationships among the three bat adenoviruses CLR1/6/10 in Rhinolophus pusillus; although, they were divergent from the current mammalian adenoviruses. This indicates that the three bat adenoviruses CLR1/6/10 are new mammalian adenoviruses. Previous studies have reported both high prevalence rate and genetic diversity of astroviruses (AstVs) in Chinese bats 23,32,33 . All the bat AstVs in this study were clustered in a large clade together with previously reported bat astroviruses 23,33,34 , and a large branch of bat AstVs were distinct from other branches of human or ovine AstVs. The astroviruses detected in this study exhibited various degrees of sequence similarity with those described in previous studies. For example, bat astrovirus BatAstv/ DSM1 showed 87.1% nucleotide sequence similarity with FJ571098 also found previously in Guangxi 34 , while bat Astv/CLR2, which we detected in Rhinolophus pusillus from CL, showed the nucleotide sequence similarity of only 66.6% with partial RdRp gene in known bat astroviruses, which suggested a novel species of bat astrovirus. Our findings suggest that astroviruses were genetically diverse not only within a single bat species but even within some individuals.
Among the known members of the family Circoviridae, PCV-2 is the only pathogen related to mammalian disease 35 . Circoviruses are commonly found in bats and show a large genetic diversity. Notably, Bt CV ZSR42 in our study shared the highest sequence identity with PCV, Bt CV CLR6, and Bt CV LJR22 clustered into avian Circovirus. This indicates that PCV has emerged, potentially as the result of a cross-species jump from birds into swine, most likely through intermediate contact with wild boars 36 . Bat bocaviruses were first identified in M. myotis in 2012 by Wu 13 . We first detected bat bocaviruses in the intestine of Myotis davidii and it was less than 85% homologous to known bat bocaviruses. Although large double stranded DNA viruses possess mutation rates far lower than RNA viruses, some small single-stranded DNA (ssDNA) viruses appear to mutate and have substitution rates closer to RNA viruses than double-stranded DNA viruses. These reports indicate that bocaviruses have potential cross-species transmission.
This study for the first time screened bats carrying norovirus in the coastal areas of China based on the metagenomic sequencing results. Norovirus sequences were detected in intestinal samples of Rhinolophus pusillus in ZS and CL groups and norovirus carrying rate reached 20% in ZS group. ZS has food rich in insects for bats. It is also easy for norovirus carrying bats to breed through faeces or direct contact with contaminated water or food. Although, there is no direct evidence that bats with norovirus can infect humans, norovirus carried by bat and the infection of human type GIV norovirus homology reached 76.8%, suggesting that bats may serve as norovirus natural host, a potential threat to human existence.
In conclusion, through the metagenomic sequencing of bat samples in southeast China, many mammalian viruses were annotated including some yet-unreported viruses. Furthermore, the PCR primers were designed according to the results of sequencing and the positive fragments were screened by PCR for phylogenetic analysis. We found that these viruses had low similarity with known viruses and suggest they could be categorized as new viruses. Also adenoviruses and circoviruses were also detected in Myotis formosus and Myotis davidii for the first time. However, further experiments are required to test if these viruses could infect humans. In addition, it is necessary to further investigate more samples in different locations to increase our understanding of the global diversity of bat viruses. Sample colletion and Nucleic acid extraction. After completion of collection from each sample site, all the bats were immediately dissected. The viscera of the sample was sub-packed in 1.5 ml EP tubes, labeled and placed in liquid nitrogen tank, shipped back to the laboratory, and put in −80 °C for cryopreservation until further process.
The sample was divided into six groups based on their location. The intestinal and lung tissues of the six groups were obtained and divided into 12 pools. A tip of every tissue (about 0.1 g) was placed in the pool. 10 times of SM buffer was added and every pool was ground until it was homogenized (50 mM Tris, 10 mM MgSO4, 0.1 M NaCl, pH7.5). The homogenate was centrifuged at 12000 g for 10 min at 4 °C but only the supernatant was used. The supernatant of each group was passed through 0.22 μm Pellicon II filters (Millipore, Billerica, MA) to filter out the broken tissue, bacteria, and other impurities. The filtrate was collected for nucleic acid digestion with 150 μl of following digestive products: filtrate 130 μl filtrate, 20U DNaseI (TaKaRa, Dalian, China), 5 μl 10 × DNaseI Buffer, 1 μl 10 mg/ml RNaseA (TaKaRa). The system was digested for 1 h at 37 °C. After digesting, the RNA was extracted with Qiagen RNAase minikit. The viral nucleic acid is used as reverse transcription template after suspending with 35 μl RNase-free H 2 O.
Reverse transcription and random PCR. The nucleic acid was subjected to reverse transcription to get the first cDNA line of the total RNA by using a previously reported method 10 . Briefly, the miscible mixture which included 33 µl nucleic acid and 2 μl of 50 μM hexameric random primers labeled with 12 groups (20-bp tag sequences), was reacted in a water bath at 75 °C for 5 minutes, immediately put in ice bath for 2 minutes, and then mixed with 3 μl 10 mM dNTP, 20U RNAsin (TaKaRa), 5 × RT buffer, M-MLVRTase 5U (TaKaRa). The mixture was put in 42 °C for 1 hour to finish reverse transcription and get the first cDNA line. This was followed by a hot bath to inactivate the reverse transcriptase. The RNA in the mixture was mixed with 1U RnaseH at 37 °C for 30 minutes. The first cDNA mixture was purified by ethanol precipitation method. Double Stranded cDNA synthesis and sequence-independent single primer amplification (SISPA). To obtain double stranded cDNA (dscDNA), exo-Klenow fragment (TaKaRa) and random primers were added for 60 min at 37 °C and then inactivated at 75 °C for 10 min. To obtain more viral nucleic acid products, dscDNA was subjected to SISPA amplification using the TaKaRa Extaq amplification system, and 20-bp tag primers (without hexamers) were used as the amplification primers in the above reverse transcription. A 25 μl reaction system containing 1 μl dscDNA mixture, 1 μl 20 mM tag primers, 10 × PCR Buffer, 4 μl 2.5 mM dNTPs, ExTaq Polymerase (1U), and ddH 2 O was denatured at 94 °C for 3 min, 30 cycles (94 °C for 30 s, 54 °C for 30 s, 54 °C for 30 s) and 72 °C for 5 min. The amplified products were purified using QIAquick PCR Purification Kit and dissolved in 50 μL TE buffer.
Second-generation sequencing. The 12 groups of double-stranded DNA products were mixed into a sample labeled with different tag sequences and Illumina sequencing in one lane was performed by the Beijing Genome Institute (BGI, Shenzhen, China). 2.5-5 ng DNA was sheared into ~170 bp fragments by Covaris, and subjected to gel electrophotometry to examine the quality of the fragmented DNA. The fragmented DNA was combined with End Repair Mix and incubated at 20 °C for 30 min. The end-repaired DNA was purified with QIAquick PCR Purification Kit (Qiagen), the A-Tailing Mix was then added, and it was incubated at 37 °C for 30 min. The purified Adenylate 3′Ends DNA, Adapter, and Ligation Mix were combined and the ligation reaction was incubated at 20 °C for 15 min. Adapter-ligated DNA was selected by running 2% agarose gel to recover the target fragments. The gel was purified with QIAquick Gel Extraction kit (QIAGEN). Several rounds of PCR amplification with PCR Primer Cocktail and PCR Master Mix were performed to enrich the Adapter-ligated DNA fragments. The PCR products were then selected by running another 2% agarose gel to recover the target fragments. The gel was purified with QIAquick Gel Extraction kit (QIAGEN). The final library was quantitated in two ways: the average molecule length was quantified using the Agilent 2100 bioanalyzer instrument (Agilent DNA 1000 Reagents) and the library was quantified by real-time quantitative PCR (QPCR) (TaqMan Probe). The quantified libraries were amplified on cBot to generate the cluster on the flowcell (TruSeq PE Cluster Kit V3-cBot-HS, Nucleotide sequence and maximum-likelihood phylogenetic analysis. The pre-product was sequenced according to Illumina standard technological process. The raw data was processed based on the internal program, including deleting the adapter and the host sequence, removing duplicated reads and a certain number of reads with low quality value (having > 2 N bases), and getting clean data. The reference database was built with internal procedures to extract the bacterial, fungal, archaeal organisms, and viral sequences from the nucleotide database, which was blasted with the sequences filtered by Short Oligonucleotide Analysis Package (SOAPaligner, version 2) 37 . Based on the following results, the sequence of reads with high correlation degree was given species classification. Different reads and sequences were merged for their high similarity and homology. In all the blast results, optimal results were used as the gene annotation with the parameter of E value < 10e-5. Functional analysis of all the genes was performed by BLAST alignment against KEGG (Kyoto Encyclopedia of Genes and Genomes) and eggNOG (Evolutionary genealogy of genes: Non-supervised Orthologous Groups) database. Table S2) of the six viruses were designed for PCR validation, primer for cornovirus and astrovirus were synthesized according to previous reported 38,23 , while primer for norovirus, adenovirus, bocavirus, and circovirus designed based on the of metagenomic sequencing results obtained in this study using Primer 5 (Premier Biosoft International, Palo Alto,CA). Nucleic acids of each individual sample were extracted using the QIAGEN kit. The PCR reaction system consisted of 18 μl PCR reaction solution (TAKARA), 0.5 μl of the upstream primer (10 μmol/L), 0.5 μl of the downstream primer (10 μmol/L), and 1 μl of the DNA template. PCR conditions were the following: pre-denaturation at 94 °C for 3 min; 30 cycles at 94 °C for 30 s, 52 °C for 30 s and 72 °C for 1 min for 30 cycles of in-house reaction; extension at 72 °C for 10 min. Positive PCR products were sequenced in both directions by an ABI 3730 DNA Analyzer (Invitrogen, Beijing, China).

Evolutionary analysis. The positive sequence obtained by PCR amplification was aligned by BLAST in
GenBank. The homologous sequences were downloaded from GenBank. The results were compared by MEGA6 39 after comparing the sequences by ClustalW. Phylogenetic reconstructions were performed using MEGA6 and the maximum likelihood method or the neighbor-joining method was performed with 1,000 bootstrap replicates.
Nucleotide Sequence Accession Numbers. The data from Solexa sequencing have been deposited in the GenBank Sequence Reads Archive under accession numbers PRJNA379515. All genome sequences of selected viruses have been deposited in GenBank under accession numbers KY775066 to KY775104.