The electronic tree of life (eTOL): a net of long probes to characterize the human microbiome from RNA-seq data


 Background

Microbiome analysis generally requires PCR-based or metagenomic shotgun sequencing, sophisticated programs, and large volumes of data. Alternative approaches based on widely available RNA-seq data are constrained because of sequence similarities between the transcriptomes of microbes/viruses and those of the host, compounded by the extreme abundance of host sequences in such libraries. Current approaches are also limited to specific microbial groups. There is a need for alternative methods of microbiome analysis that encompass the entire tree of life.
Results

We report a method to specifically retrieve non-human sequences in human tissue RNA-seq data. For cellular microbes we used a bioinformatic 'net', based on filtered 64-mer small subunit rRNA sequences across the Tree of Life (the 'electronic tree of life', eTOL), to comprehensively (98%) entrap all non-human rRNA sequences present in the target tissue. Using brain as a model, retrieval of matching reads, re-exclusion of human-related sequences, followed by contig building and species identification, is followed by reconfirmation of the abundance and identity of the corresponding species groups. We provide methods to automate this analysis. A variant approach is necessary for viruses. Again, because of significant matches between viral and human sequences, a 'stripping' approach is essential. In addition, contamination during workup is a potential problem, and we discuss strategies to circumvent this issue. To illustrate the versatility of the method, we report the use of the eTOL methodology to unambiguously identify exogenous microbial and viral sequences in human tissue RNA-seq data across the entire tree of life including Archaea, Bacteria, Chloroplastida, basal Eukaryota, Fungi, and Holozoa/Metazoa, and discuss the technical and bioinformatic challenges involved.
Conclusions

This generic methodology may find wider application in microbiome analysis including diagnostics.


Introduction
There is growing interest in the role that the microbiome plays in health and disease. Microbes are ubiquitous and are present in all body compartments including the gut, lung, oral and nasal cavities, as well as in body tissues including liver and kidney, as highlighted by the Human Microbiome Project [1,2] (https://hmpdacc.org/).
Microbiome characterization is generally based on two key techniques -ribosomal RNA (rRNA) analysis, and metagenomics (reviewed in [3][4][5][6][7][8][9][10]. In the rst, rRNA or rDNA sequences are ampli ed using a suite of PCR primers, sequenced, and compared against the database. Because this depends on the use of short PCR primers, the method may lack speci city -risking ampli cation of unrelated sequences -in addition to missing any species whose rRNA sequence diverges from the primers. In addition, differences in abundance require quantitative PCR ampli cation of each sample followed by deep sequencing, and small differences in abundance (e.g., 2-4-fold changes) are di cult to detect.
The second method is based on metagenomic analysis through shotgun sequencing and genome assembly. This requires many-fold more data, often reaching Tb levels, and requires dedicated tools to remove human sequences and to assemble contigs. Moreover, because high sequencing depth is necessary, assembly-based methods are restricted to highly abundant members of the microbiome.
Moreover, metagenomics does not easily address differential abundance. Both methods require extensive wet-lab work and can require machine-learning tools to unravel the true extent of the microbiome [11].
An ancillary technique based on short k-mers (generally 31-mers), such as Kraken [12] and CLARK [13], can be employed in conjunction with both metagenomics and RNA-seq, but also requires careful interpretation.
Each of these techniques has advantages and disadvantages. In addition to relatively high demands on data processing, and sometimes low selectivity, the different methodologies have often given con icting results. We illustrate this through studies on the brain.
Like other tissues, the brain carries a burden of endogenous microbes and viruses [14], although these have not been well characterized, and some have questioned whether there is indeed a brain microbiome [15]. However, an increasing body of evidence suggests that microbes readily enter the brain (e.g. [16], can be detected by in situ immunohistochemistry of brain [17], and that infection may play a role in neurodegenerative disorders such as Alzheimer disease (AD) [18]. Nevertheless, there has been extensive debate about which microbes are present. Chronic in ammation and infection caused by spirochetes have been suggested to contribute to the slow progression of AD [19]. Chlamydia pneumoniae shows associations with late-onset AD [20], and other bacteria such as Proteobacteria, Actinobacteria, and Firmicutes, as well as Fungi such as Malassezia, Alternaria, and Candida spp., have been reported [21]. An important PCR-based study revealed multiple bacterial species in AD brain [22]. Other work has focused on periodontal pathogens such as Porphyromonas gingivalis [23]. However, in other studies very few microbes of this class (Bacteriodetes) were found, and other key species implicated such as spirochetes and Chlamydia (see above) were not well represented [21,22]. Furthermore, beyond speci c target groups, the relative abundances of these different species have not been established.
In addition, the majority of studies to date have focused on select microbial groups such as bacteria or fungi. Several classes of cellular microbes across the known Tree of Life have not been widely studied to date, including Archaea, Amoebozoa, Chloroplastida, and Eukaryota. This relative dearth of such broader analysis extends more generally beyond our target tissue (the brain) to tissues analyzed in the majority of endeavors.
Many viruses are also present in human tissues. Herpes simplex virus 1 (HSV-1) sequences were discovered in AD brain around 30 years ago [24]. Infections with viruses such as HSV-1 are widespread in the population; these generally remain in a silent (latent) form life-long, but may be reactivated because of stress, in ammation, or other factors, leading to proliferation and localized damage. Other viruses, notably human herpesvirus 6A (HHV-6A) and 7 (HHV-7), have also been suggested to be associated with AD. Readhead et al. used a modi ed ViromeScan work ow [25] and found that the abundance of these two viruses among 515 viral species was increased in the transcriptomes of AD brain in three of four cohorts compared to normal brain. For speci c viruses, unique 31-mers were generated with Jelly sh [26], which can e ciently count k-mers up to 31 nt in length e ciently, and BLAST, which was used to lter the 31-mers with homology to human sequences. The RNA-seq reads that are possible human or bacterial sequences were ltered and then mapped to the ltered 31-mers. The viral abundance analysis was performed with 31-mer count matrices [25].
However, Chorlton [27] challenged these results, suggesting that uncorrected P values were used in multiple testing, low-complexity sequences were abundant in the datasets, local alignment by Bowtie2 required only a relatively low number of matching bases, and the BM tagger (ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/bmtagger/) had a 2% false negative rate when ltering humanderived reads. Also, hepatitis C virus and eradicated variola virus were found in 100% and 97.5% of samples, respectively, using the modi ed ViromeScan. KrakenUniq [28], which performs e cient k-mer counts in metagenomics and can better identify false-positive reads, found few HHV-6A reads and no HHV-7 reads were detected [27]. Allnut et al. used digital droplet PCR (ddPCR) to amplify speci c regions of HHV-6A and HHV-6B in 708 brain sections [29]. PathSeq [30], which has high speci city and sensitivity in distinguishing between human and non-human sequences, was also used as a complementary method with RNA-seq data, which contained part of the cohorts also used by Readhead et al., to screen for pathogens from more than 25,000 microbes, containing 118 human viruses. Neither of these methods found associations between HHV-6 and AD [29], and the true contribution of herpes and other viruses to human brain disease remains unknown.
In addition to viruses, endogenous retroviruses and retroelements constitute a further class of replicative elements that might also contribute to human disease.
Key objectives in the present report have been to devise methods that (i) extend to the entire tree of life, (ii) can accurately determine both the identity and abundance of microbes including viruses and retroelements in human tissues, (iii) are applicable to widely available RNA-seq datasets such as those listed at the National Center for Biotechnology Information (NCBI) database of sequence read archives (which now contain over 25 million Terabases of open-access sequence information https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?), (iv) do not require sophisticated computer expertise or dedicated computer programs other than those that are widely available and/or freely downloadable, and (v) allow a pictorial representation of the microbiome that facilitates comparative interpretation of the results.
We report the development of two related methods. First, an electronic tree of life (eTOL) approach based on a 'net' of 16S/18S rRNA sequences across all cellular lifeforms to comprehensively retrieve all nonhuman sequences. Second, a 'stripping' method based on viral genomes to unambiguously detect key viral species. In this methodology paper we focus on technical and bioinformatic issues -case studies are presented that illustrate the versatility of the method; application of the eToL methodology to the human microbiome in select tissues will be presented elsewhere (in preparation).

Methodology Development
Our methodology is centrally based on BLAST screening of RNA-seq libraries using a suite of filtered probes (64-mers for microbial rRNA and whole-genome viral sequences). We present here the detailed rationale behind this approach and discuss relevant technical issues.
Cellular microbes: the electronic Tree of Life (eToL) approach According to the three-domain system, cellular life can be classified into Archaea, Bacteria, and Eukaryota [31] . To address the full diversity of cellular lifeforms we consulted the Open Tree of Life (OToL; https://tree.opentreeoflife.org/), a National Science Foundation (NSF) collaborative effort across 10 institutions that synthesizes phylogenetic trees based on sequence and taxonomic data [32][33][34] . For completeness, competing phylograms include LifeMap {National Center for Biotechnology Information (NCBI) version; http://lifemap-ncbi.univ-lyon1.fr/}, and we refer to this where appropriate. Because OToL is not definitive on the placement of bacteria, we followed Schulz et al. [35] for a recent reevaluation of the evolutionary phylogeny of bacteria, and synthesized a compromise ToL that includes all taxonomic groups, extending from Archaea and Bacteria to include Amoebozoa, Chloroplastida, Fungi, basal Eukaryota, and Holozoa/Metazoa ( Figure 1).
From this tree 126 species were selected that were judged to be equally divergent from each other as estimated by their spacing/position on the tree, taking into account the relative diversity of Fungi in particular. tree was modified and rooted with the outgroup bacteria using iTOL tool (v6) [40] , as shown in Figure 2.
As will be noted from the figure, it is not entirely consistent with current trees for cellular organisms, and some phylogenetically distinct classes showed intermingling in their precise 16S/18S rRNA sequences. However, the objective of this work was not to address the phylogenetic relationships between the diverse species, but instead to devise a means to detect/extract, as far as possible, a comprehensive compendium of non-human sequences in human tissue. We therefore proceeded to use this collection of 120 rRNA sequences as a route towards evaluating the true complexity of cellular lifeforms.

Databases
The RNA-seq datasets used in this work are listed in Table S2.

Probe length and specificity
Probing for microbes in RNA-seq data has generally exploited the fact that a typical cell contains many thousands of ribosomes (Box 1), whereas housekeeping transcripts are less well represented, often by a factor of 100 or more. However, a central problem in using rRNA probes to detect specific microbial matches in human RNA-seq data is that many, if not the majority, of full-length sequence probes detect significant (according to conventional criteria regarding the likelihood of detection by chance) matches, albeit often partial, in human rRNA. The problem is accentuated with long probes (bacterial 16S rRNA is ~1.5 kb in length, eukaryotic 18S rRNA is ~1.9 kb in length; for comparison, bacterial 23S large subunit rRNA is ~3 kb, and eukaryotic 28S rRNA is ~5 kb in length). As probe length increases, so too does the likelihood of finding matches in human sequences. Conversely, using short sequences (in the range of 20-30 nucleotides) for PCR or k-mer analysis risks losing specificity, even though (at high stringency) these should be unique in the human genome/transcriptome, but this does not take into account the diversity of polymorphisms in the human population, de novo mutations, and sequencing errors. Given that in silico analysis by BLAST is formally equivalent to wet-lab probing by nucleic acid hybridization, we based our design on previous calculations that a minimum probe length of 62 nt is required, at a biologically plausible/significant level of 85% identity, to detect a unique sequence in a random collection of nucleotides of the size of the human genome (ca 3 x 10e9 nt) with a likelihood of 0.1 of encountering a similar sequence by chance [41] . For simplicity, we adopted a probe length of 64 nt.
rRNA hypervariable or constant regions rRNA genes of bacteria and fungi, in particular, are known to contain hypervariable regions [42,43] as well as conserved regions. Although the use of hypervariable regions has the advantage that it can identify exact species (or groups of species), the drawback is that probes based exclusively on hypervariable regions are likely to miss other species that have a slightly different sequence, arguing against pre-selection. By contrast, probes based entirely on conserved regions may detect species irrespective of their class. As a further argument against pre-selection of regions, rRNAs contain a high degree of secondary structure that can (unpredictably) impede reverse transcriptase-mediated copying into cDNA, and some pre-designed probes may find few matches in RNA-seq archives. To illustrate, the number of RNA-seq reads matching different regions of E. coli 16S rRNA in a test dataset differed by a factor of 100 (not presented). As a working compromise, we generated probe sequences without any pre-selection based on knowledge of the target sequence (e.g., of variable versus conserved regions). Probe redundancy (a potential outcome of random design) is addressed in the sections below.

Probe generation and nomenclature
We generated 64-mer probes from the 120 rRNA sequences (~10 per kb, noting that not all available sequences are complete). The non-overlapping 64-mer sequences were devised as probes by semi-random selection using probe.py script. The version of python was v3.6.3 [44] . This generated a list of 1323 probes. In naming the probes we aimed to devise a simple, but easily remembered, nomenclature. Probes were therefore prefixed by a single letter for each of the major domains as follows: A, Archaea; B, Bacteria; C, Chloroplastida (algae and plants), D, Amoebozoa; E0, basal Eukaryota (that may constitute a clade of their own); F, Fungi; and H, Holozoa/Metazoa. Group G was not allocated, and may be retained should any new branches of lifeforms to be discovered (if there are any). We use the term Chloroplastida to denote all organisms containing chloroplast-related organelles in preference to the term Viridiplantae because the latter implies that algae are plants [45] .
Probes were thus named >X (code of ToL domain and number)_abridged species name_rRNA (16S or 18S)_probe number, for example, A_Hsalinarum_16S1.

BLAST analysis
Analysis is based on the basic local sequence alignment search tool, BLAST/BLAST+ [46,47] , that is now widely accepted as the gold standard for detecting significant sequence similarities. In this work we carefully explored different settings for similarity detection, including wordsize (default = 28), match/mismatch parameters (default = 1,−2), and gap costs (linear), and in no case did this improve selectivity. We acknowledge that the default settings for evaluating sequence similarity have been carefully optimized by NCBI staff and their advisers, and we identified no obvious improvements. The basic settings (optimize for highly similar sequences) are that BLAST detects 64-mer sequences with 4 mismatches (94% identity), but also delects 28-mer sequences with no mismatches and 38-mer sequences with a single mismatch (which could be as low as 50% identity over the full length of the probe; values correct at time of writing, October 2021; these settings may evolve with updates of the NCBI website). However, we observed that coverage for the majority of matches detected (80-90%) were generally in the range 70-100% and, because a second screening step is applied, the default settings were retained. Cut-off scores are discussed in the sections below.

Removal of probes matching human sequences
Although 64-mer probes based on rRNA are less likely (because of their length) than fulllength rRNA sequences to find matches in human databases by chance, rRNA is relatively conserved across species, and many of the probes found matches in human sequences. The 64-mers were therefore screened for probes that were were significantly similar to human sequences (default megablast task and -entrez_query "Homo sapiens {Organism}") by BLASTn v2.11.0+ with nt database [48] . About 300 probes found matches in human sequence databases (default parameters), and these probes were discarded. The final list of probes (N = 1017) without matches in the human database was generated in Fasta format and is given in Table S3.
Determining the number of matching reads in SRAs A script was devised to count the number of matches to each probe according to our selection criteria (see also below). Readcounts after filtering can also be determined by accessing the hit table csv file and using the COUNTIF function of Microsoft Excel in the format =COUNTIF(A1:A5000,"probename"), a similar function is available in Apache Open Office (https://www.openoffice.org/welcome/credits.html) and other widely accessible spreadsheet programs.
Refiltering is necessary: sequence similarity does not follow rules of logic A central issue encountered in this work is that sequence similarity, as assessed by BLAST, does not follow the rules of basic logic/mathematics. Mathematics dictates that, if A = B, and B = C, then A = C. Logic dictates that if A > B, and B > C, then A > C. The same rules do not apply to similarity between nucleotide (or protein) sequences. If sequence A is significantly similar (σ) to sequence B, i.e., A σ B, and B σ C, one may not conclude that A σ C. Conversely, if A is not similar (/σ) to B, and B σ C, one may not conclude that A /σ C. This is because, in three sequences A-C, sequence A (e.g., the probe) may not be similar to B (e.g., human sequences), but there may be an intermediate sequence C that is significantly similar to both A and B. Applied to human RNA-seq data, we observed that a large number of BLAST matches retrieved from human tissue are still of human origin. For this reason second-round filtering of all matches detected is necessary. In addition, representative human genome(s) and transcriptomes at NCBI do not encompass the full diversity of polymorphic variants that are present across the human population, and variant sequences may be encountered that achieve a threshold significant match with a filtered (i.e., no matches in human) probe despite being of human origin.
All matches retrieved from human RNA-seq libraries were therefore revalidated to exclude human sequences. This involves (i) retrieving the match sequences, (ii) searching again for homologies to human with downloaded nt database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/; accessed 20 May 2021). Because the retrieved sequences are longer than 64 nt, the threshold score is adjusted according to the mean readsize in each dataset. The cutoffs were as follows. If the bitscore is >160 (MSBB) or >100 (Miami) or >126 (Rockefeller), the sequence was considered to be human in origin, and was discarded. For other datasets (liver, skin, and 17 brain samples a cutoff of >150 was applied; cutoffs for other databases were >160 for the Mount Sinai Brain Bank and >150 for the Edinburgh Brain Bank, not presented). Three scripts were generated for this step, EDDIE_ToL.sh, Abundance_ToL.py and Abundance_count.py. Finally, duplicate reads detected with different probes (if any) were filtered to allocate reads to a single probe showing the highest sequence similarity, and the number of confirmed non-human reads were counted for each probe (Abundance_count.py).

Normalization: housekeeping transcripts
Because different tissue samples contain different amounts of RNA, sometimes partially degraded, and different RNA-seq protocols have different efficiencies of conversion to cDNA for sequencing, we normalized the number of readcounts obtained to the amount of biological material in the SRA. For this purpose we relied on three different housekeeping genes: phosphoglycerate kinase (PGK1), hydroxymethyl-CoA reductase (HMGCR), and neuron-specific enolase (NSE). An earlier study accurately assessed the number of transcripts per cell of PGK1-driven GFP is 46-64 transcripts per cell [49] (Box 2). We found no significant differences between PGK1 and NSE levels in these datasets, whereas HMGCR expression was lower. Importantly, different target tissues will require their own panel of housekeeping gene probes. Of note, the expression levels of PGK1 and NSE1 in human brain appear to be constant as a function of age (https://hbatlas.org/pages/development).
As before, 64-mer probes were devised, and BLAST searching was employed to determine the number of host cells that each SRA corresponds to. To minimize statistical fluctuation the mean of PGK (two probes) and NSE1 (two probes) was adopted for normalization. The number of host cells thus equals the mean readcounts of PGK1 and NSE probes divided by 50, and in each case the raw readcount was normalized to the number of microbe transcripts per host cell by dividing the by the estimated number of host cells. Other tissue types will require dedicated housekeeping gene probes (discussed in Box 2).
Visualization and display of readcounts: heatmapping An objective of this work was to generate a pictorial description of the distribution of (filtered) sequence matches in a given sample. Two methods were used, both based on the sequential order of probes in the taxonomic groups A (Archaea) to H (Holozoa). In the first, Morpheus software at the Broad Institute of MIT and Harvard (Cambridge, MA, USA; https://software.broadinstitute.org/morpheus/) was used for the display either with or without conversion to log2. All matches below a cutoff value (specified in the figures) are shown in blue, and all values between cutoff and a maximum value (dictated by the experiment in hand) are shown on a white to red scale where white = cutoff and red = maximum (or above). This presentation is outlined in Figure 4. In the second, the normalized data for each probe (reads per host cell) were summed for each organism, converted to log2 scale, and heatmaps plotted using the pheatmap package (v1.0.12) of R.
Because of the wider range of abundances for retroelements and viruses, these can require different displays (these are specified in all cases).
rRNA depletion in RNA-seq libraries Ribosomal RNA represents ~80% of the total RNA in each cell. Because most SRA studies in the data repositories have focused on endogenous human transcripts rather than on exogenous microbes, it is commonplace to use protocols to deplete rRNA. Precipitation with LiCl and/or selection of poly(A) RNA using immobilized oligo(dT) are widely used, but rarely remove more than 50-80% of rRNA. To address this we probed different datasets based on raw RNA-seq versus selection for polyA + RNA before sequencing. As shown in Figure 5A, there was no clear difference in the abundance of microbial rRNA transcripts between raw and polyA+ SRA datasets. This is likely to reflect (i) inefficiency of rRNA removal, (ii) different kits are likely to have different efficiencies of removal, and (iii) the extent of variation is comparable to that seen when the same tissue is worked up using alternative protocols in different laboratories.
An alternative approach is to specifically deplete host rRNA in a sample through hybridization to specific complementary oligonucleotides [50] . This method can achieve >90% removal of host (e.g., human) rRNA but does not remove other rRNAs such as those of exogenous microbes. No adjustment for rRNA removal was therefore performed.

Redundancy check
Because all our probes for cellular microbes are based on 16S/18S rRNA, we expected that some probes in the collection would be similar to others, perhaps because they identify highly conserved regions. A concern was therefore that the patterns we observe might to some extent reflect the degree of conservation across different species, as reflected by the number of probes in the collection with similar sequences, rather than the abundance of a target species. To address this, we catenated the entire probe collection into a single file and queried it by BLAST with the individual probes. Of the 1017 probes, 634 (62.3%) were unique within the collection, 100 (9.8%) found 2 matches (i.e., detected one similar sequence), and 122 (12.0%) gave 3-5 matches, and the remainder gave >5 matches (15.9%) ( Figure 6A). We refer to the number of matches identified by each probe as 'redundancy'. Nevertheless, inspection revealed that the 'next best' matches were on average 48.8 nt in coverage, with 2.5 mismatches, falling to homology stretches as short as 31-mers with one mismatch, arguing that the extent of sequence overlap/duplication within the collection is relatively low.
However, to formally address whether the signals detected in human tissues might reflect the number of probe/probe similarities, we plotted the extent of 'redundancy' against total readcounts across the cohort. As shown in Figure 6B, there was a very slight correlation between redundancy and total readcount (trendline). However, the signals with the highest readcounts were at the lower end of the redundancy scale. Figure 6C-E shows the outcome of normalizing the readcount for each probe to the redundancy of the probe, indicating that normalization does not have a major effect on the overall pattern. In Figure 6F we examined the effect of sequentially deleting readcounts for probes with high reduncancy {20 or above (20+), 10 or above (10+), 5 or above (5+)} and then values for probes with reduncancies of 2-5, leaving only signals generated by unique probes within the collection. As shown, removal of redundant probes led to some degradation of the signal, but (i) many highly abundant signals remained despite restriction to unique probes, and (ii) the remaining signals continued to encompass the entire tree of life. However, coverage was slightly compromised, and probing the collection of known human commensals/pathogens (N = 104) with the 634 'unique' probes (not presented) failed to detect 6 species (5.8%; compared to only 2% missed with the complete probe list, next section).
One potential way to avoid this issue would be to generate a far larger collection of random probes, and then to discard probes with partial overlaps/duplications in the probe collection, and this could be done in the future. Overall, our observations argue that the overall pattern of microbe signals detected is not dependent on probe redundancy. Because Page 13/45 our primary intention was to retrieve the maximum number of non-human sequences, we elected not to remove partially overlapping sequences from the probelist because this could compromise retrieval. A drawback is that different probes might potentially retrieve the same sequences. However, this ambiguity is resolved by separate confirmation with 23S/28S sequences and contig building for identification (see below).
The probe collection is largely comprehensive To address whether the filtered probe collection is sufficiently comprehensive, we used it to screen a manually assembled list of rRNAs for known human-associated species. A list of 104 organisms known to be present in human samples was assembled according to Archaea found in human gut [51] , bacteria (listed in Wikipedia: https://en.wikipedia.org/wiki/Pathogenic_bacteria), fungi (https://en.wikipedia.org/wiki/Pathogenic_fungus), helminths [52] , and a selection of rarer species assembled manually. Where two or more closely related species were listed, a single representative species was selected, noting that some species with similar names may in fact show significant divergence. Where the corresponding rRNA sequence was not available, the sequence from a closely related species was selected. In each case 16S/18S rRNA sequences were downloaded from nucleotide database of NCBI. This collection was named the PATHLIST (Table S4). Probing PATHLIST using the probe collection found matches in 98% of cases, and only missed two species (Discussion), indicating that the eToL v1.0 collection is largely sufficient for our purposes. Subsequent versions of eToL will include these 'missing species'.

Specificity and cross-matching
To determine the extent of cross-matching between probes designed from the different taxonomic groups (A-H), the 1000+ probes were used to screen the PATHLIST dataset of human-associated microbes. Because very few known human-associated microbes fall into the groups C (Chloroplastida), D (Amoebozoa, and E0 (basal Eukaryota), these were not included. As shown in Figure 5B, other than some overlaps between Fungi and Holozoa (as expected, given their relateness), there was little cross-matching between groups, attesting to the group-specificity of the probes in the collection.
Species identification, contig generation, and diversity The eToL net collects a compendium of non-human sequences (confirmed by second-round filtering) in a target tissue. For identification, filtered matching reads for key signals of interest were downloaded, contigs were generated using one of several online tools (e.g.,

CAP3
assembly at the Rhone-Alpes Bioinformatics Pole PRABI-Doua; http://doua.prabi.fr/cgi-bin/run_cap3; also EG assembler https://www.genome.jp/toolsbin/eassembler4.cgi?status=seqclean&pmode=all [53] ; a summary of available tools is given at https://onlinetoolweb.com/contig-assembly-online-tool/), and their abundances determined by BLAST of each contig generated from human tissue against the collection of sequence matches, retaining only sequences with 100% (or near-100%) identity to the probe. Where necessary key contigs are used as probes to retrieve additional sequences from the same libraries. For sequence identification, BLAST at NCBI allows retrieval of the closest homologs. However, an average of 4000 matches were retrieved across 20 SRA datasets, but this can be as high as 16000. Computation time for contig building using EGassembler is estimated at 30 minutes for 4000 matches, rising to 5 h for 16000 matches [53] (computation time is approximately proportional to the square of the number of sequences). By contrast, contig assembly by phylogenetic group (A-H, where the mean number of matches for each bacteria and fungi was ~1000) is significantly faster and this number of sequences can be assembled into contigs in ~5 minutes [53] . Nevertheless, because the microbial sequences retrieved are not monophyletic, and show divergence within a single individual sample and between different samples, as illustrated in Box 3; high-stringency contig building risks excluding important contributors to the microbiome and, as in all such microbiome analyses, caution is necessary in interpretation.
Validation with 23S/28S and mitochondrial DNA Because this approach employs a large number of probes (>1000), any observed high or low abundance in a particular tissue SRA could occur by chance (Bonferroni correction).
To circumvent this issue the same protocols were used to devise fresh probes from 23S/28S rRNA for the same (or the closest available) key species signal of interest, and these were then used to reprobe the SRAs for matches. Again, all matches were filtered against human sequences, contigs assembled, followed by species identification.
An alternative approach that is valid for eukaryotic microbes including Fungi, but not for Bacteria or Archaea, is to employ mitochondrial DNA (mtDNA) sequences that have been 'stripped' (below) against human sequences.
Viruses: whole-genome 'stripping' Because viruses have no ribosomes, we turned to whole-genome sequences. For viruses where detailed transcriptomes are available, both in latency and in productive infection, we recommend the use of 64-mer probes based on abundant transcripts. However, transcriptome data are only available for a minority of virus types, and for general applications we based our methodology on intact viral genomes. Nevertheless, these also contain internal homologies with human sequences (see below). For methodology development we based our analysis on the report in Neuron by Readhead et al. [25] where they used a k-mer method to screen human brain SRAs for 515 viruses. Although this method is prone to false positives (Introduction) it appears to be immune to false negatives.
For methodology development we therefore selected the top 20 viruses in terms of readcounts, representing >99.9% of all matches found in human brain (B. Readhead, pers.

comm.).
To identify matching sequences in the human genome/transcriptome, complete genomes for these 20 key viruses were used to search human sequences in the NCBI databases (BLAST/nucleotide collection (nr/nt), organism name = Homo sapiens). This revealed many matches, some relatively close (Table S5). To refine this approach we applied a 'stripping' method, as follows. All regions of sequence similarity between the viral genomes and human sequences identified by BLAST were deleted from the viral genome. In addition, lowcomplexity regions were removed. The first-round stripped genomes were then searched again against H. sapiens, revealing further matches. Four rounds of stripping were necessary to remove all homologies with the human genome/transcriptome. Because some viruses are present as integrated copies in around 1% of the human population (in particular with homology to HHV-6A, HHV-6B, and potentially HHV-7) [54] , these are present in the NCBI databases of 'human' sequences, and these were manually curated to remove key matching sequences (notably human telomeric repeats that are present in the viral genomes). The stripped genomes are given in Table S6.

Retroelements
Page 16 /45 The principal retroelements in the human genome are long and short interspersed nuclear elements (LINEs and SINEs, respectively). LINE activation has been reported in neurological disease [55] . LINEs belong to several subfamilies, and representative 64-mer probes were devised for each of these. SINE are generally relatively short highly conserved sequences related to human 7SL RNA, and probes were included to cover SINE elements. The human genome also includes several classes of human endogenous retroviruses (HERVs) that have been implicated in diverse disorders including neurodegenerative disease [56] , and we included probes to assess their abundance. Following the same nomenclature scheme as before, probes for retroelements including HERVs were allocated domain code R.

Recognizing and excluding contamination
Microbiome identification through rRNA sequencing and metagenomics is prone to various types of contamination that can obscure true signals [57][58][59] . Contamination may be classified into two principal catagories (Table 1A). Type 1 contamination includes contamination of the sample and reagents used to examine it, whereas type II concerns in vivo biocontamination, as discussed below.
(i) Reagent contamination: type 1A Molecular biology reagents used to work up samples for sequence analysis are often contaminated with microorganisms. Salter et al. reported that sample dilution by a factor of 10 3 to 10 4 was necessary before contaminants represent 50% of the signal [57] . In terms of RNA-seq from PCR amplified material, this means that 0.1% of the signals could originate from contaminating material. In a series of 1000 signals, one or more may therefore arise from reagent contamination. Common bacterial contaminant species are given in Table 1, Table S1, and Table S2 of [57] , and Figure 5 of [60] . In addition, organisms commonly encountered in tap water may be consulted (Table S2). Although duplicate and/or blank samples worked up independently have been recommended [59] , this is not possible with pre-existing RNA-seq data. In addition, workup and sequencing of blank samples generally fails because instrument settings reject very low numbers of sequence reads (Azenta Life Sciences/GENEWIZ, authorized personal communication).
(ii) Sample contamination: type 1B This is an important issue because, if samples are not dissected under fully sterile conditions, they may become contaminated by exogenous organisms, for example airborne spores and microbes from human skin. These latter are likely to contaminate some samples, and analysis is complicated by the fact that the skin microbiome is very diverse and differs according to body site, individual, and geographical location [61-63] . One possible way to tackle this issue is to exclude known human skin microbes. In the present work we subtracted brain datasets against a series of RNA-seq datasets for human skin (Results).
Caution is necessary because, for example, a common skin fungus, Malassezia, has been directly implicated in human diseases such as psoriasis and has been reliably been detected in other human tissues [64] .

(iii) In vivo biocontamination: types 2A and 2B
This falls into two subtypes. Type 2A concerns contamination in vivo through life-long exposure to environmental agents. For example, we have observed signals in RNA-seq data corresponding to barley (Hordeum vulgare, not presented). Although airborne contamination of samples is not excluded, we suspect that in vivo contamination may take place. For example, inhaled ultrafine manganese oxide particles readily enter the central nervous system [65] . In mice exposed to microparticles (5 μm) and macroparticles (20 μm) in drinking water, both types of particle entered body tissues [66] , and environmental exposure to both small (<2.5 μm) and large (2.5-10 μm) particles has been associated with cognitive decline in human [67] . Barley pollen (25 μm) is in the same size range; over the course of a lifetime it is possible that these might also enter the circulation including brain vasculature, and from there into the brain itself. A similar route of infection could apply to microbial spores.
Type 2B contamination concerns contamination of the target tissue in vivo before sampling. For example, in studying a body tissue obtained postmortem, the cause of death should be taken into consideration. To illustrate, in many elderly patients (the principal source of postmortem tissues) death is often precipitated by severe infection, often pulmonary, and it is possible if not likely that microbes enter the circulation and are thus present in diverse body tissues, independently of any disease process under investigation (Table 1A).
(iii) Strategies to exclude contamination Key recommendations are summarized in Table 1 B. However, there is likely to be substantial overlap between microbes that are representative of the natural microbiome in human tissues and common contaminants of human origin such as from the skin.

(iv) Viruses and contamination
The problem of virus contamination is less severe because analysis is based on RNA-seq data, and contamination with mammalian cells expressing virus transcripts is thought to be unlikely. By contrast, contamination with viral genomes is possible, but these (particularly for DNA viruses) may be recognized because genomic reads are unlikely to correspond to the viral transcriptome. Viral reads were therefore mapped to the viral transcriptome to determine their authenticity (not presented).

Microbes and viruses: how many cells/genomes are being detected?
This analysis is based, for cellular organisms, on the number of copies of rRNA transcripts in each sample. The question therefore arises of how many cells are present, which in turn depends on the number of ribosomes (or viral transcripts) per cell. Because this parameter introduces a further complexity, further discussion is provided in Boxes 1 and 2.

Tree of life
To generate the probe collection for cellular microbes, 120 key organisms were selected from Open Tree of Life Project and other sources to represent, as far as possible, the full spectrum of cellular organisms.
The key organisms cover the domains of Archaea (A), Bacteria (B0-6), Chloroplastida (C1-4), Amoebozoa (D), basal Eukaryota (E0), fungi (F0-6) and Holozoa/Metazoa (H0-3) (Figure 1). To address the distribution of these organisms across the Tree of Life, the 16S/18S sequences corresponding to these 120 organisms were downloaded and used to build a phylogenetic tree (Figure 2). Because Archaea and Eukaryota are more related, the Bacteria group was used as outgroup [68]. This tree has relatively good statistical support because the bootstrap values for most of nodes are over 70. The nodes that have weak support (bootstrap value <70) are shown in gradient colors, ranging from red (7) to blue (69). The 16S rRNA sequences form two monophyletic groups, Archaea and Bacteria, whereas only one monophyletic group, Holozoa, was observed within the 18S rRNA sequences of eukaryotes. The species, especially from Chloroplastida and Amoebozoa, were not clustered well (Figure 2), and their nodes show weak support.
We then devised 64-mer probes for these species (the rationale is presented in the Methodology section), as shown in the pipeline (Figure 3), and ltered them against human sequences. To validate this approach, we checked whether the probe collection detects known human pathogens/commensals. With few exceptions, all organisms in the PATHLIST were detected in the probe collection, generally with multiple matches. The mean number of matches between the probe collection and each sequence in the PATHLIST was 25.8, and the median was 15. Only two species were not detected, Leishmania donovani and Ascaris lumbricoides. These data suggest that the probe collection covers around 98% of species for which sequences are available (future editions of eToL will be revised to include any missing lineages).
We also developed a uniform method of display based on heat-mapping, as illustrated in Figure 4.
A concern is that rRNA depletion (e.g., through polyA RNA selection) in RNA-seq libraries might compromise detection. However, this was not found to be a systematic problem, and robust microbe signals were detected in both polyA+ and unselected RNA-seq datasets ( Figure 5). A further concern is whether the probe collection we have developed contains probes that signi cantly overlap with each other, and thus constitute partial duplicates. As described in Methodology, this was carefully addressed ( Figure 6). The whole probe collection detected 98% of a test list of human pathogens and commensals, whereas selection of the unique probes detected only 92%. We will address this issue in further re nements, but the current analysis was continued with the complete list of probes.

Excluding contamination
Contamination is increasingly recognized to be a problem in microbiome analysis (reviewed by de Goffau et al. 2018). We therefore adapted our protocols to address this issue. A signature of contamination is that different samples processed in parallel have consistent signals in all samples. Multi-positive signals were deleted from both the Miami and Rockefeller datasets. As shown in Figure S1, many of the signals in the Miami dataset may be contaminants, whereas the same issue was not encountered with the Rockefeller dataset ( Figure S1).
Mutiple independent datasets: the brain has its own microbiome The second strategy was to only consider signals that are present in independent datasets from the same tissue. We therefore screened 17 independent RNA-seq datasets from human brain. As shown in Figure   7A, the patterns were substantially conserved despite entirely independent workup.
To rigorously con rm that the microbial signals detected in brain do not arise through contamination, we sequentially subtracted the signals from the 17 independent brain samples against tapwater, and then against human skin. Speci cally, if any signal appeared in any sample of tapwater or skin, all brain values were set to zero. Both procedures markedly depleted the brain microbial signal. However, multiple signals remained ( Figure 7B). This argues that brain has its own microbiome that differs from that of skin, and is unlikely to represent either type 1 or type 2 contamination.
Tissue differences: liver versus brain Previous studies highlighted likely overlaps between the brain and skin microbiomes (e.g., [21,22]), which may be unsurprising because both tissues have an epithelial developmental origin. Subtraction of brain against skin signals could conceal species that are both (i) present in skin, and (ii) are opportunistic invaders of the CNS. We therefore focused on a different tissue, liver. As shown in Figure S2, the microbiome pro les for brain and liver display many similarities, but also some evident differences. The identities of key differentials were established by contig building and probing of NCBI datasets ( Figure  S3). In the samples analyzed, Malassezia spp. were found to be brain-speci c, whereas Staphylococcus aureus appeared to be liver-speci c (Table S7). The differential presence of key species was con rmed by second-round reprobing using 23S/28S-based probes (not presented).

Heterogeneity
We addressed whether individual probes are generally detecting speci c organisms, or clusters of organisms. We observe that, even using these highly selective probes, we identify clusters of organisms that are not monophyletic. We illustrate this in Box 3 through a case study. It is important to note that each probe detects a cluster of related species rather than a unique species (Box 3), and the exact identity of each sequence group retrieved from human tissue must be revalidated by 23S/28S analysis.

Viruses and retroelements
In terms of readcounts, viruses were less abundant than the other microbes ( Figure 8A). Adenovirus C was the most abundant in these samples. Other viruses such as HSV-1, CMV, HHV-6A and TTV were also present in some individuals (not presented), but their overall abundance was low (comprehensive analysis using this methodology is in preparation).
We also screened, using 64-mer probes, the abundance of select retroelements and endogenous retroviruses in tissue samples. As shown in Figure 8B, these are very well represented, although at present their potential contributions to health and disease remain unknown.
How many microbes are there in brain?
Readcounts, normalized on a per cell basis, do not immediately indicate whether microbes in brain are (relative to host cells) rare or abundant. We therefore considered further normalization factors including the number of rRNA copies that are present in typical microbial cells of different types (Box 1) to calculate how many cells/genomes are present in normal human brain; we elaborate on this point in the Discussion section and in Box 2.

Discussion
We report a new method to comprehensively analyze the entire microbiome of human tissue samples from transcriptomic (RNA-seq) data. The method comprises a 'net' of >1000 probes that covers all organisms from the known spectrum of lifeforms -the electronic Tree of Life (eToL). The method reported here is not intended to replace other established methodologies, but to provide an alternative that does not require any dedicated computer programs beyond those that are already widely available to the community (e.g., BLAST and BLAST+). Primarily based on 16S/18S sequences from cellular organisms, the approach has been extended to cover viruses and retroelements. Although computationally intensive, it is less demanding than metagenomics, and avoids the problems of non-selectivity encountered in earlier studies. Moreover, for the rst time it addresses the entire ToL rather than selected subgroups of microbes.
Our study is based on a spectrum of cellular organisms (N = 120) that span the entire ToL (Figures 1 and   2), 20 viruses that are reported to be particularly abundant in our target tissue (brain), and 11 types of retroelements. For cellular species, the maximum likelihood phylogenetic tree shows the relationships between the rRNA sequences of the different cellular organisms, but species from domains such as Chloroplastida, Amoebozoa, and basal Eukaryota were not well clustered. Some nodes have weak statistical support although the overall statistical support is good. This may be because only partial rRNA sequences were available for some species. In addition, the branch of Enterocytozoon bieneusi (F1_Ebieneusi_18S) is long (not presented), which means it has high divergence compared to the other species, emphasizing the exceptional diversity of the fungi. However, the robustness of multiple alignment tools plays an important role in the accuracy of the phylogenetic tree. Although MAFFT has been reported to have good alignment accuracy, the MUSCLE tool was found to have better performance in reconstructing trees in our study. Other aligners may also be used to align such distant sequences [69].
To improve the accuracy of the alignments, rRNA secondary structures may also need to be considered [70].
Earlier studies of microbe analysis encountered problems including non-speci city and crossreactivity with human sequences. For example, the Jelly sh tool used by Readhead et al. counts k-mers with maximum length of 31 bases [25]. However, 31-mers may lack speci city. The K value (the frequency of probes aligning to target sequences by chance) can be used to calculate the probe length that is necessary for accurate target detection. By considering the background of incomplete but abovethreshold matches (K b value), the minimum probe length for nding unique matches in a typical mammalian genomic library was calculated to be 62 nt at 85% sequence identity and a K b of 0.1 [41].
Therefore, our eToL approach was based on 64-mer (or longer) probes, that have been carefully ltered to remove any sequences matching human sequences. The 64-mers were semi-randomly selected because, if we used a sliding window method to produce overlapping 64-mers, the computation time required to process the many thousands of probes generated would become prohibitive.
The collection of probes (eToL) is designed as a 'net' to entrap all non-human microbial sequences. For this reason the identity of each probe does not indicate the exact species present, and retrieval of sequences from human tissue is necessary for species identi cation. However, the matches to our probes discovered in RNA-seq data unambiguously con rmed the presence of microorganisms because all matches were double-checked to exclude human sequences. Therefore, the false positive rate of detecting human sequences was reduced to near-zero.
The 64-mer probe collection for cellular organisms is believed to be largely comprehensive because 98% of known human pathogens and commensals were detected by the probe collection. In future editions of the ToL probe list will include probes for missing species such as Leishmania donovani and Ascaris lumbricoides, as well for species that are less well represented in the probe list.
We took strict precautions to recognize and exclude potential contaminants of either type 1 or type 2 (Table 1). This revealed that some RNA-seq datasets are potentially diluted by contaminant species inadvertantly introduced during sample preparation and workup. Subtraction of matches against likely sources of contamination including human skin and tap water reduced but did not eliminate the microbiome signals, arguing that key species are indeed present in human brain (in preparation). To con rm the identity of these brain-resident species, the matches were retrieved from brain, the exact species identi ed, and then further validated by 23S/28S rRNA analysis, and for eukaryotic species, analysis of mitochondrial DNA (in preparation).
For viruses, we report that the false positives in the study of Readhead et al. may be explained, in part, by homologies between viral and human sequences. For example, HHV-6A, 6B, and 7 (that were asserted to be increased in AD) have pronounced matches with human telomeric DNA repeats, HHV-3 and HHV-8 contain sequences similar to human thymidylate synthase (TYMS), the HHV-4 genome has matches to human interleukin 10 receptor variant (IL10RV), and the genome of variola virus (the agent of smallpox, also detected in human brain by Readhead et al. contains homologies to human ribonucleotide reductase subunits (RRM1 and RRM2B) ( Table S5). The major virus type in human brain was identi ed to be adenovirus type C (transcript mapping will be reported elsewhere).
Overall, this work reveals that a remarkable diversity of microbes are present in brain (and liver) samples.
All major taxonomic groups are represented. In addition to bacteria and fungi, as previously reported, we report that microbes ranging from Archaea to Amoebozoa, Chloroplastida, basal Eukaryota, and Holozoa/Metazoa are present in human brain. Few viruses were encountered, the majority being of the adenovirus C class.
Cautious application of the eToL method may go some way towards answering the question of whether there is indeed a brain microbiome [15]. Because this is a Methodology paper, we do not report systematically on the identities of the target organisms or experimental con rmation (in preparation).
The approach presented here, unlike other strategies, has the advantage that all cellular organisms across the ToL can be addressed in a single screen. For viruses, the genome-stripping method offers a route to ensure that only viral sequences are detected. In addition, the method has advantages over other strategies in terms of speci city, and employs widely available analytical tools (BLAST and BLAST+).
Simple modi cations to the protocol make it applicable to other species such as non-human primates and rodents where whole-genome sequence data are available for stripping. The methodologies presented here may nd broad application in the analysis of microbes and viruses in widely available RNA-seq data for human tissues, and thereby enhance our understanding of the role of the microbome in human physiology in health and disease. In addition, given that RNA-seq data can now be obtained for under $400, the method may nd application in the bioanalysis of human samples such as skin, oral, nasal, and pulmonary samples, as well as of blood, urine, and cerebrospinal uid.

Declarations
Ethics approval and consent to participate Readcounts of rRNA provide an indicator of how many microbes are present, but there is no direct one-to-one relationship between rRNA readcounts and microbial cell number. The absolute abundance of ribosomes in each cell (ribosomes per cell, RBPC) depends on both growth rate and cell type/size. For bacteria such as Escherichia coli, RBPC values can be up to 70 000 during periods of rapid growth, but as low as a few thousand in poor growth conditions (http://book.bionumbers.org/how-many-ribosomes-are-in-a-cell/). In another bacterium, Sphingomonas, RBPC can be as low as 200 [71] under poor growth conditions. We assume that bacteria in human solid tissues grow very slowly, and a compromise estimate of 2000 RBPC has been adopted. The same value has been assumed for Archaea because they resemble bacteria in terms of size.
Eukaryotic cells are generally much larger, contain more ribosomes, and for yeast under fast-growing conditions 200 000 RBPC have been reported [72,73] . However, as in E. coli, ribosome content depends on growth rate [74] . We therefore applied a similar reduction for what we expect to be very slow-growing cells, giving an estimate of 10000 RBPC (fivefold greater than in bacteria), noting that the exact RBPC values will depend on the species. Although this value could apply to other eukaryotes, it may be an underestimate because, for example, some large mammalian cells can have over a million ribosomes [75] ; however, this extreme high value is unlikely to be representative of microbes more generally.

Virus-infected cells
The problem in estimating how many genomes are present per host cell becomes more acute for viruses because we must distinguish between latent/quiescent infection, where viral transcript counts are expected to be similar to those of endogenous housekeeping genes, and virulent replication, where there can be thousands of genomes/transcripts per cell. For this reason detection of viral sequences in RNA-seq data is likely to detect highly replicative viruses, but could potentially miss low-level latency-related transcripts from quiescent viruses. This point is discussed in more detail in Box 2.

Box 2. Normalization to endogenous transcripts and detection limits
A key question concerns the extent of coverage of each sequence read archive (SRA). To address this we used PGK1 probes to normalize RNA-seq data. PGK1 stands out because Kempe et al. [49] carefully measured the level of expression of PGK1-driven GFP and accurately measured 46-64 transcripts per cell. In support, it is estimated that ~12 000 genes are expressed per typical cell, and between 360 000 and 10 6 mRNAs are present: the average number of transcripts, per gene, is thus in the range 30-83, strictly comparable to the estimate of ca 50 PGK1 transcripts per cell determined by Kempe et al.
Screening of RNA-seq libraries revealed that PGK1 match numbers are in the general range of 10 to over 250 per library, with a median of 152, arguing that each sequence library roughly equates to the sequence content of ~3 cells, although more recent RNA-seq datasets appear to be a little larger (10 cells). This conclusion is substantiated by considerations of SRA file sizes (3-10 GB). Very approximately, an uncompressed file containing 1 MB of information contains 1 megabase of nucleotides, and a file containing 1 GB of information contains ~1 gigabase of nucleotides. The entire transcriptome of a cell, at 600 000 mRNAs per cell, and average mRNA length = 2.2 kb, equates to 1 320 000 000 nucleotides (1.3 GB of data). However, rRNA removal is usually far from complete and, in addition, 1/3 of the information in a typical SRA file is not sequence data (each entry contains details of the specific read in question). To cover the (non-rRNA) transcriptome of a single cell therefore requires a minimum of 2.5 GB of information in FASTA format. Although they may be partly compressed in some formats, consideration of SRA file size (3-10 GB) is consistent with the interpretation that a single SRA equates to the transcriptome of a small number of cells. A legitimate concern is therefore that poorly abundant (but biologically relevant) microbes/viruses may not be detected (see below).
Another consideration is that rRNAs tend to contain regions of secondary structure that reverse transcriptase may find difficult to copy into cDNA. However, this is unpredictable, and we have made no explicit allowance for this factor.
Average read length is a further consideration. Using 64-mer probes, if the mean read length is 150 nt, then (if mRNAs are randomly sampled) some 40% of reads will not sufficiently overlap with the probe, but this falls to ~30% for 250 nt reads. However, this factor is identical for housekeeping genes and for microbial transcripts, and does not affect normalization.
Although our normalization is based on PGK1 transcripts at 50 per host cell, even using two independent probes for PGK1 we observed some statistical fluctuation. To dampen this effect we considered two other genes. The first, hydroxymethyl glutaryl-CoA reductase (HMGCR) is another housekeeping gene, but transcript levels were on average fivefold lower, making it unsuitable. Instead we turned to a neuron-specific housekeeping gene, neuron-specific enolase (NSE, also known as ENO2). It can legitimately be argued that, because NSE is neuron-specific, it is not entirely representative of our target tissue (brain) where neurons only constitute a little over one half of all host cells. However, we saw excellent agreement between NSE and PGK1 transcripts, and for simplicity our results are presented following normalization to the mean number of PGK1 and NSE transcripts per cell (or per 100 host cells), assuming that 1 host cell = (mean {PGK1 and NSE1})/50. Although this probe combination is probably useful for normalizing RNA-seq data from brain, in other tissues it will be necessary to carefully choose the most appropriate panel of housekeeping genes for designing normalization probes. However, multiple viruses are present in human tissues at low levels, where they persist throughout a lifetime [14] , and it remains to be determined whether latent/quiescent infection such viruses compromises cell function; the lower sensitivity may therefore not be a drawback in the analysis of human physiology in health and disease.
Box 3. Case study: signals detected are not monophyletic To illustrate the complexity of computer searching, we report the case of searching the first liver SRA with a 64-mer probe, B3_AcidobacteriaKBS96_16S_8. This found 44 matches, of which eight were 100% identical, whereas the major class (36) contained one or more mismatches. Contig building generated 6 contigs, of which two were highly abundant, the others less so. Sequence comparisons and tree building (ClustalOmega) revealed that they were divergent in sequence ( Figure I). To extend the two most abundant contigs, these were used to reprobe the SRA, and matches were used to generate contigs again. This gave two contigs of 496 and 336 nt . The former was 97% identical to an uncultured Betaproteobacteria clone, whereas the latter was 100% identical to a Gammaproteobacteria species, Acinetobacter. liver 1 found closely related, but not 100% identical, sequences in liver 2, whereas the second most abundant contig in liver 1 found no close relatives in liver 2. This illustrates two points: (i) the exact species present in a tissue sample are not monophyletic, but represent a spectrum of related microbes; and (ii) what is true of one tissue sample may not be true of a near-identical sample from a different individual. The best approach may be to address multiple samples from different individuals, and identify the commonalities.   The phylogenetic tree for cellular microbes. The phylogenetic tree was obtained from Open Tree of Life project (modi ed from https://tree.opentreeo ife.org/) extended to include a recent bacterial consensus tree (main text for details).

Figure 2
Phylogenetic tree built with rRNA sequences for the key 120 organisms. 16S rRNA sequences for Archaea (red) and Bacteria (orange), and 18S rRNA sequences for Chloroplastida (yellow), Amoebozoa (green), basal Eukaryota (grey blue), Fungi (blue) and Holozoa (purple). Bootstrap values of most of the nodes are over 70 (black), indicating strong phylogenetic support. Nodes with bootstrap values lower than 70 have weaker phylogenetic support and are shown in a color gradient from red (low value) to blue (high value. Note that the placements of Fungi, Chloroplastida, and Amoebozoa are intermingled, whereas the other groups are tightly clustered.  Analysis of RNA-seq of total RNA versus polyA+ RNA, and class speci city of the different probes. (A) Two brain (cortex) polyA+ RNA datasets were available for study, these were matched with two (cortex) total RNA datasets. Although the total number of microbial matches spanned a 10-fold range (inset), as expected given independent sample preparation and sequencing methods, the gure shows that polyA+ RNA selection does not remove all rRNA, and the number of overall matches (inset) for the two polyA+ datasets was intermediate between the two matched total RNA datasets. (B) (Left) The whole probe collection was used to probe the list of 100+ human-associated microbes (PATHLIST). (Right) Probes for classes A (Archaea), B (Bacteria), F (Fungi), and H (Holozoa) were then separately used to probe PATHLIST subclassi ed into the same groups. Classes C (Chloroplastida), D (Amoebozoa), and E0 (basal Eukaryota) were not studied because too few human-associated species are known. The gure demonstrates that each probe class principally detects species of the same class, although some crossmatching was observed between Fungi and Holozoa as expected because of evolutionary relationships. The pro le of the cellular microbiome in human brain. (A) Schematic of the relative representation of microbes in normal brain (cortex): overall readcounts from 17 independent brain samples (below) indicate that bacteria and fungi are the major species in brain. (B) Pro les of readcounts in 17 independent brain samples showing different patterns in different brain RNA-seq datasets (where B9 and B16 have distinctive patters) but overall conservation of the pro le. Estimated abundances from the calculated numbers of rRNA per cell in the different organisms are as follows Archaea (10 -5 microbes per host cell), Bacteria (0.14), Chloroplastida (0.06, but type 2 contamination has not yet been excluded, Table   1), Amoebozoa (0.01), basal Eukaryota (0.01), Fungi (0.05), Holozoa (0.05, possibly because of crossmatching with Fungi, Figure 5B). Bacteria and Fungi represent 41% and 13% of the total burden (together >50%). (C) Signals that are present in >75% of all samples. (D) The brain has its own microbiome.