Iroki: automatic customization and visualization of phylogenetic trees

Phylogenetic trees are an important analytical tool for evaluating community diversity and evolutionary history. In the case of microorganisms, the decreasing cost of sequencing has enabled researchers to generate ever-larger sequence datasets, which in turn have begun to fill gaps in the evolutionary history of microbial groups. However, phylogenetic analyses of these types of datasets create complex trees that can be challenging to interpret. Scientific inferences made by visual inspection of phylogenetic trees can be simplified and enhanced by customizing various parts of the tree. Yet, manual customization is time-consuming and error prone, and programs designed to assist in batch tree customization often require programming experience or complicated file formats for annotation. Iroki, a user-friendly web interface for tree visualization, addresses these issues by providing automatic customization of large trees based on metadata contained in tab-separated text files. Iroki’s utility for exploring biological and ecological trends in sequencing data was demonstrated through a variety of microbial ecology applications in which trees with hundreds to thousands of leaf nodes were customized according to extensive collections of metadata. The Iroki web application and documentation are available at https://www.iroki.net or through the VIROME portal http://virome.dbi.udel.edu. Iroki’s source code is released under the MIT license and is available at https://github.com/mooreryan/iroki.


INTRODUCTION 27
Community and population ecology studies often use phylogenetic trees as a means to assess the diversity 28 and evolutionary history of organisms. In the case of microorganisms, declining sequencing cost has 29 enabled researchers to gather ever-larger sequence datasets from unknown microbial populations within  (Robinson et al., 2016)), but often have 49 complex user interfaces or complicated file formats to enable complex annotations. Iroki strikes a balance 50 between flexibility and usability by combining visualization of trees in a clean, user-friendly web interface 51 with powerful automatic customization based on simple, tab-separated text (mapping) files. Given its 52 focus on automatic customization and a core set of key features, Iroki's user interface can remain lean 53 and easy-to-learn while still enabling complex customizations. In addition to specifying simple color 54 gradients directly in the mapping file, Iroki also provides a dedicated module allowing the user to generate 55 custom gradients to embed their data into color space, enhancing visualization. Iroki stays responsive 56 even when customizing large trees, and it does not require an account or uploading potentially sensitive 57 data to an external service. 58 Here, Iroki was used to customize large trees containing hundreds to thousands of leaf nodes according

64
Iroki is a web application for visualizing and automatically customizing taxonomic and phylogenetic 65 trees with associated qualitative and quantitative metadata. Iroki is particularly well suited to projects in 66 microbial ecology and those that deal with microbiome data, as these types of studies generally have rich Iroki is built with the Ruby on Rails web application framework. The main features of Iroki are written 74 entirely in JavaScript allowing all data processing to be done client-side. This provides the additional 75 benefit of eliminating the need to transfer potentially private data to an online service.

76
Iroki consists of two main modules: the tree viewer, which also handles customization with tab-77 separated text files (mapping files), and the color gradient generator, which creates mapping files to use in 78 the tree viewer based on quantitative data (such as counts) from a tab-separated text file similar to the   Inner node labels may represent support values (e.g., bootstrap results) or other comments that describe 98 the inner nodes. If inner labels are numeric, then inner nodes can be decorated with filled and unfilled 99 circles that allow quick identification of branches with high support. The semantics of support labels 100 are key to proper tree representations (Czech et al., 2017). As Iroki currently does not implement tree 101 rerooting, Iroki handles these specifics implicitly rather than giving the option to map inner node labels to 102 branches or to the nodes themselves.

103
While Iroki is focused mainly on automatic customization via mapping files, some interactive features 104 are included such as node selection and the ability to modify labels after a tree has been submitted. Finally, 105 various aspects of the tree can be adjusted directly through Iroki's user interface.

106
Color gradient generator 107 Iroki's color gradient generator accepts tab-separated text files (similar to the classic-style count tables 108 exported by VIROME (Wommack et al., 2012) or QIIME 1 (Caporaso et al., 2010)) and converts the 109 numerical data (e.g., counts/abundances) into a color gradient. Several single-, two-, and multi-color  Iroki reads numerical data from tab-separated text files. Similar to the mapping file for the tree viewer, 113 the first column should match leaf names in the tree, and the remaining columns describe whatever aspect 114 of the data of interest to the researcher (e.g., counts or abundance). In a dataset with M observations and 115 N variables, the input file will then have M + 1 rows (the first row is the header) and N + 1 columns (the 116 first column specifies observation names). From this data, Iroki can generate color gradients in a variety 117 of ways.

118
Observation means A color gradient is generated based on the mean value of each observation across 119 all variables. In this case, each observation i would be represented as Observation "evenness" A color gradient is generated based on the "evenness" of observation i across 122 all N variables. Then, each observation i is represented by Pielou's evenness index (Pielou, 1966) 123 calculated across all variables: where N is the number of variables and p ij is the proportion of observation i in variable j (i.e., 128 c ij / ∑ N j=1 c ij ).
(and optionally scaled) count matrix X, with observations as rows and variables as columns, the following 141 decomposition is obtained:

220
Decorating the tree in this way allows the user to explore the data and look for high-level trends. and add bar charts to visualize the data (Fig. 3). Coloring of the dendrogram with the Viridis color palette 259 (a dark blue, teal, green, yellow sequential color scheme) was based on a 1-dimensional projection of 260 sample conductivity, oxygen, and latitude calculated using Iroki's color gradient generator. The color 261 gradient generator was also used to make the color palettes used for the bar charts.  Additionally, the previous study used g23, the gene for major capsid protein, to survey the viral community.

288
It is possible that a functional protein like RNR is more connected with environmental conditions than a

305
Clusters B and C also offer a good point of comparison (Fig. 3). In addition to the similarity of their viewer to the Canvas viewer will allow users to visualize and customize huge trees quickly and easily.

349
Various example datasets from microbial ecology studies were analyzed to demonstrate Iroki's utility.

7/17
PeerJ reviewing PDF | (2019:09:41688:1:1:NEW 3 Jan 2020) Manuscript to be reviewed  Competing interests 374 The authors declare that they have no competing interests.   Changes in OTU abundance in two sample groups. Approximate-maximum likelihood tree of hide SSU rRNA OTUs that showed differences in relative abundance between STEC positive and STEC negative cattle hide samples. Branch and leaf dot coloring represents the p-value of a Mann-Whitney U test (dark green: p  0.05, light green: 0.05 < p  0.1, gray: p > 0.1) testing for changes in OTU abundance between STEC positive samples and STEC negative samples. Inner bar heights represent log transformed OTU abundance, and outer bars represent the abundance ratio between STEC positive and STEC negative samples (blue bars for higher abundance in STEC positive samples and brown bars for OTUs with higher abundance in STEC negative samples). Taxa labels show the predicted Order and Family of the OTU and are colored by the predicted phylum using the Paul Tol Muted color palette included with Iroki.

10/17
PeerJ reviewing PDF | (2019:09:41688:1:1:NEW 3 Jan 2020)  Figure 3. Tara Oceans virome similarity with associated metadata. Average-linkage hierarchical clustering of sample UniFrac distance based on RNR sequences mined from 41 Tara Oceans viromes. Major and sub-clusters of samples (A-G) are labeled. Branch color is based on a scaled, 1-dimensional projection of sample conductivity, oxygen, and latitude onto the cubehelix color gradient. Samples that are more similar to each other in branch color represent those that are more similar to each other with respect to the environmental parameters in the ordination. The first bar series (purple) represents sample conductivity (mS/cm), the second bar series (orange) represents sample dissolved oxygen levels (µmol/kg), and the third bar series (brown/green) represents sample latitude (degrees). For the first two bar series, shorter bars with lighter colors indicate lower values, while longer bars with darker colors indicate higher values. For the third series, longer, dark brown bars indicate samples with extreme negative latitudes, whereas longer, dark blue bars indicate samples with extreme positive latitudes. Samples with intermediate latitudes are represented by shorter, light colored bars. Sample labels represent the station from which the virome was acquired and are colored by sampling depth, with light blue representing surface samples and dark blue representing samples from the deep chlorophyll maximum at that station.