ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Visualizing balances of compositional data: A new alternative to balance dendrograms

[version 1; peer review: 2 approved]
PUBLISHED 14 Aug 2018
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the RPackage gateway.

Abstract

Balances have become a cornerstone of compositional data analysis. However, conceptualizing balances is difficult, especially for high-dimensional data. Most often, investigators visualize balances with the balance dendrogram, but this technique is not necessarily intuitive and does not scale well for large data. This manuscript introduces the 'balance' package for the R programming language. This package visualizes balances of compositional data using an alternative to the balance dendrogram. This alternative contains the same information coded by the balance dendrogram, but projects data on a common scale that facilitates direct comparisons and accommodates high-dimensional data. By stripping the branches from the tree, 'balance' can cleanly visualize any subset of balances without disrupting the interpretation of the remaining balances. As an example, this package is applied to a publicly available meta-genomics data set measuring the relative abundance of 500 microbe taxa.

Keywords

compositional data, coda, balances, ilr, visualization, rstats, r

Introduction

A composition is a vector of positive measurements that sum to an arbitrary total1. Examples of compositions include measurements recorded in parts per million (ppm) or percentages, but also include measurements that are less obviously parts of the whole (e.g., count data generated by next-generation sequencing2). A component is one part of a composition. Compositional data analysis (CoDA) deals with the analysis of compositions. Compositional data, because they contain values bounded from zero to one, exist in a non-Euclidean space that render conventional statistical methods invalid. To deal with compositionality, CoDA typically begins with a log-ratio transformation that maps data into an unbounded space where conventional statistical methods can be used. The simplest transformations, the centered log-ratio transformation and the additive log-ratio transformation, use a simple reference as the denominator of the log-ratio. A more complex transformation, the isometric log-ratio transformation, transforms the composition with respect to an orthonormal basis3. Alternatively, one could analyze the log-ratio of each component to the other directly4,5.

Balances use a sequential binary partition (SBP) to define an orthonormal basis that splits the composition into a series of non-overlapping groups6. This design allows for an interpretation of the data at the level of the isometric log-ratio coordinates7. This SBP contains a diverging set of contrasts that are each interpretable as a measure of “Group 1 vs. Group 2” (following an isometric log-ratio transformation). For a D-part composition, the SBP defines D − 1 balances that decompose the variance such the sum of the sample-wise variances for each balance in the tree equals the total sample-wise variance6. Balances (like the centered log-ratio transformation and the isometric log-ratio transformation) satisfy all properties required for compositional data analysis: scale invariance, permutation invariance, perturbation invariance, and sub-compositional dominance (reviewed in 8 and elsewhere).

Although balances have proved useful for the analysis of compositional data, their usual application depends on generating a meaningful SBP. Sometimes, this involves manually creating an SBP based on expert opinion, with or without the assitance of exploratory analyses6. However, using expertise to build an SBP is not always desirable, especially for high-dimensional data (where each composition can measure thousands of components). Principal balance analysis is a data-driven alternative that, similar to principal component analysis, seeks to identify an SBP whose balances successively explain the maximal variance of a data set (a computationally expensive procedure approximated with heuristics)9,10. In the field of meta-genomics, where next-generation sequencing is used to count the relative abundance of microbe taxa, scientists have applied balances of SBPs to summarize and classify microbiome samples11. One study defined the SBP by hierarchically clustering the microbe taxa based on the outcome of interest12. Another defined the SBP based on the phylogenetic relationship between microorganisms13.

Once an SBP is generated, its balances can be visualized using a balance dendrogram14. The balance dendrogram illustrates (a) the distribution of samples across the balance, (b) the relationship between balances along the SBP tree, and (c) the decomposition of variance6,15. In addition, a balance dendrogram can show differences between sub-groupings of samples by coloring facets of the box plots. Although balance dendrograms capture a vast amount of data, the balance dendrogram may not provide the optimal visualization of balances. First, by building the figure around a tree, balance dendrograms place emphasis on the relationship between the balances, and not on the balances themselves. Second, each box plot has a unique scale positioned sporadically along the tree such that direct comparisons between one balance and all others become difficult. Third, the decomposition of variance uses lines that run parallel to the dendrogram branches, potentially confusing these concepts through use of a common symbol. In this software article, I present the R package balance for visualizing balances of compositional data. This package provides an alternative to the balance dendrogram that I hope will simplify balances for scientists less familiar with compositional data analysis.

Methods

Implementation

Within the R package universe, there are three standalone and well-documented tools for general compositional data analysis: compositions16, robCompositions17, and zCompositions18. The compositions::CoDaDendrogram function plots an archetypal balance dendrogram. There are also a number of domain-specific tools, tailored to next-generation sequencing data, and shown to work effectively19,20: ALDEx221,22 and ANCOM23 for differential abundance analysis, SparCC24 and SPIEC-EASI25 for the correlation analysis of sparse networks, propr26,27 for proportionality analysis, and philr13 for the analysis of phylogeny-based balances. Of these, the philr package computes balances and visualizes them with dendrograms, but does not plot a balance dendrogram per se.

The balance package is available for the R programming language and uses ggplot228 to visualize the distribution of samples across balances of a sequential binary partition (SBP) matrix. Each balance is calculated by the formula:

bi=|ip||in||ip|+|in|log[g(ip)g(in)]

for bi = [b1, ..., bD−1] balances where g(x) is the geometric mean of x, ip is the sub-composition of positively-valanced components, and in is the sub-composition of negatively-valanced components. Here, |ip| describes the norm, or length, of the sub-composition.

Operation

The balance package29 computes and visualizes balances of compositional data. It requires few package dependencies, has negligible system requirements, and runs fast on a standard laptop computer (e.g., any modern budget CPU with 4GB RAM). To use balance, the user must provide a compositional data set (e.g., Table 1: samples as rows and components as columns) and a serial binary partition (SBP) matrix (e.g., Table 2: components as rows and balances as columns). Below, balance is shown for an example data set from robCompositions17.

library(robCompositions)
library(balance)
data(expenditures, package = "robCompositions")
y1 <− data.frame(c(1,1,1,−1,−1),c(1,−1,−1,0,0),
                    c(0,+1,−1,0,0),c(0,0,0,+1,−1))
res <− balance(expenditures, y1)

Table 1. An example of a compositional data set with 20 sample compositions measuring 5 components each.

As compositional data, the total expenditure for each individual is arbitrary. These example data are taken from robCompositions17.

housingfoodstuffsalcoholotherservices
640328147169196
18004845152291912
208544572583731732
616331126117149
875368191290275
770364196242236
990415284588420
4143059468112
13944403931161636
1285374363785487
1102469243496388
17174524521977832
15494544241345676
838386155208222
845386211317280
1130394271490386
17654665242133822
1195443329974523
218052155327811010
1017410225419345

Table 2. An example of a serial binary partition (SBP) matrix with 5 components split into 4 balances.

These example data are taken from robCompositions17.

z1z2z3z4
1100
1-110
1-1-10
-1001
-100-1

Optionally, users can color components or samples based on user-defined groupings. To do this, users must provide a vector of group labels for each component via the d.group argument (or for each sample via the n.group argument). The boxplot.split argument facets the box plots similar to the balance dendrogram15.

group <− c(rep("A", 10), rep("B", 10))
res <− balance(expenditures, y1, n.group = group, boxplot.split = TRUE)

Figure 1 compares the balance dendrogram to its alternative using the robCompositions data17.

e0117dfa-3e5f-45c3-a142-444d7099ff96_figure1.gif

Figure 1. This figure shows a balance dendrogram and its alternative, both prepared using the data from Table 1 and Table 2.

On the left, first branch of the balance dendrogram shows how the “services” and “other” components are contrasted against the remaining components. The box plot positioned at the branch shows the distribution of samples within this balance. The length of trunk shows the proportion of variance explained by this balance. On the right, this same information gets captured by a two-panel figure. The top balance in the left panel shows how the “services” and “other” components are contrasted against the remaining components. The top balance in the right panel shows the distribution of samples within this balance. In the right panel, the line length shows the range of the sample distribution, while its thickness shows the proportion of variance explained. Note that the median of this first contrast sits slightly positive, meaning that the most samples spend more on [“alcohol”, “foodstuff”, “housing”].

Use cases

As a use case, a publicly available microbiome data set is analyzed using balances. These data measure the abundance of microbe taxa in the feces of diabetics and their non-diabetic relatives30, making it a true relative data set. Since these data contain many zeros that disrupt the log-ratio transformations, the zeros are first replaced through imputation by the zCompositions package. See the Supplementary Information for a demonstration of other pre-processing steps.

To identify balances for visualization, a serial binary partition (SBP) matrix is made by hierarchically clustering components based on their proportionality measure ϕs (used here as a dissimilarity measure27), thus joining together components that covary similarly across all samples. The ape31 and philr13 packages transform the tree object into an SBP ready for analysis and visualization.

# for compositional data with samples as rows
data.no0 <− zCompositions::cmultRepl(data, method = "CZM")
pr <− propr::propr(data.no0, metric = "phs")
h <− hclust(as.dist(pr@matrix))
phylo <− ape::as.phylo(h)
sbp <− philr::phylo2sbp(phylo)
# it is helpful to name the balances
colnames(sbp) <− paste("z", 1:ncol(sbp))
res <− balance::balance(data.no0, sbp, size.text = 4,size.pt = 1)

Supplementary Figure 1 visualizes all 499 balances and contains the same information that a balance dendrogram would contain: (a) the left panel dot plot shows the components being contrasted, (b) the right panel box plot shows the distribution of samples across each balance, and (c) the right panel line length shows the range of the balance (the range should cleanly approximate the decomposition of variance for purpose of exploratory visualization, though line width can optionally show the actual proportion of explained variance if desired). However, unlike a balance dendrogram, components and samples are projected on a common scale that facilitates direct comparisons and accommodates high-dimensional data. Yet, the main advantage of the balance package is that, by stripping the branches from the tree, it becomes possible to visualize any subset of balances without disrupting the interpretation of the remaining balances. In Figure 2, we subset the visualization to include only the top 10 most explanatory balances, ranked by the proportion of variance explained.

# full balances stored in results of balance plot
balances <− res[[3]]
vars <− apply(balances, 2, var)
rank <− order(vars, decreasing = TRUE)[1:10]
res <− balance::balance(data.no0, sbp[,rank], size.text = 4)
# then view for further study
sbp[,rank]

e0117dfa-3e5f-45c3-a142-444d7099ff96_figure2.gif

Figure 2. This figure shows the top 10 most explanatory balances, ranked by the proportion of variance explained.

The left panel shows how select microbe taxa are contrasted against others. The right panel shows the corresponding distribution of samples within each balance, with the line length showing the range of the distribution. Many of the most explanatory balances occur toward the base of the serial binary partition (SBP) matrix. Yet, this subset visualization is not feasible with the balance dendrogram. Note that the order among the top 10 balances is determined procedurally to place the base of the tree at the top of the figure.

The d.group and n.group arguments offer a way to organize the results in a meaningful way. For example, the d.group can label microbes that most interest investigators, while the n.group can label patients based on clinical findings. Here, colored components (d.group) indicate the availability of supplemental meta-transcriptomic data, while colored samples (n.group) indicate the presence or absence of type-1 diabetes. In Figure 3, we repeat the visualization of the top 10 most explanatory balances, with points colored by the user-defined groupings.

e0117dfa-3e5f-45c3-a142-444d7099ff96_figure3.gif

Figure 3. This figure shows the top 10 most explanatory balances, ranked by the proportion of variance explained, with points colored by the user-defined groupings.

The left panel shows how select microbe taxa are contrasted against others. The right panel shows the corresponding distribution of samples for each group within each balance, with the line length showing the total range of the distribution. There is apparently a difference in the median values of diabetics and non-diabetics for some balances. One could test the significance of these differences using conventional statistical methods like the Student’s t-test32. Note that the order among the top 10 balances is determined procedurally to place the base of the tree at the top of the figure.

Summary

Compositional data measure parts of a whole such that the total sum of the composition is irrelevant and each part is only interpretable relative to others. The analysis of composition data requires interpreting the parts of the composition relative to the others. Log-ratio transformations offer a way to transform the data into an unbounded space where the analyst can apply conventional statistical methods. One transformation is the isometric log-ratio transformation which transforms the composition with respect to an orthonormal basis. Balances use a sequential binary partition (SBP) to define an orthonormal basis that splits the composition into a series of non-overlapping groups. Balances can help the investigator identify trends in relative data, and are often visualized using a balance dendrogram. However, the balance dendrogram is not necessarily intuitive and does not scale well for large data. This paper introduces the balance package for the R programming language, a package for visualizing balances of compositional data using an alternative to the balance dendrogram. This alternative contains the same information coded by the balance dendrogram, but projects data on a common scale that facilitates direct comparisons and accommodates high-dimensional data. By stripping the branches from the tree, balance can cleanly visualize any subset of balances without disrupting the interpretation of the remaining balances.

Data availability

All data used for this analysis were acquired from the supplement of Heintz-Buschart et al.30. The supplement of this manuscript contains code to pre-process these data and reproduce the analysis.

Software availability

Software and source code available from: https://github.com/tpq/balance

Archived source code at time of publication: https://doi.org/10.5281/zenodo.132686029

Software license: GPL-2

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 14 Aug 2018
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Quinn TP. Visualizing balances of compositional data: A new alternative to balance dendrograms [version 1; peer review: 2 approved] F1000Research 2018, 7:1278 (https://doi.org/10.12688/f1000research.15858.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 14 Aug 2018
Views
10
Cite
Reviewer Report 03 Sep 2018
Marc Noguera-Julian, IrsiCaixa AIDS Research Institute, Badalona, Spain;  Universitat de Vic-Universitat Central de Catalunya, Catalonia, Spain 
Approved
VIEWS 10
Manuscript by Quinn entitled “Visualizing balances of compositional data” introduces a new way to visualize balances and component partitions in compositional data. Typically, these are visualized using dendrograms where branches represent the component partitions (obtained through an external method and/or ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Noguera-Julian M. Reviewer Report For: Visualizing balances of compositional data: A new alternative to balance dendrograms [version 1; peer review: 2 approved]. F1000Research 2018, 7:1278 (https://doi.org/10.5256/f1000research.17311.r37158)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
14
Cite
Reviewer Report 28 Aug 2018
Vera Pawlowsky-Glahn, Department of Computer Science, Applied Mathematics and Statistics, University of Girona, Girona, Spain 
Juan José Egozcue, Department of Civil and Environmental Engineering, Polytechnic University of Catalonia, Barcelona, Spain 
Approved
VIEWS 14
We have not been able to access the R-library “balance”. Consequently, we base our comments exclusively on the text presented. This circumstance motivates the answer “partly” to some of the previous questions. Thus, we cannot evaluate how well the software ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Pawlowsky-Glahn V and Egozcue JJ. Reviewer Report For: Visualizing balances of compositional data: A new alternative to balance dendrograms [version 1; peer review: 2 approved]. F1000Research 2018, 7:1278 (https://doi.org/10.5256/f1000research.17311.r37159)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 14 Aug 2018
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.