LipidSpace: Simple Exploration, Reanalysis, and Quality Control of Large-Scale Lipidomics Studies

Lipid analysis gained significant importance due to the enormous range of lipid functions, e.g., energy storage, signaling, or structural components. Whole lipidomes can be quantitatively studied in-depth thanks to recent analytical advancements. However, the systematic comparison of thousands of distinct lipidomes remains challenging. We introduce LipidSpace, a standalone tool for analyzing lipidomes by assessing their structural and quantitative differences. A graph-based comparison of lipid structures is the basis for calculating structural space models and subsequently computing lipidome similarities. When adding study variables such as body weight or health condition, LipidSpace can determine lipid subsets across all lipidomes that describe these study variables well by utilizing machine-learning approaches. The user-friendly GUI offers four built-in tutorials and interactive visual interfaces with pdf export. Many supported data formats allow an efficient (re)analysis of data sets from different sources. An integrated interactive workflow guides the user through the quality control steps. We used this suite to reanalyze and combine already published data sets (e.g., one with about 2500 samples and 576 lipids in one run) and made additional discoveries to the published conclusions with the potential to fill gaps in the current lipid biology understanding. LipidSpace is available for Windows or Linux (https://lifs-tools.org).


S.2) Quality control measures and approaches
Benford's law: According to number theory, the first digits of a set of numbers do not occur with equal probability when the numbers are spanning several orders of magnitude.A set of numbers violating this property (Benford's law [1]), might indicate lipid quantities of low order of magnitude or a poorly performed data imputation for missing values.
Principal component analysis for blank assessment: Blanks might indicate a good technical data acquisition when measured equally distributed throughout the complete sample measurement process.Blanks should contain a similar analyte composition and thus be quite similar.Blanks that do not cluster well together in a PCA may indicate issues in the sample acquisition process.

Coefficient of variation / Relative standard deviation:
With respect to a selected nominal study variable, CV histograms of the lipids within the respective groups are plotted.CV values exceeding 25-30 % may indicate insufficient measurement of samples or individual lipids.

Comparison of structural similarity of samples on either qualitative level or quantitative level:
The hierarchical dendrogram supports quality control in numerous ways.For instance, lipidomics identification results from two different laboratories can be quickly assessed for equality based on shared lipid presence.When turning off quantitative data from the dendrogram, one can see if samples measured from different laboratories distribute randomly or if they form two separate clusters.If the latter is the case, the laboratories may have used different methods leading to identification of non-concordant lipid sets.When working with model organisms, reference lipidome tables can be sourced from literature to compare the performance of one's own control measurements to the reference on a quantitative level.The formation of well separated clusters for both the reference samples and own control measurements (even with applied quantity normalization) may again hint at issues stemming from the pre-analytical or sample acquisition process.

Adjustable p-value distribution:
When a nominal study variable is chosen with at least two categories, a p-value distribution plot is available.The type of test is adjustable (Student's t-Test, Welch's t-Test, Kolmogorov-Smirnov Test, or ANOVA for comparison of more than two categories).An equal distribution of p-values might indicate that either no regulation exists between these categories or that a preceding experiment between these categories (e.g., knockout vs. wildtype) did not succeed.
Adjustable volcano plot for nominal study variables with two categories: An enhancement of the pvalue distribution is a volcano plot.Lipid quantities are being compared between both samples where their logarithmic ratio (fold change) is reported on the x-axis and their (negative logarithmic) p-value on the yaxis, often resulting in a volcano-shaped scatter plot.Again, points not exceeding any predefined limits might indicate absence of regulation between both categories or that an preceding experiment to these categories (e.g., knockout vs. wildtype) did not succeed.A hierarchical clustering was performed by calculating the structural space between all 702 lipid species over all 1037 studies within seven datasets from four different lipidomics studies (Ishikawa et al. [4], Sales et al. [5], Saw et al. [2], Wolrab et al. [6]), and computing the pairwise distances between all lipidomes based on their corresponding lipid spaces.This clustering does not take the lipid abundances into consideration.Three studies (red, green, blue) have no further hierarchical structure within their branches because no differences in lipid composition were reported.

Figure S3 :
Figure S3: Visualization of an exemplary lipidomics space analysis.In total, eleven samples (six human and five mouse samples) were analyzed in this experiment resulting in eleven structural lipid space models (tiles 2 -12) and a comprehensive global lipid space model (tile 1, top left).

Figure S4 :
Figure S4: Visual representation of study variable values in the dendrogram.When selecting nominal study variables (left) in LipidSpace, the branches show a pie chart of lipidome distributions associated to the values in the respective subtree.Having selected a numerical study variable (right), the best separation value is computed between two connected sub-branches and the pie charts show the distribution of lipidomes having a higher or lower value.For example, here the study variable 'cholesterol (pmol/mg protein)' (right) is shown.The top level can be separated at the value 84993.3.The color green indicates a value less than 84993.3.For each horizontal branch, a new separation value is computed as explained in Figure S5.

Figure S5 :
Figure S5: Determination of optimal value for separation of two numerical sets.(left) Both sets are sorted and their cumulative density function is computed, while the highest cumulative difference determines their optimal separation value.(right) A histogram illustrates the distribution of both sets.For this example, the sets were extracted from a dataset from a study on human plasma published by Saw et al.[ 2].The study variable is the concentration of cholesterol (mmol/l) for 359 human samples.

Figure S6 :
Figure S6: Probability distribution of two arbitrary carbon chains having x double bond position matches.We based our comparison on the two established lipid databases LIPID MAPS and SwissLipids.For each database, a distinct set of fatty acyl chains with at least one double bond position information was extracted.Double bond (DB) positions were compared either when starting to count from the carbonyl carbon group (forward) or from the methyl group (omega / backward).The probability that two arbitrary fatty acyl chains with a different DB composition have zero matching double bond positions is at least 73 % in both directions for the LIPID MAPS entries and between 64 % and 71 % for the SwissLipids entries.

Figure S7 :
Figure S7: Preservation of the spatial organization of lipids in structural space model.Here, a list of 14 diacylglycerophosphocholine lipids PC 12:0/[12-24,26]:0 were analyzed individually (left) forming a sequential arc with increasing fatty acyl chains.On the right-hand side, the 14 lipids were analyzed along with a set of 500 lipids from different lipid categories.The 14 PC lipids preserve the sequential arc (magnified field) although slightly deformed.Please note that both the lipid diacylglycerophosphocholine and the principal components have the same abbreviation PC.

Figure S8 :
Figure S8: Reanalysis of a lipidomics study on human plasma.The results computed by LipidSpace are in agreement with the results reported by Saw et al. [2].The lipid species PC O-40:7 and PE O-40:7 have the highest group separation potential.

Figure S9 :
Figure S9: Reanalysis of a lipidomics study on mouse platelets.In the study of Peng et al.[3], the main abundant glycerophosphoglycerols (PG) species are significantly regulated between the conditions wild type (WT) and knockout (KO) with a p-value < 0.001.

Figure S10 :
Figure S10: Study on different ethnicities.A separation model for the Chinese and Indian populations based on data derived from Saw et al. [2] was computed.Both populations were separated with an accuracy of 90.65 % under the consideration of a 23 lipids-comprising panel.

Figure S11 :
Figure S11: Comparison of seven different lipidomics experiments.A hierarchical clustering was performed by calculating the structural space between all 702 lipid species over all 1037 studies within seven datasets from four different lipidomics studies(Ishikawa et al. [4], Sales et al.[5], Saw et al.[2], Wolrab et al.[6]), and computing the pairwise distances between all lipidomes based on their corresponding lipid spaces.This clustering does not take the lipid abundances into consideration.Three studies (red, green, blue) have no further hierarchical structure within their branches because no differences in lipid composition were reported.