Topological data analysis: A promising big data exploration tool in biology, analytical chemistry and physical chemistry
Graphical abstract
Introduction
Topology, a sub-field of pure mathematics, is the mathematical study of shape. Although topologists usually study abstract objects, they have developed recently what they call Topological Data Analysis (TDA) [1]. The idea is here to use topology in order to visualize and explore high dimensional and complex real-world data sets. This concept has been successfully used in different topics like gene expression profiling on breast tumors [2], [3], T-cell reactivity to antigens for different type of diabetes [4], viral evolution [5], population activity in visual cortex [6] but also on unexpected topic as 22 years of voting behavior of the members of the US House of Representatives [7], characteristics of NBA basketball players via their performance statistics [7].
The two main tasks of TDA is the measurement of shape and its representation. One fundamental idea of TDA is to consider a data set to be a sample or point cloud taken from a manifold in some high-dimensional space (Fig. 1a). The sample data are used to construct simplices, generalizations of intervals, which are, in turn, glued together to form a kind of wireframe approximation of the manifold. This manifold and the wireframe represent the shape of the data. It is clear that many data analysis methods i.e.chemometric tools are available in order to explore data sets. However there are not yet ready for the analysis of future big data set which will be generated in many areas as biology, analytical chemistry or physical chemistry.
The main question is now, why topology is well suited for such data analysis? In general, TDA is considered to have three key properties. The first one is called coordinate invariance. Topology studies shapes in a coordinate free-way. Indeed topological constructions do not depend on the coordinate system chosen, but only on the distance function that specifies the shape. In Fig. 1b, the two A letters (constituted of millions of points) could represent a data set of samples analyzed with two different analytical platforms (different coordinate systems) while the topological construction extracts the main features of it. The second key property is deformation invariance. Topological properties are unchanged when a geometric shape is stretched or deformed. In Fig. 1c, the letter A deform, but the key features, the two legs and the closed triangle remain what are retrieved in the topological representation. It is because our brain works in a topological way that one can recognize A letters regardless of the font used [8]. In general, topologists consider TDA as a method which is less sensitive to noise. Indeed it possesses the ability to pick out the shape of a data set despite countless variations or deformations. The third property is compression. If we are willing to sacrifice a little bit of detail, a simple representation of the fundamental properties of A letter i.e. a close triangle and two legs can obtained (Fig. 1d). Considering this A letter as a big data set with millions of points, TDA can generate in this case a topological network with five nodes and five edges. Thus this compressed representation encodes all these relationships in a very simple form. For all these reasons, this highly scalable method gives us a good opportunity to analyze very big data sets which will be generated in biology, analytical chemistry or physical chemistry. The main purpose of this work is to introduce TDA and also highlighted these nice properties which will be necessary to manage our future data structures.
Section snippets
Topological data analysis
Before going into details, a general framework of TDA will be first introduced in order to know how it can be used to generate a topological network from our data and how to interpret it. The final network will represent the shape of our data. The shape will have meaning for data exploration. TDA uses mathematical functions as lenses on data similar to using an objective of a microscope to bring focus to your sample (Fig. 2). Different lenses highlight different aspects of a data set. Due to
Samples
Analysis of single bacteria is a hot topic particularly in the framework of air biomonitoring. This is largely due to the need to develop new spectroscopic instrumentations capable of detecting agents in real time for civil and military applications. In this context, four bacteria strains were prepared in this work i.e. Staphylococcus epidermidis (a Gram-positive bacterium), Pseudomonas fluorescens (a Gram-negative bacterium), Pseudomonas syringae and Escherichia coli (a Gram-negative
Results and discussion
The main aim of this part is to observe the behavior of common data analysis tools vs TDA when exposed to different data structures induced by different experimental conditions. First when working with spectroscopic data sets, it is almost compulsory to apply a spectral pretreatment in order to suppress artifacts or unwanted variances. Because finding a good preprocessing algorithm or a combination of several ones is not always a trivial task, it is interesting to see if raw data can be
Conclusion
The main objective of the work was to introduce the new concept of topological data analysis with a comparison to conventional chemometric tools. This allowed us to highlight nice properties of the method. Indeed TDA was able retrieve valuable information from different data structures with very low signal to noise ratio, variable shifts and missing data. As a consequence, it might be regarded as a very robust and promising method to cope with such situations. From a general point of view, it
Acknowledgment
L.D is grateful for the scientific and technical support from Devi Ramanan and Alan Lehman at Ayasdi Inc., Menlo Park CA.
References (13)
- et al.
J. Autoimmun.
(2014) Bull. Amer. Math. Soc.
(2009)- et al.
J. Chem. Phys.
(2009) - et al.
Proc. Natl. Acad. Sci. U. S. A.
(2011) - et al.
Proc. Natl. Acad. Sci. U. S. A.
(2013) - et al.
J. Vis.
(2008)