Multi-level Massive Data Visualization: Methodology and Use Cases

. This research focuses on massive data visualization that is based on dimensionality reduction methods. We propose a new methodology, which divides the whole data visualization process into separate interactive steps. In each step, some part of data can be selected for further analysis and visualization. The different dimensionality method can be chosen/changed in each step. The decision which methods to be chosen depends on desirable accuracy measures and visualization samples. In addition, there are provided statistical measures of the identified clusters. We have developed a special tool, which implements the proposed methodology. R language and Shiny package were used for developing the tool. In the paper, the principles of the methodology and features of the tool are presented by describing the specific use case.


Introduction
Big data analytics is the process of investigating big data to uncover hidden and useful information for better decisions.It involves a visual presentation of data that enables to see hidden relations between objects, which cannot be detected using conventional data analysis methods (Zubova et al., 2016).Recently this topic has been widely investigated by various researchers (Yongjie et al., 2018), (Khomtchouk et al., 2017), (Xuedi et al., 2018), (Domeniconi, 2004), (Diamond and Mattia, 2017).
Our main goal is to improve the data visual analysis process and to propose new effective ways to analyse and visualize massive data.In this paper, we present the multilevel methodology for massive data interactive visualization that is based on dimensionality reduction methods.Dimensionality reduction refers to the process of taking a data set with a usually large number of dimensions, and then creating a new data set with a fewer number of dimensions, that preserve as much of initial information as possible (Menon, 2007).In our case, we always reduce the initial number of dimensions to two (these dimensions are named by D1 and D2).
Our previous research (Zubova et al., 2018) has shown that the speed and accuracy of dimensionality reduction depend on the amount of analysed data items and the initial number of dimensions that describe them.The kind of data might also have influence.Therefore, here we present an interactive tool which allows changing various settings in different stages of visual data analysis.R language and its package Shiny were used for developing the tool.
We assume that at the beginning the computational speed is the most important factor for data visualization.In further steps of visualization process, the demand for accuracy gradually increases.This fact requires using more accurate, but possibly slower methods.During each step, the selected data set is divided into smaller clusters.In the end, the most accurate method processes the data.It would require too many resources at the beginning of dimensionality reduction, but in the end, the data set is small enough to be processed in the most accurate way.Therefore, visual samples together with accuracy measures are needed to decide which method should be applied in a particular case.
The remainder of this paper is organized as follows.In Section 2, we describe the principles of the proposed methodology in detail.Section 2 presents the use case, which reveals the features of the developed tool.Finally, conclusions are drawn in Section 4.

Multi-level Data Visualization Methodology
Here we propose and describe a visualization methodology, which divides the data visualization process into separate steps (Fig. 1).At each stage, a particular dimensionality reduction-based visualization method can be applied considering to data volume and type.The methods are selected according to their speed and accuracy.The more initial information dimensionality reduction method preserves, the more accurate it is.The possible accuracy measures are described in Subsection 2.2.When data are processed and visualized, there is ability to see the statistical measures of all features of each data cluster.The further analysis can be performed only for the selected data cluster.
The process of data visualization and analysis is separated into several steps: 1.First of all, an initial data set is loaded.A detailed description of this action is presented in Subsection 2.1.2. At the initial stage, the accuracy of chosen dimensionality reduction-based visualization method is not so important, therefore, the fastest one can be used.Thus, a 2-dimensional dataset is created, and data are visualised on the 2D scatter plot.3. The decision maker can select a part of all data items on the plot for further visualization/analysis.If requested, the selected data are visualized by different methods.Accuracy measures and descriptive information are provided as well.These tool features are described in more details in Subsection 2.3.4. Based on the provided plots and accuracy measures, the user can choose the best method for a particular case.We assume that the deeper we go, the more accurate, but possibly slower methods might be required.a.If a simple zoom of the selected plot area is chosen, then the selected items can be filtered from the 2-dimensional dataset, created in the previous step.In such a case, there is no need to execute dimensionality reduction process repeatedly.The selected items are presented on the plot.b.If the user chooses to apply a different dimensionality reduction method or re-apply the same method again, then the selected items are filtered from the initial dataset, which contains all initial dimensions.Before this action, there is also a possibility to add new additional items to initial dataset (from the selected source file).Afterwards, the user chooses the desired method, which is applied for dimensionality reduction."Working" 2-dimensional dataset is updated, and data is visualised on the 2D plot. 5.If the user chooses to perform further analysis and zoom the plot, then the process continues again from Step 3.

Loading and analysing the initial data
The process of loading and analysing data is presented in Fig. 2.
If it is already classified data (items are assigned to a particular class), then the data file and class file (containing items assignment to a particular class) are loaded.
If it is not classified file, then only the needed data file is loaded.The desired clustering method and its parameters are chosen in the next step.The required parameters depend on which clustering method is chosen.When parameters are set, the initial data are clustered.If the results do not satisfy the user, then the parameters can be changed, and new clustering is made.
Finally, in both cases (classified data/not classified data), there is a possibility to get statistics (in tables and graphs) of the initial data.Fig. 2. Loading and analysing initial data

Data analysis
The detailed analysis of the selected items can be performed to make the proper choice of dimensionality reduction method, which will be applied in the next step of data visualization process.
We use the well-known methods for dimensionality reduction: Multidimensional Scaling (MDS), Principal Component Analysis (PCA), Independent Component Analysis (ICA), Principal Curves, Locally Linear Embedding (LLE), Isometric Mapping (Isomap) (Menon, 2007), (Domeniconi, 2004), (Fodor, 2002), (Mizuta, 2007), (Rosaria et.at., 2014) (Sorzano et al., 2014).The multidimensional data are processed by these dimensionality reduction methods, and six sets of the 2-dimensional data are obtained.They are presented in 2D scatter plots.Different methods visualize the same data in different ways.Therefore, the possibility to choose between several methods enables to find which one suits the best for particular kind of data.
The methods currently implemented in the tool are most appropriate for processing continuous data.In further stages there will be added methods applicable also for categorical data (e.g.CATPCA).
The accuracy measures (Stress, Spearman coefficient, Shannon entropy) for the results, obtained by different methods are calculated and presented as well:  Stress.It shows the relative difference between distances in different spaces.It is obtained by solving the square loss function.The closer to 0 stress value is, the more accurate dimensionality reduction method is.For MDS method we use R function mds() from package 'smacof' to find the stress value (WEB, c).
To get the Stress value for other methods, the calculations by Stress formula are performed. Spearman coefficient (The Spearman's Rank Correlation Coefficient).It is a statistical measure used to discover the strength of a link between two sets of data (Hauke and Kossowski, 2011).This measure uses the ranks of variables instead of their values.Possible values range from -1 (strong negative relation) to 1 (strong positive relation).If the measure is equal to zero, this means there is no statistical link between datasets.To calculate this measure, R function cor() with method "spearman" was used. Shannon entropy.We used R function entropy from package 'entropy' that estimates the Shannon entropy of the random variable from the corresponding observed items (WEB, b), (Hausser and Strimmer, 2009).Each case is individual, thus there can't be determined one best method in advance.It depends on data type, size etc.Therefore tool provides the accuracy measures of all methods.In further stages there will be added the measure of expected execution time.This will help in cases when several methods have similar accuracy measures and speed becomes a decisive factor.
The proposed methodology also enables these additional features:  A list of the selected items together with their features is formed, and it is shown in a data table.Labels (IDs) can be placed on the selected items;  When a particular point is selected in one graph, the corresponding point is

Demonstration of tool prototype
The proposed multi-level large data visualization methodology has been implemented in a tool prototype.In this section, we describe a test dataset and a use case, which demonstrates how the designed tool prototype implements the features of the proposed methodology.
The tool prototype has been implemented in R language.The Shiny package was used to create an interactive user interface, therefore, the tool works as a web application (WEB, c).

Test dataset
Data objects are also called items, instances, samples, observations.Features are called attributes, parameters, properties, variables, dimensions.Objects described by the same features  1 ,  2 , …   form a data set.A combination of values of all features characterizes a particular object   = ( 1 ,  2 , … ,   ),  ∈ {1, … , }, where n is the number of features, m is the number of objects.
If the objects are described by more than one feature, the data characterizing the objects are called multidimensional data.If the number of features is n, then  1 ,  2 , … ,   are the n-dimensional data items.
If the data set consists of a lot of objects, i.e. the number m is lame enough, then the data set is called a large data set.If the number n is large, then the data set is called a high-dimensional data set.
We used a test dataset which contains the information about the frogs.It was previously used by others in several classifications tasks related to the challenge of anuran species recognition through their calls.This dataset was created segmenting 60 audio records belonging to 4 different families, 8 genus, and 10 species.Each audio corresponded to one specimen (an individual frog).The spectral entropy and a binary cluster method were used to detect audio frames belonging to each syllable.After the segmentation 7195 syllables were got (WEB, a).In this case, we use a smaller amount of items.
As the data are already classified (items are assigned to particular classes), there are two files:  Data file (containing the data).It has 2610 items and 10 attributes (dimensions). Class file.It has 2610 items (for each corresponding item in the data file) and only one attribute that defines which particular class each item belongs to.All items are assigned to one of 4 clusters (Families).68 items belong to the 1st cluster (Bufonidae), 542 items belong to the 2nd cluster (Dendrobatidae), 1000 items belong to the 3rd cluster (Hylidae) and 1000 items belong to the 4th cluster (Leptodactylidae).

Data loading and analysis
In the beginning, we must choose the type of dataclassified / not classified.Here we present the case where not classified data is analysed, therefore, only one file (data) is loaded.At the first step, the tab for not classified data is selected.The next step requires choosing a clustering method (Fig. 4).
If K-means method is chosen, then it is enough to specify the number of clusters.The tool shows how many items are in each cluster: e.g. the first cluster has 606 items, the second one has 160 items, etc. (Fig. 5).
If dbscan method is chosen, then there is a need to specify the number of clusters, number of neighbourhoods and epsilon (Fig. 6).The epsilon value is selected according to the plotwe look for the point at which the biggest change of distances between neighbourhoods is observed.In this case, the epsilon value was set to 0.3.
In the case of dbscan method, there are 3 clusters: the first cluster has 2120 items, the second one has 371 items, and the third one 13.Nevertheless, which clustering method is chosen, it is always possible to get the statistics of each cluster.The average, standard deviation, minimum and maximum values of each data feature are shown (Fig. 7).The button "Show plot" enables to see the statistics in a graphical way (Fig. 8).

Multi-level data visualization
In this case, from 6 possible methods, the PCA is chosen for an initial data visualization:  The selected data can be processed by other dimensionality reduction method than in the previous step.In this case, the MDS method was applied to visualization of the selected data (Fig. 11).Such visualization of the selected items by the chosen method can be used as many times as needed.The data, selected in the previous step, are visualized by one more method -ICA (Fig. 12).
The visualization results reveal that items belonging to green cluster clearly separate from the rest of the data.However, some items belonging to yellow and red clusters overlap.If we are not sure, which method should be used, the selected data can be visualized by all 6 methods in order to choose the best (preferred) method.Figures 13, 14 and 15 present a part of the data, selected in the previous step, visualized by the MDS, PCA, ICA, Principal Curves, LLE and Isomap methods.We also get the accuracy measures of each method (Fig. 16).Fig. 17 presents the accuracy measures in a graphical way.In all the plots, there is also a possibility to see the IDs of each item.This enables to see which particular cluster each item belongs to (Fig. 18).
When data are presented in the 2D graph, it allows to graphically seeing the relationships between items and their dependency to the particular cluster.However, in such a graph, we do not see the exact parameters (values of initial dimensions) of each item.Therefore, the list of selected items and their features are presented in the table.This enables to find by what characteristics each cluster distinguishes.

Conclusions
In this paper, we have proposed the methodology that enables multi-level massive data visualization in interactive way.We have developed the special tool, which implements the proposed methodology.The principles of the methodology and the features of the tool are presented by describing the specific use case.
The proposed methodology improves visual data analysis process and brings new possibilities.It allows zooming and analysing the selected parts of data.Different dimensionality reduction methods can be applied in each particular case according to data type and size.2D plots are supplemented by statistical characteristics.This enables the investigation of the same data from different points of view.
shown (highlighted) on other graphs;  Outlier detection is enabled.Outliers are extreme values that deviate from other observations on data.They may indicate variability in measurement, experimental errors or a novelty (Xuedi et al., 2018).

Fig. 9 .
Fig. 9. Dimensionality reduction methods Fig. 10 presents the initial data (containing all items), visualized by the PCA method.Different clusters are distinguished by different colours.Some clusters have more items (e.g.blue, brown, black), othersless (e.g.red, yellow, green).Some clusters also overlap each other.It is possible to select a part (or if needed -all) of the data points for further analysis (Fig. 10, right).

Fig. 11 .
Fig. 11.Data visualization by using the MDS method

Fig. 13 .
Fig. 13.Data visualization by using the MDS and PCA methods

Fig. 16 .
Fig. 16.Accuracy measures of the dimensionality reductions methods