Visualization of Skewed Data: A Tool in R

In this work we present a visualization tool specifically tailored to deal with skewed data. The technique is based upon the use of two types of notched boxplots (the usual one, and one which is tuned for the skewness of the data), the violin plot, the histogram and a nonparametric estimate of the density. The data is assumed to lie on the same line, so the plots are compatible. We show that a good deal of information can be extracted from the inspection of this tool; in particular, we apply the technique to analyze data from synthetic aperture radar images. We provide the implementation in R.


Introduction
work set the basis for the Exploratory Data Analysis, which is the art of seeking for relevant information within the data with the least possible distributional assumptions about the underlying process. Such quest is frequently based upon graphical representations.
Among the schematic plots which survived or emerged since the advent of powerful personal computers, one should mention, for univariate data, the scatterplot, the histogram (defined and discussed in Pearson 1895), the boxplot (Hubert & Vandervieren 2008), the beanplot (Kampstra 2008), the shifting boxplot (Marmolejo & Tian 2010), the violin plot (Hintze & Nelson 1998), and their many variations. a Universidade Federal de Pernambuco. E-mail: rayospina@gmail.com b MSc Candidate. E-mail: amlarangeiras@gmail.com c Universidade Federal de Alagoas. E-mail: acfrery@gmail.com The histogram and the boxplot are the most used plots which convey information about the shape of the underlying distribution. They work in the same fashion: they extract and display key quantifiers from the data. These quantifiers can be tuned for specific situations as, for instance, the choice of the bins in the histograms (the Feedman-Diaconis, Sturges, and Scott options in the hist function available in R).
A key point when using more than a single graphical presentation of the same data set is to clearly convey the same or complementary information. A common mistake is simply showing side-by-side several summaries, but the precision and extent of enhancement such juxtaposition provides is arguable. Since no single plot is able to provide all the relevant information in every conceivable case, a possible solution for this problem consists of presenting the plots with clear visual clues of their same origin: the data set.
Visualization techniques are often used to drive important decisions. If the information conveyed by the graphical summaries is hindered, decisions may be biased or completely wrong. For instance, Doulgeris, Anfinsen & Eltoft (2011) present a segmentation procedure for polarimetric synthetic aperture radar (Pol-SAR) imagery which, albeit automatic, exhibits the quality of the product at each iteration in the form of histograms overlapped with fitted densities; the closer the fit, the better the result. When the data is overly asymmetric, the automatic presentation is hard to grasp as the abscissas span a huge interval.
In this article we present a tool for the visual display of skewed data developed and freely available in R. The tool is based on the integration and coordination of several graphical representations, some of them tailored to this kind of data. We test the tool on PolSAR data, which exhibit intense asymmetry.
Section 2 presents the graphical summaries that will be integrated in our visualization tool. Section 3 presents the data: different types of land cover as retrieved by Synthetic Aperture Radar -SAR sensors. This kind of data is prone to presenting extreme deviations from the Gaussian hypothesis, as they are heavily skewed. Finally, section 4 concludes the paper with further suggestions. The Appendix provides details about the implementation and instructions to obtain the code and the data.

The summaries and their coordination
In the following we define and comment advantages and disadvantages of some commonly used graphical summaries of data. The use of these summaries is illustrated with the same data set: a sample from a PolSAR image of the Niigata area; see Fig. 6(c). As presented in Table 1, the VV polarization is the one with strongest asymmetry, so the data henceforth presented come from this band. More information about this and other images is given in Section 3.

Histograms and kernels
In Data Exploratory Analysis the probability density estimation is one of the main tools to extract data information about the distribution of the underlying population. The estimated density may help revealing patterns and features representative of the targeted object for data modeling, analysis and decision management.
Generally speaking, the problem of density estimation can be defined as the estimating processes of the unknown distribution by means of the also unknown density function f on the set of attributes of data X using the information of the observations of the random sample of size n, namely X = {X 1 , . . . , X n } ⊂ X draw from the target function f . The Histogram is a basic form of nonparametric density estimator where the region covered by X is usually divided into equal-sized bins whose height is proportional to the count of hits within that bin. This estimator is depends on the choice of bin width h and the starting points of the bin too. These two values will determine how the data will be grouped i.e. to which bin the data will belong to. The number of bins k should be related to the bin width h as, for instance, k = range(X)/h. Different rules to choose h are available. For example, Scott (1979) proposed to h = 3.5 σ/n 3 with σ the sample standard deviation, while Sturges (n.d.) proposed k = 1 + log 2 (n). Freedman & Diaconis (1981) proposed to use h = 2IQR(X)/n 3 where IQR(·) stands for the interquartile range of the data set. These methods usually affect strongly the histogram by the start end points of the bins and their width. Additionally, the histogram present the inconvenient feature of non-smoothness (Silverman 1986).
Thus, Rosenblatt (1956) and Parzen (1962) developed the kernel density estimator that is smoth, controls bin boundary effects and (under very mild conditions) also converges to the true density, but faster than the histogram. A "kernel" is any smooth function (generally, a symmetric probability density) that depends on the bandwith parameter h which controls both the spread and the orientation. Just like in histograms, h determines the smoothness of the estimation. In practice, the choice of kernel is less important than tat of h. For example, a small value of h will lead to under-smoothing and masking important features of the data, such as skewness and multimodality. On the other hand, rough curves produced by larger values of h yield smoother estimates but might dodge significant peaks or other important structure estimate (Silverman 1986). Figure 1 shows the histogram along with several density estimates using different bandwidths for the intensity VV of Niigata data set. We see how sensitive the estimate of f in relation to h. Note that the graphs reveal the strong data asymmetry. In particular, due to the combined effect of asymmetry and large data spread, the bulk of the information is confined to a small region, approximately in the interval [0, .7], while there is little to visualize in the remaining area of the plot which spans in [.7, 2.5].
R produces high-quality and fully customizable histograms with the hist function.  Tukey (1973) introduced the boxplot to analyze univariate data sets by graphically displaying important core statistics of the data. The plot extracts a few descriptive parameters and shows information about location, spread, skewness as well as the tails of the data. A boxplot shows the data distribution in terms of its quartiles, labelled Q 1 , Q 2 , Q 3 (the first, second and third quartiles).

Boxplots and variants
Define the interquartile range as IQR = 1.5(Q 3 − Q 1 ) the boxplot is comprised of the following elements: 1. a box, with horizontal lines at Q 1 , Q 2 (the median) and Q 3 ; 2. vertical lines at W L = Q 1 − IQR and W U = Q 1 + IQR (the "whiskers", omitted inside the box); and 3. individual observations: all observations outside the (W L , W U ) range (outliers), plus two observations on either en which just fall inside this range.
A variation of the boxplot is the notched boxplot (McGill, Tukey & Larsen 1978) which is useful for determining whether two samples were drawn from the same population in terms of their median values. The notch displays a confidence interval around the median based on the Gaussian hypothesis: Q 2 ± 1.57 · IQR/ √ n.
According to Chambers, Cleveland, Kleiner & Tukey (1983), although not a formal test, if two boxes notches do not overlap there is "strong evidence" (95% confidence) that their medians differ.
R's default graphical tools include the boxplot function which has the option notch=TRUE to add a notch to the box.
In many situations, mainly when with skewed data, the boxplot may erroneously identify values which exceed the whiskers as outliers. To correct this distortion, Hubert & Vandervieren (2008) proposed an adjusted boxplot for skewed distributions. The main idea is the inclusion of the medcouple introduced by Brys, Hubert & Struyf (2004) as a robust measure of skewness in the determination of the whiskers. Also, the adjusted boxplot can be useful as a fast tool for automatic outlier detection, without making any assumption about the distribution of the data. The function adjbox of the R package robustbase can be used to produce an adjusted boxplot. Figure 2 illustrates these three types of boxplot with the Niigata VV data set. Notice that the classical and notched boxplots look alike; cf. figures 2(a) and 2(b), resp. This is typical of large samples, n = 4446 in this case, for which the width of the notch becomes negligible. The difference between both classical and notched boxplots and the adjusted boxplot shown in Fig. 2(c) is noticeable. While the two former identify numerous observations as outliers, the latter only considers a few as surprising data. This last graphical representation is more adequate and may lead to better informed decisions than the former. Esty & Banfield (2003) proposed the box-percentile plot as a variant of the boxplot which allows the sides of the plot to convey more information, presenting details about the distribution of the data. The width of the box is not fixed, but proportional the the number of data. In this way, the box-percentile plot summarizes more than the histogram, but shows more details than the boxplot. The HMisc package in R computes and displays this graphical summary with the function bpplot().
Generally speaking, boxplots are useful for small or moderate-sized data sets. The classical boxplot can be expected to label an increasing number of observations as outliers as the sample size grows. To improve the boxplot in this direction, Hofmann, Kafadar & Wickham (2011) proposed the letter-value boxplot by displaying more detailed information over the tails of distribution using letter values, but only out to the depths where the letter values are reliable estimates. In this form the "outliers" can be defined as a function of the most extreme letter value shown. The function LVboxplot, implemented in R in the appendix of Hofmann et al. (2011), can be used to produce letter-valued boxplots.
Recently, Marmolejo & Tian (2010) provided a comprehensive literature review on boxplot graphs, and proposed an important variant, the Shifting boxplot. This graphical summary incorporates the mean as basilar information instead of the median. The methodology supports conducting parametric tests. The code that produces is can be obtained directly from the authors. Figure 3 presents these last three graphical representations of the Niigata VV data set. Although the visual complexity is somewhat increased from the plots presented in Fig. 2, the amount of visual information is also enhanced.   Kampstra (2008) introduced the beanplot, an alternative to the boxplot for visual comparison of univariate data between groups. In this plot, individual observations are shown as small lines in a one-dimensional scatter plot, and then a density estimator of data distribution and the mean are showed. An interesting feature of this plot is that bimodalities and possible duplicated measurements are easily exposed. R has the beanplot package which includes a function for this purpose.
Finally, Hintze & Nelson (1998) presented the violin plot: a combination of a boxplot and a (doubled) kernel density plot. The violin plot does not include the individual points, but it displays the median and a box indicating the interquartile range. It is useful when comparing multiple groups and with large dataset. Overlaid on this boxplot is a kernel density estimation plot. There is also a vioplot package in R.

The proposed tool
We discussed in the previous section a number of graphical summaries. Each conveys an important aspect of the properties of the sample. Our proposal consists of the simultaneous use of them in a synchronized fashion, for an enhanced visualization of those different aspects.
In order to assess more adequately data whit highly asymmetry and to extract the largest possible amount of information and features from the data data set, we propose the simultaneous use of the histogram enhanced with density estimation, the boxplot with notches (B-N) , the violin plot (V-P), the shifting boxplot (S-P), the adjusted boxplot (A-B), and the box-percentile plot (B-P). All graphical summaries are synchronized with respect to a location parameter estimate. The proposed tool follows the guidelines proposed by Tufte (2001) for quality visual display of quantitative information, in particular, the ratio information/ink was maximized by showing only graphical elements which convey essential information.
The median is shown by as a vertical line connecting all plots A rug plot is displayed in the middle of the graphical summaries to reinforce the position of the data, and to reinforce the visual notion that the data is common to all plots. 3. Application to SAR data SAR sensors provide relevant and complementary information about the target. Their use has proven valuable from as diverse applications as the mapping of the surface of Venus by the Magellan and Venera missions (Arvidson, Schulte, Kwok, Curlander, Elachi, Ford & Saunders 1988), the unearthing of lost Maya  The main characteristics that make valuable the data provided by SAR sensors are (i) their ability to provide images with high spatial resolution independent from daylight, cloud coverage and weather and environmental conditions as fog, smoke, smog, rain etc., and (ii) the fact that their return is the result of complex interactions between the incident signal and the target, which complement the information available in the visible and near-visible spectrum. The recent tutorial by Moreira et al. (2013) is an excellent starting point for the reader interested in this field.
The simplest form SAR data adopt is the intensity. Mejail, Jacobo-Berlles, Frery & Bustos (2003) presented evidence that the G 0 I model is able to describe many types of target textures, from textureless (such as crops) to extremely textured (as, for instance, urban areas), but including areas with moderate texture (e.g. forests). This model had been proposed by Frery, Müller, Yanasse & Sant'Anna (1997), and was later extended to the full polarimetric case by Freitas, Frery & Correia (2005). The G 0 I distribution has positive support, and it is indexed by three parameters: the number of looks, which describes the signal-to-noise ratio, the scale, which amounts for the energy incident in the receiving antenna relative to the emitted power, and the texture. Depending on a relationship between the two last parameters, the r-th order moment of this this distribution is infinite.
The visualization of both the density and data from this distribution can be demanding, since extreme observations are expected. Among the works which present qualitative analyses of SAR data, Frery, Correia & Freitas (2007) and Doulgeris et al. (2011) make critical decisions based on such information. The former decides which joint distribution will be used for the classification, while the latter forms a stopping rule for the iterative segmentation.
In the following we present the data that will be analyzed with the proposed technique. Fig. 6 presents the color composites of three images from different PolSAR sensors and areas. Fig. 6(a) is from the San Francisco area, and the area under analysis is highlighted in yellow; it is a textureless sample from the sea. Fig. 6(b) is from the Death Valley, and the sample in red has moderate texture. Fig. 6(c) is from Niigata, and the sample in yellow has extreme texture since it is from an urban area. Table 1 presents quantitative summary information about these data sets. As expected, regardless the range, the image and the polarization of the data, there is intense skewness and kurtosis in all cases. Figures 7, 8 and 9 show the results of applying the proposed technique to each polarization of the samples from figures 6(a), 6(b) and 6(c), respectively. As can be seen from the values presented in Table 1, these data sets are comparable within each image, but the ranges differ widely among images.
The samples presented in Fig. 7 look alike when comparing the histograms and fitted densities, but important differences arise in the violin and shifting boxplots. In these graphical summaries is clearer that the HV band has more spread than the other two. The extent can be visually quantified by the shifting boxplots.
The data from the Death Valley image, summarized in Fig. 8, are the most symmetric; this confirms the values presented in Table 1. Nevertheless, one observes in the boxplots that there are outliers to the right of the three samples. If asymmetry is assumed, the adjusted boxplot also detects ouliers to the left of two of the three polarizations, namely in figures 8(a) and 8(c). Fig. 9 shows at a glance the different behavior among polarizations, with the HV channel exhibiting more spread than the other two; cf. the box-percentile and violin plots Although HH and VV polarizations behave alike, the adjusted boxplot reveals that the former has more outliers than the latter.

Conclusions and future work
As presented in the examples, the use of synchronized graphical summaries promoted the discovery of information conveyed by the data. Had only one type of plot been used, some of these features would have not been identified.
Synchronization is essential for retaining the ability to compare graphical representations of the same data set. A loose presentation of two or more of the plots would not allow the discovery of such information.
More customizable options are being added to the tool as, for instance, the ability to choose interactively the order in which the plots appear.
In its current version, the function does not return any object. This can be easily customized, for instance using lists.  V−P S−P q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q A−B B−P