Descriptive statistics and visualization of data from the R datasets package with implications for clusterability

The manuscript describes and visualizes datasets from the datasets package in the R statistical software, focusing on descriptive statistics and visualizations that provide insights into the clusterability of these datasets. These publicly available datasets are contained in the R software system, and can be downloaded at https://www.r-project.org/, with documentation provided at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html. Further information on clusterability is found in the companion to this article, To Cluster or Not to Cluster: An Analysis of Clusterability Methods? (https://doi.org/10.1016/j.patcog.2018.10.026). Brief descriptions and graphs of the variables contained in each dataset are provided in the form of means, extrema, quartiles, standard deviation and standard error. Two-dimensional plots for each pair of variables are provided. Original references to the data sets are included when available. Further, each dataset is reduced to a single dimension by each of two different methods: pairwise distances and principal component analysis. For the latter, only the first component is used. Histograms of the reduced data are included for every dataset using both methods.


Data
This paper highlights statistical summaries and visualizations with nine tables and eighteen figures for selected data from the datasets package within R software [12,13], detailed in Section 2. Tables provide means, medians, ranges, standard deviations, and standard errors for all variables. Figures highlight plots of each pair of variables and unidimensional summaries of all datasets. For Figs. 10e18, the left plots are histograms of the sets of pairwise Euclidian distances for the corresponding dataset, and the right are histograms of the first principal components (PC1).
Specifically, waiting and eruption times of Old Faithful are described numerically in Table 1 and plotted in Fig. 1; projections via pairwise distances and PC1 are in Fig. 10. Iris flower measurements are in Fig. 2, distances and PC1 are plotted in Fig. 11, and descriptive statistics are in Table 2. North American river lengths are summarized in Table 3 and plotted in Fig. 3, with pairwise distances in Fig. 12. Table 4 quantifies demographics in Swiss provinces. Pairwise plots of variables, distances between points, and Specifications Value of the Data R is a free, powerful statistical programming language compatible with Windows and Unix systems and is utilized by statisticians, computer scientists, data scientists, and other analysts, such as biologists working in genomics. The R datasets package [12,13] includes dozens of datasets for use in education and research. Data are often used for demonstration by students learning established methodology and researchers testing new methods. Descriptive statistics for all features and visual presentation of the data in the form of two-dimensional plots in a single document may help researchers and students to quickly comprehend the content of the data and evaluate which data may be best suited to their goals. Simultaneous presentation of the descriptive statistics and each trio of plots (original 2D plots of data, distances and principal component) show researchers the component features and their ranges in each dimension and two unidimensional visual summaries for each dataset. The first principal component is a useful one-dimensional summary for each dataset. Histograms of pairwise distances yield one-dimensional visualizations applicable to cluster analysis. These visual summaries are easier for researchers to evaluate for research and educational purposes than raw data or text. The graphs presented have implications for clustering and clusterability, as described in our accompanying article [1].
PC1 are in Figs. 4 and 13. Table 5 quantifies employee favorability proportions for seven behaviors, which are plotted in Figs. 5 and 14. Table 6 reports stopping distance and speed for 50 cars. Plots and projections are in Figs. 6 and 15. Table 7 reports tree dimensions, which are plotted in Fig. 7, along with projections in Fig. 16. Table 8 enumerates ratings of US federal judges, plotted in Fig. 8; distances and PC1 are in Fig. 17. Table 9 summarizes state-level crime related variables. Visualizations are in Figs. 9 and 18.

Experimental design, materials, and methods
Data highlighted in this paper focuses on a subset of nine datasets suitable for cluster analysis: faithful, iris, rivers, swiss, attitude, cars, trees, USJudgeRatings, and USArrests. The following is a brief summary of each dataset; details are in Section 2.1. Faithful [2,7] includes eruption times and waiting times between eruptions of the geyser known as Old Faithful. Iris [5] consists of sepal length and width and petal length and width for 150 flowers. Rivers [10] provides lengths of 141 rivers in North America. Swiss [11] includes six demographic and fertility variables for 47 Swiss provinces. Attitude [3] measures worker attitudes for seven items from a survey of employees at a large financial componey. Cars [4] provides speed and stopping distance for 50 cars. Trees [14] includes three measurements each on 31 black cherry trees. USJudgeRatings [8] includes lawyers' ratings of US Superior Court judges. USArrests [10] provides crime variables for each state. Section 2.1 includes additional details, numerical summaries, and plots for real datasets from the R datasets package [12] used in our accompanying study [1]. Section 2.2 focuses on the two unidimensional projections of the data. Code used to produce all items in this paper is included in the file entitled "DIBcode.R."

Descriptive statistics and raw data plots
Numerical summaries of the data were calculated using the stat.desc() function within the pastecs package [6]. The summaries we display consist of the minimum, maximum, range, median, mean, standard deviation, and standard error. Scatter plots for 2-dimensional projections based on all pairs of variables are provided for each dataset. Sets of two dimensional projections are produced using the plot() command in R [12]. For example, the command plot(iris) produced the projections in Fig. 2.
The following subsections provide background on each dataset used and the variables contained therein.

Faithful
The faithful dataset [2,7], contains two variables for the Old Faithful geyser. The first is the eruption duration, and the second is the waiting time between eruptions. Both are measured in minutes. Fig. 1 displays the data. Table 1 summarizes the statistical properties of these features.

Iris
The Iris following three species: iris setosa, versicolor and virginica. The variables, all measured in centimeters, include the sepal length and width and petal length and width. Descriptive statistics for the four features are included in Table 2. These features, along with the species, are displayed in Fig. 2.

Rivers
In the rivers [10] dataset, the length, in miles, is recorded for 141 major rivers in North America. The mean, median, extrema, standard deviation and standard error of the river lengths are provided in Table 3. The data contains only one variable. Therefore, it does not have a two-dimensional projection. Instead see the one-dimensional plot in Fig. 3.

Swiss
The swiss [11] data includes 47 French-speaking nineteenth-century Swiss provinces, each of which contains six measures of socio-economic status and fertility. Fertility is measured using a standardized variable [13]. The remaining five variables are percentages correspond-ing to agricultural workers, high scores on the army exam, education past primary school, members of the Catholic religion, and infant deaths. Pairwise plots of the 47 points for each pair of measures are included in Fig. 4. Numerical summaries are found in Table 4.

Attitude
The dataset attitude [3] consists of seven employment behavior variables measured based on a survey completed by employees within a large company in the financial sector. Thirty departments were randomly selected, and the approximately thirty-five employees within which were aggregated to calculate the seven measures. The responses represent the proportion of favorable responses within each department to each of seven questions.
The seven questions could have favorable or unfavorable answer to the following themes: overall rating, handling of employee complaints, the department does allow special privileges to 0 500 1000 1500 2000 2500 3000 3500 Length of River (miles) some individuals and not others, the company presents ample opportunity to learn, raises are given based on performance, evaluations are critical, and employees consider that there are opportunities for advancement. Descriptive statistics and plots of the raw data are found in Table 5 and Fig. 5.

Cars
Recorded in the 1920s, cars [4] consists of 50 observations and two variables, representing speed and stopping distance. Speed is measured in miles per hour. Stopping distance is measured in feet. Table 6 includes numerical summaries of the stopping distance and speed for each of fifty cars. Fig. 6 contains a plot of these two features.

Trees
The trees [14] dataset is depicted in Fig. 7. Features include measurements of the girth, height and volume of timber in 31 felled black cherry trees. The units for girth, height, and volume are inches, feet, and cubic feet. Descriptive statistics on these variables are included in Table 7.

USA judge ratings
The dataset USJudgeRatings [8] contains 43 observations with ratings from lawyers on twelve elements related to judges from the U.S. Superior Court.
The following are the twelve elements: number of contacts of lawyer with judge, judi-cial integrity, demeanor, diligence, case flow managing, prompt decisions, preparation for trial, familiarity with law, sound oral rulings, sound written rulings, physical ability, and worthiness of retention.
Descriptive statistics are found in Table 8. Plots of scores for the forty-three judges from each pair of lawyers are included in Fig. 8.   Table 9 and Fig. 9, contains measurements from 1973 for each of the fifty states on 4 variables: urban population percentage and the number of arrests per 100,000 residents for assault, murder, and rape.

One-dimensional projections
One-dimensional summaries of all data are discussed in this section. Two projections are examined side by side. The first is the set of pairwise distances between the points. The second is the first

Histogram of Rivers Distance
Distance Histograms were made using the hist() function in R. For histograms of the set of dis-similarities, the distance metric employed in the present manuscript is Euclidean distance, defined as the square root of the sum of the squares of the differences between the values of each variable for a pair of observations. Distances were computed using dist() function in R. All data was scaled to have unit variance before analysis using the scale() function in R. Principal component analysis [9] was executed in R via singular value decomposition using the prcomp() function. The first principal component of each scaled dataset was extracted and examined visually with histograms.
The distributions of the pairwise Euclidean distances and first principal component are found in side by side plots, shown in Figs. 10e18. Code to produce the plots is included in the supplementary material. Because this paper is not focused on classification, the species variable from the iris data is not used in dimension reduction. Rather, the unidimensional reductions are computed based only on the first four features. Histograms of the pairwise distances and first principal component for iris are found in Fig. 11. For all other datasets, all variables were used for.

Histogram of First Component of Ratings
First Principal Component Dimension reduction. Pairwise distances for the rivers data are included in Fig. 12. However, no dimension reduction by principal component analysis is executed, because the data is already only one-dimensional and principal component analysis is not recommended for such data.    Table 4 Descriptive statistics for the swiss data. Fertility is measured via a standardized variable. Agriculture is the percentage of males in the population employed in agriculture. Examination is the percentage of draftees receiving the highest mark on the army examination. Education is the proportion of the population of draftees with education beyond primary school. Catholic is the percentage of the population who identifies as Catholic. SD and SE, respectively, denote the standard deviation and standard error.   Table 7 Descriptive statistics for the trees data. Girth is measured in inches, while height is in feet, and volume in cubic feet. SD and SE, respectively, denote the standard deviation and standard error.

Girth
Height Volume Table 6 Descriptive statistics for the cars data. Speed is measured in miles per hour; distance is measured in feet. SD and SE, respectively, denote the standard deviation and standard error.