A New Descriptive Statistic for Functional Data: Functional Coefficient Of Variation

In this study, we propose a new descriptive statistic, coefficient of variation function, for functional data analysis and present its utilization. We recommend coefficient of variation function, especially when we want to compare the variation of multiple curve groups and when the mean functions are different for each curve group. Besides, obtaining coefficient of variation functions in terms of cubic B-Splines enables the interpretation of the first and second derivative functions of these functions and provides a stronger inference for the original curves. The utilization and effects of the proposed statistic is reported on a wellknown data set from the literature. The results show that the proposed statistic reflects the variability of the data properly and this reflection gets clearer than that of the standard deviation function especially as mean functions differ.


Introduction
The increase in storage power of computers has led to new types of data which require new tools for analysis. Functional data analysis is one of those new tools which emerged to meet this requirement. In the last 15-20 years, studies on functional data analysis began to spread after the seminal work of Ramsay and Silverman (1997).
A functional datum is not a single observation but rather a set of measurements along a continuum that, taken together, are to be regarded as a single entity, curve or image. Usually the continuum is time, and in this case the data are commonly called "longitudinal." But any continuous domain is possible, and the domain may be multidimensional (Levitin et al., 2007).
Functional data analysis techniques such as functional principle component analysis, functional canonical correlation analysis, and functional regression analysis are utilized for a vast domain spreading from medical data to psychological data. A detailed review of applications of functional data analysis can be found in Ullah and Finch (2013).
Our aim in this paper is to propose a new descriptive statistic, "coefficient of variation" for functional data. Although standard deviation function and mean function themselves are not new, a new concept of coefficient variation (CV) function is necessary to compare the variation between curve groups especially when the mean curves are different between curve groups.
The paper is organized as follows: Section two presents common descriptive statistics of functional data: standard deviation function and mean function. The concept and formulation of proposed coefficient of variation function is introduced. In section three, the well-known Berkeley Growth Curve Data is investigated by both common functional descriptive statistics and functional CV. Then, comments on the benefits and effects of proposed statistic are presented. Finally, section four deals with some conclusions and suggestions.

Mean and Standard Deviation Functions
Classical descriptive statistics for univariate data can similarly be applied to functional data with minor modifications. The mean function is a simple analogue of classical mean for univariate data. It can be calculated by averaging the functions pointwise across replications.
In functional data, N observation curves are assumed to be observed for t1, t2, …, tn points. The j th point on i th curve is denoted by xi(tj), where i = 1,2, … , N, j=1,2,…,n.
Briefly, mean curve which corresponds to the mean of each observation point is obtained by Eq1. Mean curves can be compared to each other for all time points across curve groups to see the time ranges which an evident decrease or increase of means occurs.
Alternatively, if the functions have a large noise component or other undesirable nonsmooth components, it may be preferable to first approximate them by suitable splines and then take the average of the approximations (Ramsay, 1982). In other words, the accuracy of the estimations can be increased by smoothing (Rice&Silverman, 1991).
In functional data analysis, variation is measured by variance and covariance functions. Variation function is defined as [a,b] is the real interval that the function is defined. Positive square root of this function is named as standard deviation function. As in the mean function, variance function or standard deviation function can be smoothed if required.
These functions are also simple analogues of classical univariate measures and have similar interpretations. Standard deviation function shows the variation between curves across time points. Detailed information on descriptive statistics of functional data can be found in Shang (2015).

A New Statistic: Coefficient of Variation Function
For univariate data, if the need is to compare the degree of variation from one data series to another, especially if the means and/or scales are drastically different from each other, coefficient of variation is more useful than standard deviation. Since it is the ratio of the standard deviation to the mean, coefficient of variation measures the variation of a series of data independent from its measurement units, so it can be used to compare data with different scales or different means. Ratio scale measurement is more plausible for coefficient of variation since it possesses a meaningful zero value. Therefore, interpretation of mean becomes more rational. Non-negative data can be preferred to provide a non-negative mean.
Functional data, too, needs similar interpretations. If variability of multiple functional data groups needs to be compared, especially when mean functions of these data groups are different, coefficient of variation functions can be used instead of standard deviation functions. However, in literature, such a statistic does not exist so far. In this study, the following coefficient of variation (CV) function is proposed:

CV function = Standard deviation function
In order to find the CV function, firstly, standard deviation coefficients and mean function coefficients are computed. Then, standard deviation coefficients are divided by mean function coefficients and CV function coefficients are obtained. Later, CV functions are formed via basis functions. However, as it is the case for univariate CV, CV function is very sensitive near point zero. Especially when the curves has a mean function very close to zero, CV function gets affected. Hence, usage of proposed CV function is not recommended in that case. Beside other approaches, basis function approach is primarily used when estimating the curves in functional data analysis because of its differentiability. Complete computational procedure for the estimation 4 Alphanumeric Journal Volume 4, Issue 2, 2016 of mean, standard deviation, and proposed CV function according to basis function approach are given in Appendix-A. Calculated coefficients for CV functions can be found in Appendix B. Matlab code files of Ramsay (2015) are partly used and modified in order to calculate the proposed statistic. Modified code files can be found on the corresponding author's website (Keser&Deveci Kocakoç, 2015).

An Application to Height Data
Berkeley Growth Data (Tuddenham&Snyder, 1954) is analyzed in order to present the benefits of the proposed CV function. Growth data is a well-known data set which is used and investigated in many studies on functional data, such as Ramsay and Silverman (2005) Here, we first examine the individual functions, mean functions and derivative functions for girls and boys. Subsequently, difference of mean functions are tested and presented by functional t-test. The effects of this difference on standard deviation function and CV function are investigated.

Examination of Individual Functions, Mean Functions and Derivative Functions
Individual functions for girls and boys are constructed by cubic B-Splines and given in Figure 1a and 1b. When these figures are examined, it can be seen that individual functions for both girls and boys have a regular increase. Since cubic B-Splines provide us the first and the second derivative functions, we can also examine the growth rate and acceleration. First derivative functions (Figure 2a and 2b) which give the growth rate show the puberty growth spurts for both groups. As expected, first derivative functions have a negative slope just the opposite of individual functions. Growth spurts seems to happen with a jump between ages 10 and 12 for girls and ages 12-15 for boys. Besides, slight individual shifts can be seen for each of the girls and boys. Mean functions and their first derivatives given in Figure 3a and 3b may be more useful to see the group differences. When mean functions in Figure 3a are examined, we can see that means of boys and girls follow each other very closely, boys ahead of girls, to the age of 11. After 11, girls get ahead of boys, and the difference in means gets higher after age 13, boys again get ahead of girls.
Although group mean functions gives insights of each gender's growth model, it is the first derivatives of mean functions that give the beginning ages of growth spurts thus enables us to investigate data thoroughly. The first derivatives of mean functions in Figure 3b shows that the growth spurts of boys follows the growth spurts of girls very closely. Especially, growth spurts in ages 10-12 for girls and 12-15 for boys can be very distinctly recognized. The real difference begins around age 12, when boys' growth spurt actually gets started. After age 13, boys mean function clearly starts inclining. All these interpretations are in accordance with reality and with other studies using this data set.

Examination of Standard Deviation Functions and CV Functions
In order to see whether the difference in mean functions are statistically significant or not, functional t-test may be utilized. Ramsay and Silverman (2005) suggest using a pointwise test approach based on a permutation method. This method is based on randomly changing curve labels and calculating the test statistic each time. This statistic is referred to as permuted test statistic. This process is repeated ten thousand times in order to obtain a null distribution and it gives a reference in order to evaluate maximum of T(t). At the same time, test statistics are also calculated from original data. Obtained statistics is referred to as observed test statistics. A pvalue is obtained by permuted test statistic's ratio that is higher than or equal to observed test statistic. It is assumed that null hypotheses will be rejected for high values of the test statistic (Lee, 2005, Ramsay et all., 2009, Coffey&Hinde, 2011. This procedure is advantageous as it is distribution independent. On the other hand, it is an exact level α test because of the features of permutation test, so, it gives valid p values (Lee, 2005). For the steps of functional t-tests based on a permutation method, Cox and Lee (2008), Yaree (2011) and Keser (2014) can be referred.
In this study, our observations in both girls and boys are analyzed for the range 1-18 years on the 31 time points. The result of functional t-test is given in Figure 4. With pointwise functional t-test, the time when girls and boys started to differ can be revealed. The time points where observed statistic exceeds the pointwise statistic shows difference in means. If observed statistic exceeds the conservative reference line (maximum critical value), then the difference is statistically significant with p=0.05. When pointwise critical value is taken as reference, means of girls and boys tend to differ approximately starting from age 14. But when conservative reference line (maximum critical value) is taken into consideration, main strong differences start at approximately the half way of boys' puberty growth spurt and last until age 18. This situation exists around the period between ages 14-18 and this is completely consistent with the expectations of the study. So, based on this t-test, it can be concluded that gender is effective on the group means of height data.
Figures 5a and 5b show standard deviation and CV functions respectively. Both standard deviation function and CV function clearly show the change of variation for boys and girls during their growth spurts. However, CV function in Figure 5b shows this change more distinctively since it also accounts for the change in the mean. Especially the effect of the group mean after the significant difference can be seen better in CV function. As can be seen from Figure 5b, the variation of girls exceeds the variation of boys after age 16 when we introduce the group mean information into the statistic. In other words, as group mean changes, group variation also changes comparatively. Standard deviation function does not give this information. Hence, CV function provides a better insight on the effect of mean function and the variation changes between groups.

Conclusions
In this study, we propose a coefficient of variation function for functional data analysis as a descriptive statistic. Height data of girls and boys is analyzed in this study in order to prove the usefulness of the proposed CV function in detecting variation changes during growth spurts. CV function is found to be a practical descriptive statistic especially when the mean curves of functional data groups are different. We found that as group mean changes, group variation may also change comparatively, but standard deviation function does not extract this information.
Hence, when we compare more than one group of functional data, CV function can be utilized to have a better insight on the effect of mean function and the variation changes between groups. The availability of first and second derivatives of CV function also strengthens its utilization.
This study gives a hint of new opportunities in analysis of functional data. Simulation studies or comparative studies may be conducted to show different uses of CV function in special cases. Comparative studies with standard deviation may also be conducted.