A Method to Select Bivariate Copula Functions

Copula functions have been extensively used in applied statistics, becoming a good alternative for modeling the dependence of multivariate data. Each copula function has a different dependence structure. An important issue in these applications is the choice of an appropriate copula function model for each case where standard classical or Bayesian discrimination methods could be not appropriate to decide by the best copula. Considering only the special case of bivariate data, we propose a procedure obtained from a recently dependence measure introduced in the literature to select an appropriate copula for the statistical data analyses.


Introduction
In many different areas of knowledge such as medicine, engineering, economy and ecology, it is possible to have a set of observations obtained from variables whose natural behavior has some dependence structure.To model this dependence, there are many statistical techniques, models and indexes introduced in the literature as for instance frailty models, correlation coefficients, concordance coefficients, etc. (See Goethals, Janssen & Duchateau 2008).
Since the introduction of Sklar's theorem (Sklar 1959), many parametric, nonparametric and semiparametric models were proposed derived from different copula functions assuming different probability distributions, including methods for constructing models for copulas from different probability distributions (e.g., see, Durante & Sempi 2015, Nelsen 2006), most of which are parametric models.
Copula function are used to link marginal distributions with a joint distribution.For specified univariate marginal distribution functions F 1 (t 1 ), F 2 (t 2 ), . . ., F m (t m ), the function C(F 1 (t 1 ), F 2 (t 2 ), . . ., F m (t m )) = F (t 1 , t 2 , . . ., t m ), which is defined using a copula function C, results in a multivariate distribution.On the other hand, any multivariate distribution function F can be written in the form of a copula function; i.e, if F (t 1 , t 2 , . . ., t m ) is a joint multivariate distribution function with univariate marginal distribution functions F 1 (t 1 ), F 2 (t 2 ), . . ., F m (t m ), there is a copula function C(u 1 , u 2 , . . ., u m ), so that: If every F i is continuous, then C is unique and u i = F i .For the special case of bivariate distributions, m = 2.
The approach to formulating a multivariate distribution using a copula is based on the concept that a simple transformation can be made of each marginal variable so that each transformed marginal variable has a uniform distribution.Then the dependence structure can be expressed as a multivariate distribution on these obtained uniforms, and a copula is a multivariate distribution with marginally uniform random variables.Consequently there are many families of copulas that differ in the details of their dependence structure.In the bivariate case, let T 1 and T 2 be two random variables with continuous distribution functions F 1 and F 2 .The probability integral transformation can be applied separately to both random variables to define U 1 = F 1 (t 1 ) and U 2 = F 2 (t 2 ), where U 1 and U 2 have uniform (0, 1) distributions, but are usually dependent if T 1 and T 2 are dependent (Independent T 1 and T 2 imply that U 1 and U 2 are independent).Specifying dependence

63
between T 1 and T 2 is the same as specifying dependence between uniform random variables U 1 and U 2 .In this case, there is a need to specify a bivariate distribution between two uniform distributions, i.e, a copula.
Within each application field, it is needed in general, a suitable measure of the strength of dependence of the two random variables in each application.Dependence scalar measure indexes or global measures of dependence for two random variables have been studied by many authors as Jogdeo (1982), Lancaster (1982), Drouet & Kotz (2001), Balakrishnan & Lai (2009) and in many cases the use of scalar dependence bivariate indexes could be not the best way to represent complex dependence structure, so other dependence structures have been introduced in the literature, see Kowalczyk, Pleszczynska et al. (1977), Bjerve & Doksum (1993), Drouet & Kotz (2001) and Bairamov, Kotz & Kozubowski (2003).Considering the case of two dependent random variables, generally it is very difficult to establish non linear dependence structures using indexes making necessary the use of copula functions but, it is very common to have difficulties to decide on the best copula function to be fitted by the data since the literature has not yet presented many discrimination methods for copula models.In a recent paper, Ledwina (2015) proposed a new function valued measure of dependence for two random variables T 1 and T 2 also presenting its basic properties.This proposed measure has a simple form which explores only cumulative distribution functions taking values in the [−1, 1] interval treating both variables symmetrically.The correlation order or the equivalent concordance order is the quadrant order restricted to the class of distributions where fixed margins are preserved.The Ledwina dependence measure that assumes two random variables T 1 and T 2 is expressed as: where and is the joint distribution function for T 1 and T 2 .
A simple empirical estimator for the Ledwina dependence giving in (2) is proposed considering the bivariate data set of n pairs (t 1i , t 2i ), i = 1, . . ., n replacing F (t 1 , t 2 ), F 1 (t 1 ) and F 2 (t 2 ) with their respective empirical estimates, F n (t 2 ) = N umber of obs T 2 ≤ t 2 n for fixed values t 1 and t 2 .
Given that q treats both variables T 1 and T 2 symmetrically, knowledge of q and the marginal distributions allows one to recover the joint distribution function for T 1 , T 2 .Properties of q and further details can be found en (Ledwina 2015).
Using equation (2) it is possible to obtain a copula-based measure of dependence assuming joint distribution functions F (t 1 , t 2 ) with continuous margins F 1 (t 1 ) and F 2 (t 2 ), as follows: where, In this paper, it is proposed an index obtained by a modification of the Ledwina measure and with the evaluation of its performance considering five copula functions.This paper is organized as follows: in Section 2, it is introduced some special copula functions; in Section 3, it is proposed a method for discrimination of different copula functions developed with the Ledwina dependence measure; in Section 4, it is presented the results obtained from a simulation study and in Section 5, it is presented some concluding remarks.

Some Special Copula Functions
In this section, it is introduced some copula functions that are explored in the present study.In all cases, and θ is the dependence parameter.

Clayton Copula
The Clayton (1978) copula function models asymmetrical data structures with high dependence in the left tail indicating an expanding cloud.The Clayton copula is know as the Pareto bivariate copula since it is possible to obtain it from the survival function of the the bivariate Pareto distribution (Hutchinson and Lad 1990).Additionally this copula function is considered as a special case of the Lomax copula function.The Clayton copula function has the following analytical structure: When θ → ∞ the dependence is perfect and positive and if θ → 0 the variables are independent.

Frank Copula
The Frank copula function (Frank 1979) is appopriate to model weak dependence structures with positive linear trend.This copula function has the following analytical structure: The maximum dependence value is reached when θ → ∞ and the minimum when θ → −∞.When θ → 0 it is possible to assume independence between the variables.

Gumbel-Hougaard copula
The Gumbel-Hougaard copula function introduced by for details see Gumbel (1960a), Gumbel (1961) and (Hougaard 1986) is useful to model data structures with strong dependence in upper tail and weak dependence in lower tail where it is expected that the upper data show a strong correlation and the lower data shows weakly correlation.The analytical form of the Gumbel-Hougaard copula function is defined as: with θ ≥ 1.The perfect dependence is obtained when θ → 0 and if θ = 1 the is possible to assume independence between the variables.

Farlie-Gumbel-Morgenstern Copula (FGM Copula)
The first reference on the FGM copula functions family is the Eyraud in 1938 (For details see Nelsen 2006).This copula can be considered when the data set to be analyzed shows a weak and non linear dependence structure (Meintanis 2007).In this case, the scater plots obtained from the data are very similar with the plots obtained from data sets of independent variables.The FGM copula (Nelsen 2006) is defined by, for −1 ≤ θ ≤ 1.Therefore, it is possible to assume independence between the variables when θ = 0.This copula function models weak linear dependence structures

Gumbel-Barnett Copula (GB Copula)
The GB copula introduced by Gumbel (1960a) and Barnett (1980), has an analytical structure defined by, Revista Colombiana de Estadstica 42 (2019) 61-80 for 0 ≤ θ ≤ 1. Independence corresponds to θ = 0. Two random variables whose dependence structure can be fitted with a bivariate Gumbel distribution do not present linear correlations.This copula function can be obtained from tjoint bivariate Gumbel distribution with standard exponential marginal distributions.

Algorithms to Simulate Data with Dependence Structure Type Copula
The algorithms used to simulate data from the copula function structures presented in section 2 were obtained from different sources.The FGM data were simulated using an algorithm published in the Johnson & Kotz (1972) book, algorithms to simulate data with Clayton, Frank and Gumbel-Hougaard were obtained from studies published by Hofert (2008).To simulate data from the Gumbel-Barnett copula function, an algorithm was developed by the authors based on the results presented by Gumbel (1960b).

Data with Copula Clayton Dependence
The data with dependence structure type Clayton copula were simulated using the algorithm as follows: • Set a value for the parameter of dependence θ • The u 1 and u 2 values have dependence with type Clayton copula function

Data with Dependence Type Copula Frank
The data with dependence structure type Frank copula were simulated using the algorithm as follows: • Set a value for the parameter of dependence θ • Generate u 1 ∼ U (0, 1) and w ∼ U (0, 1) ) the u 1 and w values • The u 1 and u 2 values have dependence with type Frank copula function

Data with Dependence Type Copula Gumbel-Hougaard
The data with dependence structure type Gumbel-Hougaard copula were simulated using the algorithm as follows: • Set a value for the parameter of dependence θ • Generate an observation x from a positive stable distribution )) θ , 1 {θ=1} , 1 ) • Generate v 1 ∼ U (0, 1) and v 2 ∼ U (0, 1) where the pair (v 1 , v 2 ) have a dependence structure Gumbel-Hougaard • Repeat m times the previous steps to obtain a vector of pairs of data with Gumbel-Hougaard dependence structure

Data with Dependence Type Copula Fgm
The data with dependence structure type FGM copula were simulated using the algorithm as follows: • Set a value for the parameter of dependence θ Revista Colombiana de Estadstica 42 (2019) 61-80 A Method to Select Bivariate Copula Functions 69 • Repeat m times the previous steps to obtain a vector of pairs of data with FGM dependence structure

Data with Dependence Type Copula Gumbel-Barnett
In this case, the algorithm used was: • Set a value for the parameter of dependence θ • Generate u 2 ∼ U (0, 1) and w ∼ U (0, 1) • Obtain a value of x as solution of the non-linear equation 1−(1+θx)e −(1+θy)x − w = 0

Relationship Between the Kendall's Tau and the Copula Parameter of Dependence
In general, for some copula function families it is possible to have θn = g(τ ) being g a differentiable function.The relationship between the concordance measure Kendall's tau τ and the copula function parameter can be very important to estimate the dependence parameter using the moments method.According to Nelsen (2006) it is possible to obtain an expression for the kendall's tau from the copula function as follows; Solving (10) for each copula function considered, it is obtained in Table 1 the results of interest. Where:

Estimation of the Copula Dependence Parameter
Given that the copula function is a expression of a multivariate probability distribution , under a statistical point of view, it is possible to associate a likelihood function given the data and to use standard maximum likelihood or Bayesian methods to get estimates for the dependence parameter.

Maximum Likelihood Estimation
To obtain the maximum likelihood estimates (MLE), it is needed to have density function, In this way, the MLE are obtained maximizing the log likelihood function For each considered copula function introduced in section 2, the log likelihood function was derived as follows: • Clayton copula function • Gumbel-Hougaard copula function Revista Colombiana de Estadstica 42 (2019) 61-80 • Frank copula function • FGM copula function • Gumbel-Barnett copula function

Bayesian Estimation
To estimate the dependence parameters using Bayesian methods, it was assumed non informative prior distributions considering the range of values in the parametric space.In general, the posterior distribution for the dependence parameter has the form: where π(θ) is the prior distribution and L(θ | u 1 , u 2 ) is the likelihood function.For all cases, a U nif orm(a, b) distribution was assumed as a non informative distribution for the dependence parameter.To obtain values for the hyperparameters (a,b), it was assumed the existence of an expert in the subject of study, whose knowledge can be expressed through Kendall's tau.When the expert opinion for the dependence level was weak it was assumed τ ∈ (0, 0.33) if in according with the expert, the expected dependence was moderate then τ ∈ (0.33, 0.66) and for a strong dependence it was assumed τ ∈ (0.66, 1).Using the relationship between the Kendall's tau and dependence parameter presented in (10), it was obtained the intervals showed in Table 2. To estimate the dependence parameter of the FGM and Gumbel Barnett copula functions it was assumed a Uniform(0,1) prior distribution.For the FGM copula function, it was considered a positive range of values in the parameter space a

The Goodness-Of-Fit (GOF) Method
This method to evaluate the goodness of fit of data to copula functions, was developed by Deheuvels (1981) and Genest & Rémillard (2004).The idea is to compare the empirical copula C n defined as: with the parametric copula function C ( u, v) using a statistical hypothesis test.In this way, the authors define the Cramer-Von Mises Statistic as: and the asymptotic distribution of the test statistics S n was derived from the process C n (u 1 , u 2 ) depending on the unknown distribution C( û1 , û2 ).

An Approach to Decide by an Appropriate Copula Function Using the Ledwina Dependence Measure
To discriminate the best copula function among k different functions that could be candidates to be fitted by the sample data (n pairs of observations), it is proposed from a modification of (2) the following discrimination index given by, where; Given that to compute q it is necessary to estimate C(u 1 , u 2 ), the dependence parameter θ must be estimated using some statistical method reported in the literature (for instance, maximum likelihood).To obtain q * it is needed to compute F n (t 1 , t 2 ), F n (t 1 ) and F n (t 2 ) as in equation (3).When the k indexes are computed, the model with minimum value for (15) is choosed as the best model to be fitted by the data.

A simulation Study
Using the algorithms in section 3, we conducted a simulation study to evaluate the performance of our proposed index.We carried out our procedure 1000 times considering three different sample sizes (n = 50, 100, 500, 1000) to simulate vectors of pairs of observations (t 1i , t 2i i = 1, 2, . . ., n) for each of the five copula functions introduced in section 2. With each data set, it was estimated the dependence parameter using maximum likelihood and Bayesian methods.From these estimates, it is estimated the cumulative probability C(u 1 , u 2 ) associated to each pair of observations.This procedure was used considering the simulated data from each assumed copula function.With the values of C(u 1i , u 2i ) it is computed q(u 1i , u 2i ) applying equation (3).Each data set was used to compute the empirical copula and to obtain q * (u 1i , u 2i ).Finally, it was computed the proposed index given by equation ( 13) considering each copula function.
The obtained results were compared with those obtained using the Goodness Of Fit (GOF) method to select copula functions, where the null hypothesis is the model fitted by the data set (See details in Kojadinovic, Yan & Holmes 2011).The GOF procedure could be problematic when there is rejection of the null hypothesis which indicates a particular copula function but there is no indication of the best copula to be fitted by the data set.Observe that this hypothesis test must be applied for each proposed copula function.For each simulated data set it was fitted all copula functions introduced in section 2 and the procedure was carried out 1000 times.Following, it was counted the number of times that each procedure identified the true copula model.

Selection of Values for the Copula Dependence
Parameter (Weiss 2011) studied different dependence levels measured with Kendall's tau.In accordance with those authors, we decided to use three dependence levels within the range of values that the dependence parameter can take for each copula function: Weak dependence (τ = 0.2), moderate dependence (τ = 0.5) and strong dependence (τ = 0.8).For each established tau value, it was obtained the associated value of the copula parameter in each of five considered models.For the dependence parameters of the Gumbel-Barnett and FGM copula functions it was used the same values assumed by Tovar & Achcar (2012), Tovar & Achcar (2013).See Table 3.

Results of the Simulation Study
Table 5 shows the results obtained when using the proposed method to select the best copula function among the five assumed copulas.The percentage of times that the method selected the copula, changes in sample size, the level of dependence and the method that was used to estimate the dependence parameter are reported.Table 3 also presents the results obtained using the GOF method for the classifications under the same study scenarios.When the structure of the data dependence was modeled using in a Clayton copula function, the proposed method had no problems in selecting the correct copula a good percentage of the time (regardless of the level of dependence or sample size) although with small amounts of data (n = 50) a minimum amount of misclassified samples were observed.When comparing the classification rates with those obtained using the GOF method, it can be said that, contrary to what was observed for the proposed method, the GOF procedure requires large amounts of data (500 or more) to obtain high percentages of correct classifications for all levels of dependence.
If the data have a Frank copula dependence structure, our procedure identifies the copula less than 50% of the time when the dependence is weak and the sample size is below 100.With larger sample sizes, the percentage of correct classification increases, coming close to the unit when the sample size is 1000.For moderate and strong dependences, the percentages of a good classification improve noticeably; and when the sample size is 1000, no classification errors were observed.For this copula function, the inferential method used to estimate the dependence parameter of the copula might have a negligible effect on the percentage of correct responses.For this copula, the GOF method can identify the real copula function only when there are large amounts of data in the sample, regardless of the strength of the dependence between the variables.For the data with a weak Gumbel-Hougaard type of dependence, the proposed method obtained a good percentage of classification when the samples were over 500 and the parameter was been estimated using the maximum likelihood method.For those cases where we fitted the copula function using a Bayesian estimate, the results were quite poor.
For a moderate GH dependence, the rating capacity of the proposed method is good with sample sizes over 100, regardless of the procedure used in the estimation of the dependence parameter.If the GH dependence is strong but there are only 100 or fewer data, it is necessary to get maximum likelihood estimates in order to obtain good ranking results.For dependence structures with this copula function, the GOF method had difficulties in identifying the correct copula.Its capacity for identification is good only when there are amounts of data higher than 500 and the GH dependence is moderate or strong.For weak dependence structures using a Gumbel-Barnett copula function,the proposed method of classification has better performance with maximum likelihood estimators and sample sizes of at least 100 observations although in this case it is possible to get rates up to 20% of poor classifications.For moderate-level dependence with a sample size of at least 500 observations, it is possible to obtain a high probability of identifying the correct copula function; and if the dependence structures are strong, the rating capacity is quite good when the parameter is estimated using maximum likelihood, regardless 77 method for all data sets.The method easily identifies data from copula functions with an analytical structure such as the Clayton copula regardless some additional topics as the sample size or the dependence level.For structures type Frank copula function, the performance of the method depends of the sample size and it works better for moderate and strong dependence.If the dependence structure is like Gumbel-Hougaard copula function, to obtain a correct classification it is necessary to have sample sizes greater than 100.With dependence structures like Gumbel-Barnett copula function, the observed percentages of correct classification are better when the maximum likelihood estimation method is used to obtain the proposed index.If a Uniform(0,1) prior distribution is assumed for the dependence parameter and the dependence level is strong, the method does not classify correctly, and the performance of the measure could be improved if informative prior distributions are considered but this topic is beyond the scope of this paper and will be topic of a new study.In general, when the dependence structure shares similarities with other copula functions, the correct classification depends on different aspects such as sample sizes and the strength of dependence.The proposed procedure fails to identify data with dependence that can be modeled using the FGM copula function.It can be deduced that very weak linear dependence structures (close to independence) would be difficult to identify using the proposed method, given that, the procedure measures the difference between the observed probabilities assuming a proposed copula and the independence copula weighted by a factor that in the FGM copula case is a part of its structure.In this way, it is possible to deduce that for this copula functions family, it is necessary to conduct a detailed study on the performance of discrimination measures as it was proposed in this study.In our study, we used the GOF method to compare the classification capacity of the proposed method since we considered that method is the most general among those we found in the literature.It is important to point out that other methods have been proposed in the literature but for some specific copula function families (For instance see, Topçu 2016).

Figure 1 :
Figure 1: Plots of data under different Clayton dependence structures.

Figure 2 :
Figure 2: Plots of data under different Frank dependence structures.

Figure 3 :
Figure 3: Plots of data under different Gumbel-Hougaard dependence structures.

Figure 4 :
Figure 4: Plots of data under different FGM dependence structures.

Figure 5 :
Figure 5: Plots of data under different GB dependence structures.
) and u 2 = 1 − exp(−y) one pair of values with Gumbel-Barnett dependence structure • Repeat m times the previous steps to obtain a vector of pairs of data with Gumbel-Barnett dependence structure

Table 1 :
dt Kendall's tau and copula function parameters

Table 2 :
Prior intervals for dependence parameter and non informative prior distributions.

Table 3 :
Values assumed for the copula dependence parameters in the simulation study.