Causal-Effect Analysis using Bayesian LiNGAM Comparing with Correlation Analysis in Function Point Metrics and Effort

Software effort estimation is a critical task for successful software development, which is necessary for appropriately managing software task assignment and schedule and consequently producing high quality software. Function Point ( FP ) metrics are commonly used for software effort estimation. To build a good effort estimation model, independent explanatory variables corresponding to FP metrics are required to avoid a multicollinearity problem. For this reason, previous studies have tackled analyzing correlation relationships between FP metrics. However, previous results on the relationships have some inconsistencies. To obtain evidences for such inconsistent results and achieve more effective effort estimation, we propose a novel analysis, which investigates causal-effect relationships between FP metrics and effort. We use an advanced linear non-Gaussian acyclic model called BayesLiNGAM for our causal-effect analysis, and compare the correlation relationships with the causal-effect relationships between FP metrics. In this paper, we report several new findings including the most effective FP metric for effort estimation investigated by our analysis using two datasets.


Introduction
Software effort estimation is an important task in software development, which predicts a necessary development cost to meet a scheduled deadline of software release. In real industrial situations, however, many software projects fail on accurate effort estimation, and thus exceed cost and the scheduled deadline. For instance, the chaos report (The Standish Group, 1994) points out that on average 89% of companies are exceeding the estimated costs. In addition, Molokken and Jorgensen (2003) report that the development time delay reaches approx. 30% and up to 40% of the scheduled time.
To address such problems and achieve more accurate effort estimation, many effort estimation models have been studied so far (Wen et al., 2012). Effort estimation models are often regression models (e.g. linear regression models), and use metrics to estimate efforts. Among such metrics, the most widely-used ones are FP (Function Point) metrics.
(CC) are worse than using Within-Company Datasets (WC) in effort estimation. Differently from (Jeffery et al., 2000), Briand et al. (1999) and Mendes et al. (2005) report that CC is as good as WC. Kitchenham et al. (2007) present a systematic review to summarize such reports. However, it cannot determine which of WC or CC is better.
To remedy the inconsistencies among the results of different researchers, it is important to analyze the relationships among metrics for effort estimation. The reason is that in an effort estimation model (e.g. a linear regression model) using metrics, we get a misleading result due to the multicollinearity problem (Farrar and Glauber, 1967) if explanation variables corresponding to the metrics (e.g. FP metrics) are not independent. So far, a lot of studies (Kitchenham and Känsälä, 1993;Jeffery and Stathis, 1996;Lokan, 1999;Uzzafer, 2016) have investigated the relationships between FP metrics using correlation analysis. However, they have also reported inconsistent results that the explanation variables can be either dependent or independent (Jeffery and Stathis, 1996;Kitchenham and Känsälä, 1993).
In this paper, we propose a novel analysis that investigates causal-effect relationships between FP metrics and effort in addition to correlations between FP metrics. Causal-effect relationships could provide us additional information on relationships among metrics such that a certain correlation is a spurious correlation, and some metrics do not have a correlation, however, have causal-effect relationships with other metrics. In our study, we assume that FP metrics and effort are modeled using a Linear Non-Gaussian Acyclic Model (LiNGAM) (Shimizu et al., 2006). In particular, we adopt an advanced LiNGAM called BayesLiNGAM (Hoyer and Hyttinen, 2009) to identify the causal-effect relationships between FP metrics and effort.
We address the following three research questions and obtain findings for each of them:

RQ1. Are correlation coefficients between FP metrics in our dataset similar to those in previous research?
The correlation coefficients in our dataset are similar to the majority results in previous research. Previous researches (Kitchenham and Känsälä, 1993;Jeffery and Stathis, 1996;Lokan, 1999;Uzzafer, 2016) investigate relationships between FP metrics, however, they have reported inconsistent results. Thus, we investigate the correlation in our datasets.

RQ2. How many bootstrap samples should we use?
A sufficient sample size is 100. BayesLiNGAM occasionally extracts wrong causal-effect relationships. To overcome this deficiency, we adopt a general random resampling approach, called bootstrap sampling (Efron, 1992). Thus, we investigate this RQ to select the sufficient number of samples for bootstrap sampling.

RQ3. What are causal-effect relationships between FP metrics and Effort?
The strengths of the causal-effect relationships are similar to those of the correlation relationships, however, the directions of the causal-effect relationships depend on datasets.
The main contributions of our paper are as follows:  We present the first investigation of the causal-effect relationships between FP metrics and effort using two datasets.  We show that the causal-effect relationships can provide additional relationships between FP metrics and effort.
From our results, the correlation coefficients in our dataset are similar to the majority results in previous research. In addition, the existence of the causal-effect relationships is similar to that of the correlation relationships, however, the directions of the causal-effect relationships depend on datasets. Interface, one of the FP metrics, often does not have strong correlation coefficients and causal-effect relationships with other FP metrics. However, interestingly, Interface has the causaleffect relationships to effort. This means Interface is an independent metric. Therefore, if we use Interface as an explained variable for an effort estimation model, Interface does not cause a multicollinearity problem. In addition, other FP metrics except Interface have both the causaleffect relationships and the correlation relationships with each other. Those metrics may lead a multicollinearity problem.
The organization of this paper is as follows: Section 2 introduces related work and BayesLiNGAM. Section 3 explains the experimental setup and used datasets. Section 4 presents research questions and answers. Section 5 gives discussions on questions arise from the experiment results. Section 6 describes threats to validity. Section 7 presents a conclusion and future work.

Background 2.1 Motivating Example
To analyze a relationship between factor (e.g. FP metrics) using only a correlation coefficient involves a risk. We describe a risk using the following example: In the software development, a project sometimes falls into a runaway status (Takagi et al., 2005). An expert developer who has a long experience is often employed to extinguish a runaway project. Then, the high effort projects that fall into a runaway status and the projects that the expert developer belongs to are strongly correlated, when we analyze if an effort of a project that the expert developer belongs to is either high or low. Such a correlation can lead a misunderstanding such that the project requires a high effort due to the expert developer, and thus we may take a wrong solution (e.g. removing the expert developer from the project).
Therefore, it is risky to determine the reason of a high effort project using a correlation analysis only. If we investigate a causal-effect relationship between the expert developer and the high effort projects, we may not conclude the wrong solution. This is a motivation to use not only a correlation analysis but also a causal-effect analysis in our approach.

Related Work 2.2.1 Effort Estimation
Software effort (shortly, effort) is a measure to indicate whole working time for the software development. So far, various studies (Molokken and Jorgensen, 2003;Wen et al., 2012;Jorgensen and Shepperd, 2007) have proposed effort estimation approaches. FP metrics (Albrecht and Gaffney, 1983) are common metrics to build an effort estimation model, which are provided by the International Function Point Users Group (IFPUG) to measure the size of software. For instance, Albrecht is the first person who developed a methodology of FP metrics in IBM and (Albrecht and Gaffney, 1983) originally propose adopting FP metrics for effort estimation. Ahn et al. (2003) present adopting FP metrics for effort estimation of software maintenance.
FP metrics measure five elementary function types to estimate a size of software; two data functions types -internal logical files (File) and external interface files (Interface) -and three transactional function types -external inputs (Input), external outputs (Output), and external inquiries (Enquiry). These function types are used as explanatory variables for an effort estimation model in a hypothesis that large-sized software requires large effort (Abran et al., 2002).
In general, the estimation model (e.g. a regression model) needs an assumption that explanatory variables are independent (Farrar and Glauber, 1967). To confirm the assumption, many studies (Lokan, 1999;Jeffery and Stathis, 1996;Kitchenham and Känsälä, 1993;Uzzafer, 2016) have reported correlations between FP metrics. For instance, Kitchenham and Känsälä (1993) report FP metrics have correlations with each other, and are not well-formed. In addition, Lokan (Lokan, 1999) indicates that results of existing research have an inconsistency.
In this paper, we first perform a correlation analysis that means, we calculate correlation coefficients between FP metrics in our datasets, to compare with previous research. We next calculate causal-effect relationships between FP metrics and effort for a more detailed analysis.
Finally, Kitchenham and Känsälä (1993) and Jeffery and Stathis (1996) report Pearson correlation coefficients between FP metrics and Effort. For instance, Kitchenham and Känsälä analyze the coefficients and use stepwise multivariate regression to build the effort estimation model. Jeffery and Stathis report the coefficients between FP metrics and Effort, and those between Unadjusted Function Points (UFP) and Effort. There are some inconsistent results between Kitchenham et al. and Jeffery and Stathis differently from their work, in this paper, we use Kendall's t B (Sprent and Smeeton, 2016) to analyze correlation coefficients between FP metrics, and focus on causal-effect relationships between FP metrics and Effort.

Causal Discovery
A causal-effect relationship is an important relationship in an engineering to estimate and solve an industrial problem. To solve the industrial problem needs to decide if each metric is either an explanatory variable or an objective variable to build an estimation model. The causal-effect relationship can support the decision.
In addition, if we find out causal-effect relationships correctly, we can control values of arbitrary metrics using an interpretation (Pearl, 2002). The interpretation is that when a variable in a certain probability model is changed by a disturbance effect, we can observe an effect for the whole probability model by considering a direct effect by the variable (Pearl, 2002). Consequently, in the interpretation, we can consider the probability model whose variable can be intentionally changed by a disturbance effect, although a correlation is a result of analyzing data, and cannot consider a change by a disturbance effect.
To identify causal-effect relationships, we typically use a counterfactual thinking or structural causal models (Holland et al., 1985;Robins, 1986;Hernán, 2004;Heinze-Deml et al., 2017;Pfister et al., 2017;Shimizu et al., 2006;Hoyer and Hyttinen, 2009). Counterfactual thinking uses a contrary fact. For instance, in counterfactual thinking, we consider two facts to identify causal-effect relationships: she did well on exam because she was coached by her teacher, and she did not well on exam because was not coached by her teacher. Then, we compare these two facts to identify that the study is causal to the result of the exam or not for her. However, it is difficult to compare the two facts (Holland et al., 1985). Structural causal models are defined on numerical models. For instance, Shimizu et al. (2006) use Linear, Non-Gaussian, Acyclic Model to solve causal discovery.
In this paper, we use a type of structural causal models. The proposed approach uses a Directed Acyclic Graph (DAG) (Pearl, 2002) to describe causal-effect relationships between factors (metrics). To identify DAG is difficult, however, Shimizu et al. (2006) report that DAG is identifiable when we assume a non-Gaussian disturbance density instead of Gaussian for DAG. The other causal discovery example is the study by Green et al. (2017). They report causal-effect relationships between social transitions (e.g. getting job) and both smoking and drinking. In addition, causal discovery is often applied to medical field (e.g. finding the adverse effects of drugs) (Kleinberg, and Hripcsak, 2011).

Linear Non-Gaussian Acyclic Models (LiNGAM)
Previously, it has been considered that causal-effect relationships cannot be extracted from only observed data that have no time information. However, recent studies (Shimizu et al., 2006) show that causal-effect relationships can be extracted from only observed data under certain assumptions. One of such assumptions is the use of a Linear Non-Gaussian Acyclic Model (LiNGAM). LiNGAM is a data-generating model satisfying the following three properties: 1. A Directed Acyclic Graph (DAG) represents a one-to-one mapping between observed variables .
2. The value assigned to each variable x i is a linear function of the values already assigned to the variables, plus a disturbance (noise) term e i , and plus a constant term c i , that is (1) where k(i) is a causal order. LiNGAM calculates all possible causal orders. Thus, if we consider many variables, the number of causal orders is explosively increased. We'll discuss more details of this problem in discussion section 5.6.

Bayesian Discovery of Linear Acyclic Causal Models
In our approach, we extract causal-effect relationships by using the simple Bayesian inference on LiNGAM (BayesLiNGAM) (Hoyer and Hyttinen, 2009). BayesLiNGAM calculates posterior probabilities of possible DAGs from only given data. Posterior probabilities are calculated as follows: ( 2) where is the different possible DAGs, and N is the number of data samples.

D=
is the observed dataset. Here P(D) is a constant that simply normalizes the distribution. P(G m ) is the prior probability distribution over DAGs and incorporates any domain knowledge that we have. When we do not have any knowledge, we assume a uniform prior probability distribution over all DAGs. The marginal likelihoods are calculated as follows: where q consists of all the parameters (i.e. the coefficients b ij , the constants c i , and the disturbance densities p i (e i )). p(q | G m ) is calculated when we assume three assumptions that b ij is a standard Gaussian distribution with zero-mean and unit variance, c i is zero, and p i (e i ) models , ) ( International Journal of Mathematical, Engineering andManagement Sciences Vol. 3, No. 2, 90-112, 2018 ISSN: 2455-7749 96 a parameterization of the densities. p i (e i ) implements the two quite basic parameterizations: a simple two-parameter exponential family distribution combining the Gaussian and Laplace distributions, and a finite mixture of Gaussian density family. The integral is calculated by the Laplace approximation. We use this approach (Hoyer and Hyttinen, 2009) for our experiment.
Here we need to compute an approximation to (3). By the definition of LiNGAM (Hoyer and Hyttinen, 2009)

Outputs of BayesLiNGAM
We describe outputs of BayesLiNGAM to understand analyzed data. Fig. 2 shows an example of an output of BayesLiNGAM. First, we input two observed variables, Metric A and Metric B, to BayesLiNGAM. Each variable has N samples data. Then, BayesLiNGAM calculates posterior probabilities of causal-effect relationships to the all possible combinations of metrics. Posterior probabilities provide us which causal-effect relationship has the strongest possibility. In this example, two metrics have three possible combinations of metrics; Metric A is a cause of Metric B, Metric B is a cause of Metric A, and no cause.

Experimental Setup
For experiments, we use two types of datasets called China Dataset and Finnish Dataset. Table 1 summarizes the number of samples, the number of all metrics, and the metrics adopted in our analysis for each dataset. ).

China Dataset
China dataset is a dataset in PROMISE data repository (Menzies, et al., 2016) obtained from 499 software development projects. It has 19 metrics. Among them, we use five FP metrics-Interface, Output, Enquiry, Input, File-and a metric for effort, Effort.

Finnish Software Effort Dataset
Finnish Software Effort Data Set (Sigweni et al., 2015) is a dataset obtained from many companies in Finland. It has 46 metrics. Among them, we use the mostly used five FP metrics -IntFP, OutFP, InqFP, InpFP, EntFP -and a metric for effort, Worksup. There are some different points between China and Finnish datasets. For instance, China dataset has many smaller projects with smaller efforts than Finnish dataset does. Finnish dataset has many larger projects with larger efforts than China dataset does. Fig. 3 shows histograms of values of Effort in both China and Finnish datasets. We can observe China dataset has more projects than Finnish dataset in small effort values, and Finnish dataset has more projects than China dataset in large effort values. Note that China dataset has approx. 100 more projects than Finnish dataset has. In addition, values of FP metrics are similar in China and Finnish dataset. Each FP metric is skewness data, and they have many outliers. Fig. 4 shows boxplots of FP metrics in China and Finnish dataset. Each boxplot has a median value not located in the center of a box. Table 2 shows Pearson's moment coefficient of skewness (skewness) (You, 2016). The skewness is a measurement of symmetry as follows: In summary, all values in Table 2 are positive values, and therefore, it is reasonable to support that FP metrics are skew in these datasets.

Motivation
We first need to analyze and confirm the correlation coefficients between FP metrics for our datasets. As mentioned before, Lokan (Lokan, 1999) reports that correlation coefficients between FP metrics have inconsistency in previous results. For instance, Kitchenham and Känsälä (1993) report that Output is significantly correlated with Input, Inquiries and Files. However, Jeffery and Stathis (1996) report that they have no significant correlation.
We use Kendall's t B (Sprent and Smeeton, 2016) to analyze the correlation coefficients between FP metrics for our datasets. Kendall's t B is the t B version of Kendall's t that takes ties into accounts. Kendall's t is used to measure a correlation for ordinal data, which is also used in the previous studies compared with ours.

Approach
Kendall's t B observes the rank correlation, and therefore, can calculate correlation coefficients even when projects have outliers or skewed data. Since China and Finnish datasets have many outliers and skewed FP metrics, Kendall's t B is effective for evaluation.
In addition, we do not perform preprocessing to data since Kendall's t B is a non-parametric test, and we do not need to assume a distribution of data.
Correlation coefficients for our datasets are compared with those in the previous research. We collect the results of previous research are collected from the literature by Lokan (Lokan, 1999). Lokan employs results of correlation coefficients by Kitchenham and Känsälä (1993) and Jeffery and Stathis (1996). In addition, correlation coefficients are compared by a statistical test. Null hypothesis of the statistical test is that a correlation coefficient between two FP metrics has not a correlation.

Results
The correlation coefficients between FP metrics by our analysis are similar to those in the previous results by Kitchenham and Känsälä (1993) and Lokan (Lokan, 1999). For our datasets, we agree with the results by Kitchenham and Känsälä (1993) and Lokan (Lokan, 1999) on the correlation coefficients between FP metrics. On the other hand, we disagree with the result by Jeffery and Stathis (1996) on the correlation coefficients.

RQ2: How many bootstrap samples should we use? 4.2.1 Motivation
In our analysis, we adopt BayesLiNGAM, which is an approach for extracting causal-effect relationships, however, occasionally extracts wrong causal-effect relationships. To overcome this deficiency, in our previous work (Kondo and Mizuno, 2016), we created 15 new datasets from one original dataset by conducting 15 times extracting 150 samples by random sampling. We analyzed the new 15 datasets by BayesLiNGAM, and conducted majority voting to decide which causaleffect relationship is true. However, there is no evidence to decide the number of new datasets, 15.
To get an evidence for the sufficient number of new datasets, in this paper, we adopt a general random re-sampling approach, bootstrap sampling (Efron, 1992), to a phase creating new datasets. This approach provides us a heuristic solution of how many new datasets are sufficient by plotting distribution and confirming if the distribution is smooth or not.

Approach
Bootstrap sampling is a procedure to estimate a sampling distribution of a model to verify the model performance in general (Efron, 1992). The sampling distribution is generated by plotting performances of the model using bootstrap samples. Bootstrap samples are generated by a repeated method extracting N samples allowing overlapping by random sampling from an original dataset that has N samples. Bootstrap sampling can be used in outputs of the model are underspecified to evaluate a performance of the model in general. Fig. 5 shows the procedure of our experiments that using BayesLiNGAM, extracts causal-effect relationships. The procedure is as follows: 1. We create two sets (China and Finnish datasets) that consist of M datasets that consist of N samples. M means the number of bootstrap samples, and N means the sample size of a dataset (i.e. 499 and 407), respectively. 2. The posterior probabilities of three causal-effect relationships between pairs of metrics are calculated from the M datasets by BayesLiNGAM for China and Finnish datasets, respectively. 3. We plot three posterior probabilities of causal-effect relationships using M datasets, and check the distributions. Here, we define smoothness of the distribution. We define that a distribution of the causal-effect relationships is smooth if it satisfies either of the following two conditions under the following assumption.
Assumption:  We only consider the distribution of the causal-effect relationships that are calculated using more than a half of bootstrap samples.

Conditions:
 Absolute differences of the posterior probabilities (values of x-axis) of the mode and those of the second mode are less than or equal to 5 and greater than or equal to 50.  Differences of the numbers of the mode entities (values of y-axis) and those of the second mode entities are greater than or equal to 10.
The assumption aims at removing the distributions of causal-effect relationships that are not calculated on over a half of bootstrap samples. We suppose such causal-effect relationships might not true.
The first condition aims at picking up the distributions that have similar posterior probabilities or different ones between the mode and the second mode. For instance, if the difference of posterior probabilities between the mode and the second mode are very close (i.e., the difference is less than or equal to 5), it is reasonable that these values consist of one same distribution and are in a peak of the same distribution. On the other hand, the probabilities are very far from each other (i.e., the difference is greater than or equal to 50), it is reasonable that these values have a different distribution. Otherwise (if the first condition does not hold), the values possibly consist of a distribution having two peaks (e.g. mixture model).
The second condition considers the value of the y-axis of a distribution. If the value differences of y-axes between the mode and the second mode are small (i.e., the second condition does not hold), and the first condition does not hold, it is reasonable that they consist of a distribution having two  Fig. 6(c) shows a distribution not smooth, because the posterior probabilities between the mode and second mode are close and the value difference of y-axes is small.
Here, we need to decide a disturbance density p i (e i ) for BayesLiNGAM. This density is used to calculate the marginal likelihood for BayesLiNGAM. The density indicates an occurrence distribution of a disturbance term. We adopt a finite mixture of Gaussian density (MoG) since it provides better performance than the Gaussian and Laplace distributions (Hoyer and Hyttinen, 2009). As the number of mixtures of MoG, we choose five from our experience (Kondo and Mizuno, 2016).
We compare two bootstrap sample sizes, 15 and 100. The upper restriction is 100 in our experiment. Tantithamthavorn et al. (2017) state that 100 is a sufficient value for bootstrap sampling. Thus, we employ the same upper restriction.

Results
As the number of bootstrap samples, 15 is not enough for bootstrap sampling, since the sampling distribution for bootstrap sampling using 15 samples is not smooth. Figs. 6 and 7 show three sampling distributions of posterior probabilities where M is 15 for China (between Output and Enquiry) and Finnish (Interface and Enquiry) datasets. For instance, Fig. 6(c) for "Enquiry is causal to Output" does not show a smooth sampling distribution.

From our results, the sufficient number of bootstrap samples is 100 to do bootstrap sampling.
When bootstrap sampling uses 100 samples, the sampling distribution is smooth. Figs. 8 and 9 show three sampling distributions of posterior probabilities where M is 100 for China and Finnish datasets. For instance, Fig. 8(c) for "Enquiry is causal to Output" shows a smooth sampling distribution.
Figs. 8(b) and 9(b) also do not show a clear distribution. However, posterior probabilities are distributed to about 0 or 100, and the numbers of datasets in y-axis are similarly between 0 and 100 of posterior probabilities. Thus, it is reasonable to support BayesLiNGAM that cannot identify this causal-effect relationship into one posterior probability, and shows two types of posterior probabilities of causal-effect relationships. More details will be discussed in Section 5.1.
As the number of bootstrap samples, 100 is sufficient to do bootstrap sampling. In addition, BayesLiNGAM cannot decide one posterior probability of the causal-effect relationship in some cases.

Motivation
The knowledge of correct causal-effect relationships can contribute to building more accurate estimation models necessary for software development in the industrial problem. However, so far, the causal-effect relationships between FP metrics and Effort for effort estimation have not yet been analyzed.

Approach
To extract causal-effect relationships, we adopt BayesLiNGAM using bootstrap sampling where the number of bootstrap samples sets to 100 from the answer of RQ2. Fig. 10 shows the flow of our experiments. The procedure is as follows: 1. We create two sets (Finnish and China datasets) that consist of 100 datasets that consist of N data. N means the size of a dataset (i.e. 499 and 407), respectively. 2. The 100 causal-effect relationships between pairs of metrics are calculated from the 100 datasets by BayesLiNGAM for China and Finnish datasets, respectively. 3. The causal-effect relationships between pairs of metrics are determined by the majority voting of the 100 causal-effect relationships. These causal-effect relationships are referred to as #1. The second-largest ones are referred to as #2. 4. #1 and #2 denote the possibilities of causal-effect relationships  Table 4 shows for China dataset, the directions of causal-effect relationships and the number of datasets which indicate the directions for #1 and #2 in an upper triangular matrix, and the correlation coefficients in a lower triangular matrix. The symbol "→" means a row metric is causal to a column metric. The symbol "←" means a column metric is causal to a row metric. "None" means there is no causal-effect relationship between a row metric and a column metric. The number in brackets means the number of bootstrap samples. For instance, look at the cells for Interface and Output in Table 4. None for #1 indicates there is no causal-effect relationship between Interface and Output. The number in the bracket, 46, indicates this result is calculated from 46 bootstrap samples. → for #2 indicates Interface is causal to Output. This result is calculated from 41 bootstrap samples.

Results
In China dataset, when FP metrics and Effort have small correlation coefficients, there are low possibilities of causal-effect relationships, and when FP metrics and Effort have strong correlation coefficients, there are high possibilities of causal-effect relationships. Causaleffect relationships and correlation coefficients have a relationship. For instance, Interface has small correlation coefficients with other metrics except Effort, and it has low possibilities for a causal-effect relationship with other metrics except Effort. In addition, Output has a smaller correlation coefficient with Enquiry than with other metrics, and it also has a low possibility for a causal-effect relationship with Enquiry.

Discussion and Findings
In this section, we give discussions on questions arise from and the findings from the results of our analysis.

The sampling distributions for a few causal-effect relationships have two different distributions by bootstrap sampling using 100 samples.
The sampling distributions by bootstrap sampling sometimes have two different distributions (i.e. they do not satisfy the first and the second conditions for smooth distributions in Section 4.2.2). For example, "Output is causal to Enquiry" and "Interface is causal to Enquiry" as shown in Figs. 8 and 9 have two different distributions. Bootstrap sampling typically generates a sampling distribution, and therefore, these results are unusual.
However, this circumstance does not affect identifying a causal-effect relationship by BayesLiNGAM based on bootstrap sampling. The pairs of metrics that are involved in such cases have a clear difference between possibilities of causal-effect relationships. For instance, the sampling distribution of "Output is causal to Enquiry" in China dataset has two different distributions. Nevertheless, the sampling distribution of no causal-effect relationship for the pair of metrics is smooth and has many datasets achieving high posterior probabilities, as in Figs. 8(a) and 8(b). In addition, the pair of metrics has a high difference between #1 and #2 as shown in Table 4.

A few causal-effect relationships have a small difference between #1 and #2.
Identifying a causal-effect relationship is difficult when a difference between #1 and #2 is small since we could not identify which causal-effect relationships are likelihood in bootstrap sampling. For instance, the difference between Interface and Output is small both for China and Finnish datasets. BayesLiNGAM cannot always indicate the correct causal-effect relationships for such cases. Investigating a further decision method would be useful to support such a case that the difference between #1 and #2 is small, and thus it is difficult to identify a causal-effect relationship by BayesLiNGAM.

BayesLiNGAM sometimes cannot extract a posterior probability for a causaleffect relationship.
We have conducted bootstrap sampling, however, BayesLiNGAM cannot calculate a posterior probability for a few datasets (bootstrap samples). Table 6 shows the example of the number of datasets between Interface and Input in Finnish dataset. BayesLiNGAM successfully calculates causal-effect relationships for 93 datasets, but fails the calculation for 7 datasets. Nevertheless, we can identify a causal-effect relationship, since we can get the calculation results for almost all datasets. In particular, it is more important to identify a causal-effect relationship than to calculate and identify all posterior probabilities of bootstrap datasets. Kitchenham et al. (2007) indicate that some studies show inconsistent results on whether there are differences between WC and CC to estimate effort or not. Our results indicate that causal-effect relationships are different depending on datasets. The differences of causal-effect relationships across both WC and CC can lead to such inconsistent results since different causal-effect relationships have different tendencies. Therefore, the proposed method can be used to analyze relationships across metrics of WC and CC, and to compare estimation results across WC and CC.

Causal-effect relationships can explain inconsistent results between WC and CC.
If WC has inconsistent causal-effect relationships like our results, and metrics of CC are also inconsistency, we can find out one reason why sometimes WC is better than CC, and for other times, WC is as well as CC. If WC has consistent causal-effect relationships and CC does not have consistent causal-effect relationships, it indicates that sometimes CC is as well as WC, however, CC includes worse points than WC does.

Interface and Output are the best independent explanatory variables for effort estimation and controlling effort, respectively.
RQ3 is to investigate the directions of causal-effect relationships between FP metrics, and those in FP metrics and Effort. From results, the causal-effect relationships between FP metrics are inconsistent, and therefore, it is difficult to discuss general findings. On the other hand, causaleffect relationships between FP metrics and Effort have consistent results. FP metrics is causal to Effort metrics in both datasets. Therefore, it is reasonable that every metric can be useful to estimate effort as an independent explanatory variable. We only consider multicollinearity problem. From this viewpoint, Interface often has neither the causal-effect relationships nor the correlation relationships with other FP metrics. Therefore, this is one of the best independent explanatory variables for effort estimation.
In addition, we can use the interpretation to control effort using FP metrics since FP metrics have causal-effect relationships for effort. In particular, Output metric is a valuable metric using the interpretation, since #1 value for Output is high in every dataset.

How many metrics to which BayesLiNGAM can be applied?
In this paper, using BayesLiNGAM, we only investigate relationships between two metrics of FP metrics and Effort. BayesLiNGAM can be applied to any number of metrics. However, there is a International Journal of Mathematical, Engineering andManagement Sciences Vol. 3, No. 2, 90-112, 2018 ISSN: 2455-7749 computational problem such that the number of DAGs (also the number of combinations of causaleffect relationships considered) and thus the calculation time increased explosively with the number of metrics. Indeed, the implementation of BayesLiNGAM used in our experiment shows us a notification that indicates there are too many inputs if we use over five metrics. To overcome this problem, Hoyer and Hyttinen (2009) propose an alternative approach, which uses the greedy search. Hoyer and Hyttinen (2009). report that their approach can be applied to estimate causaleffect relationships with over six metrics while reducing the calculation time. Investigating causaleffect relationships among more than two metrics could be an interesting future work.

How do we decide which correlation relationships or causal-effect relationships to believe?
In general, causal-effect relationships are better relationships than correlation relationships. This is because correlation relationships are sometimes spurious correlations as shown in Fig. 1. Therefore, if there are conflicting results between causal-effect analysis and correlation analysis, we should confirm whether correlation relationships are not spurious correlations.

Threats to Validity 6.1 Construct Validity
We use Kendall's t for calculating correlation coefficients instead of Pearson correlation coefficients. Kendall's t is also adopted in previous studies, and is more powerful to skewed data and outliers, and our datasets are skewed and have many outliers. Thus, it is valid to adopt Kendall's t to calculate correlation coefficients.
For using BayesLiNGAM, we assume that the disturbance density is a finite mixture of Gaussian density and the number of mixture is five. That means that we approximate population of data as a five mixture of Gaussian density.
For experimental analysis, we use two datasets, China and Finnish datasets, which have been adopted previous studies on effort estimation (Sigweni et al., 2016;Bettenburg et al., 2012). Thus, it is valid to use these datasets.

External Validity
Correlation coefficients between FP metrics already have been investigated in previous studies, and our results are similar to the majority of previous results. Therefore, results of correlation coefficients are general.
Results of causal-effect relationships are also general since we adopt two types of datasets, and adopt bootstrap sampling. Bootstrap sampling supports providing a general result.

Reliability
We use BayesLiNGAM (open at https://www.cs.helsinki.fi/group/neuroinf/lingam/bayeslingam/) that was implemented by Hoyer and Hyttinen who originally proposed BayesLiNGAM. Thus, reliability of results of BayesLiNGAM is high.
In addition, we provide all data and scripts that are used for our study at https://se.is.kit.ac.jp/~mkondo/BayesLiNGAM.tar.bz2. Thus, anyone can easily conduct and confirm our analysis.

Conclusion
In this paper, we presented a causal-effect analysis between FP metrics and effort using BayesLiNGAM. Using the proposed analysis, we can investigate the directions of causal-effect relationships among the metrics. Therefore, our analysis can support building a good effort estimation model.
From the results of our analysis using two datasets, we confirmed that causal-effect relationships between FP metrics are similar to correlation relationships between them, and most of causaleffect relationships have same directions. However, a few causal-effect relationships have different directions in difference datasets.
We also confirmed that when FP metrics and effort have a correlation, they also have causal-effect relationships. Thus, correlations between FP metrics and effort are not spurious correlations.
In addition, from our results, Interface, one of the mostly used FP metrics, does not have strong correlation coefficients and causal-effect relationships with other FP metrics. This result indicates that Interface is the best FP metric to build an effort estimation model since it then does not cause a multicollinearity problem.
Our future work includes extracting new features from original features (e.g. metrics) to solve the multicollinearity problem. We could make the new features that can overcome the multicollinearity problem by integrating correlated features. Although a stepwise regression approach (Mendes and Mosley, 2001) is already proposed to remove correlated features, we plan to make the new features that contribute to the performance improvement of an objective task. In particular, we are interested in adopting a neural network approach.