Quadrilateral Interval Type-2 Fuzzy Regression Analysis for Data Outlier Detection

This paper presents a fuzzy regression analysis method based on a general quadrilateral interval type-2 fuzzy numbers, regarding the data outlier detection. The Euclidean distance for the general quadrilateral interval type-2 fuzzy numbers is provided. In the sense of Euclidean distance, some parameter estimation laws of the type-2 fuzzy linear regression model are designed. Then, the data outlier detection-oriented parameter estimation method is proposed using the data deletion-based type-2 fuzzy regression model. Moreover, based on the fuzzy regression model, by using the root mean squared error method, an impact evaluation rule is designed for detecting data outlier. An example is ﬁnally provided to validate the presented methods.


Related Works and Motivations.
Regression analysis is widely used in engineering science, social science, economy, and finance [1][2][3][4] since it is a significant and comprehensive method to analyze the dependence between dependent variables and one or more variables. Generally, in a regression model, deviations (errors) between the estimated and the observed values are deemed to be because of random variations and/or measurement errors. To this end, some statistical analysis techniques are usually developed for model determination. Nevertheless, in practice, the deviations are sometimes caused by the indefiniteness of the system structure or the incomplete measurable data. In such a case, the uncertainties are ordinarily called fuzziness but not randomness. en, to deal with the fuzziness, the regression analysis based on fuzzy data or fuzzy numbers called fuzzy regression analysis was proposed and developed in [5][6][7].
Different from conventional regression techniques which are based on the nonstatistical method, fuzzy regression methodology uses possibility theory and fuzzy set theory [8,9] for modeling and analysis. us, it is much appropriate to manage uncertainties. Zadeh [9] has introduced two types of fuzzy sets to describe the uncertainties [10][11][12]. e first type of fuzzy set called the type-1 fuzzy set has several typical extended forms, such as interval fuzzy set, triangular fuzzy set, trapezoidal fuzzy set, pentagonal fuzzy set, and intuitionistic fuzzy set [13]. Because of some variations in dealing with uncertainties, such as the inaccurate description of language uncertainties and the parameter perturbation in the uncertainties, the type-2 fuzzy set was presented to tackle this problem. Its degree of ambiguity is characterized by two subordinate membership functions. is is actually an extension of the type-1 one and leads to wider applicability for complex systems. But, in return, it makes the calculation of the fuzzy regression complicated when using type-2 fuzzy sets. For simplicity of calculation, Mendel et al. [14] specialized the type-2 fuzzy sets as a kind of interval type-2 fuzzy sets, which can also describe the uncertainties better over the type-1 fuzzy sets. In many research and applications of the fuzzy regression model, the fuzzy numbers based on triangular fuzzy sets are most frequently applied due to its simplicity. However, such fuzzy numbers manifest some limitations in observation, particularly when the complete observed output of the predicted model is required [15].
In many practical studies, the complete description of regression problems greatly depends on the property of input-output data [16]. Different types of input-output data lead to dissimilar analysis results of the fuzzy regression.
ere are typically four cases of the fuzzy regression analysis. ese are the cases based on the crisp-input crisp-output (CICO) [17], crisp-input fuzzy-output (CIFO) [15,18], fuzzy-input crisp-output (FICO) [19], and fuzzy-input fuzzy-output (FIFO) [19][20][21] observations in the literature. Most commonly, the case of CIFO data is studied in practice. In this paper, the interested fuzzy regression is analyzed based on CIFO data. Specifically, the case where the predictor variable is crisp but the parameters (coefficients) are fuzzy is considered. erefore, the observed responses, in this case, are naturally fuzzy.
To sum up, there are three main methods of the fuzzy regression analysis, such as least squares fitting criterion, minimum fuzzy criterion, and interval regression analysis approach [22]. For example, in [23], mathematical programming methods were used for estimating the parameters of a fuzzy regression model in terms of the trapezoidal case and triangular case. e work in [24] developed a fuzzy regression model and used the least square method to estimate the coefficients in the sense of distance. e authors [15] presented a modified fuzzy linear model, based on which all the observed data can be enveloped by the identified model output. A tolerance approach was introduced in [25] to the construction of fuzzy regression coefficients based on a possibilistic linear regression model with fuzzy data. In this paper, the least square method will be used to calculate the estimation error of the fuzzy regression values.

Contributions of is Work.
Actually, the kernel interval in a trapezoidal fuzzy number is limited to a single point equal to the midpoint of the support interval. As an extension of the triangular fuzzy numbers, the trapezoidal fuzzy numbers can fill those gaps. at is why we develop the fuzzy regression model using the trapezoidal interval type-2 fuzzy numbers and even use a more general case quadrilateral interval type-2 fuzzy numbers considered in this paper. Meanwhile, the model structure is assumed to be linear, which is commonly used in the literature. Consequently, the corresponding fuzzy regression problem becomes a parameter estimation problem of the regression model.
In another research field, data information of all individuals often fails to be collected by experimenters due to measurement methods, preservation methods, and human factors. It results in incomplete observation values of some data indicators in the samples, which are common in clinical trials, socioeconomic statistics, environmental ecology, and other researches. In fact, under normal circumstances, samples with missing values cannot fully reflect the real characteristics of interested systems and the internal relationship between variables. us, improper treatment may even lead to large deviations in the results. erefore, how to deal with the missing data and extract information from data correctly and effectively is an important issue in statistical inference.
However, it is often unavoidable to mix a certain proportion of outliers or strong influence points into the actual data due to the interference of many factors, such as negligence error and rounding error. Once the outliers are mixed, these fuzzy regression methods will become unpractical and even severe challenges can even lead to wrong conclusions. It results in incomplete observation values of some data indicators in the samples, which are common in clinical trials, socioeconomic statistics, environmental ecology, and other researches [26][27][28]. Hence, the influence analysis of outliers on models is an important part of the statistical diagnosis. To this end, this paper will investigate the fuzzy regression analysis based on the type-2 trapezoidal fuzzy numbers regarding the data outlier detection. e contributions of the paper can be summarized as follows: (1) Firstly, the definition of interval trapezoidal type-2 fuzzy numbers is provided. e parameter estimation laws of the fuzzy linear regression model based on trapezoidal fuzzy numbers are designed in the sense of Euclidean distance. (2) en, some parameter estimation laws in terms of the data outlier detection are synthesized for the trapezoidal fuzzy regression model. e rest of the paper is organized as follows. Section 2 describes the fuzzy linear regression model and the quadrilateral type-2 fuzzy numbers. Sections 3 and 4 present parameter estimation laws of the fuzzy regression model. Section 5 provides the impact evaluation rule. Section 7 summarizes the paper.

Model Description and Preliminaries
(x i , y i ) is denoted as a set of pairs of observation data, where x i ∈ R is the predictor variable (input) and y i ∈ R is the observed response variable (output), and i � 1, 2, . . . , n. For each observation data (x i , y i ), the functional form of the linear regression model is formulated as where α ∈ R and β ∈ R are the unknown parameters to be estimated. Some definitions used in developing the theoretical results are presented in Appendix.
e uncertainty of a type-2 fuzzy sets F can be described by a bounded region, that is, the projection area of the fuzzy sets F on the plane (x, u), which is called the footprint of uncertainty, expressed by μ F (x, u). e upper-bound membership function μ U F (x, u) and the lower-bound membership function μ U F (x, u) of the interval type-2 fuzzy numbers are actually corresponding to type-1 fuzzy sets, respectively.
An illustrative description and comparison of the type-2 membership functions under different cases is shown in  (2) as shown in Figure 1(b) is a special case of the quadrilateral membership function [29], where h s and h t are the membership values of the second and third elements a 2 and a 3 , respectively. e mostly used triangular type-2 membership function illustrated in Figure 1(c) is actually a special case of the trapezoidal one considered in this paper. Figure 1(d) shows a crisp value of a fuzzy set by comparing with the mentioned fuzzy numbers.
e upper-and lower-bound membership functions of the trapezoidal type-2 fuzzy numbers are expressed as the following form: U 4 , and 0 ≤ h L ≤ h U ≤ 1 denote the heights of the two trapezoids. Considering h s ≠ h t , we give the following definition of general quadrilateral interval type-2 fuzzy numbers.
Based on this discussion above, we will analyze the regression model (3) for the general case when the quadrilateral type-2 fuzzy numbers are used. e CIFO-based interval type-2 fuzzy regression model is formulated as follows: where α and β denote the two fuzzy numbers, which are quadrilateral interval type-2 fuzzy numbers considered in this paper. y i is the observed response (i � 1, 2, . . . , n).
Before proceeding further, we give the following definition of the Euclidean distance of two quadrilateral interval type-2 fuzzy numbers.

Remark 1.
e coefficients a and b used in (A.6) can be adjusted as needed. If a � b � 1/2, then the Euclidean dis- In the following section, the fuzzy regression is analyzed in the sense of the defined Euclidean distance, for the estimation of the α and β in (3). Furthermore, the parameter estimation in terms of the data outlier will be also discussed subsequently.

Parameter Estimation of the Type-2 Fuzzy Regression Model
For the considered CIFO observation data, we write the interval type-2 fuzzy parameters α and β in (3) as follows: According to (3), the fuzzy-observed response y i can be represented subject to the positive or negative x i . If x i > 0, then the corresponding quadrilateral type-2 fuzzy number is On the contrary, if x i < 0, the corresponding type-2 fuzzy number results in Remark 2. Actually, one can find the minimum x min from all x i whether it is positive or negative and then subtract which are all nonnegative are obtained. erefore, we will investigate the regression analysis by considering Based on the fuzzy regression model (3), we aim to design the estimates of the bounds of α and β by minimizing the distance between the resulting fuzzy number and the observed response y i . As discussed in Remark 2, we can consider the case of x i ≥ 0, for the parameter estimation. We provide the estimation laws for the general case of the membership grades h κ s r and h κ t r (r ∈ α, β , κ ∈ U, L { }), in the following theorem.
Proof. According to Definition 4, for n observation pairs (x i , y i ), the sum s(α, β) of the squared Euclidean distance between y i and α + βx i subject to the fuzzy numbers α and β can be obtained as follows: en, we take the partial derivative of s(α, β) with respect to α κ τ and β κ τ (τ � 1, 2, 3, 4 and κ ∈ U, L { }) and receptively obtain the following equations for κ ∈ U, L { }: Let be the estimates of α and β, respectively. en, solving the algebraic equation sets obtained above, we get the estimates of the bounds of α and β as exactly expressed in eorem 1. is completes the proof.

Parameter Estimation against the Data Deletion Fuzzy Regression Model
For the sake of the evaluation of the impact of the j-th data (x j , y j ) in regression analysis based on the regression model (3), we can delete the j-th data (x j , y j ) and detect if these data are an outlier or a strong influence factor, by comparing the changes in statistical inference results. e regression model (1) when the j-th data (x j , y j ) are deleted is called the data deletion-based regression model, which is represented as where α [j] and β [j] are the two quadrilateral interval type-2 fuzzy parameters. By the parameter estimation method in eorem 1, the following results can be drawn for the parameter estimation against the j-th data deleted.

Theorem 2.
Consider the trapezoidal fuzzy regression model (12) with a set of observation data (x i , y i ), i � 1, 2, . . . , n, and ) is a class of quadrilateral interval type-2 fuzzy number. If the j-th data point (x j , y j ) is deleted, then the following estimates of α [j] and β [j] in the Euclidean distance are designed for i, j � 1, 2, . . . , n.
Proof. Similar to the case of s(α, β) in eorem 1, after the j-th data (x i , y i ) are deleted, the sum s(α [j] , β [j] ) of the squared Euclidean distance between y i and α [j] + β [j] x i results in the following, for i, j � 1, 2, . . . , n: Mathematical Problems in Engineering en, following the steps in the proof of eorem 1, the results in eorem 2 can be obtained accordingly. We omit the specific proof for saving space.

Impact Evaluation Rule for the Data Outlier Detection
Since y [j] and y introduced above are two type-2 fuzzy numbers, it is inconvenient to compare their difference. For this reason, a suitable statistical measure is usually suggested in order to compare the impact quantitatively.
In this paper, we introduce the standard deviation of the regression equation as the statistical measure to analysis the impact of the data deletion. Let us define the standard deviation of the regression equation (3) as Definition 5 in Appendix.
From Definition 5, we know that the standard deviation of the regression equation is actually the average deviation between the observed value and the estimated value. Apparently, the smaller the standard deviation is, the closer the estimated value is to the observed value, as well as the closer the observation points are clustered around the fuzzy regression model.
When calculating the standard error in (A.8), we should firstly obtain the parameters α and β by solving the extremevalue problem in the statistical analysis and then estimate the regression value. It will use two rounds of statistical calculations, and thus, two degrees of freedom are taken. erefore, the denominator in (A.8) uses n − 2 and not n in the statistical analysis.
According to the data deletion-based type-2 fuzzy regression model (12), when the j-th data point is deleted, the corresponding standard deviation is Specifically, the square of σ j for the type-2 fuzzy regression model (12) with fuzzy α and fuzzy β can be calculated by For the data deletion-based fuzzy linear regression model in (12), let σ 2 j be the metric of the impact on the regression model (12). Evidently, if σ 2 j increases after deleting the j-th data point, then it indicates that the impact is greater and this data point may be an outlier; otherwise, the j-th data are normal. e derived results can be reduced to the case of trapezoidal type-2 fuzzy regression model when . Besides, it becomes the normal case when using h κ s r � h κ t r � 1. erefore, one can consider α κ 2 � α κ 3 and β κ 2 � β κ 3 (κ ∈ U, L { }) when dealing with the triangular type-2 fuzzy numbers in practice.

Simulation Example
In this part, we provide an example to validate the presented fuzzy regression model and the designed impact evaluation rule for the data outlier detection. We borrow a set of data; these are the estimation errors produced in Table 9 from [30] but considered as some type-2 trapezoidal fuzzy numbers. Table 1 gives the considered interval type-2 trapezoidal fuzzy numbers as the observed data.
Based on this set of observed data, we will detect if some of them is an outlier or a strong impact point by using the designed impact evaluation rule. For simplicity, we use the normal trapezoidal type-2 fuzzy number h κ s r � h κ t r � 1 (r ∈ α, β , κ ∈ U, L { }) for the type-2 fuzzy parameters α and β. Considering y i � α + βx i , according to eorem 1, by setting a � b � 0.5, we obtain α and β of the resulting type-2 fuzzy regression equation as follows: en, in the following steps, we can use this fuzzy regression equation to calculate the standard deviation of the regression value after deleting the j-th data point. Based on the fuzzy regression model (17), according to eorem 2, we can obtain the standard deviations as shown in Table 2.
From the standard deviations in Table 2, after deleting the 14-th data point, one can find that the standard deviation of the estimate of the fuzzy regression equation is larger than those of others. is can be observed evidently from Figure 2. us, the 14-th data point is likely the outlier point, which means the estimation error obtained in Table 1 is the least-accurate estimation error.

Conclusion
is paper has dealt with the problems of the fuzzy regression analysis and data outlier detection based on general quadrilateral interval type-2 fuzzy numbers. e Euclidean distance for the type-2 fuzzy numbers has been provided and used for the parameter estimation and standard deviation. Some parameter estimation laws of the quadrilateral interval type-2 fuzzy linear regression model have been designed. Finally, the impact evaluation rule has been designed using the data deletion-based fuzzy regression model. e data outlier detection can be achieved by using the calculation of the standard deviations of the fuzzy regression values. 23

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.