Seismic risk in the east of the Bayan Har block based on the POT model

Abstract This article elaborates on a peaks over threshold (POT) model with earthquake samples based on the Pareto distribution. We analyze the earthquakes using the POT model in the eastern Bayan Har block. The result shows that the area experiences an earthquake above 7.5 every 30 years and a large earthquake above 7.8 every 100 years. The average magnitude of earthquakes above 5.2 that occur in the area every 5 years is 7.1 or greater. The upper limit of the theoretical magnitude predicted by the model is 8.7. Therefore, the risk of geological hazards is high in this area.


Introduction
The extreme value theory is an effective method for early warning and risk assessment of extreme events with low frequency but great harm. The generalized extreme value distribution (GEV) (Jenkinson 1955) was first introduced into earthquake early warning by Nordquist (1945). Other scholars (Epstein and Lomnitz 1966;Chen and Lin 1973;Tuncel et al. 1974) systematically studied the maximum magnitude prediction methodology. Coles et al. (1999) showed the dependence measures for extreme value analyses. Because only the maximum value of the group is used for the analysis, the data information is lost. This affects the effectiveness of the prediction. People pay more attention to large earthquakes that will cause heavy losses but do not pay much attention to small earthquakes with little destructive power. Therefore, the Pareto distribution is reasonable for early earthquake warning analysis.
The scholars (Balkema and Haan 1974;Pickands 1975) proved that the limit distribution of excess of a given threshold is the generalized Pareto distribution (GPD). It is also called the POT model (peaks over threshold) and is widely used in risk assessment (Castillo 1988;Coles and Walshaw 1994;Coles and Tawn 1996;Embrechts and Mikosch 2003;Hussain 2021;Wen et al. 2021). The researchers improved the model in several aspects (Hosking and Wallis 1987;Davison and Smith 1990;Beirlant and Teugels 1996;Ju arez & Schucany 2004). Scholars applied GEV and GPD to seismic risk analysis (Castillo and Hadi 1997;Sornette 2003;Pisarenko et al. 2014;Dutfoy 2019). Pisarenko et al. (2008) proposed a new approach to characterize the tail distribution of earthquake magnitudes.  studied the seismic risk index based on the Pareto distribution. Dutfoy (2021) built a seismic prediction model based on the generalized Pareto distribution with different observation periods. Barra and Vega-Jorquera (2021) discussed the properties of the q-Pareto distribution and its application in earthquake prediction.
Due to complex geological conditions and frequent earthquakes, positive earthquake warnings and risk assessments are necessary to ensure people's lives and minimize economic losses. The east of the Bayan Har block is an area with dense fault zones and intense activity. Three earthquakes occurred after 2000 (Wenchuan Ms 8.0 in 2008; Ya'an Ms 7.0 in 2013; ABA Ms 7.0 in 2017) in this area. The intense earthquake attracted the attention of some scholars (Li et al. 2018). Chen et al. (2013) studied the division and deformation characteristics of the east of the Bayan Har block. Wen et al. (2011) discussed the movement and deformation of the north and east boundary fault systems. He proposed that the 2008 Wenchuan earthquake was the latest of the large earthquake sequences. Li et al. (2016) used the fault source model of the seismogenic structure and the time-dependent seismicity model to predict the seismic risk of the Bayan Har block. He identified earthquake-prone areas in the next 50 years.
Due to the effectiveness of the POT model in medium-and long-term earthquake prediction, the earthquake data in the east of Bayan Har block are studied based on the POT model in this article.

Extreme value distribution
The theory of extreme value distribution is discussed in detail (Coles 2001;Shi 2006). Extreme value analysis is based on the Fisher-Tippett theorem (Fisher and Tippett 1928) as follows.
Let X 1 , X 2 , :::, X n be a series of independent and identically distributed random variables with a distribution function FðxÞ: Let M n ¼ maxfX 1 , X 2 , :::, X n g, if there exist constants fa n >0g, fb n g, and non-degenerate function KðxÞ, then lim n!1 P M n Àb n a n x ¼ KðxÞ, x 2 R: Then KðxÞ must be the distribution function of either Gumbel, Fr echet or Weibull. These three distributions can be merged into the GEV distribution. The GEV has a distribution function denoted by: where n is the shape parameter, l is the location parameter and r is the scale parameter. The Fr echet case is obtained when n>0, the Weibull when n<0, while the Gumbel case is defined when n ¼ 0: Because we are interested in the tail distribution of the magnitude, the distribution of the excess XÀu is important. Scholars (Balkema and Haan 1974;Pickands 1975;Leadbetter 1984) proved that if the distribution function F belongs to the attraction domain of a GEV distribution, the limiting distribution of the excesses is the generalized Pareto distribution (GDP). That is, if X is a random variable that holds Equation (1), then the distribution function of the excess is: where, It is the distribution function of the two-parameter generalized Pareto distribution. The shape parameter n is the same as that of the GEV distribution. On the basis of the above theory, we build a POT model in the next section.

The POT model based on Pareto distribution
In this section, estimations of the GPD parameters and seismic risk indexes are given.

Determination of the threshold
The first set of conditions to construct a POT model is the determination of a reasonable threshold. For a reasonable threshold u, the mean excess function is and eðuÞ is the linear function of u: Let N u ¼ P n i¼1 I jX i >uj be the number of samples over the threshold. The mean excess function of the samples is If the mean excess function of the samples e n ðuÞ swings near a straight line, we can select u as the threshold.

Estimation of the GPD parameters
When n 6 ¼ 0, the log-likelihood function is where x i 2 Dðn,rÞ ¼ ½0, þ 1 n>0 ½0, Àr=n n<0: The maximum likelihood estimation of the parameters ðn,rÞ can be calculated using the numerical method. Estimates of parameters ðn,rÞ are ðn,rÞ: The confidence interval for each parameter can be calculated by the Fisher information matrix: where Z 1Àa is the quantile of the standard normal distribution and M cov is the covariance matrix of n andr:

Estimation of the distribution function and the quantile of the POT model
We can get the estimation of the distribution function of the excess as follows. From Equation (3), F u ðxÞ ¼ FðxþuÞÀFðuÞ

1ÀFðuÞ
; which leads to From Equation (4), F u ðxÞ ¼ ð1 þ n x r Þ À 1 n : The frequency N u =n above the threshold u is the estimation of FðuÞ, where N u is the number of samples over the threshold. Then, FðuÞ ¼ N u n : We can get the estimation of Fðu þ xÞ as follows: Then, we can get the estimation of FðxÞ : wherer 0 ¼r N u n n , h ¼ u þr n N u n n À 1 : From Equation (9), we get the estimation of the p-quantile x p : where Fðx p Þ ¼ p, 0<p<1: The confidence level of x p is calculated using the Delta method (Pisarenko et al. 2008): x If n<0 and p ! 1, the estimation of the upper limit point x Ã of the support F is:

Seismic risk indexes and their estimation
Based on the above conclusions, the seismic risk indexes are defined as follows.

Return level and return period
Under the assumption of seismic elastic rebound theory (Wood 1911) , large earthquakes occur periodically. Combined with the research results of Qian (2013), the return period and the return level were defined as follows. The meaning of the return level with the return period T is that the average number of earthquakes over the threshold is 1. For a given magnitude u, Eð P i T I jX i !uj Þ ¼ 1: That is, u ¼ F À1 ð1À1=TÞ and u is the quantile of F: Here, we can denote u by x p : Fðx p Þ is the theoretical return period with the return level x p and Fðx p Þ ¼ p ð0<p<1Þ: If the sample is daily-observed data, there are 365 days per year, and the return period T (unit: year) with the return level of x p satisfied: where Fðx p Þ ¼ p ð0<p<1Þ: From Equation (8), the return period estimate can be obtained: If the given return period is T years, bring p ¼ 1À1=ð365TÞ into Equation (11), and we can estimate the return level: 3.4.2. Expected return magnitude Let x ¼ xðTÞ be the return level, to describe the magnitude exceeding xðTÞ,  proposed the expected return magnitude. xðTÞ ¼ EðXjX>xðTÞÞ ¼ xðTÞ þ EðXÀxðTÞjX>xðTÞÞ: From Equation (5), EðXÀxðTÞjX>xðTÞÞ ¼r 1Àn þ n 1Àn ½xðTÞÀu: The estimate of the expected return magnitude is as follows: x ðTÞ ¼xðTÞ þr

Earthquake case data
The data in this article are from the National Seismic Science Data Sharing Center (https://data.earthquake.cn/). Nineteen thousand two hundred twenty-one records before December 2019 are the research samples in the scope of the article (Pisarenko & Sornette 2003). The magnitude of the surface wave M s is used to describe the magnitude of the earthquake. If the magnitude of the surface wave is missing, we converted the magnitude of the local earthquake M l to M s according to the equation   Earthquakes above M s 5.0 in the east of the Bayan Har block have been recorded since 1920 (Huang et al. 1994). All earthquake cases have been recorded since the establishment of the China Seismic Network in 1970. It shows that the earthquakes above M s 4.0 and M s 3.0 in the area were completed in 1950 and 1965, respectively, as shown in Figure 2. Because the large earthquakes above the threshold are studied in this article, 19,065 earthquake data after 1960 are selected as samples.

Threshold of the POT model
The threshold u is crucial. If u is too large, there will be few samples, and the estimation variance is significant. If it is too small, the assessment of the Pareto distribution is partial.
If the excess obeys the Pareto distribution, the value of the shape parameter n does not change as the threshold increases. For the qualified threshold u, the mean excess function eðuÞ is a linear function of u, and the mean excess function of the samples swings near a straight line with a fixed slope.
The relationships between shape parameter and threshold, as well as mean residual life and threshold, are shown, respectively, in Figures 3 and 4.
When the threshold u 2 ½5, 6:4, the estimation of the shape parameter n remains unchanged, as shown in Figure 3. Empirical residual life fluctuates slightly along a straight line with a fixed slope in Figure 4. It is reasonable to select u ¼ 5.2 as the threshold. The number of samples at the point is 96, representing about 0.5% of the total data, which meets the need for tail modelling.

Parameter estimation and the POT model
According to the POT model in Section 3, we calculated the parameters using the maximum likelihood estimation method based on Matlab. Package 'POT' in R (Ribatet and Dutang 2022) is convenient for applying the POT methodology. The estimations of the shape parameter n and the scale parameter r of the Pareto distribution are shown in Table 1.
The density function curve of the Pareto distribution shown in Figure 5 is consistent with the outline of the histogram. The theoretical quantile is consistent with the empirical quantile from the Q-Q chart shown in Figure 6.
The return level on the logarithmic coordinate axis is convex, as shown in Figure  7. The scatter of order statistics based on samples over the threshold swings near the return level, which is in the middle of the confidence interval of the return level. So, it is reasonable to describe the statistical law of the excess with the POT model. For comparison, GEV and exponential distribution are used to analyze the same data.
We built the GEV model with the annual maximum as the samples. The final results of the parameters obtained by the GEV method can be summarized as follows.n ¼ À0:1542,l ¼ 5:0594,r ¼ 0:8805: The upper limit point x Ã of the support F isx Ã ¼lÀr=n for the GEV model. And the theoretical maximum magnitude is about M s 10.8 from the above equation. This result is not in line with reality. Because the annual maxima are the research data in the GEV model, the information may be lost. If two or more magnitudes more significant than the threshold are observed in a given T interval, the GEV method keeps only the largest one, while the GDP method will use all of them.
Seismicity can be deduced from the research results in the article (Chen and Lin 1973) that the magnitude approximately obeys the exponential distribution with the distribution function FðxÞ ¼ 1Àe À1:056x : From Equation (12), the approximate CDF of the POT model is:     A comparison between the two distributions is shown in Figure 8, in which the statistical law of the sample tail in the POT model is more consistent with the sample information. The upper limit of the confidence interval ofn in Table 1 is less than 0, indicating that the return level of the magnitude predicted by the model has a theoretical upper limit and the theoretical maximum magnitude is about M s 8.7 according to Equation (13). The above analysis shows that the POT model based on Pareto distribution is suitable for seismic risk analysis of the east of Bayan Har block and the prediction conclusion is reasonable.

Seismic risk assessment in the east of Bayan Har block
We assess the risk of an earthquake by the return period and the return level. From Section 4.3, for the given return level x p , the estimation of the return periodT can be obtained by Equation (14). For a given return period, the corresponding return level can be calculated according to Equation (15). We summarize the estimates of the return level for the given return period in Table 2.
The 100-year return level is M s 7.8 and the confidence interval is (7.4, 8.3), indicating that earthquakes with a magnitude of about M s 8.0 will occur in the east of Bayan Har every 100 years. This result is consistent with the fact that there was an earthquake of Ms 8.0 in 2008. There are four samples that exceed the lower limit of In the prediction results, the return level of the 30-year return period is M s 7.5. In the last 100 years, earthquakes exceeding M s 7.5 occurred 3 times in 1947, 1973 and 2008, respectively. The time interval between the three of them is about 30 years. The above conclusions show that the earthquake predicted by the POT model is consistent with the sample information.
The expected return magnitudes of the eastern Bayan Har block within 10 years are higher than the upper limit of the corresponding confidence interval, which indicates that the seismic frequency exceeding the predicted result in this area is high. The average earthquake magnitude exceeding the threshold within 3 years is M s 7.0 and the average extent within 5 years is not less than M s 7.1. These two results, combined with the theoretical upper limit of M s 8.7 predicted above, further show that geological activities in the east of the Bayan Har block are not only frequent but also violent.

Conclusions
1. The construction of the POT model for seismic risk analysis is described in detail. Compared to the GEV and exponential distributions, the POT model based on the generalized Pareto distributions can objectively describe the seismic risk in the east of the Bayan Har block. Based on parameter estimation, the strong earthquake magnitude distribution of the Bayan Har block is: The model can effectively predict an occurrence that exceeds a certain magnitude threshold. 2. The tail distribution of the data can be explained by the POT model.
The predicted results of the strong earthquake distribution, the seismic risk in a given period, the return period, and the return level in the east of the Bayan Har block can provide a valuable reference for formulating emergency management plans. It gives an idea to determine the upper limit of the magnitude of the earthquake.
If the earthquake data are focused on one or more interrelated fault zones, the research will be more targeted, and the results will be more meaningful.