Decision-making Risks of Operational Tests from a Statistical Point of View

The operational test, along with the evaluation based on it, is the most important decision-making basis for the Department of Defence in acquiring weapons and equipment. However, there has been relatively little attention to the credibility of its conclusions. In order to evaluate the decision-making risks quantitatively, the relationship between the two types of errors in the operational tests is studied. By analysing the interaction between the factors affecting decision risk errors, it is pointed out that it is of great practical significance to ensure that the tests have sufficient statistical power, which is usually overlooked before in operational tests.


Introduction
The operational test is a process of observing the completion degree of the combat task of the weapon and equipment under the control of professional fighters and the degree of adaptation to the given combat task by simulating the actual combat. There are many factors affecting the reliability of its conclusions, such as capital consumption, schedule, resource allocation, and the credibility of test conclusions. However, among these factors, the most fundamental factor is the reliability of the test results, because other factors are directly or indirectly evolved from it. Therefore, from the perspective of statistics, this paper investigates the decision risks of extrapolating relevant assessment indicators of weapons and equipment based on test data.

Statistical hypothesis test
Hypothesis testing is the process of making a hypothesis about the parameter(s) of the population, and then deducing whether the hypothesis is valid from the information of the sample. This process is called statistical test of hypothesis [1].
The process of arriving at a conclusion by hypothesis testing is analogous to a proof by contradiction. It calls the conclusion or theoretical result which wants to support the alternative hypothesis, and it's denoted as . The conclusion or theory that is opposed to the alternative hypothesis is called the null hypothesis and is denoted as 0 . Then, come up with a function that is completely determined by the sample data, that is the test statistic, and construct the probability distribution of the test statistic if 0 is true. Finally, the sample data obtained from the test is substituted into the test statistics to calculate the probability of the occurrence of the sample and more extreme cases than the sample under this distribution。If the value is very small, 0 is considered very suspicious when the sample is used as evidence and should be supported. On the other hand, if the value is big, then it is considered that we cannot deny 0 , when the sample is taken as evidence. Therefore could not be supported.

Two types of errors, significance levels, statistical power and effect size
In the hypothesis test, when 0 is true, the case of rejecting 0 is called the type I error, the case where is true and 0 is not rejected is called a type II error. They are two types of risks that researchers face when extrapolating population information based on sample data.
The statistical tool for controlling the risk of making the type I error is the significance level. It's the acceptable level of probability of making the type I error, denoted as α。When the value calculated based on the sample data is less than , it is called "reject 0 at the level", otherwise, it is called " 0 cannot be rejected at the level". The statistical tool for controlling the risk of making the type II error is the statistical power. It is the probability of not making the type II error, denoted as (1 − ), where is the probability of making the type II error.
Effect Size( ) measures provide an index of how much impact treatments actually have on the dependent variable. [2] the most common measures are proposed by Cohen [3]. Taking figure 1 as an example, suppose the population is normally distributed and the variance is known, and 0 is that the population mean is less than or equal to 10, namely ≤ 10, is that the population mean is greater than or equal to 10, namely ≥ 10. While the sample size is , and 0 is true, the sample mean follows the normal distribution of parameters 10 and 1. If is true, the sample mean follows a normal distribution with parameters 12 and 1. After the significance level α is determined, a threshold value of value can be determined in the distribution of 0 , which is the (1 − ) quantile of the distribution of 0 . Then by definition, 0 can only be rejected if the sample mean is greater than . Therefore, the area covered by the red region on the right of the 0 distribution graph is the significance level α, the area covered by the blue area to the right of is the statistical power (1 − ). Then, the relationship between the two types of errors is shown on the right side of figure 1. If 0 or is true, the probability of making type I or type II error is α or β. Where, in the figure 1 is the (Cohen = (μ − μ 0 )/σ [3]), σ is the standard deviation of both of the distributions.

Influence of sample size on statistical power
Since is objective, and the significance level α adopts the accepted standard (generally 0.05), the adjustment of sample size will mainly affect the statistical power (1 − β). Taking the normal distribution as an example, if the sample size is , 0 and obeys (10,1) and (12,1) respectively, so first of all, from the equation of If the sample size is 2 , then 0 and obey (10,1/2) and (12,1/2) respectively, that is, the standard deviation of the sample mean decreases to σ ̅ = 1/√(2). As above, from the equation of it can be derive that 2 = 11.1631. Then, the statistical power was calculated as follows:

Influence of lack of statistical power on test results
Statistical power has a decisive influence on the conclusion of test, but it is often ignored in current researches [4][5] [6] [7]. Because of research funding constraints, the statistical power is very low in many studies, of which the results reported are type I or type II errors, or the so-called false positives or false negatives. This is so common in the academic community that many of the results published in the top journals cannot be avoided [4][5] [6] [7]. Literature [5] [6][7] discussed the subjective and objective causes which result in this situation, while literature [4][8] focused on how the ability to find effects through tests was severely weakened in the absence of statistical power. Take figure 3 as an example, which reflects the fact that the tests, with different statistical power, will obtain the positive predictive value ( ), in the case that the results with effect account for different proportions of all the results. Therefore, the literature [8] points out that it is because most studies expect to get seismic conclusions, but the seismic conclusions account for a small proportion of all conclusions, plus the statistical power of tests cannot be guaranteed to be sufficient, therefore, many of the results claimed to have effects are actually false positive results, and the proportion is very high, that is, the is very small.

Conclusion
So from what has been discussed above, it is of great practical significance to improve the correctness of operational test and appraisal conclusion. That is because whether the weapons or equipment of the DoD have the claimed operational effectiveness and operational suitability is related to the life safety of the warfighter and the victory of the wars. If, as shown in figure 3, when = 0.1, (1 − ) = 0.2, the is only 0.31, that is, only about 3 pieces have the claimed operational effectiveness and operational suitability among the 10 pieces of SUT. The consequences is obviously catastrophic. Therefore, it is necessary to seriously study the statistical power of OT, and its sample size.