Sound and Relatively Complete Belief Hoare Logic for Statistical Hypothesis Testing Programs

We propose a new approach to formally describing the requirement for statistical inference and checking whether a program uses the statistical method appropriately. Specifically, we define belief Hoare logic (BHL) for formalizing and reasoning about the statistical beliefs acquired via hypothesis testing. This program logic is sound and relatively complete with respect to a Kripke model for hypothesis tests. We demonstrate by examples that BHL is useful for reasoning about practical issues in hypothesis testing. In our framework, we clarify the importance of prior beliefs in acquiring statistical beliefs through hypothesis testing, and discuss the whole picture of the justification of statistical inference inside and outside the program logic.


Introduction
Statistical inferences have been increasingly used to derive and justify scientific knowledge in a variety of academic disciplines, from natural sciences to social sciences.This has significantly raised the importance of statistics, but also brought concerns about the inappropriate procedure and the incorrect interpretation of statistics in scientific research.Notably, previous studies have pointed out that many research articles in biomedical science contain severe errors in applying statistical methods and interpreting their outcomes [1].Furthermore, large proportions of these errors have been reported for basic statistical methods, possibly performed by researchers who can use only elementary techniques.In particular, the concept of statistical significance, evaluated using p-values, has been commonly misused and misinterpreted [2].
One of the main issues behind these human errors is that the logical aspects of statistical inference are described informally or implicitly using natural languages, and handled manually by analysts who may not fully understand the statistical methods.In particular, this makes them overlook some assumptions necessary for statistical methods, hence choosing inappropriate methods.Nevertheless, to our knowledge, no prior work on formal methods has specified the preconditions for statistical inference or verified the choice of statistical techniques.
In this paper, we propose a method for formalizing and reasoning about statistical inference using symbolic logic.Specifically, we introduce sound and relatively complete belief Hoare logic (BHL) to formalize the statistical beliefs acquired via hypothesis tests, and to prevent errors in the choice of hypothesis tests by describing their preconditions explicitly.We demonstrate by examples that this logic can be used to reason about practical issues concerning statistical inference.

Contributions
Our main contributions are as follows: • We propose a new approach to formalizing and reasoning about statistical inference in a program.In particular, this approach formalizes and checks the requirement for statistical methods to be used appropriately.
• We define an epistemic language to express statistical beliefs obtained by hypothesis tests on datasets.Specifically, we formalize a statistical belief on a hypothesis ϕ as the knowledge that either (i) ϕ holds, (ii) the sampled dataset is unluckily far from the population, or (iii) the population does not satisfy the requirements for the hypothesis test.Then we introduce a Kripke model for hypothesis tests to define the interpretation of this language.
• Using this epistemic language, we construct belief Hoare logic (BHL) for reasoning about statistical hypothesis testing programs.Then we prove that BHL is sound and relatively complete w.r.t. the Kripke model for hypothesis tests.
• We clarify the importance of prior beliefs in acquiring statistical beliefs, and prove essential properties of statistical beliefs by using our framework.
• We show that BHL is useful for reasoning about practical issues concerning statistical inference, such as p-value hacking and multiple comparison problems.
• We provide the whole picture of the justification of statistical beliefs acquired via hypothesis tests inside and outside BHL.In particular, we discuss the empirical conditions for hypothesis tests and the epistemic aspects of statistical inference.
To the best of our knowledge, this appears to be the first attempt to introduce a program logic that can specify the requirements for hypothesis tests to be applied appropriately.We consider this as the first step to building a framework for formalizing and verifying the validity of empirical science and data-driven artificial intelligence.

Relation with the Preliminary Version
A preliminary version of this work, considering only sound but not complete program logic, appeared in [3].The main novelties of this paper to that version are: • We introduce a sound and relatively complete belief Hoare logic (BHL) that has a simpler set of axioms and inference rules than our preliminary version [3].
• We extend the notion of a possible world with a hypothesis test history, and redefine the assertion language.This enables us to provide a more rigorous model for statistical beliefs and to prove the relative completeness of BHL.
• We add propositions and discussion for hypothesis formulas and show the importance of prior beliefs in hypothesis testing by using our framework in Section 7.
• We present all proofs for our technical results in Appendix B.

Related Work
Hoare logic [4] is a form of program logic for an imperative programming language.This program logic is then extended and adapted so that it can handle various types of programs and assertions, including heap-manipulating programs [5], hybrid systems [6], and probabilistic programs [7].Atkinson and Carbin propose an extension of Hoare logic with epistemic assertions [8].In their work, an epistemic assertion is used to reason about the belief of a program about a partially observable environment, whereas their logic does not deal with a statistical belief arising from statistical tests conducted in a program.To the best of our knowledge, ours appears to be the first program logic that formalizes the concept of statistical beliefs in hypothesis testing.
Epistemic logic [9] is a branch of logic for reasoning about knowledge and belief [10,11].It has been used to specify and verify various knowledge properties in systems, e.g., authentication [12] and anonymity [13,14].Many previous works on epistemic logic incorporate certain notions of degrees of beliefs and confidence [15].Notably, Bacchus et al. [16] define the degree of belief in a possible world semantics where each world is associated with a weight and the degree of belief in a formula ϕ is defined as the normalized sum of the weights of all accessible possible worlds satisfying ϕ.However, this line of studies has not modeled the degree of belief in the sense of statistical significance in a hypothesis test.In contrast, our framework models the degree of belief in terms of a p-value without assigning a weight to a possible world.
Fuzzy logic [17] is a branch of many-valued logic where the truth values range over [0 , 1].It has been used to model and reason about the degrees of uncertainty in beliefs and confidence [18,19].To the best of our knowledge, however, no prior work on fuzzy logic can reason about the correct application of statistical hypothesis testing.
The first attempt to express statistical properties of hypothesis tests using modal logic is the work on statistical epistemic logic (StatEL) [20,21].They introduce a belief modality weaker than S5, and a Kripke model with an accessibility relation defined using a statistical distance between possible worlds.Unlike our work, however, StatEL cannot describe the procedures of statistical methods or reason about their correctness.
From a broader perspective, many studies formalize and reason about programs based on knowledge [22] and beliefs [23].For example, Sardina and Lespérance [24] extend the situation calculus-based agent programming language GOLOG [25] with BDI (belief-desire-intention) [26] agents.Belle and Levesque [27] propose a beliefbased programming language called ALLEGRO to deal with the probabilistic degrees of beliefs in programs with noisy acting and sensing.However, no prior work appears to have studied belief-based programs involving statistical hypothesis testing.

Plan of the Paper
In Section 2, we review fundamental concepts from statistical hypothesis testing.In Section 3, we present an illustrating example to explain the basic ideas of our framework.In Section 4, we introduce a Kripke model for describing statistical properties and define hypothesis testing.In Section 5, we introduce the syntax and the semantics of an imperative programming language Prog.In Section 6, we define an assertion language, called epistemic language for hypothesis testing (ELHT), that can express statistical beliefs.In Section 7, we clarify the importance of prior beliefs in acquiring statistical beliefs, and show the essential properties of statistical beliefs in our framework.In Section 8, we introduce belief Hoare logic (BHL) for formalizing and reasoning about statistical inference using hypothesis tests.Then we show the soundness and relative completeness of BHL.In Section 9, we apply our framework to the reasoning about p-value hacking and multiple comparison problems using BHL.In Section 10, we provide the whole picture of the justification of statistical beliefs inside and outside BHL.In Section 11, we present our final remarks.
In Appendix A, we present examples of the instantiations of derived rules with concrete hypothesis test methods.In Appendix B, we show the proofs for the propositions on assertions, basic results on structured operational semantics, remarks on parallel compositions, and the proofs for BHL's soundness and relative completeness.

Preliminaries
In this section, we introduce notations used in this paper and recall background on statistical hypothesis testing [28,29].
Let N, R, R ≥0 be the sets of non-negative integers, real numbers, and non-negative real numbers, respectively.Let [0, 1] be the set of non-negative real numbers less than or equal to 1.We denote the set of all finite vectors of elements in S by S * , the set of all multisets of elements in S by P(S), and the set of all probability distributions over a set S by DS.Given two distributions

Statistical Hypothesis Testing
Statistical hypothesis testing is a method of statistical inference about an unknown population x (the collection of items of interest) on the basis of a dataset y sampled from x.In a hypothesis test, an alternative hypothesis ϕ 1 is a proposition that we wish to prove about the population x, and a null hypothesis ϕ 0 is a proposition that contradicts ϕ 1 .The goal of the hypothesis test is to determine whether we accept the alternative hypothesis ϕ 1 by rejecting the null hypothesis ϕ 0 .

Tails
prior knowledge alternative hypothesis ϕ 1 null hypothesis ϕ 0 Two nothing In a hypothesis test, we calculate a test statistic t(y) from a dataset y, and see whether the t(y) value contradicts the assumption that the null hypothesis ϕ 0 is true.Specifically, we calculate the p-value, showing the degree of likeliness of obtaining t(y) when the null hypothesis ϕ 0 is true.If the p-value is smaller than a threshold (e.g., 0.05), we regard the dataset y is unlikely to be sampled from the population satisfying the null hypothesis ϕ 0 , hence we reject ϕ 0 and accept the alternative hypothesis ϕ 1 .
A hypothesis test is based on a statistical model P (ξ, θ) with unknown parameters ξ, known parameters θ, and (assumed) probability distributions of the parameters ξ.
Example 1 (Z-test for two population means).As an illustrating example, we present the two-tailed Z-test for means of two populations.We introduce its statistical model as two normal distributions N (µ 1 , σ 2 ) and N (µ 2 , σ 2 ) with a known variance σ 2 and unknown true means µ 1 , µ 2 .Let y 1 and y 2 be two given datasets where each data value was sampled from N (µ 1 , σ 2 ) and N (µ 2 , σ 2 ), respectively.
In the Z-test, we wish to prove the alternative hypothesis under the null hypothesis ϕ 0 .When the p-value is small enough, the datasets y 1 and y 2 are unlikely to be sampled from the same distribution, i.e., the null hypothesis µ 1 = µ 2 is unlikely to hold.Hence, in the Z-test, if the p-value is smaller than a certain threshold (e.g., 0.05), we reject the null hypothesis ϕ 0 and accept the alternative hypothesis ϕ 1 .

Illustrating Example
Throughout the paper, we use the following simple illustrating example to explain the basic ideas of our framework.
Example 2 (Comparison tests on drugs).Let us consider three drugs 1, 2, 3 that may decrease blood pressure.To compare the efficacy of these drugs, we perform experiments and obtain a set y i of the reduced values of blood pressure after taking drug i.Then we apply hypothesis tests on the dataset y = (y 1 , y 2 , y 3 ).We assume that the data values in y i have been sampled from the population that follows a normal distribution N (µ i , σ 2 ) with a mean µ i and a variance σ 2 .For simplicity, we consider the situation where we know the variance σ 2 but do not know the means µ i .
Suppose that drug 1 is composed of drugs 2 and 3, and we investigate whether drug 1 has better efficacy than both drugs 2 and 3. Then we take the following procedure: • We first compare drugs 1 and 2 concerning the average decreases in blood pressure.We apply a two-tailed Z-test A 12 (Example 1) to see whether the means of the populations are different, i.e., µ 1 = µ 2 .In this test, the alternative hypothesis ϕ 12 is the inequality µ 1 = µ 2 , and the null hypothesis ¬ϕ 12 is µ 1 = µ 2 .
• Let α ij be the p-value when only comparing drugs i and j.
• If α 12 ≥ 0.05, the Z-test A 12 does not reject the null hypothesis ¬ϕ 12 and concludes that the efficacy of drugs 1 and 2 may be the same.Then we are not interested in drug 1 any more, and skip the comparison with drug 3.
• If α 12 < 0.05, the Z-test A 12 rejects the null hypothesis ¬ϕ 12 and concludes that the alternative hypothesis ϕ 12 is true.Then we apply another Z-test A 13 to check whether the alternative hypothesis ϕ 13 def = (µ 1 = µ 3 ) is true.
• Finally, we calculate the p-value of the combined test A consisting of A 12 and A 13 , with the conjunctive alternative hypothesis ϕ 12 ∧ ϕ 13 .
Overview of the Framework.In our framework, we describe a procedure of statistical tests as a program using a programming language (Section 5); in Example 2, we denote the Z-test program comparing drugs i with j by C ij , and the whole procedure by: Then we use an assertion logic (Section 6) to describe the requirement for the hypothesis tests as a precondition formula.In Example 2, the precondition is given by: In this formula, y i ni N (µ i , σ 2 ) represents that a set y i of n i data is sampled from the population that follows the normal distribution N (µ i , σ 2 ).The modal formula P(ϕ 12 ∧ ϕ 13 ) represents that before conducting the hypothesis tests, we have the prior belief that the alternative hypothesis ϕ 12 ∧ ϕ 13 may be true (see Section 7 for discussion).The formula κ ∅ represents that no hypothesis test has been conducted previously.
The statistical belief we want to acquire is specified as a postcondition formula.In Example 2, the postcondition is: Intuitively, by testing on the dataset y, when we believe ϕ 12 with a p-value α ≤ 0.05, we believe the combined hypothesis ϕ 12 ∧ ϕ 13 with a p-value at most min(α 12 , α 13 ).
Finally, we combine all the above and describe the whole statistical inference as a judgment.In Example 2, we write: By proving this judgment using rules in BHL (Section 8), we conclude that the statistical inference is appropriate.
We remark that the p-value can be larger for a different purpose of testing.Suppose that in Example 2, drug 1 was a new drug and we wanted to find out that it had better efficacy than at least one of drugs 2 and 3. Then the procedure is: and the alternative hypothesis is ϕ 12 ∨ ϕ 13 with a p-value greater than α 12 and α 13 (at most α 12 + α 13 ).This is the multiple comparisons problem [30], arising when the combined alternative hypothesis is disjunctive.We explain more details in Section 8.

Model
In this section, we introduce a Kripke model for describing statistical properties and formally define hypothesis tests.

Variables, Data, and Actions
We introduce a finite set Var of variables comprised of two disjoint sets of invisible variables and of observable variables: Var = Var inv ∪ Var obs .We can directly observe the values of the latter, but not those of the former.Throughout the paper, we use y as an observable variable denoting a dataset sampled from the population.
We write O for the set of all data values that consists of the Boolean values, integers, real numbers, distributions of data values, and lists of data values.A dataset is a list of lists of data values.In particular, we deal with a list of real vectors as a dataset.Then the vectors range over X def = R l for an l ∈ N. A distribution over a population has type DX , and a dataset has type list X .We remark that distributions and datasets are elements of O; i.e., DX ⊆ O and list X ⊆ O. ⊥ denotes the undefined value.
We write d ∼ D n for the sampling of a set d of n data from a population D where all these data are independent and identically distributed (i.i.d.).Let Smpl be a set of i.i.d.samplings of datasets from populations (e.g., d ∼ D n ), and Cmd be a set of program commands (e.g., v := e and skip).Then we define an action as a sampling of a dataset or a program command; i.e., Act = Smpl ∪ Cmd.In Section 5, we instantiate Cmd with concrete commands used in a programming language.

States and Possible Worlds
We introduce the notions of states and possible worlds equipped with test histories.We write A for a finite set of hypothesis tests we consider.

Definition 1 (States).
A state is a tuple (m, a, H ) consisting of (i) the current assignment m : Var → O ∪ {⊥} of data values to variables, (ii) the action a ∈ Act that has been executed in the last transition, and (iii) the test history H : (list X ) → P(A) that maps a dataset d to the multiset of all hypothesis tests that have used the dataset d.
We remark that H (d) is a multiset rather than a set, because the same test on the same dataset d can be performed multiple times.
Definition 2 (Possible worlds).A possible world w is a sequence of states (w[0], w [1], . . ., w[k − 1]) where w[i] is the i-th state in w. w[0] and w[k − 1] are called the initial state and the current state, respectively.The length k is denoted by len(w).We write (m w , a w , H w ) for the current state w[k−1] of a possible world w.We assume that the test history is empty at the initial states.Since a possible world records all updates of data values, it can be used to model the updates of knowledge and beliefs.As with previous works on epistemic logic [10], agents' knowledge and belief are defined from their observation of possible worlds.Definition 3 (Observation).The observation of a state w[i] = (m, a, H ) is defined by obs(w[i]) = (m obs , a, H ) with an assignment m obs : Var obs → O ∪ {⊥} such that m obs (v) = m(v) for all v ∈ Var obs , and that m obs (v) = ⊥ for all v ∈ Var inv .The observation of a world w is given by obs(w) = (obs(w[0]), . . ., obs(w[k − 1])).

Kripke Model
We introduce a Kripke model with labeled transitions where two kinds of relations • for each w ∈ W, a valuation V w that maps a k-ary predicate symbol to a set of k-tuples of data values.
We assume that each world in a model has the same sets Var inv and Var obs of variables.
In Section 5.2, we instantiate the actions in a Kripke model with concrete program commands described in a programming language, and define the transition relation . This is formally defined as (w, w ) ∈ [[v : = 1]] in Section 5.2.

Formulation of Hypothesis Testing
Next, we formalize the notion of hypothesis tests as follows.
Definition 5 (Hypothesis tests).We consider a basic test type s ∈ {L, U, T} each representing a lower-tailed, upper-tailed, and two-tailed test.A hypothesis test is a tuple A (s) ϕ0 = (ϕ 0 , t, D t,ϕ0 , (s) , P ) consisting of: • ϕ 0 is an assertion, called a null hypothesis; • t is a function that maps a dataset d ∈ list X to its test statistic t(d), usually with range(t) = R k for a k ≥ 1; ) is a probability distribution of the test statistic when the null hypothesis ϕ 0 is true; t ∈ range(t) × range(t) is a likeliness relation where for a test type s and for values r and r of the test statistic, r (s) r represents that r is at most as likely as r .For brevity, we often omit t and (s) to write (s) and ; • P (ξ, θ) denotes the population following a statistical model P with unknown parameters ξ and known parameters θ.
For brevity, we abbreviate A (s) ϕ as A ϕ , A (s) , or A. We denote by P A the statistical model P of a hypothesis test A, and by A a finite set of hypothesis tests we consider.
The likeliness relation r (T) r expresses |r | ≥ |r |.When the null hypothesis ϕ 0 is true, the test statistic t(y 1 , y 2 ) follows the standard normal distribution N (0, 1), hence For the upper-tailed (lower-tailed) test, with alternative hypothesis ϕ Next, we define disjunctive/conjunctive combinations of hypothesis tests.Intuitively, a disjunctive combination A ϕ1∨ϕ2 (resp.conjunctive combination A ϕ1∧ϕ2 ) is a hypothesis test with a null hypothesis ϕ 1 ∨ ϕ 2 (resp.ϕ 1 ∧ ϕ 2 ) that performs two hypothesis tests A ϕ1 and A ϕ2 in parallel.

Definition 6 (Combination of tests). For
be two hypothesis tests.The disjunctive combination of A ϕ1 and A ϕ2 is given by where t(y 1 , y 2 ) = (t 1 (y 1 ), t 2 (y 2 )), D t,(ϕ1,ϕ2) is a coupling of D t1,ϕ1 and D t2,ϕ2 (i.e., it is a joint distribution such that D t1,ϕ1 and D t2,ϕ2 are the marginal distributions of D t,(ϕ1,ϕ2) ), (r 1 , r 2 ) r 2 , and P is a coupling of P 1 and P 2 .Similarly, the conjunctive combination of A ϕ1 and A ϕ2 is Then we define a function y,A to decompose a combined test into individual tests.Definition 7.For a combination A of n hypothesis tests A 1 , . . ., A n and a tuple of n datasets y = (y 1 , . . ., y n ), the multiset of all the pairs of datasets and tests is: For instance, (y1,y2),Aϕ 1 ∧ϕ 2 = {(y 1 , A ϕ1 ), (y 2 , A ϕ2 )}.

A Simple Programming Language
We introduce an imperative programming language Prog.

Syntax of Prog
Let Fsym be the set of all function symbols, where constants are dealt as functions with arity 0. We define the syntax of Prog by the following BNF: where v ∈ Var obs and f ∈ Fsym.Then a program can handle only observable variables.T represents types.A type is either bool for Boolean values, int for integers, real for real numbers, T 1 × T 2 for pairs consisting of a value of type T 1 and a value of type T 2 , or list(T ) for lists of values of type T .e represents expressions that evaluate to values.An expression is either a variable v or a function call f (e 1 , . . ., e k ); the latter is typically a call to a function that computes a test statistic.c and C represent commands and programs, respectively.We give their intuitive explanation as follows.
• skip does nothing.
• v := e updates v with the result of an evaluation of e.
• C 1 C 2 executes C 1 and C 2 in parallel that may share some data.
• if e then C 1 else C 2 executes C 1 if e evaluates to true; executes C 2 if e evaluates to false.
• loop e do C iteratively executes C as long as e evaluates to true.
For instance, the programs in Section 9 conform to the programming language Prog.Hereafter we assume that all programs are well-typed although we do not explicitly mention the types.Checking this condition for our language can be done by adapting a standard type-checking algorithm to our setting.
We write upd(C) for the set of all variables that may be updated by executing C: Then we impose the following restriction to every occurrence of This restriction is to ensure that an execution of C 1 does not interfere with that of C 2 , and vice versa.

Semantics of Prog
We define the semantics of Prog over a Kripke model M (Section 4.3).The semantics is based on the standard structural operational semantics (e.g.[31]).
For a possible world w ∈ W and n = len(w), we write As in Figure 1, we define a binary relation that relates a pair C, w consisting of a program C and a possible world w to its next step of execution.If C is terminated, the next step will be a possible world w , otherwise the execution continues to the C , w .We remark that the semantics of a program contains the trace of commands executed in it.Hence, even if programs finally have the same result, their semantics may be different.For instance, when the value of a variable v is 1, the execution of the two programs v := v + 1 and v := 2 * v result in different worlds with the same memory:

Remark on Parallel Compositions
Since parallel compositions are nondeterministic, w ∈ [[C 1 C 2 ]](w) may not be unique.However, the resulting world w is essentially the same, because w

Procedures of Hypothesis Testing
We define the interpretation of a program f A for a hypothesis test A = (ϕ 0 , t, D t,ϕ0 , (s) , P ) with a null hypothesis ϕ 0 , a test statistic t, a test type s, and a statistical model P .For a dataset y and an assignment m, [[f A (y)]] m represents the p-value: which is the probability that a value r is at most as likely as the test statistic t(m(y)) when it is sampled from D t,ϕ0 in the world where the null hypothesis ϕ 0 is true.
As in Figure 1, the execution of a program v := f A (y) for a hypothesis test A updates the test history H so that A is added to the multiset H (m(y)) of all tests using the dataset m(y).Formally, the operation in Figure 1 is the union of multisets: To refer to the test history H w in a possible world w, we introduce a history variable h y,A ∈ Var inv for each variable y and each hypothesis test A. A history variable h y,A takes an integer value representing the number of executions of a hypothesis test A on a dataset y.Since h y,A is an invisible variable, it never appears in a program.
The interpretation of h y,A is consistent with the test history H w ; namely, m w (h y,A ) represents the number of occurrences of A in the multiset H w (m w (y)).As shown in Figure 1, if a program command updates H w , the values m w (h y,A ) of history variables h y,A are also updated consistently.Although h y,A is an invisible variable, the test history H w itself is observable (Definition 3) and is used to define knowledge.

Assertion Language
We define an assertion language called epistemic language for hypothesis testing (ELHT) that can express knowledge and statistical beliefs by using modal operators.

Syntax of the Assertion Language
We introduce the syntax of the assertion language ELHT.We define assertion terms, formulas, and predicate symbols for statistical notions.Then we introduce the modality of statistical beliefs as a disjunctive knowledge.

Assertion Terms
We introduce assertion terms to denote data values as follows.Recall that Var is the set of all variables and Fsym is the set of all function symbols.We introduce a set IntVar of integer variables denoting finite tuples of integers such that IntVar∩Var = ∅.Then the set ATerm of assertion terms is defined by: where x ∈ Var, i ∈ IntVar, and f ∈ Fsym.Notice that the assertion terms can deal with invisible variables unlike the program terms (Section 5.1).Fsym includes function symbols denoting families of probability distributions (e.g., N for normal distributions) and those denoting data operations (e.g., mean for calculating the mean of data values).

Assertion Formulas
We define the syntax of assertion formulas with a modal operator K for knowledge.As in previous studies, a formula Kϕ expresses that we know ϕ.Formally, for a set Pred of predicate symbols, the set Fml of formulas is defined by: where η ∈ Pred, u 1 , . . ., u n ∈ ATerm, and i ∈ IntVar.In the formulas, the quantifiers ∀ and ∃ never appear inside the epistemic modality K.They are used only to prove the relative completeness of our program logic in later sections.We remark that there is no universal/existential quantification over the observable and invisible variables.We denote the set of all variables occurring in a formula ϕ by fv(ϕ).
As syntax sugar, we use disjunction ∨, implication →, and existential quantifier ∃.We also define epistemic possibility P as usual by Pϕ def = ¬K¬ϕ.

Hypothesis Formulas
We introduce notations for alternative/null hypotheses in hypothesis tests (Section 2).Recall that an alternative hypothesis ϕ 1 is a proposition that we wish to prove, and that a null hypothesis ϕ 0 is a proposition that contradicts the alternative hypothesis ϕ 1 .We write ¬ ϕ 1 for the null hypothesis corresponding to an alternative hypothesis ϕ 1 .In Section 7.1, we define it as syntax sugar and discuss the details.

Predicate Symbols
We introduce the following predicate symbols for statistical notions: • u = u represents the equality of two data values u and u .
• y n x expresses that a dataset y consists of n data sampled from a population x.
• y x represents that a dataset y has been sampled from a population x.
For brevity, we define the syntax sugar S for a multiset S of pairs of variables and hypothesis tests.Intuitively, (y,A) represents that a dataset y has been sampled from a population that satisfies the hypothesis test A's requirement.Formally: where P A (ξ A , θ A ) denotes the population following the statistical model P A for a test A (Section 4.4).When S is a singleton {(y, A)}, we abbreviate {(y,A)} as y,A .

Modality of Statistical Beliefs
We use the following syntax sugar for formulas on executions of hypothesis tests: • κ describes the record of all hypothesis tests conducted so far.Formally, κ S is the formula representing that S is the multiset of every pair (y, A) of a dataset y and a hypothesis test A that has been applied to y.This formula is defined as equations between history variables h y,A (Section 5.2) and their values by: where n (y,A,S) is the integer representing the number of occurrences of (y, A) in the multiset S. When S is a singleton {(y, A)}, we abbreviate κ {(y,A)} as κ y,A .
As syntax sugar, we introduce the statistical belief modality K y,A .Intuitively, a statistical belief K < y,A ϕ expresses that we believe a hypothesis ϕ based on a statistical test A on an observed dataset y with a certain error level (p-value) at most .We formalize this as the knowledge that either (i) the hypothesis ϕ holds, (ii) the observed dataset y is unluckily far from the population (from which y is sampled), or (iii) the dataset y did not come from a population that satisfies the test A's requirement (e.g., a population following a normal distribution).
Formally, for a hypothesis test A ¬ ϕ with an alternative hypothesis ϕ and its null hypothesis ¬ ϕ, we define: As the dual modality, we define the statistical possibility P y,A by P y,A ϕ def = ¬ K y,A ¬ϕ.For brevity, we often omit the subscript ¬ ϕ from A ¬ ϕ to abbreviate K y,A ¬ ϕ ϕ as K y,A ϕ.We also write K y,A instead of K = y,A and K ε ϕ instead of K y,A ϕ.

Semantics of the Assertion Language
We define semantics for the assertion language ELHT.

Interpretation of Assertion Terms and Formulas
We introduce an interpretation function I : IntVar → Z * that assigns a finite tuple of integers to an integer variable.Then we define the interpretation We define the interpretation of formulas in a world w in a Kripke model M = (W, ( a − →) a∈Act , R, (V w ) w∈W ) in Definition 4 as follows: M is sometimes omitted when it is clear from the context.

Interpretation of Predicate Symbols
We define the interpretation of predicate symbols.Let A = (ϕ, t, D t,ϕ , (s) , P ) be a hypothesis test.Recall that the population's distribution has type DX , and that (s)  is the likeliness relation (Section 4.4).In a world w, we interpret predicate symbols by:

(DX )×N
There is an i ∈ N s.t. .
Intuitively, the set V w (ν y,A ) consists of only the p-value with which the hypothesis test A on the dataset m w (y) rejects the null hypothesis ϕ.Then M, w |= ν y,A ( ) represents that in a possible world w, the observation of a dataset y is unlikely to occur (except with probability ) according to the hypothesis test A where the test statistic follows the distribution D t,ϕ in the world w.Formally, we have: where the p-value Pr r ∼Dt,ϕ [ r (s) t(m w (y)) ] is the probability that a value r is at most as likely as the test statistic t(m w (y)) when it is sampled from the distribution D t,ϕ in the possible world w; e.g., when D t,ϕ is the standard normal distribution N (0, 1), then Pr r ∼N (0,1) [ r (T) 1.96 ] = Pr r ∼N (0,1) [ |r | ≥ 1.96 ] = 0.05.We remark that the p-value is not a probability in the real world, but a probability in the possible world w where the null hypothesis ϕ is true.Analogously, the interpretation of ν y,A is defined in terms of a range of p-values.For instance, ν < y,A ( ) represents that the p-value of a test A on a dataset y is less than .The interpretation of syntax sugar κ S is given by: where H w is the test history that maps a dataset o to the multiset of all hypothesis tests applied to the dataset o in the world w (Section 4.2).

Interpretation of Statistical Belief Modality
The interpretation of the statistical belief modality K < y,A is given as follows.
Intuitively, K < y,A ϕ expresses a belief that an alternative hypothesis ϕ on the population is true.For a two-tailed test A, w |= (¬ϕ ∧ y,A ) → τ < y,A ( ) means that if we consider a possible world w where the null hypothesis ¬ϕ is true and the dataset y is drawn from a population satisfying the test A's requirement y,A , then the execution of A would conclude that the observation of the dataset y is unlikely to occur (with exceptions at most ), i.e., w |= τ < y,A ( ).See Sections 6.4, 6.5, and 7 for discussion.Although the modality K expresses the knowledge in terms of S5, the syntax sugar K < y,A ϕ represents a belief instead of a knowledge.This is because ϕ can be false when τ < y,A ( ) ∨ ¬ y,A holds; i.e., we may have a false belief on ϕ (i) when the sampled dataset y is unluckily far from the population or (ii) when the dataset y did not come from the population that satisfies the test A's requirement.

Remark on the Universe of the Kripke Model
We remark that the universe W of the model M is assumed to include all possible worlds we can imagine.If there is no possible world satisfying a null hypothesis ¬ ϕ in M, then the alternative hypothesis ϕ is satisfied in all worlds in M, hence so are Kϕ and K < y,A ϕ.This implies that if we cannot imagine a possible world where ¬ ϕ is true, then we already know that ϕ is true without conducting the hypothesis test A.

When Hypothesis Tests are Meaningful
The formula K < y,A ϕ expresses a belief after conducting a hypothesis test A on a dataset y, and covers the following two cases where the execution of A is not useful: (i) we knew that the alternative hypothesis ϕ is true without conducting the test A; (ii) we know that the requirement (y,A) for the test A on y is not satisfied.
Hence, deriving only the formula K < y,A ϕ is not sufficient to conclude the correctness of the alternative hypothesis ϕ from the execution of the hypothesis test A.
Formally, K < y,A ϕ is satisfied also when we have the prior knowledge K(ϕ ∨ ¬ (y,A) ) that (i) ϕ is satisfied or (ii) the test A's requirement is not satisfied.Thus, the execution of A is meaningful only when we do not have the prior knowledge K(ϕ ∨ ¬ (y,A) ), i.e., only when we have the prior belief P(¬ϕ ∧ (y,A) ).
For the outcome of the test A to be meaningful, the requirement (y,A) must hold in the real world.In practice, however, we usually have a limited knowledge of the population (Section 10), and may not know whether the population satisfies the requirement (y,A) .For this reason, in Section 8 and 9, we aim to derive a statistical belief K < y,A ϕ under the assumption that (y,A) holds as a precondition instead of K (y,A) .

Prior Beliefs and Posterior Statistical Beliefs in ELHT
We clarify the importance of prior beliefs in acquiring statistical beliefs, and prove essential properties of statistical beliefs by using the assertion language ELHT.

Hypothesis Formulas
To formalize the prior beliefs for hypothesis testing, we introduce notations for the alternative and null hypotheses in the assertion language ELHT.
We use two formulas ϕ U and ϕ L to represent the alternative hypotheses in an uppertailed test and a lower-tailed test, respectively.Then ϕ U and ϕ L cannot be true simultaneously; i.e., |= ¬ϕ U ∨ ¬ϕ L .The alternative hypothesis of the two-tailed test is: The syntax sugar ¬ ϕ T , ¬ ϕ U , and ¬ ϕ L for the null hypotheses can be defined by: Example 5 (Hypothesis formulas in Z-tests).For µ 1 , µ 2 ∈ R, the two-tailed, uppertailed, and lower-tailed Z-test (Example 1) have the alternative hypotheses: This is because the upper-tailed (resp.lower-tailed) test is based on the assumption The null hypotheses of these tests are ¬ ϕ s def = (µ 1 = µ 2 ) for s ∈ {T, U, L}.See Table 2 for the summary of the hypothesis formulas in Z-tests.

Prior Beliefs in Hypothesis Tests
We formally describe the prior knowledge of hypothesis testing using epistemic formulas.We show an example in Table 3.

Prior Beliefs in Two-Tailed Z-Tests
For an application of the two-tailed Z-test to be meaningful, we are supposed to have the prior belief that µ 1 > µ 2 is possible (denoted by Pϕ U ), and that µ 1 < µ 2 is possible (denoted by Pϕ L ).ELHT naturally explains that these prior beliefs are essential to interpret the results of hypothesis tests as follows.Assume that, in a world w, we had neither of these prior beliefs, but obtained a statistical belief K α y,A ϕ T by conducting a two-tailed hypothesis test A; i.e., w that is, we already know that the alternative hypothesis ϕ T is false regardless of the result of the test A (that aims to show that ϕ T is true).Clearly, the execution of the test A is meaningless when we know that ϕ T is false.For this reason, the prior beliefs Pϕ U and Pϕ L are essential for the application of the two-tailed test to be meaningful.We remark that even if we do not have these prior beliefs, the definition of the formula K α y,A ϕ T is still consistent with the principle of the hypothesis testing (although the test is useless, as mentioned above).Recall that the statistical belief is defined by , we learn that either (i) the sampled dataset y is unluckily far from the population, or (ii) y was sampled from a population that does not satisfy the requirement y,A for the hypothesis test A on y.

Prior Beliefs in One-Tailed Z-Tests
When we apply the upper-tailed Z-test, we are supposed to have the prior belief that µ 1 > µ 2 is possible (denoted by Pϕ U ), and the prior knowledge that µ 1 < µ 2 is impossible (denoted by ¬Pϕ L or by K¬ϕ L ).
This prior knowledge K¬ϕ L is used to select an upper-tailed test rather than a two-tailed.In Proposition 1, we show that K¬ϕ L is logically equivalent to K(ϕ U ∨ ¬ ϕ U ); i.e., under the knowledge K¬ϕ L , either the alternative hypothesis ϕ U (i.e., µ 1 > µ 2 ) or the null hypothesis ¬ ϕ U (i.e., µ 1 = µ 2 ) holds.Hence, the prior knowledge K¬ϕ L allows for applying the upper-tailed test.Without this prior knowledge, we cannot apply the upper-tailed test, because we do not see whether one of the alternative hypothesis ϕ U and the null hypothesis ¬ ϕ U holds.
In conclusion, we can use our assertion logic to explain that the prior knowledge K¬ϕ L is crucial to apply the upper-tailed test.Symmetrically, the lower-tailed test requires the prior knowledge K¬ϕ U , as indicated in Table 3.

Posterior Beliefs in Z-Tests
We remark that the prior beliefs on alternative hypotheses will not change even after conducting hypothesis tests.For example, in the case of the two-tailed Z-test, the prior belief Pϕ L ∧ Pϕ U remains to hold after conducting the test and obtaining a pvalue α > 0. This is because the statistical belief is defined as a disjunctive knowledge K(ϕ T ∨ τ y,A (α) ∨ ¬ y,A ), and thus cannot conclude any knowledge of the alternative hypothesis (e.g., Kϕ T or K¬ϕ T ).

Properties of Prior Beliefs on Hypotheses
Now we show basic properties of prior beliefs on hypotheses as follows.
Proposition 1 (Basic properties of prior beliefs).Recall that ϕ In a two-tailed hypothesis test, either the null hypothesis ϕ T or the alternative hypothesis ¬ ϕ T is always satisfied; i.e., |= ϕ T ∨ ¬ ϕ T .2. We know that the lower-tail ϕ L is impossible iff we know that either the null hypothesis ϕ U or the alternative hypothesis ¬ ϕ U for the upper-tail test is satisfied: 3. We know that the upper-tail ϕ U is impossible iff we know that either the null hypothesis ϕ L or the alternative hypothesis ¬ ϕ L for the lower-tail test is satisfied: The proof is shown in Appendix B.1.

Type II Error
Symmetrically to the p-value (type I error rate) α, the type II error rate β is the probability that a hypothesis test A does not reject the null hypothesis ¬ ϕ when the alternative hypothesis ϕ is true.For instance, in the two-tailed Z-test (Example 1), β is the probability that the Z-test fails to reject the null hypothesis µ 1 = µ 2 when the alternative hypothesis µ 1 = µ 2 is true.We remark that β is determined by the effect size |µ1−µ2| /σ; For a smaller distance |µ 1 − µ 2 |, it is more difficult for the Z-test to distinguish the null and alternative hypotheses, hence the type II error rate β is larger.
Formally, let y be a dataset such that the p-value α of a test A is 0.05 in a world w; i.e., w |= K 0.05 y ,A ϕ.To calculate the type II error rate β, we consider an effect size es > 0. Suppose that a hypothesis ξ def = (es = |µ1−µ2| /σ) is satisfied, i.e., w |= ξ.Then w |= ϕ.The belief on the type II error is expressed by w |= K β y ,A ¬ξ; i.e., in the world w, we believe that ξ is false with a degree β of belief, although ξ is actually true in w.

Properties of Statistical Beliefs
The statistical possibility P < y,A ϕ means that we think a null hypothesis ϕ may be true after a hypothesis test A did not reject ϕ with a significance level .Formally: ). Next, we show basic properties of statistical beliefs as follows.

(SBν)
The output of f A is the p-value of the hypothesis test A on the dataset y; i.e., |= ν y,A (f A (y)).

(SB4)
If we believe ϕ based on a test A, then we know this statistical belief; i.e., |= K y,A ϕ → K K y,A ϕ.

(BHT-∧) Let A 0 be the conjunctive combination of A 1 and A 2 . If we execute
A 1 on the dataset y 1 and A 2 on y 2 separately, then we obtain the statistical belief on ϕ 1 ∧ ϕ 2 with the p-value at most min(f A1 (y 1 ), The proof is shown in Appendix B.1.

Belief Hoare Logic for Hypothesis Testing
We introduce belief Hoare logic (BHL) for formalizing and reasoning about statistical inference using hypothesis tests.

Hoare Triples
We define an environment as a pair Γ = (Γ inv , Γ obs ) consisting of an invisible environment Γ inv and an observable environment Γ obs that assign types to invisible variables and to observable variables, respectively.We write Γ |= ϕ if M, w |= ϕ for any model M and any world w that respects the type information in Γ (i.e., the type of w(v) being Γ(v) for any v ∈ Var).Let Env be the set of all possible environments.
A judgment is of the form Γ {ψ} C {ϕ} where Γ ∈ Env, ψ, ϕ ∈ Fml, and C ∈ Prog.Intuitively, this represents that whenever the precondition ψ is satisfied, executing the program C results in satisfying the postcondition ϕ if C terminates.
We say that a judgment Γ {ψ} C {ϕ} is valid iff for any model M and any possible world w, if M, w |= ψ, then M, w |= ϕ for all w ∈ [[C]](w).A valid judgment Γ {ψ} C {ϕ} expresses the partial correctness of the program C: It respects the precondition ψ and the postcondition ϕ up to the termination of C.

Inference Rules
We define the inference rules for belief Hoare logic (BHL).The rules consist of those for basic command constructs (Figure 2) and for hypothesis tests (Figure 3).
The rules in Figure 2 for the basic constructs are the same as those for a standard imperative programming language; the readers are referred to a standard textbook on the Hoare logic [4] for details.We add the following remarks to a few rules: Figure 4: (TWO-HT), (LOW-HT), and (UP-HT) are derived rules for two-tailed, lower-tailed, and upper-tailed tests, respectively, where ϕ T , ϕ U , ϕ L are alternative hypotheses (Section 7.1) and κ ∅ is given in (10).(MULT-∨) is for the Bonferroni's method with the disjunctive combination A of A1 and A2.(MULT-∧) is for the conjunctive combination A of A1 and A2.
• In the rules (IF) and (LOOP), the guard condition e is a Boolean expression implicitly used as a logical predicate in the preconditions and the postconditions.Translating a Boolean expression into an ELHT assertion is straightforward.
• The rule (CONSEQ) is used to weaken the precondition and strengthen the postcondition of a triple.The relation Γ |= ϕ is used in this rule.
The rules in Figure 3 are characteristic of BHL.(HIST) describes the properties of an execution of a hypothesis test command f A on a dataset y.Essentially, this rule states that the precondition is obtained by substituting the p-value f A (y) for the variable v in the postcondition ψ. (HIST) differs from (UPDVAR) in that an execution of f A on y also increases the history variable h y,A by 1. Recall that h y,A denotes the number of all executions of f A on y and is updated only by an execution of f A on y.
The rule (PAR) in Figure 3 exchanges the sequential composition C 1 ; C 2 with the parallel composition C 1 C 2 .We recall that in Section 5.1, the restriction upd(C 1 ) ∩ Var(C 2 ) = upd(C 2 ) ∩ Var(C 1 ) = ∅ is imposed to ensure that an execution of C 1 does not interfere with that of C 2 , and vice versa.

Soundness and Relative Completeness
BHL satisfies soundness and relative completeness.
We prove Theorem 1 in Appendix B.4 and Theorem 2 in Appendix B.5.

Derived Rules
In Figure 4, we show useful derived rules for typical forms of hypothesis tests.These can be instantiated to a variety of concrete test methods; see Appendix A.

Derived Rules for Single Hypothesis Tests
The derived rules (TWO-HT), (LOW-HT), and (UP-HT) correspond to two-tailed, lower-tailed, and upper-tailed hypothesis tests, respectively.Recall that the formulas ϕ L , ϕ U , and ϕ T ( def = ϕ L ∨ ϕ U ) denote the alternative hypotheses for the lower-tailed, upper-tailed, and two-tailed tests, respectively (Section 7.1).
The derived rule (TWO-HT) states that we can perform a two-tailed test f A (T) on a dataset y if we have the prior belief Pϕ L ∧ Pϕ U that both the lower-tail ϕ L and upper-tail ϕ U are possible before performing the test (Section 7.2).If the test f A (T) on y returns a p-value α ∈ [0, 1], we obtain the statistical belief on the alternative hypothesis ϕ T denoted by K α y,A ϕ T .The derivation of (TWO-HT) is given by: CONSEQ where (CONSEQ) uses Proposition 3 (BHT).We remark that the derivation does not use Γ |= ψ → ( y,A ∧ Pϕ L ∧ Pϕ U ).However, for the hypothesis test A to be meaningful, the postcondition must imply y,A ∧Pϕ L ∧Pϕ U , as mentioned in Sections 6.5 and 7.2.
If we have the prior belief Pϕ L ∧¬Pϕ U (resp.¬Pϕ L ∧Pϕ U ) that only the lower-tail ϕ L (resp.upper tail ϕ U ) is possible, then we can apply (LOW-HT) (resp.(UP-HT)) and obtain the statistical belief K α y,A ϕ L (resp.K α y,A ϕ U ).The derivations of (LOW-HT) and (UP-HT) are similar to that of (TWO-HT).

Derived Rules for Multiple Hypothesis Tests
The derived rule (MULT-∨) corresponds to the reasoning about two tests A (s) 1 on y 1 and A (s) 2 on y 2 with a disjunctive alternative hypothesis ϕ 1 ∨ ϕ 2 .As illustrated in Section 3, a typical example is to test whether a drug has better efficacy than at least one of two drugs.Hereafter, we abbreviate A (s) 1 and A (s) 2 as A 1 and A 2 , respectively.The precondition in (MULT-∨) expresses that we have obtained a statistical belief K α1 y1,A1 ϕ 1 on an alternative hypothesis ϕ 1 with a p-value α 1 .If we obtain an output α 2 of the second test A 2 , we cannot conclude that α 2 is the p-value for ϕ 2 , because the p-value when performing the two tests A 1 and A 2 simultaneously is larger than α 1 and α 2 .This is known as the multiple comparison problem.
In contrast, the derived rule (MULT-∧) formalizes the reasoning about multiple tests with a conjunctive alternative hypothesis ϕ 1 ∧ ϕ 2 (e.g., the program C drug in Example 2, which tests whether a drug has better efficacy than both drugs).According to statistics textbooks (e.g., [30]), this does not make the p-value higher, i.e., the p-value is at most min(α 1 , α 2 ).(MULT-∧) guarantees the correct procedure for conjunctive hypotheses.The derivation for (MULT-∧) is given by: where (CONSEQ) uses Proposition 3 (BHT-∧).

Reasoning About Hypothesis Testing Procedures Using BHL
In this section, we apply our framework to the reasoning about p-value hacking and multiple comparison problems using BHL.

Reasoning About p-Value Hacking
The p-value hacking (a.k.a.data dredging) is a scientifically malignant technique to obtain a low p-value.A typical example is to conduct hypothesis tests on different datasets and ignore the experiment showing a higher p-value to report only a lower.
Our framework can describe and reason about programs for p-value hacking.For example, the following program C hack conducts a hypothesis test A 1 on a dataset y 1 and another A 2 on y 2 , and reports only a lower p-value α while ignoring the higher: We write ϕ 1 and ϕ 2 for the alternative hypotheses of the tests A 1 and A 2 , respectively.Let S = {(y 1 , A 1 ), (y 2 , A 2 )}, and A be the conjunctive combination of A 1 and A 2 .
Based on the discussion on the prior knowledge in Section 6.5, we assume that we do not have the prior knowledge that these hypotheses are true or the dataset did not come from the population satisfying the reqiremens of the tests; that is, we have:  For the reported value α to be an actual p-value, the formula needs to hold as a postcondition of C hack .Thus, at the end of the first line of C hack , must hold due to the rules (UPDVAR) and (IF).By applying (CONSEQ) and the definition of the statistical belief modality, the following formula needs to hold: By assumption (13), this formula implies Pκ y1,A1 ∨Pκ y2,A2 .By Proposition 3 (BHκ), we obtain κ y1,A1 ∨κ y2,A2 ; i.e., only one of the two hypothesis tests has been conducted.However, by applying (PAR) and (HIST) to C hack 's first line, κ {(y1,A1),(y2,A2)} needs to be satisfied; i.e., both the tests must have been conducted.Hence a contradiction.Therefore, we cannot conclude that the reported value α is the actual p-value.

Reasoning About Multiple Comparison with Conjunctive Alternative Hypotheses
We illustrate how BHL reasons about the following program in the multiple comparison in Example 2: In this example, the derivation of the judgment Γ {ψ pre } C drug {ϕ post } given in (3) guarantees that the hypothesis tests are applied appropriately in the program C drug .

Reasoning About Multiple Comparison with Disjunctive Alternative Hypotheses
In contrast, the program C multi def = C 12 C 13 in (4) has a disjunctive alternative hypothesis ϕ 12 ∨ ϕ 13 and thus shows a multiple comparison problem. Figure 6 show the derivation tree for C multi .Since the alternative hypothesis ϕ 12 ∨ ϕ 13 is disjunctive, we apply (MULT-∨) to obtain the belief K ≤α12+α13 (y ,y ),A (ϕ 12 ∨ ϕ 13 ), with a p-value (larger than α 12 and α 13 ) at most α 12 + α 13 .

Discussion
In this section, we provide the whole picture of the justification of statistical beliefs inside and outside BHL.A statistical belief derived in a program relies on the following three issues: (i) the validity of hypothesis test methods themselves, (ii) the satisfaction of the empirical conditions required for the hypothesis tests, and (iii) the appropriate usage of hypothesis tests in the program.In our framework, these are respectively addressed by (a) the validity of BHL's axioms and rules, (b) the (manual) confirmation of the preconditions in a judgment, and (c) the derivation tree for the judgment.

Validity of Hypothesis Test Methods
The validity of hypothesis test methods is not ensured by mathematics alone.The philosophy of statistics has a long history of argument on the proper interpretation of hypothesis testing.One of the most notable examples is the argument between the frequentist and the Bayesian statistics, which still has many issues to be discussed [32].
We also remark that statistical methods occasionally involve approximation of numerical values.Even when the approximation method has a theoretical guarantee, we may need to confirm the validity of the application of the approximation empirically, e.g., by experiments in the specific situation we apply the statistical methods.
For these reasons, we do not attempt to formalize the "justification" for hypothesis test methods within BHL, and left them for future work.Instead, we define simple axioms that can be instantiated with the hypothesis tests commonly used in practice and explained in textbooks, e.g., [28,29].Then we focus on the logical aspects of the appropriate usage of hypothesis tests, which has been a long-standing, practical concern but has not been formalized using symbolic logic before.
One of the advantages of this approach is that we do not adhere to a specific philosophy of statistics, but can model both the frequentist and the Bayesian statistics by instantiating the derived rules for hypothesis tests (Appendix A).

Clarification of Empirical Conditions
The hypothesis test methods usually assume some empirical conditions on the unknown population from which the dataset is sampled.Typically, many parametric tests require that the population follows a normal distribution.For instance, the Z-test in Example 1 assumes that the population follows a normal distribution with known variance, but this cannot be rigorously confirmed or justified in general.
In some cases, such conditions on the unknown population are confirmed approximately or partially (i) by exploratory observations of the sampled data and (ii) by prior knowledge of properties of the population (outside the statistical inference).However, there is no general method for justifying such empirical conditions rigorously.Thus, the formal justification of those conditions would require further research in statistics.
In the present paper, the empirical conditions on the unknown population remain to be assumptions from the viewpoint of formal logic.Hence, we describe empirical conditions as the preconditions of a judgment in BHL.Explicit specification of the preconditions would be useful to prevent errors in the choice of statistical methods.Furthermore, when we formalize empirical science in future work, it would be crucial to clarify the empirical conditions that justify scientific conclusions.

Epistemic Aspects of Statistical Inference
One of our contributions is to show that epistemic logic is useful to formalize statistical inference.Although the outcome of a hypothesis test is the knowledge determined by the test action, it may form a false belief ; i.e., a rejected null hypothesis may be true, and a retained one may be false.Hence, the formalization of statistical inference deals with both truth and beliefs, for which epistemic logic is suitable.
The key to formalizing statistical beliefs is to introduce a Kripke semantics with a possible world where a null hypothesis is true (Section 6.2).This possible world may not be the real world where we actually apply the hypothesis test.Notably, the p-value in the test is the probability defined in this possible world, and not in the real world.
Our Kripke semantics is essential for modeling the appropriate usage of hypothesis tests in the real world.We make a distinction between (i) "ideal" possible worlds where all requirements for the hypothesis tests are satisfied and (ii) the real world where hypothesis tests are actually conducted but their requirements may not be satisfied.Without this distinction, we would deal with only mathematical properties of hypothesis test methods satisfied in "ideal" possible worlds, and could not discuss the appropriateness of the actual application of the hypothesis tests in the real world.
By using this model, we have clarified that statistical beliefs depend on prior beliefs (Section 7.2).By using the possibility modality P, certain requirements for hypothesis tests are formalized as prior beliefs, which may not be true or confirmed in the real world.For example, the choice of two-tailed or one-tailed tests depends on the prior belief that both lower-tail and upper-tail are possible before applying the test.
Finally, the update of statistical beliefs by a hypothesis test is modeled using a transition between possible worlds.Since the world records the history of all hypothesis tests, BHL does not allow for hiding any tests to manipulate the statistics (e.g., in p-value hacking and in multiple comparisons in Section 9).

Conclusion
In this work, we proposed a new approach to formalizing and reasoning about statistical inference in programs.Specifically, we introduced belief Hoare logic (BHL) for describing and checking the requirement for applying hypothesis tests appropriately.We proved that this logic is sound and relatively complete w.r.t. the Kripke model for hypothesis tests.Then we showed that BHL is useful for reasoning about practical issues in hypothesis tests.In our framework, we clarified the importance of prior beliefs in acquiring statistical beliefs.We also discussed the whole picture of the justification of statistical inference.We emphasize that this appears to be the first attempt to introduce a program logic for the appropriate application of hypothesis tests.
In ongoing work, we are extending our framework to other kinds of statistical methods.We are also developing a verification tool based on this framework.supported by JST, PRESTO Grant Number JPMJPR2022, Japan, and by JSPS KAK-ENHI Grant Number 21K12028, Japan.Tetsuya Sato is supported by JSPS KAKENHI Grant Number 20K19775, Japan.Kohei Suenaga is supported by JST CREST Grant Number JPMJCR2012, Japan.
where p and q are the density functions of D p and D q , respectively.The ity distributions p(D q ) and q(D q ) are the push-forward measures of D q along p and q respectively.The likelihood function L is defined by L(y|ξ 0 ) = n i=1 q(y i ) and L(y|ξ 1 ) = n i=1 p(y i ), and the test statistic t(y) is called the likelihood ratio.In the likelihood ratio test, for a given p-value α and a threshold k such that Pr d1,...,dn∼Dq [t((d 1 , . . ., d n )) ≤ k] ≤ α, if we have t(y) ≤ k, the likelihood L(y|ξ 0 ) is too small to accept the distribution D q .We then conclude that the other candidate D p is better to fit y (thus this test is lower-tailed).The p-value of this test is given by: Pr d1,...,dn∼Dq By instantiating the p-value [[f Aϕ 0 (y)]], we obtain (A.2).By applying the derived rule (LOW-HT), we obtain a valid BHL judgment corresponding the likelihood ratio test: We can deal with Bayesian hypothesis tests in an analogous way.
Example 8 (Bayesian hypothesis test).Consider the Bayesian likelihood ratio test with a dataset y of sample size n, prior distributions D p , D q ∈ DR with density functions p and q , and posterior distributions D p(z) , D q(z) ∈ DR with density functions p(−|z) and q(−|z).The goal of this test is to determine whether the dataset y is sampled from D q(z) where z follows D q .The alternative hypothesis ξ = ξ 1 (resp.the null hypothesis ξ = ξ 0 ) is that y is sampled from D q(z) where z follows D q (resp.from D p(z) where z follows D p ).As with Example 7, this test requires the prior knowledge K(ξ = ξ 0 ∨ ξ = ξ 1 ).We first define the following statistical model with the parameter ξ.

Likeliness relation
Unlike the (classical) likelihood ratio test, the test statistic t(y) is the Bayes factor, that is, the ratio of the marginal likelihoods L(y|ξ 0 ) = q (z) n i=1 q(y i |z)dz and L(y|ξ 1 ) = p (z) n i=1 p(y i |z)dz.As with the likelihood ratio test, we obtain a valid BHL judgment for the Bayesian hypothesis test by applying the derived rule (LOW-HT).
We recall that the executions of single commands skip, v := e and v := f A (y) are deterministic, hence the semantic relation [[a]] of each action a is functional.These semantic functions can be rewritten explicitly as follows: We remark here that the incrementation ζ hy,A and the substitution If the memories of the current states of two worlds are identical, execution paths starting at these worlds can be simulated by each other, and can be written explicitly.
Lemma 2. Let w 1 , w 2 be two possible worlds.Suppose that m w1 (v) = m w2 (v) holds for all v ∈ Var(C) ∩ Var obs .If C, w 1 −→ k w 1 for some w 1 , then there are l ∈ N with 0 ≤ l ≤ k and a sequence a 1 , a 2 , . . ., a l of actions such that: • Case C ≡ loop e do C and [[e]] mw = ⊥.We have l = 0 and for all v ∈ It suffices to show that for the first step C, w 1 −→ C , w 1 of execution, We prove this by induction on the inference tree as follows.Recall that by assumption, m w1 (v) = m w2 (v) holds for all v ∈ Var(C) ∩ Var obs .
• • The other cases are shown in a similar way.

Appendix B.3. Remarks on Parallel Compositions
We present some remarks on parallel compositions.We first show that in general, parallel compositions contain sequential compositions.Lemma 3.For any possible world w, we have We first show that this can be an execution of C 1 C 2 starting at the world w.
. By the assumptions of this lemma, we have Second, we show the converse of the above lemma.Let w be a world such that C 1 C 2 , w −→ * w .By Lemma 2, there is a sequence a 1 , . . ., a n of actions such that: Then we can decompose it into executions of C 1 and C 2 in the following sense.
Lemma 5.The sequence a 1 , . . ., a n of actions can be decomposed into two subsequences a Proof.By assumption, there is a k ≥ 0 such that C 1 C 2 , w −→ k w .We prove this lemma by induction on k.If k = 0, 1, the statement holds vacuously.Suppose k = k + 2 for k ≥ 0. We decompose that execution into -Case w = [[a 1 ]](w).We have L of a 2 , . . .a n such that: Since the action a 1 is performed in the program C 1 and upd(C 1 )∩Var Thus, by Lemma 2, we conclude: • Case γ ≡ C 2 , w for some w .By definition, we should have C 1 , w −→ w and C 2 , w −→ * w .Then w = w 1 and m w (v) = m w 1 (v) holds for all v ∈ Var(C 2 ) ∩ Var obs .We have the following two cases.
-Case w 1 = w.We immediately obtain Thus by Lemma 2, we obtain • The other cases are proved in a similar way.
Executions of programs can be nondeterministic due to parallel compositions.However, since two programs in parallel do not interfere with each other, their executions result in the same memory and test history as follows.such that:

Lemma 6. For any
Now we define: where the sequence a 1 , . . ., a n is decomposed into the subsequences a L . By induction hypothesis, we have q w1 (w 1 ) |= I ϕ 1 .
Since w 2 is an arbitrary possible world such that (q w (w ), w 2 ) ∈ R, we conclude q w (w ) |= I Kϕ 1 .
The direction from right to left can be proved straightforwardly by Lemma 3.
Proof of (B.8) in Lemma 8. We prove (B.8) by induction on ψ as follows.We define: where a = v := f A (y) and H = H w {m w (y) → {A}}.
Proof for Theorem 1.We obtain the validity of the axioms and rules for basic constructs in Figure 2 as usual.
We show the validity of HIST as follows.Recall that the precondition in HIST is ψ pre To derive BHL's relative completeness, we show the expressiveness of ELHT.
Proposition 4 (Expressiveness).The assertion language ELHT is expressive; i.e., for every program C and every formula ϕ ∈ Fml, there is a formula F ∈ Fml such that F I = wp I (C, ϕ).
Proof.Let w ∈ W and ϕ ∈ Fml.By the definition of the weakest precondition, it is sufficient to prove that there is a formula F ϕ C ∈ Fml such that: We show this by induction on the program C. The proof is analogous to [4] except for the case of parallel composition.To describe this using the assertion language ELHT, we replace the possible worlds w i (i = 0, . . ., k) with equivalent assertions as follows.Let v = (v 1 , . . ., v l ) be all observable and invisible variables occurring in C or ϕ.Then v ∩ IntVar = ∅.For i = 0, . . ., k and j = 1, . . ., l, let s ij = w i (v j ) and s i = (s i1 , . . ., s il ) ∈ Z l .We write ϕ[ si /v] for the assertion obtained by the simultaneous substitution of s i for v in ϕ.Then each w i can be converted into the equivalent substitution   Finally, we prove the relative completeness of BHL as follows.
Proof of Theorem 2. Assume that a judgment Γ {ψ} C {ϕ} is valid.Let w be a world such that w |= ψ.By the validity of the judgment, we have w ∈ wp I (C, ϕ).
By Proposition 4, there exists a formula F ϕ C that expresses the weakest precondition, that is, (

a−→
and R may relate possible worlds.A transition relation w a − → w represents a transition from a world w to another w by performing an action a.An observability relation wRw represents that two possible worlds w and w have the same observation, i.e., obs(w) = obs(w ).Then for any worlds w and w , wRw implies H w = H w .In Section 6, this relation is used to model knowledge in the conventional Hintikka-style.Definition 4 (Kripke model).A Kripke model is a tuple M = (W, ( a − →) a∈Act , R, (V w ) w∈W ) consisting of: • a non-empty set W of possible worlds; • for each a ∈ Act, a transition relation a − →⊆ W × W; • an observability relation R = {(w, w ) ∈ W × W | obs(w) = obs(w )}; as the semantic relation [[a]].For example, in a transition w v:=1 − −− → w , an assignment action v := 1 is executed and the resulting state Types) e ::= v | f (e, . . ., e) (Program terms) c ::= skip | v := e (Commands) C ::= c | C; C | C C | if e then C else C | loop e do C (Programs) , a, H ) is the current state w[n − 1] with an assignment m : Var → O ∪ {⊥}, an action a in the last transition in w, and a test history H .For the assignment m of the current state, we define the evaluation [[e]] m of a program term e inductively by [[v]] m = m(v) and [[f (e 1 , . . ., e k )]] m = [[f ]]([[e 1 ]] m , . . ., [[e k ]] m ).
where −→ * is the transitive closure of −→.When the program C does not terminate, [[C]](w) = ∅.With the semantic relation [[c]] for single commands c of the programming language Prog, we instantiate the transition relation in the Kripke model M (Definition 4); i.e., we define the transition relation c − → as the semantic relation [[c]].skip, w −→ w; (m, skip, H )

Figure 1 :
Figure1: Rules of execution of programs.The operation on a test history H is defined in(7).
[[u]] I m of an assertion term u w.r.t.I and an assignment m : Var → O ∪ {⊥} inductively by [[x]]

Example 4 (
Statistical belief in Z-tests).Recall again the two-tailed Z-test for two population means in Example 1.The alternative hypothesis is ϕ def = (µ 1 = µ 2 ), and the null hypothesis ¬ ϕ is given by µ 1 = µ 2 .As in Example 3, we denote this Z-test by

Figure 5 :
Figure 5: An outline of the proof for the illustrating program C drug in Example 2.

-
w for some C 1 and w .By definition, we should have C 1 , w −→ C 1 , w .Then we have the following two cases.Case w = w.By induction hypothesis, we obtain C 1 , w −→ * w 1 and C 2 , w −→ * w 2 .Then, we also have C 1 , w −→ C 1 , w −→ * w 1 .

Table 1 :
Hypotheses in the Z-tests.

Table 3 :
Prior belief/knowledge in the Z-tests (Example 1) where ϕ U and ϕ L are respectively the alternative hypotheses of the upper-tailed and lower-tailed Z-tests in Table2.
For any S ⊆ Var × A, we have |= κ S ↔ Pκ S ↔ Kκ S .3.(BHT-∨)LetA 0 be the disjunctive combination of A 1 and A 2 .If we execute A proof is shown in Appendix B.1.Next, we present the relationships between hypothesis tests and statistical beliefs.Proposition 3 (Statistical beliefs by hypothesis tests).Let y, y 1 , y 2 ∈ Var obs .Let f A , f A1 and f A2 be programs for hypothesis tests A, A 1 , and A 2 with alternative hypotheses ϕ, ϕ 1 , and ϕ 2 , respectively.Let S = {(y 1 , A 1 ), (y 2 , A 2 )}.1.(BHT)Ifwe execute the test A on the dataset y, then we obtain the statistical belief on ϕ with the p-value f A (y); i.e., |= κ y,A → K 1 on the dataset y 1 and A 2 on y 2 separately, then we obtain the statistical belief on ϕ 1 ∨ ϕ 2 with the p-value at most f A1 (y 1 ) + f A2 (y 2 ); i.e., |= κ S → K