Bayesian Tests of Two Proportions: A Tutorial With R and JASP

The need for a comparison between two proportions (sometimes called an A/B test) often arises in business, psychology, and the analysis of clinical trial data. Here we discuss two Bayesian A/B tests that allow users to monitor the uncertainty about a difference in two proportions as data accumulate over time. We emphasize the advantage of assigning a dependent prior distribution to the proportions (i.e., assigning a prior to the log odds ratio). This dependent-prior approach has been implemented in the open-source statistical software programs R and JASP. Several examples demonstrate how JASP can be used to apply this Bayesian test and interpret the results.

repeatedly peek at interim results and stop data collection as soon as the p-value is smaller than some predefined α-level (Goodson, 2014). However, this practice inflates the Type I error rate and hence invalidates an NHST analysis (Jennison & Turnbull, 1990;Wagenmakers, 2007). Thirdly, standard NHST does not allow users to incorporate de tailed expert knowledge. For example, among conversion rate optimization professionals it is widely known that online advertising campaigns often yield minuscule increases in conversion rates (cf. Johnson, Lewis, & Nubbemeyer, 2017;Patel, 2018). Such knowledge may affect NHST planning (i.e., knowledge that the effect is minuscule would necessitate the use of very large sample sizes), but it is unclear how it would affect inference. 1 As we will see below, in the Bayesian framework is it conceptually straightforward to enrich statistical models with expert background knowledge, thereby resulting in more informed statistical analyses (Lindley, 1993).
It should be acknowledged, however, that non-standard (i.e., less popular) forms of frequentist analyses exist that alleviate some of the concerns listed above. For in stance, sequential inference can be carried out by the Sequential Probability Ratio Test (Schnuerch & Erdfelder, 2020;Wald, 1945) or by Safe Testing (e.g., Grünwald, de Heide, & Koolen, 2021). In addition, it has been argued that evidence of absence can be obtained by means of an equivalence test, in which the null hypothesis is defined as an effect size that falls outside of a region of practical interest (e.g., King, 2011;Tango, 1998). An in-depth discussion of the pros and cons of frequentist inference is beyond the scope of this article.

Bayesian Statistics
The limitations of standard frequentist statistics can be overcome by adopting a Bayesian data analysis approach (e.g., Deng, 2015;Kamalbasha & Eugster, 2021;Stucchio, 2015). In Bayesian statistics, probability expresses a degree of knowledge or reasonable belief (Jeffreys, 1961) and in principle Bayesian statistics fulfills all three desiderata listed above (e.g., Wagenmakers et al., 2018). In the next sections we introduce two approaches to Bayesian A/B testing. The two approaches make different assumptions, ask different questions, and therefore provide different answers (cf. Dablander et al., 2022).

The 'Independent Beta Estimation (IBE) Approach'
Let n A denote the total number of observations and y A denote the number of successes for Group A. Let n B denote the total number of observations and y B denote the number 1) For example, with the data in hand one may find that p = 0.15, and that the power to detect a minuscule effect was only 0.20. However, power is a pre-data concept and consequently it remains unclear to what extent the observed data affect our knowledge (Wagenmakers et al., 2015). Moreover, the selection of the minuscule effect is often motivated by Bayesian considerations (i.e., it is a value that appears plausible, based on substantive domain knowledge). of successes for Group B. The commonly used Bayesian A/B testing model is specified as follows: y A ∼ Binomial n A , θ A y B ∼ Binomial n B , θ B This model assumes that y A and y B follow independent binomial distributions with success probabilities θ A and θ B . These success probabilities are assigned independent beta(α, β) distributions that encode the relative prior plausibility of the values for θ A and θ B . In a beta distribution, the α values can be interpreted as counts of hypothetical 'prior successes' and the β values can be interpreted as counts of hypothetical 'prior failures' (Lee & Wagenmakers, 2013): Data from the A/B testing experiment update the two independent prior distributions to two independent posterior distributions as dictated by Bayes' rule: p θ A | y A , n A = p θ A × p y A , n A | θ A p y A , n A p θ B | y B , n B = p θ B × p y B , n B | θ B p y B , n B where p(θ A ) and p(θ B ) are the prior distributions and p(y A ,n A | θ A ) and p(y B ,n B | θ B ) are the likelihoods of the data given the respective parameters. Hence, the reallocation of probability from prior to posterior is brought about by the data: the probability increases for parameter values that predict the data well and decreases for parameter values that predict the data poorly (Kruschke, 2013;van Doorn, Matzke, & Wagenmakers, 2020;Wagenmakers, Morey, & Lee, 2016). Note that whenever a beta prior is used and the observed data are binomially distributed, the resulting posterior distribution is also a beta distribution. Specifically, if the data consist of s successes and f failures, the resulting posterior beta distribution equals beta(α + s, β + f) (Gelman et al., 2013;van Doorn et al., 2020). 2 Ultimately, practitioners are most often interested in the difference δ = θ A − θ B between the success rates of the two experimental groups, as this difference indicates whether the experimental condition shows the desired effect.
2) When the prior and the posterior belong to the same family of distributions they are said to be conjugate.  (1914). A researcher wishes to test whether honey bees have color vision by comparing the behavior of two groups of bees. The experiment involves a training and a testing phase. In the training phase, the bees in the experimental condition are presented with a blue and a green disc. Only the blue disc is covered with a sugar solution that bees crave. The control group receives no training. In the testing phase, the sugar solution is removed from the blue disc, and the behavior of both groups is being observed. If the bees in the experimental condition have learned that only the blue disc contains the appetising sugar solution, and if they can discriminate between blue and green, they should preferentially explore the blue instead of the green disc during the testing phase. The researcher finds that in 65 out of 100 times, the bees in the experimental group continued to approach the blue disc after the sugar solution was removed. The bees that were not trained approached the blue disc 50 out of 100 times. In the remainder of this section, we will refer to the bees in the control condition as group A and to the bees in the experimental condition as group B. The R file for this fictitious example can be found in the Supplementary Materials. Before setting up this A/B test and collecting the data, the prior distribution has to be specified so that it represents the relative plausibility of the parameter values. For the present example, the researcher specifies two uninformative (uniform) beta(1,1) priors. After running the A/B test procedure, the priors are updated with the obtained data. With bayesAB the calculation of the posterior distributions is done by feeding both the priors and the data to the bayesTest function: R> library(bayesAB) R> bees1 <-read.csv2("bees_data1.csv") R> AB1 <-bayesTest(bees1$y1, bees1$y2, + priors = c('alpha' = 1, 'beta' = 1), + n_samples = 1e5, distribution = 'bernoulli') A more detailed explanation of the function and its arguments can be obtained by typing ?bayesTest into the R console. The results can be obtained and visualized by executing: R> summary(AB1) R> plot(AB1) Figure 1 shows the two independent posterior distributions that plot(AB1) returns.
To plot these posterior distributions, bayesTest makes use of the rbeta function that draws random numbers from a given beta distribution. To obtain each posterior distribution the package first exploits conjugacy: the number of successes s are added to the α values of either version's prior distribution and the number of failures f are added to the respective β values (e.g., Kruschke, 2015;Kurt, 2019). Thus, the posterior distribution for θ A is beta(α A +s A , β A +f A ) and that for θ B is beta(α B +s B , β B +f B ). The rbeta function draws random samples from each posterior distribution and the density of these values is shown in Figure 1. 3 We can see that group B's posterior distribution for the success probability assigns more mass to higher values of θ. This suggests that the success probability of the trained bees is higher, which in turn implies that bees have color vision.

Independent Posterior Beta Distributions of the Success Probabilities for Group A and B
Note. The plot is produced by the bayesAB package with the fictitious bee data (i.e., A = 50/100 versus B = 65/100) described in the main text. The analysis used two independent beta(1, 1) priors.
The bayesAB package also returns a posterior distribution which indicates the 'conver sion rate uplift' , that is, the difference between the success rates expressed as a proportion of θ A . The advantage of expressing the difference as a proportion of θ A is that a change from 1% to 2% (i.e., a doubling of the conversion rate) is seen to be much more impressive than a change from 50% to 51%. The associated disadvantage is that a small change can appear more impressive than it really is. The posterior distribution for the conversion rate uplift is computed from the random samples obtained for the two beta 3) The posterior distributions are available analytically, so at this point the rbeta function is not needed; it will become relevant once we start to investigate the posterior distribution for the difference between θ A and θ B . posteriors shown in Figure 1. As shown in Figure 2, the posterior distribution for the uplift peaks at around 0.4, indicating that the most likely increase of bee approaches on the blue disc equals 40%. Also, most posterior mass (i.e., 98.4% of the samples) is above zero, indicating that we can be 98.4% certain that group B approaches the blue disc more often than group A. Note that this statement assumes that it is a priori equally likely that the training in the experimental condition B had a positive or negative effect on the rate at which the bees approach the blue disc, and that the possibility that both groups have the same approach rate is deemed impossible from the outset-this is an important point to which we will return later. The posterior probability that θ B > θ A can also be obtained analytically (Schmidt & Mørup, 2019). The formula is not implemented in the bayesAB package; our R imple mentation can be found in the OSF repository (Hoffmann, Hofman, & Wagenmakers, 2022). For the above example, p(θ B > θ A | data) = 0.984. In fact the entire posterior distri bution for the difference between the two independent beta distributions is available analytically (Pham-Gia, Turkkan, & Eng, 1993). The left-hand panel of Figure 3 shows the posterior distribution of the difference between the two independent beta posteriors from the bee example. Unfortunately, the analytic calculation fails for values of α and β above ∼ 70, which occur with strong advance knowledge or high sample sizes. 4 In this case, one can instead employ a normal approximation, the result of which is shown in the right panel of Figure 3. 5

Figure 3
Posterior Distributions of the Difference δ = θ B -θ A for the Fictitious Bee Data (i.e., A = 50/100 Versus B = 65/100) Note. The left-hand panel shows the analytic distribution of the difference between two independent beta distributions (Pham-Gia et al., 1993). The right-hand panel shows the normal approximation of the difference between two independent beta distributions.
One advantage of the Bayesian approach is that the data can also be added to the analy sis in a sequential manner. This means that the evidence can be assessed continually as the data arrives and the analyses can be stopped as soon as the evidence is judged to be compelling (Deng, Lu, & Chen, 2016). As a demonstration, Figure 4 plots the posterior mean of the difference between θ A and θ B as well as the 95% highest density interval (HDI) of the difference in a sequential manner. The HDI narrows with increasing sample size, indicating that the range of likely values for δ gradually becomes smaller. After some initial fluctuation, the posterior mean difference between θ A and θ B (i.e., the orange line) settles between 0.1 and 0.2. The R code for the sequential computation can be found in the OSF repository (Hoffmann, Hofman, & Wagenmakers, 2022). Note that the results are not analytic-they are based on repeatedly drawing samples. This sampling process introduces variability such that when the same analysis is executed again on the same data, the outcomes will differ slightly. The numerical variability can be made arbitrarily small by drawing more samples.
4) The reason for this is numerical overflow from Appell's first hypergeometric function.
5) The appendix contains the formulas from Schmidt and Mørup (2019) and Pham-Gia et al. (1993), as well as the formulas for the normal approximation.
In sum, the IBE approach allows practitioners to judge the size and direction of an effect, that is, the difference between the two success probabilities. It is important, however, to recognize the assumptions that come with this approach. In the next section, we will elaborate on these assumptions and their consequences. Assumptions of the IBE Approach -The IBE approach makes two important assump tions. The first assumption is that the two success probabilities are independent: learning about the success rate of one experimental condition does not affect our knowledge about the success rate of the other condition (Howard, 1998). In practice, this assumption is rarely valid. Howard (1998) explains this with the following example: Do English or Scots cattle have a higher proportion of cows infected with a certain virus? Suppose we were informed (before collecting any data) that the proportion of English cows infected was 0.8. With independent uniform priors we would now give ℋ 1 (p 1 > p 2 ) a prob ability of 0.8 (… ) In very many cases this would not be appropriate. Often we will believe (for example) that if p1 is 80%, p2 will be near 80% as well and will be almost equally likely to be larger or smaller. (We are still assuming it will never be exactly the same.) (p. 363) The second assumption of the IBE approach is that an effect is always present; that is, training the bees to prefer a certain color may increase approach rates or decrease approach rates; it is never the case that the training is completely ineffective. This assumption follows from the fact that a continuous prior does not assign any probability to a specific point value such as δ = 0 (Jeffreys, 1939;Williams, Bååth, & Philipp, 2017;Wrinch & Jeffreys, 1921). Thus, using the IBE approach practitioners can only test whether the alterations in the experimental group yield a positive or a negative effect. Obtaining evidence in favor of the null hypothesis-which was one of the desiderata lis ted by Gronau et al. (2021)-is not possible with this approach. Hence, the IBE approach does not represent a testing effort, but rather an estimation effort (Jeffreys, 1939). To allow for both hypothesis testing and parameter estimation a Bayesian A/B testing model has to be able to assign prior mass to the possibility that the difference between the two conditions is exactly zero. It should be acknowledged that this can be achieved when the success probabilities are assigned beta priors (e.g., Günel & Dickey, 1974;Jamil et al., 2017;Jeffreys, 1961); however, here we follow the recent recommendation by Dablander et al. (2022) and adopt an alternative statistical approach.

The Logit Transformation Testing (LTT) Approach
An A/B test model that assigns prior mass to the null hypothesis of no effect was introduced by Kass and Vaidyanathan (1992) and implemented by Gronau et al. (2021). In contrast to the IBE approach, this model assigns a prior distribution to the log odds ratio, thereby accounting for the dependency between the success probabilities of the two experimental groups. The LTT approach is specified as follows: As before, this model assumes that y A and y B follow binomial distributions with success probabilities θ A and θ B . However, the success probabilities are a function of two parame ters, γ and ψ. Parameter γ indicates the grand mean of the log odds, while ψ denotes the distance between the two conditions (i.e., the log odds ratio; Bland & Altman, 2000;Hailpern & Visintainer, 2003). The hypothesis that there is no difference between the two groups can be formulated as a null hypothesis: ℋ 0 : ψ = 0. Under the alternative hypothesis ℋ 1 , ψ is assumed to be nonzero. By default, both parameters are assigned normal priors: While the choice of a prior for γ is relatively inconsequential for the comparison between ℋ 0 and ℋ 1 , the choice of a prior for ψ is far-reaching: it determines the predictions of ℋ 1 concerning the difference between versions A and B. In other words, ψ is the test-relevant parameter. 6 We consider four hypotheses that may be of interest in practice: ℋ 0 : θ A = θ B ; The success probabilities θ A and θ B are identical. ℋ 1 : θ A ≠ θ B ; The success probabilities θ A and θ B are not identical. ℋ + : θ B > θ A ; The success probability θ B is larger than the success probability θ A . ℋ − : θ A > θ B ; The success probability θ A is larger than the success probability θ B . By comparing these hypotheses, practitioners may obtain answers to the following questions: 1. Is there a difference between the success probabilities, or are they the same? This requires a comparison between ℋ 1 and ℋ 0 . 2. Does group B have a higher success probability than group A, or are the probabilities the same? This requires a comparison between ℋ + and ℋ 0 . 3. Does group A have a higher success probability than group B, or are the probabilities the same? This requires a comparison between ℋ − and ℋ 0 . 4. Does group B have a higher success probability than group A, or does group A have a higher success probability than group B? This is the question that is also addressed by the IBE approach discussed earlier, and it requires a comparison between ℋ + and ℋ − .
To quantify the evidence that the observed data provide for and against the hypotheses we compare the models' predictive performance. 7 For two models, say ℋ 0 and ℋ + , the ratio of their average likelihoods for the observed data is known as the Bayes factor (Jeffreys, 1939;Kass & Raftery, 1995;Wagenmakers et al., 2018): where BF +0 indicates the extent to which ℋ + outpredicts ℋ 0 . The evidence from the data is expressed in the Bayes factor, but to compare two hypotheses in their entirety, the a priori plausibility of the hypotheses needs to be 6) Note that the overall prior distribution for ψ can be considered a mixture between a 'spike' at 0 coming from ℋ 0 and a Normal 'slab' coming from ℋ 1 (e.g., van den Bergh et al., 2021).

7)
We use the terms 'model' and 'hypothesis' interchangeably. considered as well. Bayes' rule describes how we can use the Bayes factor to update the relative plausibility of the two competing models after having seen the data (Kass & Raftery, 1995;Wrinch & Jeffreys, 1921): The prior odds quantify the plausibility of the hypotheses before seeing the data, while the posterior odds quantify the plausibility of the two hypotheses after taking the data into account (Wagenmakers et al., 2018). The Bayes factor is the evidence-the change from prior to posterior plausibility brought about by the data.

Implementation of the LTT Approach in R and JASP -
To demonstrate the analyses with the LTT approach we can use the abtest package (version 1.0.1, Gronau, 2019) in R (R Core Team, 2020). The functionality of this package has also been implemented in JASP (JASP Team, 2020). Below we first discuss the R code and then turn to the JASP implementation. Note that the same analysis can also be performed with other more general software for Bayesian inference, such as JAGS (Plummer, 2003) or Stan (Carpenter et al., 2017). It is recommended that a hypothesis is specified before setting up the A/B test (McFarland, 2012). For the previous example, it can be assumed that the researcher hypothesized that bees may indeed have color vision and that the bees in Group B will therefore approach the blue disc relatively frequently during the testing phase. Hence, from a Bayesian perspective, we may want to compare the directional hypothesis ℋ + (i.e., that bees in Group B approach the blue disc more often than bees in Group A) against the null hypothesis ℋ 0 (i.e., there is no difference in the approach rate between Groups A and B). To prevent the undesirable impact of hindsight bias it is likewise recommended to specify the prior distribution for the log odds ratio ψ under ℋ + before having inspected the data.
For illustrative purposes we assume that in the present example there is little prior knowledge, which motivates the specification of an uninformed standard normal prior distribution: ℋ 1 : ψ ∼ N 0, 1 , which is also the default in the abtest package. With the hypotheses of interest specified and the prior distributions assigned to the test-rele vant parameter, we are almost ready to execute the Bayesian hypothesis test using the ab_test function. This function requires the data, parameter priors, and prior model probabilities. For the present example, we set the prior probabilities of ℋ + and ℋ 0 equal to 0.5 and we assign the grand mean parameter γ a relatively uninformative standard normal prior distribution: R> library(abtest) R> bees2 <-as.list(read.csv2("bees_data2.csv")[-1,-1]) R> prior_prob <-c(0, 0.5, 0, 0.5) R> names(prior_prob) <-c("H1", "H+", "H-", "H0") R> AB2 <-ab_test(data = bees2, prior_par = list(mu_psi = 0, + sigma_psi = 1, mu_beta = 0, sigma_beta = 1), + prior_prob = prior_prob) As shown in the code above, the standard normal prior on ψ is specified by assigning values for mu_psi and sigma_psi to the prior_par argument of the ab_test function. The prior model probabilities are specified by feeding a vector which specifies the probability for the hypotheses ℋ 1 , ℋ + , ℋ − , and ℋ 0 to the prior_prob argument.
The ab_test function then returns the Bayes factors and the prior and posterior proba bilities of the hypotheses. A more detailed explanation of the function and its arguments can be obtained by typing ?ab_test into the R console. For the bee example, the Bayes factor BF +0 equals 4.7, meaning that the data are approximately 5 times more likely under the alternative hypothesis ℋ + than under the null hypothesis ℋ 0 . A Bayes factor of ∼ 5 is generally considered moderate evidence (e.g., Jeffreys, 1939;Lee & Wagenmakers, 2013). The robustness of this conclusion can be explored by changing the prior distribution on ψ (i.e., by varying the mean and standard deviation of the normal prior distribution) and observing the effect on the Bayes factor. Figure 5 visualizes the robustness of the Bayes factor for changes across a range of values for μ ψ and σ ψ . The Bayes factor is highest for low σ ψ values and μ ψ ≈ 0.6. The heatmap shows that our conclusion regarding the evidence for ℋ + over ℋ 0 is relatively robust. The plot can be produced with: R> plot_robustness(AB2, mu_range = c(0, 2), sigma_range = c(0.1, 1), + bftype = "BF+0") A sequential analysis tracks the evidence in chronological order. Figure 6 shows how the posterior probability of either hypothesis unfolds as the observations accumulate. The figure indicates that after some initial fluctuations, and a tie after about 90 observations, the last 110 observations cause the probability for the alternative hypothesis to increase steadily until it reaches its final value of 0.826. Because we consider only two hypotheses, the probability for ℋ + is the complement of that for ℋ 0 . The sequential analysis can be obtained as follows: R> plot_sequential(AB2) Figure 6 also visualizes the prior and posterior probabilities of the hypotheses as a probability wheel. The probability of ℋ + has increased from 0.5 to 0.826 while the posterior plausibility of ℋ 0 has correspondingly decreased from 0.5 to 0.174.

Figure 6
The Having collected evidence for the hypothesis that trained bees prefer the blue disc more than do untrained bees, one might then wish to quantify the size of this difference in preference. To do so, we switch from a testing framework to an estimation framework. For this purpose, we adopt the two-sided model ℋ 1 and use Bayes' rule to obtain the posterior distribution for the log odds ratio. Figure 7 shows the result as produced via the plot_posterior function: The dotted line in Figure 7 displays the prior distribution, the solid line displays the pos terior distribution (with a 95% central credible interval [CI]), and the posterior median and 95% CI are displayed on top. For our fictitious bee example, Figure 7 indicates that the log odds ratio is 95% probable to lie between 0.024 and 1.111. It is important to realize that this inference is conditional on ℋ 1 (van den Bergh et al., 2021), which features a prior distribution that makes two strong assumptions: (1) the effect is either positive or negative, but never zero; (2) a priori, effects are just as likely to be positive as they are to be negative (i.e., the prior distribution for the log odds ratio is symmetric around zero). The abtest R package has also been implemented in JASP, allowing teachers, students, and researchers to obtain the above results with a graphical user interface. A screenshot is provided in Figure 8. There are two ways in which the abtest functionality can be activated in JASP. The first method, shown in Figure 8, is to activate the Summary Statistics module using the blue '+' sign in the top right corner. Clicking the Summary Statistics icon on the ribbon and selecting Frequencies → Bayesian A/B Test brings up the interface shown in Figure 8. Using the Summary Statistics module, users only need to enter the total number of suc cesses and sample sizes in the two groups. As shown in Figure 8, the input panel offers the similar functionality to the abtest R package. The slight difference in outcomes is due to the fact that the results for the directional hypotheses ℋ + and ℋ − involve importance sampling . The second method to activate the abtest functionality in JASP is to store the results in a data file, open it in JASP, click the Frequencies icon on the ribbon and then select Bayesian → A/B Test. When the data file contains the intermediate results, this second method allows users to conduct a sequential analysis such as the one shown in Figure 6.
To showcase the different approaches to Bayesian A/B testing we now apply the methodology to two example data sets. 8 The first example data set features real data collected on the 'Rekentuin' online learning platform, and the second example data set features fictitious data constructed to be representative of online webshop experiments (i.e., relatively small effect sizes and relatively high sample sizes).

Example I: The Rekentuin The Rekentuin A/B Experiment
Rekentuin (Dutch for 'math garden') is a tutoring website where children can practice their arithmetic skills by playing adaptive online games. The Rekentuin website is visited by Dutch elementary school children between the ages of 4 and 12. During the testing interval from the 22nd of January 2019 to the 5th of February 2019, a total of 15,322 children were active on Rekentuin.

Screenshots from the Rekentuin Web Environment
Note. The left-hand panel shows a screenshot of a Rekentuin landing page. The page shows that the child has earned three crowns for the category 'optellen' (Dutch for 'addition'). The right-hand panel shows a screenshot of an addition problem in the Rekentuin web environment. The coins at stake are displayed on the bottom right corner.
The left-hand panel of Figure 9 shows a screenshot of a Rekentuin landing page. In Rekentuin, children earn coins by quickly solving simple arithmetic problems that are organized into different classes (e.g., addition, subtraction, division, etc.). An example of an addition problem is shown in the right-hand panel of Figure 9, with the coins at stake shown in the bottom right corner. The children can use the coins that are gained to buy virtual trophies (not shown). The better a given child performs, the more trophies they are able to add to their trophy cabinet. The prospect of earning trophies motivates the children to participate and perform well (for details see Brinkhuis et al., 2018;Klinkenberg, Straatemeier, & van der Maas, 2011). On the Rekentuin landing page, the plant growth near each class of arithmetic problem indicates the extent to which that class was recently practiced; practice makes the plants grow, whereas periods of inactivity makes the plants wither away.
In 2019, the developers of Rekentuin faced the challenge that many children would preferentially engage with the class of arithmetic problems that they had already mas tered (e.g., addition)-a sensible strategy if the goal is to maximize the number of coins gained. To incentivize the children to practice other classes of arithmetic problems (e.g., subtraction) the developers implemented a 'crown' for the type of games that the children had already mastered (see Figure 9, left-hand panel). Children could gain up to three crowns for each type of game. Thus, in order to obtain more crowns, children had to engage more frequently with the types of games they had played less often. However, the crowns did not have the desired effect-instead of decreasing the playtime on crown games, the playtime on crown games actually increased.
To induce the children to play other games, the Rekentuin developers constructed a less subtle manipulation: they removed the virtual reward (i.e., the coins) from the crown games. To test the effectiveness of this manipulation, the Rekentuin developers designed an A/B test. Half of the children continued playing on an unchanged website (Version A), whereas the other half could no longer earn coins for crown games (Version B). The children playing Version B were not notified of the change but had to discover the changes for themselves.
The question of interest is whether changing the incentive structure for crown games (i.e., removing the coins) had the desired effect. To address this question we analyzed the Rekentuin data set using the two Bayesian A/B testing approaches outlined earlier.

Method Preregistration
The data were collected by Abe Hofman and colleagues on the Rekentuin website in 2019. All intended analyses were applied to synthetic data and the associated analysis scripts were stored on a repository at the OSF. We did not inspect the data before the preregistration was finalized. All preregistration materials as well as the real data are available in the Supplementary Materials section.

Data Preprocessing
Our analysis concerns the last game that each child played during the testing interval: was it a crown game or not? By examining only the last game we obtain a binary varia ble (required for the present A/B test) and also allow children the maximum opportunity to experience that crown games no longer yield coins.
We excluded children from the analyses according to two criteria. Firstly, we excluded 8573 children who did not play any crown game during the time of testing because they could not have experienced the experimental manipulation in Version B. Secondly, we excluded 350 children who only played one crown game and it was their last game, because for these children we cannot observe the potential influence of the manipulation on their playing behavior. In total, we therefore excluded 8923 children.

Descriptives
The Rekentuin data are summarized in Table 1. The table indicates the number of children who played a crown game or a non-crown game as their last game. In the control condition, 2272 out of 3178 children (≈ 71.5%) played a non-crown game as their last game; in the treatment condition, with the coins for crown games removed, this was the case for 2596 out of 3221 children (≈ 80.6%). It appears the manipulation had a large effect. We now quantify the statistical evidence using the Bayesian A/B test. Note. Children in Version B (no coins available in crown games) played more non-crown games.

Rekentuin A/B Test: The IBE Approach
As before, in the IBE approach we assigned two uninformed beta(1,1) distributions to the success probabilities of Versions A and B. 9 Figure 10 displays the resulting two independent posterior distributions. 9) Researchers with access to pre-intervention data could instead consider to use an informed prior distribution, although there is always a risk that the pre-intervention data differ from the post-intervention data on some unknown dimension.

Note. Version A corresponds to the unchanged Rekentuin website. Version B denotes the Rekentuin version
where the children could not earn coins for crown games. The plot is produced by the bayesAB package.
Consistent with the intuitive impression from Table 1, virtually all of the posterior mass in Version B is on higher values of θ non-crown than that in Version A. This suggests that the success probability of the modified Rekentuin version is higher and that removing the coins from the crown games had a positive impact on the number of non-crown games played. Figure 11 shows the conversion rate uplift. The distribution peaks at around 0.12, indicating that the most likely conversion increase equals 12%. Also, all posterior mass (i.e., 100% of the samples) is above zero. In other words, we can be relatively certain that Version B is better than Version A. Note that this statement assumes that it is equally likely that the alterations in Version B had a positive or negative effect on the rate at which the children played non-crown games.
In addition to the bayesAB package output, we computed the posterior probability of the event θ B non−crown > θ A non−crown using the formula reported by Schmidt and Mørup (2019). For the Rekentuin data, p(θ B > θ A ) ≈ 1. The analytic calculation of the posterior distribution for the difference δ = θ B non−crown − θ A non−crown between the two independent beta distributions fails because the data set is too large. Figure 12 plots the entire probability distribution of the difference δ calculated using the normal approximation. The distribu tion is very narrow and peaks at around 0.09.

for the Rekentuin Data
Note. The conversion rate indicates the proportion of children that played a non-crown game. The uplift is calculated by dividing the difference in conversion by the conversion in A. The plot is produced by the bayesAB package.

Figure 12
Posterior Distribution of the Difference δ = θ B Non−Crown -θ A

Non−Crown for the Proportion of Non-Crown Games Between the Two Rekentuin Website Versions
Note. Children in Version B-the modified website version-played more non-crown games compared to children playing on the website Version A.

Sequential Analysis of the Difference Between the Success Probabilities (i.e., θ B Non−Crown − θ A Non−Crown ) of the two Rekentuin Versions
Note. The orange line plots the posterior mean of the difference. The grey area visualizes the width of the highest density interval as a function of sample size n. Figure 13 plots the sequential analysis of the posterior mean of the difference between θ A non−crown and θ B non−crown as well as the 95% HDI of the difference. After some initial fluctuation, the posterior mean difference between θ A non−crown and θ B non−crown settles at ∼0.09 while the HDI becomes more narrow with increasing sample size. The range of likely values for δ eventually ranges from approximately 0.071 to 0.112.

Rekentuin A/B Test: The LTT Approach
For the LTT approach, we compare the directional hypothesis ℋ + (i.e., children in Version B play more non-crown games then children in Version A) against the null hypothesis ℋ 0 (i.e., the proportion of non-crown games played does not differ between Versions A and B). We employed a truncated normal distribution with µ = 0 and σ 2 = 1 under the alternative hypothesis as there is a range of parameter values that seem plausible (see, for example, Cameron, Banko, & Pierce, 2001;Tang & Hall, 1995). In particular, it is plausible that removing the coins from the crown games results in a marked change.
The observed sample proportions of 0.806 for Version B and 0.715 for Version A suggest that the children in Version B played more non-crown games as compared to Version A. The Bayes factor BF +0 that assesses the evidence in favor of our hypothesis that the children in Version B played more non-crown games equals 7.944e+14. This means that the data are about 800 trillion times more likely to occur under the alter native hypothesis ℋ + than under the null hypothesis ℋ 0 . In sum, the Bayes factor indicates overwhelming evidence for the alternative hypothesis (e.g., Jeffreys, 1939;Lee & Wagenmakers, 2013). Figure 14 visualizes the dependency of the Bayes factor on the prior distribution for ψ by varying the mean μ ψ and standard deviation σ ψ of the normal prior distribution. From looking at the heatmap we can conclude that the Bayes factor is robust. The data indicate extreme evidence across a range of different values for the prior distribution on ψ. Figure 15 tracks the evidence for either hypothesis in chronological order. After about 800 observations, the evidence for ℋ + is overwhelming. The posterior probabilities of the hypotheses are also shown as a probability wheel on the top of Figure 15. The green area visualizes the posterior probability of the alternative hypothesis and the grey area visualizes the posterior probability of the null hypothesis. The data have increased the plausibility of ℋ + from 0.5 to almost 1 while the posterior plausibility of the null hypothesis ℋ 0 has correspondingly decreased from 0.5 to almost 0.

Figure 15
The

Flow of Posterior Probability for H 0 and H + as a Function of the Number of Observations Across Both Rekentuin Versions
Note. The prior and posterior probabilities of the hypotheses are displayed on top.
In sum, the evidence in favor of the alternative hypothesis is overwhelming. To complete the picture, we quantified the difference between the two Rekentuin versions by estimat ing the size of the log odds ratio. Figure 16 shows the prior and posterior distribution for the log odds ratio under the two-sided model ℋ 1 . The dotted line displays the prior distribution and the solid line displays the posterior distribution (with 95% central CI). The plot indicates that, given that the log odds ratio is not exactly zero, it is 95% probable to lie between 0.386 and 0.622, where the posterior median is 0.504. In R and in JASP, this prior and posterior plot may also be shown on a different scale: as an odds ratio, relative risk, absolute risk, and individual proportions.

Prior and Posterior Distribution of the Log Odds Ratio Under H 1 for the Rekentuin Data Set
Note. The median and the 95% credible interval of the posterior density for the Rekentuin data are shown in the top right corner.

Example II: The Fictional Webshop
The Rekentuin manipulation directly targeted children's motivation to play the games. Common A/B tests for web development purposes implement more subtle manipulations that result in much smaller effect sizes. In this section we analyze such a scenario. Consider the following fictitious scenario: an online marketing team seeks to improve the click rate on a call-to-action button on their website's landing page. Therefore, they devise an A/B test. Half of the website visitors read 'Try our new product!' (Version A), and the other half reads 'Test our new product!' (Version B). 10 The success of the website versions is measured by the rate at which website visitors click on the call-to-action button.
To demonstrate the analyses we use synthetic data. The corresponding R code can be found at the Supplementary Materials. Table 2 provides the number of clicks in each group. The conversion rate equals 1131/10000 = 0.1131 in Version A and 1275/10000 = 0.1275 in Version B. The company now wishes to determine whether and to what extent the observed sample difference in proportions translates to the population. 10) This example was inspired by a real conversion rate optimization project at https://blog.optimizely.com/2011/06/08/optimizely-increases-homepage-conversion-rate-by-29/ Note. Visitors confronted with Version B clicked the call-to-action button more often than those confronted with Version A.

The IBE Approach
We again use the bayesAB package in R to analyze the data according to the IBE approach using the default independent beta(1, 1) distributions on θ A and θ B (Portman, 2017;R Core Team, 2020). The left-hand panel of Figure 17 illustrates the two independent posterior distributions.
We can see that Version B's posterior distribution for the success probability assigns more mass to higher values of θ. This suggests that the click-through rate for Version B's message 'Test our new product!' is higher than that for Version A's message 'Try our new product!'. The right-hand panel of Figure 17 depicts the conversion rate uplift. The posterior distribution for the uplift peaks at around 0.125, indicating that the most likely conver sion increase equals 12.5%. Also, most posterior mass (i.e., 99.9% of the samples) is above zero, indicating that we can be 99.9% certain that Version B is better than Version A instead of the other way around.
The analytically calculated posterior probability of the event θ B > θ A equals 0.999 (Schmidt & Mørup, 2019). Figure 18 shows the posterior difference distribution δ. The distribution peaks at 0.014. A difference of this size is relatively large for a conversion rate optimization endeavor (Browne & Jones, 2017). We calculated δ for this data set with the normal approximation.

Posterior Distribution of the Difference δ = θ B − θ A for the Click-Through Proportion Between the two Fictitious Website Versions
Note. Visitors confronted with Version B clicked more often on the call-to-action button than visitors confronted with Version A. Figure 19 plots the posterior mean of the difference between θ A and θ B as well as the 95% HDI of the difference in a sequential manner. With increasing sample size, the HDI becomes more narrow. This indicates that the range of likely values for δ becomes smaller. After some initial fluctuation, the posterior mean difference between the two success probabilities θ A and θ B settles at ∼ 0.014.

The LTT Approach
Before the data can be analyzed according to the LTT approach, a prior distribution for the log odds ratio has to be specified. For this purpose, it is important to note that the subtle manipulations of common A/B tests generally result in very small effect sizes. The effect size of website changes (i.e., the difference in conversion rates between the baseline version and its modification) is typically as small as 0.5% or less (Berman et al., 2018). This means that the analysis of such data requires an exceptionally narrow prior distribution that peaks at a value close to 0 in order for the shape of the prior distribution to do justice to the relative plausibility of parameter values.
For the present example, we will compare the impact of different prior distributions on the analysis outcome (i.e., a sensitivity analysis). Suppose that the online marketing team specifies three prior distributions. Firstly there is the prior distribution specified by a team member who is still relatively unfamiliar with conversion rate optimization. This team member lacks substantive knowledge about plausible values of the log odds ratio and prefers to use the uninformed standard normal prior, truncated at zero to represent the expectation of a positive effect (see Figure 20, left-hand panel). The two remaining prior distributions come from team members with prior knowledge on A/B testing; consequently, these distributions are much more narrow. One prior comes from a team member who is optimistic about the conversion rate increase in Version B. Specifically,

Figure 19
Sequential Analysis of the Difference Between the Click-Through Probabilities (i.e., θ B

− θ A ) of the two Fictitious Webshop Versions
Note. The orange line plots the posterior mean of the difference. The grey area visualizes the width of the highest density interval as a function of sample size n. The left-hand panel shows the sequential analysis of the difference, with the y-axis ranging from -1 to 1. The right-hand panel shows the sequential analysis of the difference, with the y-axis ranging from -0.1 to 0.1. this team member believes the most likely value of the odds ratio to be around 1.20 (i.e., μ ψ = 0.18; see Figure 20, center panel). The final prior comes from a team member who believes the most likely value of the odds ratio to be around 1.05 (i.e., μ ψ = 0.05; see Figure 20, right-hand panel). All three prior distributions are truncated at zero to specify a positive effect. In sum, the data will be analyzed with three priors: the uninformed prior with μ ψ = 0 and σ ψ = 1, the optimistic prior with μ ψ = 0.18 and σ ψ = 0.005, and the conservative prior with μ ψ = 0.05 and σ ψ = 0.03. We used the abtest package (Gronau, 2019) in R (R Core Team, 2020) and JASP for the present analysis. Before estimating the size of the effect, we first evaluate the evidence that there is indeed a difference between Version A and B. Overall the data support the hypothesis that visitors confronted with Version B click on the call-to-action button more often than those confronted with Version A. Specifically, compared to the null hypothesis ℋ 0 , the data are about 11 times more likely under the uninformed alternative hypothesis ℋ + u , about 80 times more likely under the optimistic alternative hypothesis ℋ + o , and about 27 times more likely under the conservative alternative hypothesis ℋ + c . 11 The influence of the prior distribution on the Bayes factor can be explored more systematically with the Bayes factor robustness plot, shown in Figure 21. Varying both the mean μ ψ and the standard deviation σ ψ of the prior distribution on ψ shows that the BF +0 mostly ranges from about 10 to about 60. The evidence is generally less compelling for prior distributions that are relatively wide (i.e., high σ ψ ) or relatively peaked but away from zero (i.e, low σ ψ and high μ ψ ). In both scenarios, substantial predictive mass 11) It follows from transitivity that the optimistic colleague outpredicted the pessimistic colleague by a factor of 80/27 ≈ 2.96. is wasted on effect sizes that are unreasonably large, and were unlikely to manifest themselves in the context of the present webshop A/B experiment.

Bayes Factor Robustness Plot for the Fictitious Webshop Data
Note. Overall, there is strong evidence for H 1 over H 0 across a range of reasonable values for μ ψ and σ ψ . The evidence is less compelling when the prior for the log odds ratio is relatively wide (i.e., when σ ψ is relatively high) or far away from zero (i.e., when μ ψ is relatively high).
The prior and posterior probabilities of the hypotheses are displayed on top of Figure 22. For the uninformed prior, the optimistic prior, and the conservative prior, the posterior probabilities for ℋ + are approximately equal to 0.920, 0.988, and 0.965, respectively. This illustrates that even though the three priors provide different levels of evidence as meas ured by the Bayes factor, the overall interpretation is approximately the same. Figure  22 also shows the flow of posterior probability for each of the three prior distributions as a function of the fictitious incoming observations. For all distributions, a clear and consistent pattern of preference starts to emerge only after 10,000 observations, which is when the posterior probability of ℋ + gradually rises while that of ℋ 0 decreases accordingly.

Flow of Posterior Probability for H 0 and H + as a Function of the Number of Observations Across Both Fictitious Website Versions.
Note. The left-hand panel shows the sequential analysis with the uninformed prior. The center panel shows the sequential analysis with the optimistic prior. The right-hand panel shows the sequential analysis with the conservative prior.
In sum, the fictional webshop data present strong to very strong evidence for the claim that the conversion rate is higher in Version B than in Version A (Lee & Wagenmakers, 2013). Being assured that the effect is present, the online marketing team now wishes to assess the size of the effect. Figure 23 displays the prior and posterior distribution for the log odds ratio using the three different priors. The prior distribution is plotted as a dotted line and the posterior distribution as a solid line (with 95% central CI). Under the assumption that the effect is non-zero, the left-hand panel of Figure 23 indicates that the posterior median of the log odds ratio is 0.135 with a 95% CI ranging from 0.050 to 0.220 when using the uninformed prior distribution. The middle panel of Figure 23 displays the posterior distribution for the log odds ratio when using the optimistic prior distribution. The leftward shift of the posterior distribution indicates that the effect is somewhat smaller than expected; the posterior median is 0.179 and the 95% CI ranges from 0.168 to 0.190. The change from prior to posterior distribution is only modest, and this reflects the fact that the optimistic prior was also relatively peaked, meaning that the prior belief in the relative plausibility of the different parameter values was very strong. Finally, the right-hand panel of Figure 23 displays the posterior distribution when using the conservative prior distribution. The rightward shift of the posterior distribution indicates that the effect is somewhat larger than expected; the posterior median is 0.072 and the 95% CI ranges from 0.029 to 0.116. The general pattern in Figure 23 is that the change from prior to posterior is more pronounced when prior knowledge is weak.

Prior and Posterior Distribution of the Log Odds Ratio Under H1 for the Fictitious Webshop Data Set
Note. The median and the 95% credible interval of the posterior density for the fictitious webshop data are shown in the top right corner. The right-hand panel shows the uninformed prior and the posterior distribution of the log odds ratio under ℋ 1 . The center panel shows the optimistic prior and the posterior distribution of the log odds ratio under ℋ 1 . The right-hand panel shows the conservative prior and the posterior distribution of the log odds ration under ℋ 1 .

Concluding Comments
The A/B test concerns a comparison between two proportions and it is ubiquitous in medicine, psychology, biology, and online marketing. Here we outlined two Bayesian A/B tests: the 'Independent Beta Estimation' or IBE approach that assigns independent beta priors to the two proportion parameters, and the 'Logit Transformation Testing' or LTT approach that assigns a normal prior to the log odds ratio parameter. These approaches are based on different assumptions and hence ask different questions. We be lieve that the LTT approach deserves more attention: in many situations, the assumption of independence for the proportion parameters is not realistic. Moreover, only with the LTT approach is it possible for practitioners to obtain evidence in favor of or against the null hypothesis. 12 Both approaches allow practitioners to monitor the evidence as the data accumulate, and to take prior/expert knowledge into account.
The LTT approach could be extended to include the possibility of an interval-null or perinull hypothesis to replace the traditional point-null hypothesis Morey & Rouder, 2011). If the interval is wide, and if ℋ 1 is defined to be nonoverlapping (such that the parameter values inside the null-interval are excluded from ℋ 1 ) then the evidence in favor of the interval-null hypothesis may increase at a much faster rate than that in favor of the point-null (see also Jeffreys, 1939, pp. 196, 197;Johnson & Rossell, 2010). Interval-null hypotheses are particularly attractive in fields 12) It is possible to expand the IBE approach and add a null hypothesis that both success probabilities are exactly equal (e.g., Jeffreys, 1961), yielding an Independent Beta Testing (IBT) approach. A discussion of the IBT is beyond the scope of this paper (cf. Dablander et al., 2022). such as medicine and online marketing, where the purpose of the experiment concerns a practical question regarding the effectiveness of a particular treatment or intervention. An effect size that is so small as to be practically irrelevant will, with a large enough sample, still give rise to a compelling Bayes factor against the point-null hypothesis. This concern can to some extent be mitigated by considering not only the Bayes factor, but also the posterior distribution. In the above scenario, the conclusion would be that an effect is present, but that it is very small.
Despite its theoretical advantages, the Bayesian LTT approach has been applied to empirical data only sporadically. This issue is arguably due to the fact that many re searchers are not familiar with this procedure and the practical advantages that it entails. The fact that the LTT approach had, until recently, not been implemented in easy-to-use software is another plausible reason for its widespread neglect. In this manuscript we outlined the Bayesian LTT approach and showed how implementations in R and JASP make it easy to execute. In addition, we demonstrated with several examples how the LTT approach yields informative inferences that may usefully supplement or supplant those from a traditional analysis.