Bayesian Simultaneous Credible Intervals for Effect Measures from Multiple Markers

Abstract Inference from multiple markers is often encountered in biomedical research, especially when comparing treatments. Most common statistical methods rely on hypothesis testing with null hypothesis of no effect for each marker and methods based on closed testing for getting the combined p-value. This article proposes the use of simultaneous credible intervals obtained from the joint posterior distribution of effect measures of multiple markers for combined inference. The advantage of this method is that it finds the actual effect sizes and also allows different types of effect measures for various markers. We used equal tailed intervals and highest posterior density regions for effect measures of multiple markers. Simulation studies were carried out to examine frequentist properties of these simultaneous credible regions. These studies revealed that the inference based on simultaneous intervals can be different than the inference based on individual marker. The method is demonstrated through two case studies: (a) an observational study of wet age-related macular degeneration (wet-AMD); and (b) reanalysis of a randomized controlled trial for treatment of migraine. The wet-AMD observational study was the motivation for the present research.


Introduction
In clinical studies, multiple markers or outcomes are studied frequently, especially when comparing two groups: treatment and control. The term marker will be used for "marker or outcome" in this article. If multiple markers are independent of each other, then the inference is straightforward. However, this is rarely the case in real-world situation. Although the dependence structure of these markers is often not known explicitly, getting the combined inference from all the markers is crucial. In observational studies, the interest could be in finding the difference between two groups of baseline measurements, between two groups of existing health conditions or in comparing two treatments or markers in naturally or artificially paired units, such as left and right eye. For carrying out combined inference from all the possibly correlated markers, a joint interval estimation method is a natural choice for effect measures derived from multiple markers. The effect measures should also be chosen appropriately for different markers. The most commonly used effect measures for a binary marker are risk difference, risk ratio and odds ratio.
Most of the frequentist methods dealing with multiple markers depend on multiple hypothesis testing. When dealing with multiple markers, in frequentist approach, if the joint distribution of all the parameters of interest is not available in a closed form, one has to rely on the Bonferroni adjustment (Holm 1979;Simes 1986;Hochberg 1988;Hommel, Bretz, and Maurer 2007;Huque et al. 2010). Under the Bonferroni adjustment, equal Type I error is assigned to all the markers, while other modifications of the Bonferroni method suggest allocating the Type I error to selective markers based on domain expert's opinion or based on the p-value ordering. When multiple markers are assessed with various effect measures, the marginal p-values are obtained as the first step and then combined inference is carried out by a combination of these marginal p-values or an adjustment of marginal p-values (Pesarin 1990;Westfall and Krishen 2001). Bayesian alternative of multiple hypothesis testing is also employed quite often (Abramovich and Angelini 2006;Guo and Heitjan 2010). Furthermore, all observed markers may not be equally important for a specific analysis. For example, the hierarchy of the markers of interest is commonly studied before starting a clinical trial. The markers are then divided into families as per their hierarchy. Gate-keeping procedures are used on these families of markers to get the adjusted p-values (Röhmel, Benda, and Läuter 2006;Dmitrienko and Tamhane 2007).
In all these multiple hypothesis testing approaches, the focus is more on controlling the family-wise error rate or the total Type I error than on quantifying the effect. However, when the interest is in quantification of the difference between two groups in the population with respect to various markers, identifying proper effect measures for the markers and obtaining simultaneous confidence or credible intervals of those effect measures can be an appropriate option. The interest may also be in carrying out simultaneous estimation and hypothesis tests with possible different types of effect measures. In many cases, the hierarchy of the markers is not known and the interest lies in identifying a marker or markers that would be more efficient in distinguishing the two groups. When multiple markers are to be studied, vast literature is available for multiple testing but simultaneous testing is rarely seen (Roy and Bose 1953;Gabriel 1969) and simultaneous interval estimation is almost nonexistent. Frequentist methods of simultaneous confidence intervals mainly tackle the problem of continuous response by using multivariate normal distribution. Furthermore, with frequentist approach, the methods to obtain confidence intervals need to be adapted to the effect measure used. These frequentist methods cannot be used for small or sparse data, for which normality assumption is not valid. We refer to Robin et al. (2019) for a good review of analysis methods of multiple markers for small sample studies.
In this article, we propose Bayesian simultaneous credible intervals for effect measures derived from multiple markers, in order to determine the difference between two groups. We extend the method for a single binary marker, discussed in Nurminen and Mutanen (1987), to m (> 1) binary markers. The tail probability from the joint posterior distribution can be treated as one-sided Bayesian p-value for comparison with the frequentist approach of hypothesis testing. The upper-and lower-tail posterior probabilities are plotted for illustration purpose. The R-shiny app and the corresponding R code developed for exhibiting the joint distributions of various combinations of effect measures are available at the gitlab URL given in the supplementary materials. This app can also be used to get the tail probabilities from the joint posterior distribution of a subset of the effect measures for selected markers. Hence, it can be used as Bayesian approach of gate-keeping procedures.
This article is organized as follows. In Section 2, we describe Bayesian methods for simultaneous credible intervals for a single binary marker. Three possible effect measures, viz., risk difference, risk ratio and odds ratio, are used to quantify differences between two groups. In Section 3, we extend this approach to multiple binary markers and we describe the use of this approach for gate-keeping procedures. In Section 4, we discuss the choice of prior parameters. In Section 5, we discuss frequentist properties of the Bayesian credible intervals with the help of simulation studies. In Section 6, applications of these methods to the data from an observational study of 100 patients suffering from wet-type of age-related macular degeneration (wet-AMD) and to the data from a published clinical trial are demonstrated. We compare Bayesian credible intervals from the case study with Wald and Score type frequentist confidence interval for a single marker. Section 7 gives concluding remarks.

One Binary Marker: Review of Bayesian Methods
Our interest is in comparison of two independent groups, with fixed sample sizes N 1 and N 2 , with regard to one binary marker E 1 . Let π i , i = 1, 2 be the probabilities of presence of E 1 for the two groups, respectively.
The difference between the two groups can be measured in terms of effect measures P(π 1 , π 2 ). Examples of these are: difference of proportions P(π 1 , π 2 ) = (π 2 − π 1 ), risk ratio (π 2 /π 1 ), and odds ratio ({π 2 /(1−π 2 )}/{π 1 /(1−π 1 )}). Bayesian credible intervals for these effect measures are studied in detail in Nurminen and Mutanen (1987). In this section we review their methods and in the next section we extend these methods to m multiple markers. The data of the two groups with a binary marker can be written as a 2 × 2 contingency table, Y, as shown in Table 1. For fixed group sizes N 1 and N 2 and given (π 1 , π 2 ), the probability of observing such a table Y can be written as the product of two binomial probabilities.
To understand the difference between the two independent groups with respect to the marker in the Bayesian way, we specify an appropriate prior for (π 1 , π 2 ). We assume that π 1 and π 2 are independent a priori and assign a beta prior to π i with parameters (a i , b i ), i = 1, 2. This specification results into the following joint prior.
where 0 ≤ π i ≤ 1 and a i > 0, b i > 0, for i = 1, 2. The beta distribution (2) is a conjugate prior for the binomial likelihood (1) and hence, the posterior distributions of π 1 and π 2 given Y are also independent beta distributions with parameters (a i + y i1 , b i + y i0 ), i = 1, 2.
The posterior probability that the parameter P(π 1 , π 2 ) is more than p, can be written as follows: F(p|Y) = Pr(P(π 1 , π 2 ) > p|Y). ( For the sake of convenience, we omit explicit conditioning on Y in (3) and hereafter denote F(p|Y) as F(p). The posterior distribution (3) can be approximated by the Monte Carlo estimator: where (π 1s , π 2s ), s = 1, . . . , S are samples from the beta posterior distributions and S is sufficiently large. All the distributional characteristics such as posterior mode, posterior median and the credible intervals can be obtained using (4).

Credible Intervals for Effect Measures
Two methods are generally used to compute the credible intervals for the effect measure from the samples of posterior distribution (Kruschke 2014).

Equal tailed credible interval
Equal Tailed credible Interval (ETI) is defined such that equal probability is observed below the lower limit and above the upper limit of the interval. If the interval is with coverage probability (1−α), then the probability of observing a parameter below its lower limit is α/2 and the probability of observing a parameter above its upper limit is α/2. The equal tailed credible interval always contains the posterior median of the posterior distribution but may exclude the posterior mode.
Let the two-sided credible interval with coverage probability (1 − α) be denoted as (p L , p U ). It is computed asF(p U ) = α 2 and (1 −F(p L )) = α 2 . For a 1-sided interval, if one is interested in large effects, only the upper limit is computed and the lower limit is the lowest possible value of the parameter. For the parameter difference of proportions, the one-sided credible interval is either of the form For one-dimension, these limits can be easily obtained by applying a bisection method or by using existing R packages such as HDInterval (Meredith and Kruschke 2020).

Highest posterior density set
The Highest posterior Density Interval (HDI) is the set of points in the parameter space, which contribute more to the posterior density. It may not be a connected single interval. The set is computed such that the density at any point inside the set is more than the density of every point outside the set. If the interval is denoted by I HDI then for sufficiently small > 0 and This is equivalent to finding a threshold k such that all the points with their neighborhood density more than or equal to k are included in the HDI and all the points with their neighborhood density less than k are excluded from the HDI. An algorithm to compute such a threshold using the bisection method is given in Turkkan and Pham-Gia (1993). However, a direct computation of this threshold is possible by sorting the empirical point densities for a fixed . In this article, we use this direct method, where the density of a point is the empirical density in the neighborhood of the point. For the choice of in empirical density computation, we refer to Waterman and Whiteman (1978). To compute the HDI, it is possible to use some existing R packages that compute approximations of empirical density functions to normal densities. In this article, we have used the actual empirical densities and not their approximations, because the actual empirical densities can be conveniently extended to higher dimension. Since this is the highest posterior density set for the given threshold, the posterior mode always lies inside the set but the posterior median may not be included in the HDI.

Multiple Markers: Bayesian Simultaneous Credible Intervals
We extend the approach of Section 2 to m (> 1) binary markers here. Without loss of generality, let m = 2. As before, we Total y consider two independent groups with fixed and known sizes, N 1 and N 2 . Our interest is in comparison of the two groups with regard to two binary markers E 1 and E 2 . We represent data of these two markers in a 2×2 contingency table for each group as shown in Table 2. Here, y ijk , (j, k = 0, 1; i = 1, 2) is the frequency in each cell of the contingency table for group i and responses j and k for the two markers, respectively. Let π ijk be the corresponding cell probability and π i = {π ijk } j,k=0,1 and π = (π 1 , π 2 ) be the vector of the four probabilities for the two groups. Note that j,k π ijk = 1 for each i. Furthermore, for each group i, the probability of presence of the marker E 1 is π (1) i = π i11 + π i10 and probability of presence of the marker E 2 is π The likelihood for the parameters π given the data Y = (y ijk ; j, k = 0, 1; i = 1, 2), shown in Table 2, for the two groups can be written as the product of two independent multinomial distributions.
The conjugate prior for the multinomial likelihood is a Dirichlet distribution. We assume that π 1 and π 2 have independent Dirichlet distributions with parameters a i = (a i11 , a i10 , a i01 , a i00 ), i = 1, 2, as given by A posteriori π 1 and π 2 are independent and distributed according to the Dirichlet distributions with parameters (a ijk + y ij,k ), i = 1, 2, and j, k = 0, 1. The posterior distributions of any function of π , for example, π i , i = 1, 2, can be obtained from the posterior distribution of π.
Extending the concept of the effect measure P of Section 2, we can now define P 1 (π (1) 2 ) for marker 1 and P 2 (π (2) 1 , π 2 ) for marker 2. For example, P 1 (π (1) 1 . Since the proposed approach allows use of different effect measures simultaneously, we will suppress the explicit formula and write only P j in the sequel. The joint posterior probability of the two measures (P 1 , P 2 ) being at least p = (p 1 , p 2 ), respectively, for the given data, can be written as F(p|Y) = Pr(P 1 > p 1 , P 2 > p 2 |Y).
As before, hereafter we will denote F(p|Y) as F(p).
The above formulation can be extended to multiple binary markers, m > 2, with Dirichlet distribution serving as the prior for 2 m probabilities for each group.

Simultaneous Credible Intervals for Effect Measures
Credible intervals for the effect measures of interest for two markers are the simultaneous intervals obtained from the joint posterior distribution of π. In this section, we extend the methods discussed in Section 2.1 to higher dimensions.

Equal tailed credible interval
The two-sided simultaneous equal tailed credible intervals (ETI) for P j , j = 1, 2 are denoted as (P L1 , P U1 ) and (P L2 , P U2 ) and are computed asF where, F p j is defined for one dimension as in (3) for marker E j andα is the largest possible value of α such that (9) Then, the simultaneous interval denoted by I s is formed by the Cartesian product of (P L1 , P U1 ) and (P L2 , P U2 ).
The interval is computed using a bisection method described in the following algorithm.
A similar algorithm using bootstrap sampling is given in Montiel Olea and Plagborg-Møller (2019). Although the computation seems to be dependent on a seed used for generating the posterior samples, the required accuracy can be obtained with sufficiently large number of samples.
Highest posterior density set A Highest posterior Density Interval or region (HDI) may not be a single connected region. The density at any point inside the HDI is more than the density at every point outside the set. The HDI is computed by ordering the point densities and finding a threshold k such that all the points with their neighborhood density more than or equal to k are included in the HDI and all the points with their neighborhood density less than k are excluded.
Hence, computing the HDI reduces to finding a threshold k such that The HDI is the union of rectangular neighborhoods of points with density more than k.

Gate-Keeping Procedures for Multiple Markers
In order to apply the proposed methods to clinical trials, we briefly describe some gate-keeping procedures and their connection with the proposed methods. When multiple markers have a natural hierarchy of importance based on treatment effect, then the markers are grouped into families according to their importance. Gate-keeping procedures are applied to address the multiplicity problems by explicitly taking into account the hierarchical structure (Röhmel, Benda, and Läuter 2006;Dmitrienko and Tamhane 2007). Gate-keeping procedures are used in clinical trials, where each family plays the role of a gate-keeper for the next family. If the available data supports that there is no treatment effect for the markers from the higher ranked families, then the lower ranked families will not be assessed. In this section, we will elaborate how simultaneous credible regions can perform the following gate-keeping procedures.

Serial Gate-Keeping
In serial gate-keeping, hypothesis testing is sequential. Hence, only if all the null hypotheses in the current family are rejected, the markers from the next family are analyzed. In serial gatekeeping, the interest lies in rejecting all the null hypotheses. Hence, the resultant test statistic is the minimum of all the test statistics or the adjusted p-value is the maximum of the marginal p-values for the markers. The simultaneous intervals discussed in this article correspond to the serial gate-keeping procedure, where the objective is to find the effect of all the markers simultaneously.

Parallel Gate-Keeping
In parallel gate-keeping, the interest lies in rejecting at least one null hypothesis. Hence, the resultant test statistic of the combined hypothesis is the maximum of all the test statistics computed for the markers. This can be translated into a simultaneous interval setting, by defining the interval as a set of all the points having at least one co-ordinate lying in the corresponding marginal posterior interval.

Choice of Prior Parameters
For two binary markers, we specified Dirichlet prior for multinomial probabilities (Section 3). The important question is about the choice of prior parameters (a i11 , a i10 , a i01 , a i00 ), i = 1, 2, especially because it can be interpreted as an increase in the sample size for small sample correction. In order to understand the role of prior parameters from this angle, we look at the posterior means.
The posterior means of the probability of presence of marker 1 for the groups i = 1, 2 are where c i = j,k a ijk , y i1. = y i11 + y i10 and a i1. = a i11 + a i10 .
The parameter of interest is the effect measure P(π (1) 1 , π (1) 2 ) and to study its posterior distribution, we use the posterior distribution of π (1) 1 , we observe that the posterior mean is For m markers, the above expectation is (y 21. − y 11. )/(N + Ma). It is clear that the sample size in each group increases by the sum of the Dirichlet prior parameters. In the case of equal sample sizes and the same prior, the assessment of the effect measure is based on the sample sizes which are increased equally by (4a). Note that m binary markers results into M = 2 m categories and hence, require M Dirichlet parameters for each group. In this case, the sample size increases by M with uniform prior D(1) while it increases by M/2 with Jefferys prior D(1/2). Such increase in the sample size would have visible impact, especially for small sample sizes and large number of markers. Another prior choice, a reference distance prior (Berger, Bernardo, and Sun 2015), is D(1/M)-Dirichlet prior with all parameters equal to the inverse of the number of categories (1/M). This choice of prior increases the sample size only by 1. In a simulation study for multiple markers, we have considered "reference distance prior, " in addition to uniform and Jeffreys prior, Section 5.
It is known that for m = 1 (categories M = 2), Jeffreys prior is β(1/2, 1/2) and it corresponds to adding 1/2 to each cell. This is equivalent to the continuity correction used in the frequentist approach in order to facilitate computation in the case of sparsity. It is to be noted that the reference distance prior is same as Jeffreys prior for m = 1.
The parameter of interest is P(π 1 , π 2 ) but we have chosen to specify priors for (π 1 , π 2 ). In order to visualize the prior distributions of P(π 1 , π 2 ) for chosen priors of π 1 and π 2 , the prior densities for (π 2 − π 1 ) and (π 2 /π 1 ) are shown in Figure 1 of the supplementary materials. For the difference of proportions the uniform priors for (π 1 , π 2 ) gives a triangle density, and beta distributions with large parameters give bell-shaped densities. For the risk ratio the priors with small beta parameters show linear density curve with negative slope for risk ratio ∈ (0, 10) and beta distributions with large parameters give bell-shaped densities.

Simulation Studies
The main aim of this section was to explore the frequentist properties of simultaneous credible regions, using simulation study for m = 1, 2 markers. We have followed the recommendations for frequentist simulation study in Ollila et al. (2022). We also investigate the influence of prior parameters when the sample sizes are small or the data are sparse.
A small simulation study was also performed to examine the effect of well-suited or ill-suited priors. For this study, data were generated with proportions from beta distributions such that they would have small variance and the expected value of π 1 = {0.5} and that of π 2 ∈ {0.3, 0.5, 0.7}). In addition to the three sample size combinations discussed earlier, a larger sample size of 200 in each group was also considered for examining the effect of prior on posterior when data have varied sample size.

Multiple Markers
For multiple markers, to determine the parameters for simulation, we used the real data described in Section 6.1 and other scenarios, including equal probability of presence Table 3. Simulation scenarios for two binary markers: Six scenarios of marginal probabilities of presence of two markers in two groups.
E 1 0.6 0.71 0.9 2 E 2 0.9 0.71 0.9 NOTE: ρ 2 is the Kendall's τ correlation between the two markers and ρ is the corresponding correlation between the bivariate normal distributed variables. Note that when the probability of presence of a maker is 0.5 for both the markers, ρ 2 = (2/π ) arcsin(ρ), for example, scenario 3. For other scenarios, ρ was chosen by trial and error method so that on an average the simulated data gives binary variables with required correlation of ρ 2 .
for both the markers in both the groups, high correlation among markers. A sample of size N i in group i was generated using the given marginal probabilities (π (1) i , π (2) i ) and their association in terms of Kendall's τ . We generated the datasets using the package simdata (Kammer 2020) in R. In the first step, the data from multivariate normal distribution, with specified correlation matrix, were generated. In the next step, the data were dichotomized with the given probabilities to get binary variables. The process of dichotomization of two bivariate normal variables reduces the Pearson's correlation ρ, to Kendall's τ correlation ρ 2 in binary markers. When the dichotomization probability is 0.5 for both the variables then ρ 2 = (2/π ) arcsin(ρ) (Rousson 2014). For other choices of dichotomization probabilities, we configured the value of ρ such that the required ρ 2 is obtained after dichotomization, by performing a large number of simulations. The six scenarios of simulation study are described in Table 3. For each of the six scenarios, 1000 datasets were simulated and analyzed.
Exploration of frequentist properties: For a single marker, we computed two types of frequentist confidence intervals, Wald and Score confidence intervals, and two types of Bayesian credible intervals, ETI and HDI for each effect measure. Noninformative uniform prior and Jeffreys prior were used for Bayesian methods. The estimands in this exploration were the coverage probabilities of each confidence and credible intervals. The coverage probabilities of the true value of the effect parameter were calculated for each type of intervals described above.
For multiple markers, the coverage of simultaneous posterior regions was studied with three priors for multinomial probabilities-uniform, Jeffreys, and reference distance prior. Table 4. Coverage probabilities for 15 scenarios of π 1 , π 2 with balanced sample sizes for the effect measure difference of proportions computed with methods-Bayesian equal tailed credible interval (ETI) with Beta(1, 1) prior, Bayesian highest posterior density credible region (HDI) with Beta(1, 1) prior, ETI with Jeffreys prior, HDI with Jeffreys prior, Wald confidence interval, Newcombe confidence interval(NC) and Miettinen-Nurminen confidence interval (MN).

One Marker
The coverage probabilities for effect measure difference of proportions with equal sample sizes on both groups are given in Table 4. The coverage probabilities for this effect measure with unbalanced sample sizes on groups are given in Table 5. Similar tables for effect measures ratio of proportions and odds ratio are given in the Tables 1-4 in the supplementary materials. For all the effect measures, the ETI and the frequentist score interval showed desired coverage. Wald interval showed good coverage for larger sample sizes as compared to smaller sample sizes. HDI showed the lowest coverage in all the scenarios and for all effect measures. For imbalanced sample sizes when one of the probabilities of presence of marker is 0.1 or 0.9 and the effect measure is ratio of proportions or odds ratio then the coverage probability of Bayesian credible intervals seems to be lower than that of the balanced sample size scenarios.
To understand the effect of prior parameters used in the analysis in comparison with the distributions used for generating the data, a small simulation study was performed. The results of this study are given in Table 6. In the case of informative priors, when the analysis prior was the same as the prior used for simulating the data, the coverage probability was close to the desired value. When the analysis prior and the actual prior were different from each other, the coverage probability was lower, especially for small sample sizes. The effect of ill-specified prior was prominent with imbalanced sample sizes. On the other hand, with medium to large sample sizes, the effect of a prior was negligible. Noninformative or weakly informative priors, like uniform prior and Jeffreys prior, show good coverage probabilities in all the scenarios.

Multiple Markers
The coverage probabilities of the simultaneous ETI and the HDI region are shown in Table 7. Posterior joint distribution of the effect measures of two markers was computed with uniform prior (Dirichlet D(1) prior), Jeffreys prior (Dirichlet D(1/2) prior) and reference distance prior (Dirichlet D(1/4) prior). As expected, equal tailed credible interval showed better coverage as compared to HDI. Since the joint distribution of the effect measures was not available, frequentist simultaneous confidence regions are not reported for multiple markers.
It is to be noted that with noninformative priors, the equal tailed credible region includes the true parameter more often than the HDI , although, the HDI is the smallest region containing the points of highest densities. In all the scenarios of simulation study, the coverage probability with Jeffreys prior was less than the coverage probability obtained with uniform prior. Although the difference is small, it indicates the importance of choice of prior. In scenarios 5 and 6 with highly correlated markers, the coverage probability of HDI was improved as compared to the first 4 scenarios. Further, the coverage probabilities of the posterior region with Jeffreys prior and reference distance approach are comparable, indicating the benefits of the use of reference distance prior over uniform and Jeffreys prior for a large number of markers.

Application 1: Observational Study of Wet-AMD
We considered data of 100 wet age-related macular degeneration (wet-AMD) patients whose baseline and sixth month measurements were available. The measurements included visual acuity (VA), presence of abnormal fluid within the retina as macular Cyst, presence of abnormal fluid between the retina and the retinal pigment epithelium (neuroepithelial detachment, NED), and pigment epithelial detachment (PED). All the patients underwent the same treatment with slight variation in the duration between injections. Wet-AMD being an agerelated disease, the patients were grouped according to their age at the time of diagnosis of the disease-baseline age less than 80 years and baseline age 80 years or above. The aim of this analysis was to understand how the two groups differed with respect to the markers and how the baseline age affected the initial treatment effect in terms of these markers. The markers Cyst, NED, and PED were binary while VA was continuous. For the purpose of illustration, VA was dichotomized-1 indicating that VA at 6 months was greater than or equal to baseline VA, and 0 indicating that VA at 6 months was less than baseline VA. With simultaneous credible intervals, we checked whether these markers distinguished the age groups at the baseline and after initial treatment. A general description of this cohort and data are available in Ollila et al. (2022).

Analysis of One Marker
We analyzed each of the three markers (Cyst, NED, PED) separately, at baseline and at 6 months, using the method described in Section 2.1, and computed 90% credible intervals of difference of proportions in two age groups with noninformative beta(1, 1) prior. The resulting empirical posterior densities are given in Figure 1. In each plot of Figure 1, x-axis has the parameter of interest, Difference of proportions and y-axis has the posterior density. The upper panels show the posterior densities for the baseline measurements of each marker and the middle panels show the posterior densities for the measurements of the same markers after 6 months of treatment. The posterior density of difference of proportions for the marker Cyst after the initial treatment has lesser variance than its baseline plot, and is more concentrated on the positive side of "no effect. " This indicates that the presence of Cyst is generally larger in the age group of 80 and above with ETI of (−0.08, 0.16) at baseline and (−0.03, 0.17) after 6 months of treatment. This reduction in the presence of Cyst shows that the treatment worked better for younger patients. The credible interval of approximately (−0.25, 0.04) for difference of proportions for NED indicates that the presence of NED is lower in the age group of 80 and above at the baseline. After 6 months of treatment, the posterior has not changed drastically. The presence of PED does not seem to be different in the two groups at the baseline and also after the initial treatment, but the posterior distribution after 6 months has shifted left as compared to the baseline posterior distribution.
In all the figures, the triangle on x-axis indicates the posterior median and the filled circle indicates the posterior mode. Almost all of the posterior distributions are unimodal and symmetric about their posterior modes hence, the posterior mode and median coincide. Figure 1(g) shows the posterior distribution of difference of proportions in two age groups for a binary marker indicating improvement in VA at six months after the start of the treatment. It can be seen that the improvement in VA in the first 6 months is much better in the lower age group, since, the 90% credible interval lies completely below zero. Figure 2 shows the posterior densities of baseline Cyst with uniform, Jeffreys and an informative prior. For the presence of Cyst, the prior study results were available in (Chakravarthy, Evans, and Rosenfeld 2010). This information was used to specify an informative prior for Cyst in both the groups. Figure 2(c) shows the posterior density of baseline Cyst measurements with the informative prior. The posterior densities with the informative and the noninformative prior look similar but with the informative prior, the posterior mode is shifted slightly away from zero as compared to the noninformative prior.
Many frequentist methods like Wald, Score, methods with continuity correction (Newcombe 1998) are available for computing confidence interval for the effect measures of a single binary marker. However, most of these confidence intervals are obtained by inverting a specific hypothesis test. We obtained these intervals by testing the null hypothesis of no difference in age groups versus the alternative hypothesis of nonzero difference in the two age groups for each marker. We compared the Bayesian ETIs with these frequentist intervals for one dimension.
The actual values of the credible intervals and the frequentist confidence intervals for the improvement in VA are given in Table 8. The HDI is not included in the table, since, it cannot be written as two limit points because it may not be a connected interval. With the noninformative Beta(1, 1) prior, the Bayesian posterior credible interval is close to the frequentist confidence interval. Overall, we observed that with the noninformative Beta(1, 1) prior, the Bayesian posterior credible intervals are close to the frequentist Agresti-Caffo and Newcombe confidence intervals.

Analysis of Multiple Markers
Next, we analyzed multiple markers jointly for the comparison of the two age groups. In Figure 3, the pairwise joint credible intervals of the markers at the sixth month of the treatment period are shown. On the Y-axis, the values of the effect measure of the first marker are displayed and on the X-axis the values of the effect measure of the second marker are displayed. The dashed rectangle shows the equal tailed rectangular simultaneous interval (ETI) for two markers and the region marked by the smaller rectangles shows the HDI. The mode of the joint posterior density is shown by a dark circle on the region. The inference obtained from each individual marker is reflected to a large extent in the pairwise joint distributions. However, the 90% VA interval in one dimension excluded zero clearly, while, all the simultaneous intervals have zero included. From Figure 3(a), it is clear that the joint posterior of difference of proportions in NED and VA improvement has equal variation on both the effect measures, and the posterior mode is below zero for both, NED and VA. The joint posterior density of the effect measures for  Cyst and VA in Figure 3(c) shows longer right tail for VA. Cyst has a smaller variation than VA. A similar pattern is observed in Figure 3(d) for Cyst and NED. Cyst effect measure has a smaller variation as compared to NED and the distribution has longer right tail for NED. Figure 4 shows Bayesian simultaneous credible intervals for difference of proportions for the two markers Cyst and NED, computed from the joint posterior distribution with three different priors-uniform, Jeffreys and reference distance prior. As compared to uniform prior, Jeffreys prior and reference distance prior have smaller posterior regions. In Figure 4(a) of uniform prior, the posterior mode is away from the observed difference of proportions, while, for Jeffreys prior they are close to each other. In Figure 4(c), the posterior mode is toward zero on both the dimensions from the observed value.
Visualizing simultaneous intervals for three and more markers is not trivial. The cumulative joint probability is computed such that all effect measures are either below a fixed value (left tail probability) or above a fixed value (right tail probability). The cumulative joint probabilities of these diagonal values are plotted in Figure 5. In this example, the diagonal values of difference of proportions are plotted on the x-axis and the corresponding cumulative joint probability is plotted on the y-axis. These plots are obtained from the R code available at the gitlab URL given in the supplementary material.
The uppermost panel shows pairwise (joint distribution of two markers simultaneously) left and right cumulative joint probabilities, respectively. The leftmost curve of both the plots show joint cumulative probability of VA and NED. From their marginal distributions and their joint distribution, it can be easily inferred that the effect measure for these two markers is negative when comparing the two age groups. From this curve, it can also be seen that more than 80% probability mass lies below "zero" value of difference of proportion for NED and VA improvement. Next, the rightmost curve is from the joint posterior of Cyst and PED. The left panel curve of Cyst and PED shows that less than 20% of the joint probability mass lies below zero. This indicates that the difference in the two age groups is positive for this joint distribution. Overall, in pairwise distributions, the pairs containing VA improvement as one of the markers have more probability mass below zero and the pairs with Cyst as one of the markers have more probability mass above zero.
A similar trend is seen in the plots in the middle panel of Figure 5, where the joint posterior tail probabilities are plotted for effect measures of three parameters simultaneously. For the combinations with Cyst as one of the markers, the substantial posterior probability mass lies above zero, while the combinations without Cyst are dominated by the effect measure of improvement in VA. It can be inferred from these graphs that Cyst is an important marker jointly with VA and it helps in differentiating the two age groups.

Application 2: Clinical Trial-Migraine Study
To illustrate the use of the proposed method to a randomized clinical trial, we reanalyzed a small section of the data from a randomized, placebo-controlled, double blind clinical trial designed to compare multiple treatments for migraine with placebo (Ho et al. 2008). We considered two groups of 50 patients each, with one on one of the treatments, and the other on placebo. These data were taken from StatXact (version 12) software (Cytel Inc. 2019). Here, we considered four markers: pain freedom at 2 hr post treatment (EP 1 ), absence of phonophobia at 2 hr post treatment (EP 2 ), absence of photophobia at 2 hr post treatment (EP 3 ) and sustained pain freedom up to a 24 hr period (EP 4 ). Out of these markers, EP 1 , EP 2 and EP 3 were co-primary markers, while, EP 4 was the secondary marker. Figure 6 shows the posterior densities of difference of proportions for all the four markers with uniform prior and Jeffreys prior. From Figure 6, it is clear that the posterior densities  of difference of proportions corresponding to EP 2 and EP 4 are away from zero to the right, and the 90% credible intervals fall completely on the right side of zero. For the markers EP 1 and EP 3 the shift is comparatively less and the lower limit of the credible intervals are close to zero. The markers EP 2 and EP 4 clearly show the treatment effect. The posterior distribution with two noninformative priors, uniform and Jeffreys, show very little difference. Figure 7 shows the joint cumulative posterior distribution for the diagonal values of multiple markers. The inference obtained from individual markers is further strengthened with the inference from the joint distribution. The upper panel shows pairwise joint left and right cumulative probabilities, respectively. The rightmost curve of cumulative joint probability is for the joint probability of EP 2 and EP 4 , and shows more posterior probability mass on the right side of zero. The leftmost curve is the joint probability of EP 1 and EP 3 . A similar trend is seen in the middle panel for joint probabilities of effect measures of various combinations of three markers out of the four. The lower panel shows joint probabilities for all the four markers. It clearly shows the positive treatment effect from the joint posterior. Table 9 lists the frequentist individual and adjusted p-values (adjustment for the presence of other markers) for serial and parallel gate-keeping . It also shows Bayesian empirical posterior probability at "no effect" computed from the marginal posterior and the joint posterior distribution. Since, the frequentist method only provides adjusted p-value and not the confidence interval for multiple markers, it was not possible to compare the Bayesian credible intervals with their frequentist counterparts. Since, the data were small, we have considered the exact nonparametric method from StatXact (Cytel Inc. 2019) to compute the frequentist individual marker p-values and adjusted pvalues.
For all three primary markers, the individual or marginal frequentist and Bayesian p-values are comparable. The joint posterior probability of "no effect" in all the markers simultaneously is very small, indicating the treatment effect. The frequentist serial gate-keeping p-value is 0.076, which is generally inflated because of the discrete nature of the exact p-value computations.
Furthermore, the Bayesian posterior probability of "no effect" for the secondary marker EP 4 is 0.018, and agrees with the frequentist inference of rejection of null hypothesis with the exact p-value of 0.004.
The use of the proposed methods to gate-keeping in clinical trials is illustrated in Figure 8. The equal tailed simultaneous regions at 90% credible level corresponding to two markers EP 1 and EP 4 are shown. The left figure shows the ETI with serial gate-keeping, while, the right figure shows the ETI with parallel gate-keeping.

Discussion
Comparison of two groups with regard to multiple markers occur frequently in randomized clinical trials as well as in observational studies. Statistical inference using all markers simultaneously would provide a coherent approach for the comparison. However, this has rarely been carried out.
In this article, we have attempted to assess m binary markers using simultaneous Bayesian credible intervals. When multiple markers are available for comparison of two groups, the inference based on them simultaneously is often different than that based only on individual marker. There is often loss of information in the latter inference and the results may be misleading. As we have shown in Application 6, visual acuity as a single marker can differentiate between two age groups very well since zero is excluded from the credible region obtained from the marginal distribution. But when visual acuity is considered with the other markers, all the simultaneous credible regions of two markers include the point (0, 0), indicating a not-soextreme difference in age groups, contrary to what is suggested by individual marker. Further, although cyst did not give a clear indication of extreme effect measure when studied as a single marker, all its joint posterior probability curves indicate that cyst is an important marker along with VA, and helps in differentiating the two age groups.
The major advantage of using the proposed Bayesian method is that it works seamlessly for marker-specific effect measures, unlike frequentist methods. Another problem with frequentist approach is-when the effect measures do not follow multinormal distribution, it is difficult to obtain simultaneous confidence intervals and the only inference in such a case is a single pvalue, which is often criticized. With Bayesian analysis, the joint posterior distribution of all the effect measures itself is available and any characteristics of the effect measures can be obtained. As shown in Section 4, the choice of prior parameters for Dirichlet prior enables the estimation of the proportions when the sample size is small or the data are sparse. This can be viewed as an ad hoc approach of continuity corrections employed in the frequentist approach. Based on the simulation studies, we recommend use of Dirichlet prior with all parameters equal to the inverse of the number of categories (1/M) in most situations. This choice of prior is recommended especially for small or sparse data because it doesn't increase the sample size dramatically.
In frequentist approach, an appropriately chosen effect measure is estimated or tested directly. Here we have specified priors on the multinomial probabilities because specification of priors directly for the effect measures result into nuisance parameters. Moreover, the choice of prior may not be obvious for the effect measure.
From the simulation study, it was clear that frequentist methods worked better for equal and large sample sizes for  two groups, as compared to imbalanced or small sample sizes, whereas, Bayesian methods showed overall consistent performance. We have already noted that, although, the highest posterior density region is the smallest region with points of highest densities, it does not show required coverage of true parameters. On the other hand, equal tailed credible regions show good coverage of true parameters, which are comparable to frequentist methods of Agresti-Caffo or Newcombe. The latter are preferred over the traditional Wald or Score confidence intervals. For multiple markers, equal tailed regions showed better coverage than highest posterior density regions. However, the highest posterior region is a better visualization of the posterior probability distribution. In addition, the proposed methods can provide inference for gate-keeping procedures popular in clinical trials. The advantage of these methods is that they can be used even when the hierarchy of markers is not known a priory. This method can also serve as a good Bayesian alternative to simultaneous confidence interval method for tests such as noninferiority test discussed in Tang and Yu (2020).
Throughout this article, we have assumed that the group sizes are fixed. However, similar methods can be developed by appropriately choosing the likelihood function when the groups sizes are unknown. For example, the Poisson likelihood for counts is appropriate in such situations. The gamma distribution is the conjugate for this likelihood. Furthermore, relative risk or risk ratio can be considered as an effect measure for each marker and simultaneous credible intervals can be obtained on similar lines as in Section 6.
The limitations of the proposed methods is the visualization of simultaneous credible regions for large number of markers m.
Equal tailed regions can be specified by their lower and upper limits but the highest posterior regions cannot be given by two endpoints. So, visualization is important for highest posterior regions. Visualizing the simultaneous confidence regions beyond three dimensions is not possible, hence, one has to rely on the cumulative probability curves to understand the joint posterior distribution as in Figure 7. When m increases, the possible combinations of responses increase to 2 m . The methods can easily become computationally intensive and hence, alternative methods might be needed.