Bayesian analysis for social data: A step-by-step protocol and interpretation

The paper proposes Bayesian analysis as an alternative approach for the conventional frequentist approach in analyzing social data. A step-by-step protocol of how to implement Bayesian multilevel model analysis with social data and how to interpret the result is presented. The article used a dataset regarding religious teachings and behaviors of lying and violence as an example. An analysis is performed using R statistical software and a bayesvl R package, which offers a network-structured model construction and visualization power to diagnose and estimate results.• The paper provides guidance for conducting a Bayesian multilevel analysis in social sciences through constructing directed acyclic graphs (DAGs, or "relationship trees") for different models, basic and more complex ones.• The method also illustrates how to visualize Bayesian diagnoses and simulated posterior.• The interpretations of visualized diagnoses and simulated posteriors of Bayesian inference are also discussed.


Method details
In social sciences, the persistence of 'stargazing', p-hacking, and HARKing issues has currently led to a severe reproducibility crisis in which 70% of researchers have failed to reproduce the experiments of other scientists [1][2][3][4] . The crisis forces the academia to react with rigorous study design and preregistration procedures, more careful use of statistical analysis, and interpretation of statistical results [5][6][7] . In this article, we propose that the Bayesian inference approach [8 , 9] , with its natural properties, seemingly offers a solution for analyzing social data. In the following section, we will briefly explain a dataset of Vietnamese folktales that we are going to use as an example to illustrate the method.
The analysis was done using the bayesvl R package (version 0.8.5) in the R statistical software (version 3.6.2) [10] . Similar applications of Bayesian statistics in social data analysis can be found in [11][12][13][14] .

Data in brief
Hereafter, we use one of our latest research studies as an example for performing Bayesian multilevel analysis with social data [14] . The study explores the association between the outcome and the behaviors of lying and violence of main characters under the influence of religious teachings in selected Vietnamese folktales. The dataset consists of binary variables encoded from 307 Vietnamese folktales. The dataset is stored in the bayesvl repository and can be loaded with the following commands: Even though there are 25 binary variables, of which only eight variables are employed in this article: • "Lie": whether the main character lies • "Viol": whether the main character employs violence • "VB": whether the main characters' behaviors express the value of Buddhism • "VC": whether the main characters' behaviors reflect the value of Confucianism • "VT": whether the main characters' behaviors express the value of Taoism • "Int1": whether there are interventions from the supernatural world • "Int2": whether there are interventions from the human world • "Out": whether the outcome of a story is favorable for its main characters

Data analysis with Bayesian statistics
Step 1. model construction First, we establish three different directed acyclic graphs (DAGs), or so-called "relationship trees," from simple to more complex ones, based on the dataset mentioned above.

Model 1. Multiple regression analysis
The first and the most straightforward "relationship tree" exemplified examines the determinants of the behaviors of lying and violence on the outcome of the main character (see Fig. 1 ).
To construct the "relationship tree" in Fig. 1 , one needs to initially create the model and load the variables -represented by nodes -into the model by employing the function bayesvl() and bvl_addNode(), respectively as follows: Because the statistical distribution of all employed variables is binomial, we set "binom" in the function. Besides binomial distribution, the package also provides various types of statistical distribution for the types of data, namely: normal distribution -"norm", categorical distribution -"cat", Bernoulli distribution -"bern", Student's t-distribution -"student", Poisson distribution -"pois", and so on.
After loading all the variables into the "relationship tree", the next step is to grant the regression type to the connection between the independent variables "Lie" and "Viol" and the dependent variable "O" by employing the function bvl_addArc(). The model can be set as the fixed effect type by adding a "slope" into the command: R > model1 < -bvl_addArc(model1, "Viol", "O", "slope") R > model1 < -bvl_addArc(model1, "Lie", "O", "slope") In Bayesian inference, the posterior probability is estimated from a prior probability and a "likelihood function" derived from a statistical model for the observed data. Therefore, setting prior distribution is critical before fitting the model. The prior distribution can be determined based on previous empirical findings, researcher's past experience and personal intuition, or expert opinion [8 , 15] . Nonetheless, preceding empirical works and knowledge do not always exist, so determining prior distribution by researcher's experience or personal intuition might be criticized as intentional subjectivity. In such circumstances, setting prior distribution as "uninformative" or "know nothing priors" can be a prominent alternative, because it can mitigate the criticism of intentional subjectivity and help users fit a new model without firm empirical findings [15] . The package developers utilize uninformative prior distribution with mean 0 and standard deviation 10 (or 100 for alpha) as default. The prior distribution of each relationship in the "relationship tree" is always given at the time the path between two nodes is created employing the function bvl_addArc(), but if the prior distribution is not set, the package will use the default prior distribution. The prior distribution setting method will be clearly explained when constructing model 3 below. One can check the prior distribution of coefficients in model 1 by typing: R > bvl_stanPriors(model1) a_O ~normal(0,100) b_Viol_O ~normal(0, 10) b_Lie_O ~normal(0, 10) Since the prior distribution was not set in bvl_addArc(model1, "Viol", "O", "slope"), the package automatically set prior distribution of b_Viol_O as default distribution which is normal(0, 10). Eventually, the function bvl_bnPlot can help produce the graphical network of the constructed model (see Fig. 2 ).

R > bvl_bnPlot(model1)
To check the structure and mathematical form of the model, one can use the function summary: R > summary(model1) Model Info: nodes: 3 arcs: 2 scores: NA formula:
The dash-line arrow demonstrates the relation between the transformed data and the observation data (see Fig. 3 ). The values of transformed data are generated from the values of two observation data through the mathematical operator " * ". The value of "B_and_Viol" is generated from the multiplication between the values of "VB" and "Viol" by using the function bvl_addArc(). One can use a similar function to give the transformed value to "C_and_Viol" and "T_and_Viol".

Model 3. multilevel regression analysis
One can create a much more complex model of multilevel regression analysis, while only following a similar procedure with two models mentioned above and employing some additional functions. The primary purpose of the third exemplary "relationship tree" is to explore the impacts of lying and violence behaviors, their interaction with religious values, and intervention from the supernatural or human world on the outcome of the main character (see Fig. 5 ).

R > bvl_bnPlot(model3)
One can also check the mathematical construct of each transformed data in the "relationship tree" above by using the function bvl_formula(), like the following examples: Step 2. Fitting the model Before fitting the model using MCMC simulation, one needs to generate the Stan code in R. Because the bayesvl package provides an automatic generation of Stan code, one can use the following commands: The model created from the "relationship tree" can be fitted with MCMC simulation using the function bvl_modelFit(). The structure of the function bvl_modelFit() is partly dissimilar with other currently existent Bayesian analysis packages because it does not require users to construct conventional mathematical relationships among variables as well as set up the prior distribution for each relationship. One only need to input the name of constructed "relationship tree", the dataset, and mandatory set-up for MCMC simulation. As the bayesvl package was coded utilizing the No-U-Turn Sampler (NUTS) sampler [16] , the effective sample size per iteration is usually higher than that utilizing other samplers. However, the simulation is more computationally intensive and timeconsuming. Thus, it should be aware that the model specified with a high number of iterations, chains, and cores might monopolize computing power for a substantial time, especially for less powerful machines The command for model fit in the current exemplary case is shown below:   The model is fitted using four chains, each with 50 0 0 iterations of which the first 20 0 0 are for warmup, resulting in a total of 12,0 0 0 post-warmup posterior samples. In general, the model's simulated results show a good convergence based on two standard diagnostics of MCMC simulation, n_eff, and Rhat. The n_eff represents the effective sample size, which is the number of iterations needed for effective independent samples [8] . If the value is greater than 10 0 0, it is a good signal of a strong correlation between the dependent and independent variables. Rhat value -also known as the Gelman shrink factor and the potential scale reduction factor, shows the convergence of the logarithm [17] . If the value is higher than 1.1, the model is not convergent. The Rhat value is computed using the following mathematical formula [18] : Where ˆ R represents the Rhat value, ˆ V is the estimated posterior variance, and W is the withinsequence variance.

Step 3. Model visual diagnostics
One can aesthetically visualize the convergence diagnostics, posterior distribution, and estimated results. The function bvl_plotTrace() can generate the trace plots of the constructed model.
R > bvl_plotTrace(model3) Fig. 7 displays the trace plot of each parameter in the model, which is a standard visual diagnostic for MCMC work. The first 20 0 0 samples mark the warmup (adaptation, or burn-in) period. During this period, the Markov chains learn to sample more efficiently from the posterior distribution, so samples in the warmup period are not reliable and representative for inference. It should be noted that the trace plot plotted by the function bvl_plotTrace() only shows the samples after the warmup phase. In order to be identified as "clean, healthy" after the warmup period, the Markov chain needs to meet two primary characteristics: stationarity and good mixing. The chain in Fig. 7 is formed from four component chains, each of which obtains 30 0 0 iterations after the warmup period. Visually, if all lines (or paths) stick around a very stable central tendency, the Markov chain can be considered as stationary, while the rapid zig-zag motions of each line can be seen as the signal for a well-mixing chain. In general, no divergent chains are found, which suggests that the autocorrelation function dies out quickly, and the Markov property is satisfactory with the data distribution at hand. Because the MCMC algorithm produces autocorrelated samples, the function bvl_plotAcfs() is another command to check whether the autocorrelation is eliminated (to 0) after certain finite steps. One can visually diagnose the autocorrelation of the model by the following command, which will generate the results in 4 rows and three columns: The mathematical formula for the autocorrelation parameter for lag = L is displayed below: where x t is the sampled value of x at iteration t, T represents the total number of sampled values, and x is the mean of sampled values. From Fig. 8 , we can see that the effective sample size (ESS), which is all above 10 0 0, reduces quickly to 0 before lag 3. This tendency satisfies the Markov property of the chains and, consequentially, ensure computing efficiency. The Gelman Shrink Factor or the Rhat value estimated above can also be visualized by using the function bvl_plotGelmans(): R > bvl_plotGelmans (model3) Measuring how much variance there is between chains relative to how much variance there is within chains is another idea to check the convergence. If the average difference between chains is similar to average difference within chains (when Rhat = 1.0), the chains are well convergent. Nevertheless, the relative value might increase (when Rhat > 1.0) and indicates the less convergent tendency between chains, if there appears at least on orphaned or stuck chain [19] . Fig. 9 illustrates the mean value of potential scale reduction factor for each variable and parameter at 97.5% as well as the shrink factor suggested by Gelman and Rubin [20] . Overall, all the shrink factors get to 1.0 rapidly during the warmup period, which meets the standard of MCMC simulation.

Step 4. Result of visual presentation
Besides the mean and standard deviation of the posterior distribution summarized in the model fit above, one can visually present the estimated posterior distribution of every variable coefficient through histograms. The visualization can be made using the function bvl_plotParams(). We visualize the estimated posterior distribution of every variable in the constructed model in four rows and three columns with the Highest Posterior Distribution Intervals (HPDI) at 89% (see Fig. 10 ). The default HPDI is at 89%; therefore, to adjust the HPDI to 95%, one can simply change the credibility range (credMass) from 0.89 to 0.95.

Conclusion
Recently, the reproducibility crisis and the problems of 'stargazing', p-hacking, or HARKing in statistical analysis have required the scientific community to be more rigorous in conducting research and find solutions for the persistent statistical issues. Thus, the method paper proposes Bayesian analysis as a substitution for the conventional frequentist approach. Bayesian statistics have the advantages of treating all unknown quantities probabilistically and incorporating prior knowledge or belief of scientists into the model as an alternative approach for frequentist analysis in social sciences. The usage of the bayesvl R package for social data analysis also provides the opportunity to construct a "relationship tree" among variables intuitively and graphically visualize simulated posterior, especially in the age of Big Data [21] .

Declaration of Competing Interest
The authors declare that they have no known competing for financial interests or personal relationships that could have appeared to influence the work reported in this paper.