Evidence of Herding and Stubbornness in Jury Deliberations

We explore how the mechanics of collective decision-making, especially of jury deliberation, can be inferred from macroscopic statistics. We first hypothesize that the dynamics of competing opinions can leave a"fingerprint"in the joint distribution of final votes and time to reach a decision. We probe this hypothesis by modeling jury datasets from different states collected in different years and identifying which of the models best explains opinion dynamics in juries. In our best-fit model, individual jurors have a"herding"tendency to adopt the majority opinion of the jury, but as the amount of time they have held their current opinion increases, so too does their resistance to changing their opinion (what we call"increasing stubbornness"). By contrast, other models without increasing stubbornness, or without herding, create poorer fits to data. Our findings suggest that both stubbornness and herding play an important role in collective decision-making.

What mechanisms underlie collective decision-making? Recent research into opinion dynamics has compared statistical patterns in empirical data to models of opinion dynamics [1][2][3][4][5][6][7][8] and tested how opinions change in controlled experimental settings [9,10]. Although both methods have provided substantial insight extending the early studies of collective opinion dynamics [11,12], they have generally not tackled one question our paper aims to answer: how do non-consensus decisions occur [13]? For example, is it possible to distinguish influence from noninfluence when we observe group opinions [14,15]? In other words, if members of a group share the same opinion, can we tell if they arrived at that opinion independently or as a result of their interactions (e.g., through a tendency to "follow" or imitate one another)? If this kind of influence does plays a role, why do opinions often remain in non-consensus?
In this paper, we probe these questions by studying jury deliberation. Our work compliments previous research where judicial rulings were found to be affected by factors unrelated to the specific cases [16]. By analyzing jury data in bulk, we aim to understand how the mechanisms of jury opinion formation not directly related to the facts of the case (such as influence and stubbornness) couple together with the laws defining hung juries (situations in which jury opinions are considered too divided to reach a verdict) to shape decision-making patterns observed in data. Furthermore, while groups are thought to create better decisions than single individuals [10,[17][18][19]], a recent model suggests that correlated juror decisions undermine their collective accuracy [20], a conclusion supported by experiments on crowd wisdom [10]. We have found that there are significant correlations between decisions (see Fig. 13), which, together with the * kaburghardt@ucdavis.edu previous model [20], may help explain why juries appear to perform worse than individual judges at correctly acquitting defendants [21]. We are therefore motivated to understand the mechanisms that drive these correlations, which may point to ways in which the quality of jury decisions can be improved.

FEATURES OF THE DATA
Jury deliberation is an ideal test bed for models of opinion dynamics. Jurors are exposed to the same information during the trial, are instructed not to discuss the trial with non-jurors, and cannot learn about the trial from outside sources [22][23][24][25][26], therefore opinion variation between jurors is likely due to internal factors, such as influence instead of external factors, such as varying levels of information.
In this paper, we model jury datasets for civil trials in Oregon (OR) [27] and California (CA) [28]. We find qualitatively similar behavior in the Washington (WA) and Nebraska (NE) datasets whose data is less complete [29,30] (Fig. 1), and for criminal trials in the OR dataset (see Fig. 8). We split data by the number of jurors in each case. The OR 6 dataset, for example, corresponds to the Oregon civil trials with six jurors.
The purpose of our modeling efforts is to explain four features of the data shown in Fig. 1. First, we aim to model why the mean deliberation time, T delib , is lowest when the fraction of jurors voting for the plaintiff in the final vote, V f p /N is 0 or 1, and highest when V f p /N ≈ 0.3 or V f p /N ≈ 0.6 ( Fig. 1a). For each trial, we look at the break down between for-plaintiff and for-defendant votes. We also normalize the number of jurors voting for the plaintiff in the final vote, V f p , by the jury size, N . Juries are dismissed and a new trial is given to the defendant in a civil trial if 1 − φ < V f p /N < φ, where φ is 0.75 for the OR and CA civil trials [27,31]. This property, observed across all N , also helps explain why juries tend to reach supermajorities, in which V f p /N ≥ φ or V f p /N ≤ 1 − φ (Fig. 1c). We also observe that T delib scales with the trial time, T trial , as T delib ∼ (T trial ) 1/2 (Fig. 1b), a property not strongly correlated with the final vote (Fig. 12), even though both affect the mean deliberation time. Interestingly, the mean deliberation time does not scale strongly with the number of jurors, which is counter to many intuitive models (Fig. 11) [32,33]. Finally, in Fig.  1d, we notice that deliberation time distribution is heavytailed.
In this paper, we fit empirical joint distributions of final votes and deliberation times to model distributions through maximum likelihood estimation of model parameters, therefore, counter-intuitively, we infer opinion dynamics without access to time-series data. Recent work, however, has shown that different dynamical models of group opinion formation create different distributions in the time for groups to reach consensus [34,35], which inspires us to reverse engineer the model that best matches the jury data. By matching the joint distribution instead of either distribution alone strongly limits the possible dynamical models that can explain the data. For example, in contrast to many models of group decisionmaking, juries rarely reach complete agreement before they stop deliberating, therefore it may be possible to match the deliberation time alone with unrealistic models, in which all jurors reach agreement. We do not claim, however, that the models we present are the only possible models that can match the data well, but are instead simple models that illustrate mechanisms that drive opinion dynamics.
A major limitation in the data, however, is that no two trials are exactly the same, therefore aggregating over heterogeneous trials may strongly affect our results [36]. To test for this effect, we split data into more homogenous groups with the same N and similar T trial , because N affects V f p and T trial affects T delib . We find, however, that splitting the data does not affect the qualitative behavior of V f p /N , and T delib (E.g., see Fig. 1). These results so far suggest that heterogeneity should not significantly affects our findings.
We next introduce three null models, in which jurors reach their opinions independently, then six models incorporating influence, though herding behavior and individual stubbornness, and discuss how well each matches the data.

NULL MODELS
We first create an independent, random vote null model with which to compare other models. For this first model, for each dataset, we reshuffle all juror votes, which creates a binomial distribution of final votes. Not surprisingly, this "one-mode null model" fits data poorly; therefore we propose a slightly more nuanced "two-mode The mean deliberation time scales as (T trial ) 1/2 across several datasets, even though T trial correlates only weakly with V f p /N (Fig. 12).
where φ = 0.75, the thresholds between which juries hang [27,31]. (d) The complimentary cumulative distribution of deliberation times is heavy-tailed across datasets. Data is taken from Oregon [27], California [28], Washington, and Nebraska [29,30], and error bars represent 90% confidence intervals in the mean. null model." For this, we split the jury data into those with majority for-plaintiff final votes (V f p /N > 0.5) and the rest (V f p /N ≤ 0.5), reshuffle juror votes of each subset separately, and then combine the distributions. In both cases, we fix P (T delib |V f p /N ), the conditional probability for juries to stop deliberating at time T delib , given the fraction of for-plaintiff votes in the final vote, V f p /N , to exactly match the empirical data, as an unrealistic but best-case scenario of these null models. Both models produce poor fits of the data compared to other models (Fig. 2e,4,& 8), with the exception of CA 12 (T trial = 34 − 61 hours) in which the two-mode model fits data better than any influence model. Overall, however, a simple model in which opinions are picked at random, independently of each other, does not provide a compelling explanation of the data.
We also create a "two-timescale" null model of the deliberation time distribution, in which the time for each juror to make their pre-determined final decision is independent (exponentially distributed), but depends on whether their decision is for the plaintiff or not (hence "two-timescale"). Deliberation ends when the last juror makes their final decision. Separate fitting parameters are used for for-plaintiff and for-defendant votes because for-plaintiff votes usually take longer than for-defendant ones (p-value < 2 × 10 −2 based on the Mann-Whitney U test for CA 6, CA12, OR 6, and OR 12 datasets, no significant difference for the CA 8 dataset), and it allows for this null model to better agree with the data. We determined distributions by Monte Carlo sampling 10 5 times for each V f p such that P (V f p /N ) is fixed to be the empirical data distribution as a best-case scenario. In this way, the two-timescale null model is meant to explain how juries stop deliberating, not how they reach their final vote. We find, however, that this model creates a poorer fit to the observed data than the full influence model (to be discussed shortly), despite artificially fixing P (V f p /N ). While other plausible time distributions could be used and the assumption of a homogeneous distribution might not be ideal, disagreement between this idealized model and data point to limitations in similar null models.

Full model
Given the relatively poor performance of the null models, we propose an "influence with increasing stubbornness" model that can better describe the datasets. Within the large space of plausible models, we focus on a simple model with few parameters, and then check whether any of these parameters could be removed without affecting the quality of the fit. Furthermore, we focus on a model with herding because of the correlations between juror opinions seen in the data (Fig. 13), which may suggest that jurors have a tendency to follow the majority opinion. In the model, jurors tend to adopt the majority opinion and juries end deliberation at a rate that depends on the current vote (number of jurors currently leaning for the plaintiff and for the defendant). The for-mer incorporates a simple mechanism for juror influence that enables the supermajorities observed in data, while the latter captures jury resistance to hung conditions. In addition, we add a stubbornness property, in which jurors increasingly hold on to their current opinion. This facilitates the strong non-consensus patterns from data. More specifically, as shown in Fig. 3, at each timestep in the model (where a timestep is chosen to be 1 minute, see Methods), a random juror is selected and considers re-evaluating their current opinion with probability 1−s, where s reflects their stubbornness and depends on how long they've held their current opinion. If they do reevaluate, they pick the majority opinion with probability p, and the minority with probability 1 − p. At the end of each timestep, the jury stops deliberating with probability q, which depends on the current set of juror opinions.
The stubbornness probability s, depends both on how long the juror has held their current opinion and whether the current set of opinions meet the hung condition: where t 0 is the time a juror adopted its most recent opinion, τ is the time a juror has held their current opinion, ∆t is the length of a simulation time step, and µ eff (t) is the rate jurors become more stubborn: where f is the reduction in this rate when juries are hung (the current vote, V p (t), divided by N is between the jury hanging thresholds 1 − φ and φ). If we set the stubbornness probability s to a constant, that would only have the general affect of changing the timescale of the dynamics. We incorporate increasing stubbornness (s grows with τ ) as a behavioral hypothesis, which has previously been shown to help explain voter behavior in elections [8,37]. The jury's tendency to reach a non-hung decision is captured by making the stubbornness rate µ eff (t) lower under hung conditions, meaning that jurors do not hold onto their opinions as strongly as they would otherwise, presumably to lessen the probability that the jury hangs. At the end of each timestep, the probability for a jury to stop deliberating, q, is determined: s. These transition probabilities are constructed from a total of four fitting parameters: µ, α, f , and p, and three fixed parameters: b, the bias of the initial condition; ∆t, the length of a time step; and q 0 , which are discussed further in Methods. Fig. 2) shows that not only can the model explain vote and time distributions, but it can also explain the peaks in deliberation time near the critical fraction of voters V f p /N ≈ 0.3 and 0.6. This appears to be due to important factors included in the influence model: the instability of juries having 50/50 split decisions, and the ability for juries to stop deliberating even then they have not reached complete consensus (see Supporting Information).

Variations of the full model
Having developed a model that explains the data better than the simple null models, we construct variants of the full model in order to identify which mechanisms are most important for capturing the observed patterns. First, we test whether herding affects jury trials by setting p = 0. 5 (Figs. 4 & 8). If p = 0.5, a juror would have equal preference to pick the majority opinion as the minority one. We see that the fit is significantly worse, therefore herding appears to affect the outcomes of jury trials. We next test the role of increasing stubbornness by setting µ eff = 0. Removing the increasing stubbornness parameter, however, produces significantly poorer fits to the data (Figs. 4 & 8). A similar conclusion is reached in previous work that matches a model to election data in several European countries [8]. Because highly disparate datasets have similar conclusions about the importance of increasing stubbornness, we believe it plays a fundamental role in opinion dynamics. Setting the stubbornness FIG. 4. Comparison of Models. Normalized log-likelihood functions for the null models and the influence model variants to illustrate comparison with the full influence model. For each dataset indicated in the legend, log-likelihood functions for these models were normalized by |log(L full )|, the loglikelihood function of the full model, therefore models above -1 explain the data better than the full influence model, while those below -1 perform worse. (a) The relative fit of the onemode, two-mode, and two-timescale null models, along with "no herding" model, in which p = 0.5, "no stubbornness" model with µ = 0, "no vote dependence" model, in which the model dynamics do not depend on the number of jurors voting for the plaintiff, and the "no hung conditions" model, in which jury dynamics do not depend on whether the jury is currently hung. (b) In a zoomed-in graph, the influence model variants seen in (a) perform worse than the full model. probability s to a constant greater than 0 should only generally decrease the timescale of the dynamics, presumably making the final vote distribution more similar to the initial vote distribution, therefore in the interest of space, we leave out further model variants of this type. Finally, to better understand how the hung conditions affect jury behavior, we fit a model with no dependence on hanging: µ eff (t) = µ and q(t) = q 0 + α|V p (t)/N − 1/2|. In this "no hung conditions" variant, neither stubbornness rate, nor the quitting rate, depends on whether the jury is currently hung. The probability for the jury to end deliberations, however, still increases linearly with the amount of consensus among jurors. To test the importance of the current vote has on jury dynamics, we create a "no vote dependence" variant in which µ eff (t) = µ, and q is a fitted constant. Both of these variants show poorer agreement to the data compared to the full model (Figs. 4 & 9). We finally tested removal of the hung conditions from either the stubbornness rate (Eq. (2)) or the stopping probability (Eq. (3)), but not both. We find that removing the hanging dependence of the stubbornness rate fits the data worse than removing the hanging dependence of the stopping probability (Figs. 7 & 10). Hanging may therefore affect how juror opinions change more than it affects how juries decide to end deliberations.
In summary, the full model agrees with data significantly better than the null models: one-mode null model, two-mode null model, and two-timescale null, as well as variants that remove herding, stubbornness, hungconditions, and vote-dependent behavior.

Findings
What does the influence model suggest about jury deliberation? To begin to answer this question, we examine the best-fit model parameters for the different datasets (Tab. 1). Similar results are found when we look at criminal data from Oregon as well (Tab. 2).
First, we see that the fitted stubbornness rate is usually much lower when juries are hung (f < 1), which suggests that, under hung conditions, jurors significantly reduce the rate at which they stick to their current opinion. Also, the positive estimated values ofα indicate that juries are more likely to stop deliberating when they reach near-consensus. Further,p > 0.5 implies herding occurs within the jury, andμ > 0 min −1 implies jurors keep their most recent opinion with increasing stubbornness.
In Fig. 5, we see that a parameter in the influence model, α, follows the power law relation α ∼ (T trial ) −1/2 , which agrees with Fig. 1b because T delib ∼ α −1 (Eq. (3)). We propose a possible mechanism for the scaling relationship T delib ∼ (T trial ) 1/2 : over the course of a trial, the amount of data juries will deliberate on, D, might follow a random walk with a reflecting boundary condition at 0, which implies thatα −1 ∼ T delib ∼ D ∼ (T trial ) 1/2 (see Supporting Information).
We also notice that, across all the data, the herding probability, p, is highest when juries are smallest (Tabs. 1 & 2), while this value drops significantly for datasets with larger N (p-value < 0.05 between any N = 6 dataset and any N = 12 dataset). Previous studies on jury size [38], found that larger juries become hung more frequently, possibly because they have a minority opinion able to better resist the majority. Our study provides evidence of this explanation because larger juries have smaller p values, and therefore jurors that are less likely to follow the majority opinion. We should caution, however, that influence is not necessarily homogenous across jurors, which may affect our results. DISCUSSION We find that models in which jurors make decisions independent from each other disagree with the data. On the other hand, models in which jurors are influenced by each other agree well, at least qualitatively, with the data. Importantly, we found best agreement from a model in which jurors display tendencies to both follow one another and also increasingly stick to their current opinion. This type of behavior was also previously found to be important for explaining voting patterns in elections [8], which suggests that it may be a fundamental mechanism of group decision-making.
Future work is necessary to better understand whether stubbornness or influence can hurt or help collective wisdom. In a recent theoretical paper [20], correlations between jurors were found to sometimes create judgments with lower accuracy than individual jurors when they need to reach a simple majority. In contrast, sequential voting, in which individuals base their decision on the popularity of decisions in the past, has been shown to significantly improve the wisdom of crowds [39,40]. We are not aware of any paper that discusses how stubbornness can empirically help or hurt deliberation, nor does our research directly address how stubbornness and/or influence affects the quality of jury decisions.
Our work could also be extended by building more accurate models and better addressing data heterogeneity. Most of the data is statistically significantly different from the model, based on the two-dimensional Kolmogorov-Smirnov test (p-value < 0.1) [41], pointing to a need for more nuanced models to better explain the data. Another, more fundamental problem, however, in the datasets is heterogeneity: trials vary in complexity and jurors differ across trials, which can affect how decisions are reached. This may be addressed, however, with controlled experiments in which several groups separately deliberate on the same, or very similar, information. Data on how opinions change over time, as well as the time for juries to reach a verdict can provide tantalizing clues about the underlying mechanism of opinion dynamics.  The jury data we study is taken from Multnomah County, Oregon [27], San Francisco County, California [28], Thurston County, Washington, and Douglas County, Nebraska [29,30]. In the CA and OR datasets, the deliberation time and final vote are known, which can affect each other, but the OR dataset, unlike the other datasets, does no record T trial . The CA dataset bins T trial in days, but the WA and NE datasets record both hours and days (roughly 4.5 hours per day in court), therefore we convert each trial day in the CA data into 4 hours. We removed all data where we did not have both the trial time, deliberation time in hours, and final vote in the CA data. Furthermore, we focus on trials in which jurors only vote on one count to simplify our study (this only removes 138 trials total) and the OR dataset only records the most important count if multiple exist [27]. Once cleaned, we have 53 trials for CA 6, 338 trials for CA 8, and 1726 trials for CA 12 out of 6482 total trials. We do not know whether the kept data was unknowingly biased, although the qualitative similarities suggest that any bias should not significantly affect our results (Fig.  1). We also removed all data where we did not simultaneously know deliberation time and final vote in OR data (only 4 trials were removed; once cleaned, there were 207 trials for OR 6 jury data, and 951 trials for OR 12 jury data). Finally, we removed data where we did not simultaneously know both the trial time and deliberation time in the WA and NE data. This removed 10 trials for the WA dataset, and 21 trials for the NE dataset (in the cleaned data, there were 141 and 135 trials, respectively). All mean confidence intervals in the data come from bootstrapping data 10 4 times.

Fitting The Data
In the OR dataset, some trials are criminal trials, which have different rules about when juries are hung (see Fig. 8 & Tab. 1) [42], therefore we primarily focus on civil trials. In the CA dataset, all trials were civil trials. In the WA and NE datasets, on the other hand, the final vote was not recorded, therefore we did not attempt to model the dynamics.
In the influence model, juror opinions are initially binomially distributed, with each juror having a probability b of an initially for-plaintiff opinion. The parameter b was chosen such that the probability of simulated juries initially voting for the plaintiff plus 1/2 the probability of simulated juries being evenly split was equal to Pr(V f p > N/2) in the dataset. This ensured that the final distribution had a similar value for Pr(V f p > N/2). In addition, we somewhat arbitrarily set the timestep in simulations to be 1 minute, but simulations with significantly smaller or larger timesteps (as small as 15 seconds, or as large as 4 minutes) are not usually statistically significantly different (p-value> 0.1 using the likelihood ratio test [43]). An exception to this rule is CA12 with T trial = 6 − 10 and 11 − 18 hours, where the timesteps of 15 seconds and 1 minute are not statistically different, but both are preferred over timesteps of 4 minutes (p-values vary between 0.006 and 0.09). Furthermore, q 0 is arbitrarily set to 0.3α, but varying this value between 0.1α− 1.0α similarly produces statistically equivalent fits (p-values > 0.1). We cannot set this value to 0, however, because it would mean juries never stop deliberating when they are evenly split, which is in disagreement with the data.
To findp,α,μ, andf , we use maximum likelihood estimation, and then use the log-likelihood function to compare the quality of fits. Some values were predicted to be nearly 0 in the model, even though they existed in the data, therefore, we added a small base probability of between 10 −4 and 10 −14 to the models with no significant qualitative changes (all values shown are with a base probability of 10 −11 ). Finally, the distributions we used to fit the influence models to the data were created from 1.6 × 10 5 simulations per parameter value. There was an inherent limit in the probability resolution (6×10 −6 ), but we do not believe this significantly affects our results. All parameter confidence intervals come from bootstrapping and fitting the data 10 4 times.

Acknowledgements
Our work is supported by the Army Research Office under contract W911NF-15-1-0142. KB would like to thank Nicholas Pace and Walter Fontana for enlightening discussions.

SUPPORTING INFORMATION (SI) How Deliberation Time is Affected by the Final Vote
It might not be intuitive why the deliberation time, T delib , is highest when jurors are near consensus in both the data and the influence model, while T delib is lower when jurors are evenly split (see Figs. 1 & 2 in the main text). In this section, we present a simple Markov chain model to better understand this finding.
We find that the fraction of jurors voting for the plaintiff, V f p /N = 1/2 is rare, as seen in Fig. 2a in the main text. This finding is likely related to the influence model becoming the Majority Voter Model (MVM) when α = µ = 0, i.e., when juries do not stop deliberating and there is no jury stubbornness, because V p (t)/N = 1/2 is known to be unstable past a critical point in the MVM when the influence of neighbors changes from weak (and opinions are evenly split) to strong (and there is nearunanimous agreement) [44,45]. Using this numerical finding, we create a similar, but much simpler, model in which the number of jurors voting for the plaintiff is represented as a node in a Markov chain, and there is a bias for juries to have greater agreement (see Fig. 6).
In the model, juries begin evenly split (V p (0)/N = 1/2) but can transition to a new state, V p (1)/N ± 1/N , with probability (1 − s ′ )/2. Once jurors reach this new state, they can achieve greater consensus, V p (1)/N ± 2/N , with probability (1 − s ′ )/2, or stay in the current state. This pattern can continue until juries stop deliberating with probability q ′ at each timestep.
Recall that, in the influence model seen in the main text, a juror will choose not to re-evaluate their opinion with probability s, and even if they do re-evaluate, they may choose to keep their original opinion, therefore it is reasonable for self-loops to exist in the Markov chain model. That said, because s is often less than 1, and p > 1/2, it is reasonable to assume that opinions develop stronger pluralities over time, ergo the Markov chain model captures many qualitative features of the influence model.
Starting from time t = 1, we find that the probability a jury is evenly split by the time they stop deliberating at time t is which implies that and the probability the jury stops deliberating with an opinion V f p /N = 1/2 + 1/N (or equivalently V f  6. A Markov model that qualitatively describes the dynamics of the influence model seen in the main text. We assume states change as a Markov chain, therefore with probability s ′ we remain in state Vp(t)/N = 1/2, but transition to Vp(t)/N = 1/2 ± 1/N with probability 1−s ′ 2 . Once we are at a new state, we either transition to Vp(t)/N = 1/2 ± 2/N with the same probability or stay in the current state. Finally, with probability q ′ , juries stop deliberating.
and the probability over all time is where we use ± to emphasize that the probabilities for V f p /N = 1/2 + 1/N and V f p /N = 1/2 − 1/N are the same. Using P r(1/2, t) and P r(1/2 ± 1/N, t), we can also find the mean deliberation time conditioned on the final vote: and where . is the average. If s ′ → 0 (in other words, V p (t)/N = 1/2 is very unstable), then we find that and In comparison and The probability that deliberation stops at V f p /N = 1/2 is small, but so is the time that this deliberation would subsequently take. In comparison, V f p /N = 1/2 ± 1/N is more likely, but mean deliberation time is subsequently higher. If we continue to V f p /N = 1/2 ± 2/N , T delib is expected to further increase because it takes at a minimum number of timesteps to reach the state. In short, the Markov chain model helps explain why T delib is low when the jury is evenly split, even though the probability for a jury to be evenly split is low as well. Furthermore, the Markov chain model helps explain why deliberation increases with greater consensus, at least until V f p /N ≈ 0.3 and V f p /N ≈ 0.6, when quitting rates substantially increase in the influence model, therefore lowering T delib again.

Random Walk Stopping Rate
If we assume that the amount of information jurors accumulate is D, which we assume follows a random walk with a reflective boundary at D = 0 (people cannot have negative information), and the amount of time users deliberate scales as T delib ∼ D, then α −1 ∼ D, where α is proportional to the quitting rate in the influence model. To better understand how D affects the dynamics, recall that P r(D|T ) = T + 1 (T + D − 1)/2 where T is the number of timesteps. Taking T to be large, using the Sterling's formula, and dropping nonleading terms, This immediately implies that D ∼ T 1/2 . T is not, as of yet, explicitly defined, because T is still the number of timesteps and not an actual time. We can, however, set T ∼ T trial , and, because T delib ∼ D, T delib ∼ T 1/2 ∼ T

Alternative Jury Models
We mention in the main text that removing all hung conditions in the herding and stubbornness model will produce a poorer fit (see Fig. 4 in the main text and Fig. 7). To better understand why this is the case, we separately remove the dependence of the quitting rate, q, and stubbornness rate, µ eff (t), on whether the jury is hung. In the former case, we see a small change in the likelihood function, but in the latter case, the likelihood function has a more significant drop. This suggests that jurors depend more on changing their stubbornness rate than changing their quitting rate when they avoid hanging.

Oregon Criminal Cases
In this section, we compare data and fits for criminal and civil cases in Oregon. The reason we separate the data is both because the requirements for a verdict are different (ten out of twelve jurors are need to agree instead of nine out of twelve, although five out of six still need to agree in six-person juries), and the motivations for reaching a decision may be different. Overall, we find quantitatively similar findings between criminal and civil cases.
First, we compare OR 6 and OR 12 attributes seen in Fig. 2 of the main text (Fig. 8). We find that T delib is higher for OR 6 criminal cases compared to civil cases, but the trend is not as clear for the OR 12 cases (Fig.  FIG. 7. A comparison between the normalized log-likelihood functions of three model variants. −1 corresponds to the full model, and the more negative the log-likelihood, the worse the fit. The no hung conditions model creates consistently poorer fits than the full model, but if q is independent of whether the jury is hung and µ eff (t) is Eq. (2) in the main text ("No hung stop"), the fit is very close to the full model, although this assumes, unnaturally, that juries with evenly split verdicts never stop deliberating. Instead, when we let µ eff (t) = µ ("No hung stubbornness"), the fit is roughly as poor as the model without any hung conditions. 8a). That said, in all cases we see that T delib is higher when there is greater disagreement among jurors. We also find that juries are commonly found to reach a verdict and hung juries are rare (Fig. 8b). Finally, we see that P r(T delib ) is almost exactly the same for both civil and criminal cases.
Next we compare the fits for civil and criminal cases. Overall, we find that civil and criminal cases fit each model similarly well (Fig. 9). For example, the onemode and two-mode null models give some of the worst fits, and the two-timescale model was the best null model, although it was still worse than the full influence model. Furthermore, removing either herding or stubbornness from the full influence model produces a much worse fit, while removing the vote dependence or hung conditions has a much smaller effect. We also see the same qualitative trends when we separately remove the dependence of the stubbornness rate or quitting rate on whether the jury is hung (Fig. 10). In both the criminal and civil cases, removing the stubbornness rate's dependence on whether a jury is hung creates a significantly worse fit compared to removing the quitting rate's dependence.
where φ is the thresholds between which juries hang (φ = 0.75 for civil cases and 0.833 for criminal cases) [27,31]. (c) The complimentary cumulative distribution of deliberation times is heavy-tailed across datasets. Data is taken from [27] and error bars represent 90% confidence intervals in the mean.

Correlations Between Jury Attributes
In this section, we discuss correlations between various attributes, in order to better understand how to model jury dynamics. First, we look at how the jury size affects the deliberation time (Fig. 11), and notice very little correlation between the two. This contrasts with many models of opinion dynamics in which deliberation time strongly correlates with system size [32,33]. Next, we compare how the trial time depends on the final vote (Fig. 12). Interestingly, although both the trial time and the final vote strongly affect the deliberation time (Fig. 2 in the main text), neither are strongly correlated with each other. We use this property to find separate mechanisms for the correlation between each attribute and deliberation time.
Finally, we plot the probability a typical voter will vote for the plaintiff (or vote guilty in criminal cases), P r(Outlier For Plaintiff), versus the vote of all the other jurors for OR 12 (Fig. 13). We find a strong correlation between the two in civil cases and criminal cases, therefore juror opinions are not independent, which gives strong evidence that herding may exist in juries. We find P r(Outlier For Plaintiff) by determining how many trials end with verdict V f p = V f p,N −1 + 1, corresponding to the outlier juror voting for the plaintiff, and how many trials end with V f p = V f p,N −1 , corresponding to the outlier juror voting for the defendant. The probability, P r(Outlier For Plaintiff), is simply  9. Comparison of Models for OR 6 and OR 12 civil and criminal cases. Normalized log-likelihood functions for the null models and the influence model variants to illustrate comparison with the full influence model. For each dataset indicated in the legend, log-likelihood functions for these models were normalized by |log(L full )|, the log-likelihood function of the full model, therefore models above -1 explain the data better than the full influence model, while those below -1 explain the data worse. (a) The relative fit of the one-mode, two-mode, and two-timescale null models, along with the "no herding" model, in which p = 0.5, "no stubbornness" model with µ = 0, "no vote dependence" model, in which the model dynamics do not depend on the number of jurors voting for the plaintiff (or voting guilty in criminal cases), and the "no hung conditions" model, in which jury dynamics do not depend on whether the jury is currently hung. (b) In a zoomed-in graph, the influence model variants seen in (a) perform worse than the full model.