Reducing Partisanship in Judicial Elections Can Improve Judge Quality: Evidence from U.S. State Supreme Courts

Should technocratic public officials be selected through politics or by merit? This paper explores how selection procedures influence the quality of selected officials in the context of U.S. state supreme courts for the years 1947-1994. In a unique set of natural experiments, state governments enacted a variety of reforms making judicial elections less partisan and establishing merit-based procedures that delegate selection to experts. We compare post-reform judges to pre-reform judges in their work quality, measured by forward citations to their opinions. In this setting we can hold constant contemporaneous incentives and the portfolio of cases, allowing us to produce causal estimates under an identification assumption of parallel trends in quality by judge starting year. We find that judges selected by nonpartisan processes (nonpartisan elections or technocratic merit commissions) produce higher-quality work than judges selected by partisan elections. These results are consistent with a representative voter model in which better technocrats are selected when the process has less partisan bias or better information regarding candidate ability.


Introduction
The decisions of public officials have a large impact upon our daily lives, yet they typically face weak incentives for good performance. In the case of judges, weak incentives are a design feature justified by the need to have them issue unbiased decisions in the public interest. 1 This is particularly important for appellate courts in common-law systems, such as the United States, where judicial decisions have the power of law. 2 In the absence of explicit pecuniary rewards, what motivates good performance for public officials? One possibility is the intrinsic reward from doing a good job, which may arise when organizations have a clear "mission" that can motivate their members (Wilson, 1989). Dewatripont et al. (1999) study the design of rewards in the presence of mission, and in particular highlight the role of "professionalism" in encouraging task-specific skills, which in turn increase the professional's intrinsic incentive for work (see White (1959) and Wilensky (1964)). 3 These ideas highlight the importance of selecting the right type of public official who values work quality, and providing incentives that do not interfere with these values. Tabellini (2007, 2008) analyze decision quality under elections or bureaucracy. Tenured "bureaucrats" are preferred for technical tasks, where organizations can evolve a mission, and for protecting the rights of minorities. In contrast, elected "politicians" are more sensitive to outcomes, and to the preferences of the median voter. The choice of one or the other system is a practical question that depends upon the empirical magnitude of these factors. 4 The purpose of this paper is to estimate the effect upon performance of changing the appointment procedure for state appellate courts. We do this by exploiting the changes in the way state supreme court judges are selected in the United States over the period . These courts serve as the state judiciary's analogue to the U.S. Supreme Court and have the authority to review laws produced by state legislatures and decisions produced by lower state courts. Thus state supreme court judges are some of the most powerful officials in state government. These judges are the last appeal on some of the most important features of common law, including most rules on contract, property, tort, and crimes.
The identification of the effect of appointment procedure relies on two ingredients. First, over this period of time U.S. states experimented with different methods of appointing judges. Second, the job of a supreme court judge does not change much over the course of the career, and it does not vary across states. The main component of the job is to write an opinion explaining the court's decision. This involves working (mainly with clerks, and not with other judges) to research the relevant case law, reason through the implications for present and future litigants, and introducing a precedent that future judges in the state must follow. 5 Judges do not write all the text of their opinions, however they are responsible for each decision they sign. The quality of judges depends not only upon their native skills as a writer of legal arguments, but also as an individual who hires and manages law clerks that assist in this activity. Thus, if we find that some judges perform better than others, then we are providing some direct evidence on the importance of management for performance. 6 We construct two classes of performance measures. The first of these is "output". This includes number of cases heard, total words written, and citations to other cases. These roughly measure time on the job. The second class of performance measures correspond to "quality". Given that the job of an appellate judge is to clarify and in some cases create new law, legal scholars measure quality by the frequency with which the decisions of a judge are cited in future cases, both within-state and out-of-state. 7 While the task of a judge is similar from state to state, the rules for selecting and retaining appellate judges vary across states and over time. These rules are listed in Table 1, with rule changes indicated by cell borders. A graphical representation of these changes is included in choice of institution (removable "politician" or unaccountable "judge"); election pressure can cause officials to modify their decisions to reflect the interests of the electorate. 5 See Hanssen (e.g. 1999); Helland and Tabarrok (e.g. 2002); Hanssen (e.g. 2004); Hall and Bonneau (e.g. 2006); Hall (e.g. 2007); Kritzer (e.g. 2011Kritzer (e.g. , 2015 for details on the operation of these courts. 6 Bloom and Reenen (2007) document that managers do matter for the performance of private firms. 7 See Posner (2008) for a seminal discussion of the work of judges and what we mean by quality. Choi et al. (2010) discuss in detail why citations are used to measure judicial performance.
In this study we are concerned with three types of appointment systems: partisan elections, non-partisan elections and appointment by the state governor. The first system, partisan elections, is used for both selection of new judges and retention of incumbent judges. For these elections, judges represent a political party, Republican or Democrat, that is clearly identified on the ballot. They must win a primary election for their party before running in a general election. Incumbent judges rarely face a credible challenge in the primary, but in the general election they usually face a challenger from the opposing political party.
The second system, nonpartisan elections, are also used for both selection and retention. In this system there are competitive elections, but there are no primaries and party affiliations are not on the ballot. There are generally two candidates, an incumbent and a challenger, but the incumbent is not identified as such.
The third major system is merit selection with uncontested retention elections, also known as the Missouri Plan. In this system, judges are nominated by a commission of expertssenior attorneys and retired judges -and confirmed by the governor. Incumbent judges face an up-or-down retention vote with no challenger. This system is designed to be more meritocratic, and to impose weaker political incentives, than electoral selection. 8 There is an extensive discussion in the political science literature of the reasons for these different systems, and the effect they have upon judicial decisions, that we discuss briefly in the next section. This work is concerned with the politics of decision-making, and when judges make decisions that pander to one political party or the other. Our concern here is with the labor economics of this particular market for experts. By an expert we mean any person who is engaged to carry out tasks for which the employer is not qualified to do. Doctors, lawyers or financial advisors are all experts whose demand for services depends upon consumers buying (or in our case voting) with their choices. The question is whether or not the non-expert decision maker is able to distinguish the quality of the expert.
In our context, the question we are interested in is whether the appointment system positively selects judges for their ability. In our earlier work, Ash and MacLeod (2015), we 8 There is a fourth hybrid system, where judges are initially selected through partisan elections but thereafter face uncontested retention elections. California has governor appointment but uncontested retention elections. The other states either have some combination of governor or legislative appointment, both for initial selection and for period retention. In Massachusetts, New Hampshire, New Jersey, and Rhode Island, judges have lifelong tenure. In Ohio and Michigan, judicial elections are difficult to classify within the partisan/nonpartisan dichotomy because they have partisan primaries and nomination processes, but the political party is not on the ballot in general elections. Following Nelson et al. (2013), we classify these states as partisan selection and nonpartisan retention. Alternative codings, or leaving them out of the analysis, does not change our results.
found that on average appellate court judges do have a taste for quality. We were able to show this by exploiting changes in their workload, that in turn led them to write longer and more influential opinions. In the case of elections, we cannot expect voters on average to be well informed regarding the quality of a judge. Moreover, since we do not have data on the output of all potential candidates, we cannot measure the quality difference between winners and losers. Rather, our approach is to ask the practical question of whether a state's choice to change from, say, a partisan election system to a non-partisan election system leads to a change in the quality of the court. Choi et al. (2010) look at this question using data on the opinions written by all state appellate court judges over a period of three years (1998)(1999)(2000). They find that in the cross section states with appointed judges have higher quality decisions, but produce fewer decisions. These results are consistent with our results in Ash and MacLeod (2015) suggesting that judges face a trade-off between quality and output. Given that these results are from a cross-section they do not answer the question of what will happen when a given state changes their appointment system.
We have constructed a data set of all state supreme court decisions from 1947 until 1994 that is matched to a number of judge characteristics and state rules on judicial appointment. The data on decisions was provided by Bloomberg Law, which includes the number of (positive) citations to each decision. With this data we can construct a number of measures of performance and output. We find that states that move from partisan elections to either nonpartisan elections and or technocratic merit commissions choose judges who produce higher-quality work. We also find evidence that judges do care about re-election and put effort into gaining re-election. This follows from the observation that in an election-year a judge's work output falls. Interestingly, if a judge is up for re-election under a partisan system we also observe a fall in work quality. Moving from nonpartisan elections to uncontested re-elections increases work quality for incumbent judges, while there is no effect on incumbent judge performance when moving from a partisan to a nonpartisan re-election system, or when moving from partisan to uncontested re-election. These results are consistent with the hypothesis that partisanship affects judge performance through the selection of judges that have a lower intrinsic value for quality.
The agenda of the paper is as follows. In the next section we briefly review the relevant literature. Section 3 discusses the role of information in the judicial selection process. The data and performance measures are discussed in Section 4, followed by a discussion of our identification strategy and empirical analysis. Section 6 concludes the paper. The Appendix 6 contains results from a number of robustness checks.

Literature
The literature can be roughly divided into two distinct questions. In economics there is a literature that is concerned with understanding the quality of civil servants, and how different appointment systems affect this quality. In our case these are judges, but similar issues apply to any individual who is chosen to make complex decisions on our behalf. For the most part voters do not have the skill set to carry out the choices that the appointed individual is carrying out, yet they must decide who they should support for appointment.
There is a literature in economics showing that electoral pressures do affect decision making. Besley and Case (1995) use the variation in term limits to show that state governors do respond to electoral pressure and reduce state spending where there is a binding term limit. Besley and Coate (2003) show that electricity prices are lower in states with elected (versus appointed) utility regulators, consistent with elected regulators being more responsive to consumers. List and Sturm (2006) show that electoral incentives drive decision-making by governors on environmental issues. Ferraz and Finan (2008) find that voters respond to information on corruption, and therefore serve to select out the worst politicians. Ferraz and Finan (2011) find that mayors who are term-limited (and therefore face no re-election incentives) are more corruption, consistent with an incentive effect against corruption. Pande (2011) reviews the literature in development economics on political accountability and finds that voters respond to information about the political process and that politician performance improves electoral accountability. In particular, the quality of information regarding candidates plays a central role in explaining the quality of elected politicians.
Like Choi et al. (2010) discussed above, Lim and Snyder (2015a) use the cross-section variation in election systems to study the relationship with various measures of trial judge quality in a sample of 39 states from 1990 -2010. They find that voters are strongly influenced by party cues in partisan elections, but add the caveat that trial court judges are less important than appellate court judges in making law, which might result in less voter scrutiny.
There is also an important and voluminous literature documenting the significant impacts of political motivations on decision-making. Hall and Bonneau (2006) have an interesting paper that looks at the effect of a challenger upon the success of an incumbent judge using data from 21 states from 1990 to 2000. They find that voters do respond to challenger quality. There are also a number of papers that focus upon the politics of decision making. Specifically, does the appointment system affect the values that use to evaluate cases. A key finding is that stronger electoral pressures are associated with harsher criminal sentences (Huber and Gordon, 2004;Gordon and Huber, 2007;Lim, 2013;Berdejo and Yuchtman, 2013;Iaryczower et al., 2013;Canes-Wrone et al., 2014;Park, 2014;Lim et al., 2015). Looking at federal judges, Epstein et al. (2013) review the evidence that decisions tend to reflect the ideological leanings of the president that appointed them. Besley and Payne (2013) find that elected judges are more likely to support anti-discrimination law. Shepherd (2009) andCanes-Wrone et al. (2010) find that retention elections cause state appellate judges to vary their decisions in subsets of politically sensitive cases. Fouirnaies and Hall (2018) show that U.S. state legislators who are term-limited decrease effort, in terms of sponsoring bills, acting on committees, and being present for floor votes.
Finally, there is a literature that looks at the information processing by voters. Early work by Rahn (1993) shows that voters use partisan identifiers as a decision making short cut. Using a combination of survey and experimental data, Bonneau and Cann (2015) and Lim and Snyder (2015a) also find that partisan identification plays an important role in voter decision making. Recently, A. Kirkland and Coppock (2017) provide some direct experimental evidence that if voters cannot observe partisan labels then they put more weight upon observable judge characteristics that reflect quality. In the next section we discuss how such a mechanism may lead to the quality of judges varying with the type of system used to appoint them.

Mechanisms
In our data we observe the change in appointment system and link these changes to measures of judge performance. The literature shows that when moving from a partisan to nonpartisan system voters are likely to use more information regarding judge ability. In this section we show how formally this may lead to variation in the quality of selected judges. The theoretical literature on voting in political science is well-developed and sophisticated (see Ashworth and de Mesquita (2008) and Ashworth et al. (2017) for comprehensive reviews of this literature). However, empirically we only look at the most basic features of the environment -the division of appointment systems into three crude categories, and the effects that these have upon basic performance measures.
The relevant question is: How do these appointment systems change the performance of judges? To answer this question, suppose voters are faced with the choice between judge A and B. We follow Condorcet (1785) and view voting as an information revelation problem. The voter's choose based upon some signal s j of a latent factor, q j . It is also the case that voters care about the political views of judges, and hence they may be biased for one judge over another. Let b j denote the bias term. Hence we suppose: It is possible that the election system can affect the pool, both in terms of the levels and variance. One can model this by setting: where m is the mean ability in the electoral pool, and δ is a measure of the variance. Note that without loss of generality judge A is assumed to have higher ability. Given that the elections are relative races, then like other parameters, bias can be viewed as a single parameter in favor of Judge A: This allows for nice closed-form solutions for the expected ability. Moreover, it is possible that the parameters vary with the appointment system and future behavior of the judge. At the time of appointment, the parameters are assumed to take on their expected values. Under the hypothesis that there is a common latent factor, and that voters prefer higher q to lower q, then we may suppose that there is a representative voter who chooses the judge with the highest value (see Rothstein (1991); Saporiti and Tohmé (2006)). Thus, the judge selected in period τ is given by: The expected quality of the judge chosen in period τ is given by: where σ 2 = σ 2 A + σ 2 B is the variance of γ B − γ A , and F () is the cumulative distribution function for the Normal distribution.

Judicial Effort
Judges have the potential to affect these signals through two avenues. One is to provide more information regarding themselves so that the voters can more accurately assess their quality. Second, they can pander to the voters regarding their political views, and thus affect the bias in their favor. Let the effort of judge j be given by e j = e I j , e P j , and total effort be given by e = { e A , e B }, where I indicates providing information (not disinformation!) and P refers to pandering.
Suppose that bias is given by: where P () ≥ 0 is a bounded, differentiable, strictly concave "pandering" function that satisfies P (0) = 0. It is also assumed that variance σ 2 j e I j is a concave, differentiable, strictly decreasing function of effort that is bounded below by zero. Let σ e I = σ 2 A (e I A ) + σ 2 B (e I B ) be the variance of the difference in error terms as a function of judge investment into information transmission. The cost of effort for Judge j is given by C e I j + e P j , where it is assumed that C (0) = 0, C (e) > 0 and there is anē such that lim e→ē C (e) = ∞. Suppose that the utility that judgesget from obtaining office is U * j > 0. Then the payoffs of the judges are: Notice that Judge A has an incentive to invest in providing information if and only if δ + b > 0, while Judge B will invest if and only if δ + b < 0. Thus, only one of the two judges has an incentive to invest in providing information, depending upon whether or not the bias is sufficiently strong to overcome the quality differences.
In contrast, both judges have an incentive to invest in manipulating the bias. Consider the case where the bias is such that Judge A chooses to provide information to voters (e I * A > 0). In that case the first order conditions for an effort equilibrium are given by Notice that the Normal probability density function, f (), is symmetric and hence the term in front of costs C is the same in both expressions. Suppose that the gain from winning is the same for both judges (U * A = U * B ), and Judge A chooses e I A > 0, then it must be the case that the level of pandering by Judge B is more than by Judge A (e P * B > e P * A ). In other words, the less-skilled judge does not invest in providing information on his performance, but panders more to voters than the more highly skilled Judge A. This effect is reversed if the bias is sufficient to make Judge B more likely to win.

Judicial Selection
Let us now consider how election success is affected by the parameters that are exogenous from the perspective of voters. Suppose that they have rational expectations regarding the parameters of the system. In general we cannot expect voters to have a great deal of highquality information regarding candidates, though as we discussed above there is evidence that votes do have some information regarding judges. Given that both judges have an incentive to engage in pandering, to a first order we can assume that it cancels out, and thus we can take the baseline level of bias,b, as given.
Given this and the equilibrium variance of signal quality derived from the previous section, we have that the average quality of an elected judge in period τ is given by: immediately get: namely, an increase in mean quality of the pool increases expected quality, all else held fixed. Given that Judge A is assumed to be better, then bias in favor of Judge A increases the probability that he/she wins, and hence increases average quality. Hence, the effect (3.5) is of practical use if and only if we can tell ex ante who is the better judge. Next, notice that the effect of the two variance measures (δ and σ 2 ) also depends upon the bias term. The positive bias case occurs when δ +b > 0. This ensures that the probability of the better judge (Judge A) winning is greater than 1/2, in which case we have: Ex ante, the higher quality judge is not known by the voters, and hence it is more useful to think about bias in terms of its strength. When the size of the bias is less than the variance of the pool, |b| < |δ|, we call this the weak bias case, which also implies (3.6 -3.7).
In the strong bias case, the result depends upon the sign. If the bias is in favor of the better candidate, then we get the above comparative statics. However, if δ + b < 0, then we get: (3.8) The effect of δ is ambiguous in this case and depends upon the relative magnitudes of the bias and the variance. "If the state has a problem with judicial impartiality, it is largely one the state brought upon itself by continuing the practice of popularly electing judges."

Application to Empirical Context
Our goal is to explore the validity of this claim by exploring the natural experiment that arises when states change the system they use from appointing judges. Consider first a non-partisan system. In such a system judges are not identified with any political party, and hence a priori the voters may not know where they stand on certain issues that are associated with parties. In the context of our model, this is captured by setting the bias termb = 0. This also implies that in an election year a non-partisan judge may have a greater incentive to pander to voters than under a partisan system, to let them know where he stands on sensitive issues. Hence, compared to a partisan system, we might expect nonpartisan judges to spend more time in election year politics, which in turn may be reflected in less effort on the bench. In our data we can observe election year performance, and we can see how it affects both total output and the quality of decision making. In the case of appointed judges, their reappointment is not determined by voters and hence we would not expect to see any performance fall during years they are reappointed. Thus, the null hypothesis is that in an election year judges work at the same level as in a non-election year. We expect this null to be rejected in the case of partisan and non-partisan elections, with possibly a larger effect in a non-partisan election.
In terms of the average quality of judges the case is far from clear. As we discussed above, there is evidence to support the hypothesis that voters are influenced by the political party. However, this does not necessarily imply less competition. For example, if party A is more favored in a state, then individuals who want to be judges would know that they should compete in A's primaries (there are cases of judges changing political affiliation for this purpose). See Hirano and Snyder Jr (2014) for some evidence on this point. In the case of a merit panel we can suppose that they have better quality information Thus, again the null hypothesis is that the appointment system has no effect upon the quality of judges appointed. The theory suggests two alternatives to the null hypothesis: 1. Suppose that both candidates are drawn from the same distribution. In partisan elections voters have less incentive to gather information relative to the non-partisan system, and hence we expect judge quality to be highest under a merit system, followed by a non-partisan election system, and finally the partisan system would have the lowest quality judges.
2. Suppose that political parties use superior information to choose candidates, and the distribution of judges under a merit system is the same as the distribution of a randomly 13 chosen judge under the non-partisan system, then the quality of judges is highest under the partisan system, followed by the non-partisan system, then then the merit system would have the lowest quality.
The second result follows from (3.1). Namely, if appointment systems are not competitive in the sense that the distribution of judges they select is the same as a single judge presenting themselves for office, then election systems introduce an element of competition that improves quality. Whether or not a partisan system is superior to a non-partisan system depends upon the quality of the primary system. From the results in Ash and MacLeod (2015) we know that judges face a trade-off between output and quality. Those results follow from exogenous shocks to workload generated by the addition of intermediate appellate courts, and hence identified separately from the appointment system. Left to their own devices we found that judges, like many academics, prefer quality over quantity. By this, we mean they seem to prefer to write longer opinions that are cited by their colleagues in the future. By measuring how the appointment system affects output and quality separately we should also be able to uncover some information regarding voter preferences over these different performance measures.

Data Overview
The data-set used for the empirical analysis is an update and extension of that used in Ash and MacLeod (2015). It merges information on judge biographies, state-level court institutions, and published judicial opinions. For this paper, the dataset was checked and rebuilt, with more information relevant to elections added or updated. These data allow panel estimates on the effects of court institutions on judge performance.
There are 1,628 state supreme court judges in our data. Table 2 reports summary statistics on the characteristics of judges working in one of the three selection systems discussed in the introduction. For many of the variables, the systems are comparable. Relative to the partisan judges, the nonpartisan and merit judges are more likely to be female. Merit judges are the most likely to have judicial experience, while partisan judges are the most likely to have political experience. Nonpartisan and merit judges have longer career lengths. Merit judges are the least likely to lose re-election.
Our performance measures were constructed from published state supreme court opinions

Measuring Judge Performance
An important step in this research is to provide an effective measure of judge performance. We focus on two simple metrics for judge performance, work output and work quality. The measures build off of work by previous researchers, in particular Choi et al. (2010) and Epstein et al. (2013). The baseline measure of work output is the total number of words written by a judge in opinions during a year on the job. This is a measure of the total volume of opinion-writing work that a judge is responsible for in that year. As alternatives to assess robustness, we look at number of sentences written and number of characters written.
Work quality is measured by the number of citations to a judge's opinions. Judges in a common-law system cite previous cases that are useful to their decision, and therefore citations can be seen as an expert evaluation of peer decision quality (Posner, 2008). More citations means that a case (and the authoring judge) have a stronger influence on the path of the law.
The citations measure is per case (divided by number of cases), so it is workload-adjusted. Citations are annotated as positive, negative, or distinguishing by the data provider, so for the baseline we look only at positive citations. As alternative measures, we use all cites (including negative and distinguishing), discussion cites (where the case was discussed at length by the citing court), and out-of-state cites (only citations in other jurisdictions). Because state supreme court precedents have no bindingness in other states, out-of-state citations serve as an especially strong signal of legal usefulness or influence (Choi et al., 2010).
To check for the importance of caseload changes, we report the number of opinions written as an outcome. To help assess the relative importance of output and quality, we also report a measure of work impact -the total number of positive citations to a judge in a year (unadjusted for number of opinions). The appendix includes a range of other outcomes, including measures of caselaw research and number of discretionary opinions written.
This approach faces the challenge that the difficulty and importance of a ruling varies from case to case, for reasons outside a judge's control. This problem is exacerbated when trying to compare judges across different states (as done in Choi et al. 2010), since not only lower-court characteristics but also a range of court and state factors might affect number of opinions written, and citations to those opinions.
The goal of the analysis is to make two key comparisons: first, to compare a judge's performance to other judges in the same court-year; and second, to compare a judge's performance to his/her performance in other years. A challenging feature of the data is that the distributions of the outcomes are extremely variable across courts, judges, and years. This means that, when making within-court-year comparisons, for example, court-years with higher variance are upweighted in regressions using the raw data. In addition, coefficients on treatments that affect different subsets of states will not be comparable, as the different subsets will have different outcome variance.
We address this issue by normalizing outcome variables within subsets of observations. For the purposes of within-court-year comparisons (selection system and election cycle), we divide the outcomes for each judge-year by the court-year standard deviation, meaning that each court-year subsample will have variance once. In turn, for within-judge comparisons (retention system), outcomes are divided by the judge's standard deviation, meaning that each judge's sample of observations, together, will have variance one. 9 We do not de-mean the outcomes within the groups; de-meaning does not change the coefficients or standard errors, but it does reduce adj. R 2 substantially. In the replication notebook we show that the state-year fixed effects and judge fixed effects each explain, by themselves, about 60% of the variance in both outcomes (output and quality). Together, they explain about 80% of the variance in output and 70% of the variance in quality.
Citations are a joint measure of work quality and case importance. Some types of cases are more important than others. For example, cases that review the constitutionality of statutes are probably relatively important. In addition, judges have some discretion over the types of cases they are chosen to author opinions for. If we want to compare the quality of judges working on the same court at the same time, we need to try to account for these non-judge factors.
Empirically, we use the full range of dummy variables for the area of law of a case, as well as the related industries of a case. These are coded for each case by Bloomberg staff 9 The main implication for our results is that we can compare the election-year effect sizes for partisan elections and nonpartisan elections. Without this normalization, the non-partisan effect appears larger than the partisan effect, but that turns out to be solely due to different variances in work output for these states. The only finding that is different with variance normalization is the effect of partisan elections on quality; without the normalization, this coefficient is still negative but not statistically significant. The full set of results without variance normalization are reported in the appendix. attorneys, and there may be up to three legal areas and three related industrial sectors for any particular case. Summary tabulations for the most frequent legal areas and industrial sectors are reported in table 3. The case characteristics vector includes a dummy variable for each area and sector, equaling one if the case has been assigned to that area or sector. Because there are so many of these characteristics, including separate covariates for every category would almost saturate the dataset. Instead, we include the first five principal components of this matrix of controls, which explains 65% of the variance of the matrix of case controls. 10 An important issue is that a "judge" is not really a single individual, but a team of individuals that includes clerks and secretarial staff. Judges select the clerks that are working for them, and hence our measures can be seen as composites that depend upon both the judge's legal skill when researching, reasoning, and writing, as well as managerial skill when selecting and directing clerks. As we know from Bloom et al. (2012), management quality varies across firms, and there are systematic relationships between management quality and firm performance.
In our data we cannot directly disentangle managerial skill from legal skill. However, we can ask if there is variation across judges in the same court, and/or across time within-judge. Figure 1 demonstrates the variation left over in work output and work quality after residual- izing out state-year fixed effects and case characteristics. This is the outcome variation left over for use in the estimation of treatment effects. As can be seen in the figure, there is significant variation in output and quality across judges, even after controlling for institutional and case-level characteristics. Next, Figure 2 shows that our performance measures are positively correlated, both within judge over time, and within court-year between judges. In the appendix we report plots of our judge quality measures over time for individual judges, to illustrate that the measures can reliably distinguish individuals over time.
To validate our outcome variables as judge-specific measures of performance, we explore the extent to which they are correlated with another performance measure previously used in the literature, the quality ratings issued by state bar associations. We were able to merge our data for a small number of judges with the data on evaluations provided by Lim and Snyder (2015b). We then regressed our performance measures on the bar evaluations with state-year fixed effects, to see whether our quality and output measures are predictive of the bar association evaluations of a "good judge," as coded by the authors. Table 4 Column 3 shows that quality, but not output, is a strong predictor of the bar association qualifications evaluation. 11 Outcome is an indicator for being a "good" judge as defined in Lim and Snyder (2015), with mean 0.86. Coefficients estimated by conditional logit. Standard errors clustered by state in parentheses.

Econometric Approach and Results
The empirical analysis follows the theoretical analysis. In turn, we look at election-year pressures, then changes to the retention system, and then finally to the selection system.

Effect of Election-Cycle Pressure on Incumbent Judges
How do judges change their behavior over time in response to the election cycle? Ash and MacLeod (2015) show that contested elections reduce judicial performance. We add to that analysis by distinguishing between partisan and nonpartisan elections. In theory, if judges wish to be re-elected then they should put effort into election-year politics. This in turn would lead to a reduction in time spent on judging. The empirical strategy for examining the effects of electoral demands on judicial behavior is to exploit the staggered election cycle for identification of stronger electoral incentives. The election schedule is arbitrarily assigned by history, so it is reasonable to assume that the schedule is uncorrelated with other institutional or socioeconomic factors that might affect individual judge performance. For this analysis we used data provided by Kritzer (2011), supplemented by new data collection and checking by a team of research assistants.
The electoral cycle is represented in our regressions as a vector of dummy variables E jst , which equals one if judge j in state s is up for election at year t. The vector includes separate indicators for partisan, nonpartisan, and uncontested retention elections. The dummy variable is coded as a one regardless of whether the judge actually ran for election -this is intended to avoid endogeneity problems from the judge's choice whether to actually run. Similar results were obtained using an intensity treatment variable giving the number of years until the next election.
One possible source of bias in this analysis comes from time-invariant characteristics of individual judges. Some judges may have higher or lower performance than others on average due to unobservable characteristics, and they may be up for election more often or less often for any number of reasons. To deal with this possibility, we include a full set of judge-specific fixed effects. Therefore any estimated election coefficients are relative to a judge's personal average.
A second major source of bias comes from the time-varying changes in the court work environment which may be correlated with the electoral schedule. For example, there may be campaigning demands during election years on all judges -not just those up for election -if they are asked to assist fellow members of their political party. To deal with this possibility, we include a full set of court-year fixed effects. Therefore any estimated election coefficients are also relative to the court average in each year. This means they effectively compare judges sitting on the same court, working at the same time, but who are in different stages of the electoral cycle.
Our preferred specification is where JUDGE j is a judge fixed effect, STATE s × TIME t is a court-year fixed effect for each state s and year t, and E jst includes the election-year treatments. Standard errors are clustered by state. Note that this gives the average output deviation for the year before an election. The coefficient estimates from Equation (5.1) are reported in Table 5. In Columns 1 through 3, we see that both partisan elections and nonpartisan elections have a negative election-year effect on work output. This effect does not change much when adding judge fixed effects, case topic controls, or judge experience controls. In all cases, there is no negative effect of uncontested retention elections on work output; if anything there is a positive effect.
We don't see much of an effect of election-year pressure on work quality (Columns 4 through 6). In nonpartisan and uncontested elections, the effects are clearly zero. In partisan elections, there is a negative coefficient but it is not significant.
The effects on work output observed under partisan and nonpartisan elections are driven  at least in part by a decrease in workload. Column 7 shows that they are ruling on significantly fewer cases than their colleagues when up for election. Due to the decrease in cases published, the election-cycle judges have significantly lower impact in terms of total citations per year (Column 8). Table 6 extends the analysis to the alternative specifications for performance. These results are in line with what we saw in the previous table. First, from Columns 1 and 2 we learn that the electoral effect on output is not sensitive to how output is defined. Second, we see negative coefficients for the effect of partisan elections on quality, although only "all" citations is significant.
A visualization of the results and a range of robustness checks and alternative specifi-24 cations are reported in the appendix. We got similar results after dropping short opinions (less than 3 paragraphs). We also got similar results when looking at different types of cases (criminal, civil, administrative, or constitutional), when using logs of the outcomes, including more case factor controls, or dropping the first and last year for each judge career. Without normalizing the variance by state-year, or when normalizing by judge, the effects of partisan elections on quality goes to zero. When weighting by number of opinions, the effect of non-partisan elections is larger, and the effect of partisan elections is smaller and not significant. The election-cycle results suggest that contested elections reduce output, while uncontested elections have no effect. Since uncontested elections are non-competitive by design, the null effect here is unsurprising. 12 In addition, partisan elections tend to reduce work quality, but not nonpartisan elections. These results are consistent with the hypothesis that partisan judges have a less preference for producing high quality decisions. The fact that in election years there is a fall in performance is consistent with the hypothesis that voters do care, and that judges do need to spend time to get re-elected.

Effect of Retention Process on Incumbent Judges
This subsection reports the results on how changing the system for judge retention affects the performance of sitting judges. The theory predicts that judges will work to get re-elected, though their choice between pandering and providing information regarding their work quality depends upon which factor is more important for their re-election chances. We identify the effect of the re-election system using discrete changes in the rules for retaining state supreme court judges. Four states moved from partisan retention elections to nonpartisan retention elections: Florida, Georgia, Kentucky, and Utah. Eight states moved from partisan retention to uncontested retention elections: Colorado, Illinois, Iowa, Indiana, Kansas, Nebraska, New Mexico, and Oklahoma. Six states moved from nonpartisan retention to uncontested retention: Arizona,Florida,Maryland,South Dakota,Utah,and Wyoming. 13 The regression framework is a standard differences-in-differences approach based on Bertrand et al. (2004). To control for time-invariant judge characteristics that may be correlated with the retention system in various states, we include judge fixed effects. To control for national trends in performance, we include year fixed effects. To control for pre-existing state trends in performance that may be confounded with the reforms, we include state-specific linear trends. In the strictest specification, we include controls for a number of other rules affecting the judiciary, notably state expenditures on courts.
We measure effects in a ten-year window before and after the reforms. The regressions include an indicator equaling one for the baseline time window of ten years before and ten years after a change to the retention system. The treatment variable is a dummy for the ten years after the change. Thus, with the inclusion of the judge fixed effects, the estimates can be interpreted as the average difference in within-judge performance for the ten years after the policy change relative to the ten years before the policy change. In a handful of states, we shrank the time window if the reform occurred close to the beginning or end of the sample. 14 A graphical representation of the windows is included in Appendix Figure 3. In the appendix we include a table using other time windows; our main result on nonpartisan judges and quality is a somewhat lagged effect that is statistically significant with an effect window of at least eight years.
Formally, we estimate where YEAR t is a fixed effect for the two-year period t, JUDGE j is a judge fixed effect, and STATE s ×t is a state-level linear time trend for state s. The termR st is a vector of indicators equaling one for the baseline time windows of ten years before and ten years after each of the retention reforms. R st is a vector of treatment indicators for the ten years after each rule change (with ρ measuring the corresponding causal effects of interest). X jst includes other state and judge controls, namely variables for case topic and other court-related policies. Standard errors are clustered by state and year. Table 7 reports the estimates for ρ from Equation 5.2. First, in Columns 1 through 3 we look at effects of the rule changes on average judge output. There is an inconsistently significant negative effect of moving from partisan to nonpartisan elections. Effects for partisan-to-uncontested and nonpartisan-to-uncontested, change a lot across specifications.
Next, Columns 4 through 6 show no effect of moving from partisan to nonpartisan elections, or moving from partisan to uncontested, on work quality. However, there is a positive   Topic controls include area of law and related industries. State policy controls includes other appointment-process changes, mandatory-retirement changes, changes in the number of judges, and log state government expenditures on the judicial branch. Outcomes normalized to have variance one within judge. Standard errors, adjusted for two-way clustering by state and year, in parentheses. P-values in brackets.
and statistically significant effect when moving from a nonpartisan system to uncontested elections. This result is insensitive to the addition of state time trends, judge experience controls, and a range of state-level policy controls. In terms of caseload (Column 7) and total citations (Column 8), there are no effects for partisan to nonpartisan, nor for nonpartisan to uncontested. We see positive effects for partisan to uncontested, but these are sensitive to the inclusion of trends and controls.
Finally, Table 8 adds to the results with the alternative outcomes. In terms of output, we see more consistent negative effects for moving from partisan to nonpartisan elections. In terms of quality, there is a significant positive effect on "all" citations for the nonpartisanto-uncontested reform (Column 3). For discussion cites (Column 4) and out-of-state cites (Column 5), the coefficients are positive but not significant at the standard thresholds.
In the appendix and replication materials we provide other robustness checks and alternative specifications. We got similar results after dropping short opinions (less than 3 paragraphs). The quality results were similar in magnitude, but more significant, when weighting by number of opinions written, using log transformations of the outcome, or adding more case factor controls. Dropping the first and last year for each judge reduced significance of the main quality regression. The retention effect is statistically significant with an effect window of at least 8 years. 15 There was no effect on probability of being overruled by the U.S. Supreme Court. Finally, we dropped each treated state one by one; the nonpartisan-to-uncontested effects on quality were significant in all regressions.
These results provide additional evidence that electoral incentives have an important impact on the performance of appellate court judges. In particular, the partisanship of elections matters. When moving from partisan to nonpartisan elections, there is a reduction in output, consistent with the hypothesis that nonpartisan elections are more competitive. When taking nonpartisan judges and giving them tenure, meanwhile, there is an increase in work quality. This is consistent with the idea from Ash and MacLeod (2015) that these judges have an intrinsic motivation to produce high-quality work, and that with weaker electoral incentives they will orient their time and decision choices away from campaigning and pleasing voters, and toward writing good decisions that please their peers.
On the other hand, when taking partisan judges and giving them tenure, there is no effect on output or quality. This suggests, consistent with Judge O'Connor's view, that partisan elections are less competitive than nonpartisan elections. Because there was little electoral pressure in the first place, eliminating those elections does not have a big impact on the way partisan judges spend their time or decide their cases.
Second, these results are consistent with the hypothesis that partisan-selected judges have a lower intrinsic valuing of quality than non-partisan judges. When giving them tenure, they do not value quality enough to invest in it with the newly available time. Because we found above that partisan election cycles reduce output (and therefore seem to involve election pressure), the evidence is more in line with this selection mechanism.

Effect of Selection Process on Selected Judges
The evidence on election year effort shows that election pressure does count, and that potential judges do attempt to influence the vote. The evidence is consistent with the hypothesis that judges selected under a non-partisan systems place more value upon quality than quantity relative to a judge selected under a non-partisan system. This weighting does not necessarily imply that they are worse judges. The effect of the selection system upon overall quality of a judge can be measured in our data by measuring the effect of a change in a state appointment system upon judges selected under different systems but working on the same court.
In Table 1 we see that four states changed from partisan selection to nonpartisan selection: Florida, Georgia, Kentucky, and Utah. 16 Seven states moved from partisan selection to merit selection: Colorado, Iowa, Indiana, Kansas, Nebraska, and Oklahoma. 17 Six states moved from nonpartisan selection to merit selection: Arizona,Florida,Maryland,South Dakota,Utah,and Wyoming. 18 The goal is to compare the performance of judges selected before these reforms to the performance of judges selected after these reforms.
We control for time-varying state-specific factors by including a full set of court-year (interacted) fixed effects. This specification effectively compares the performance of judges sitting on the same court at the same time, but selected under different regimes. We include case characteristics to address the issue of endogenous selection of case types to judge types. In the strictest specification, we include a set of covariates for judge personal characteristics. These include controls for judge starting age, gender, political party, and whether they come from a top law school.
The estimating equation for performance variable y ist for judge j in state s at year t is where STATE s ×YEAR t includes the court-year fixed effects, X jst includes the case and judge controls, and S jst includes the treatment indicators equaling one for judges selected under the post-reform system. 19 Given the inclusion of the fixed effects, the coefficients ρ procure 16 Florida moved from nonpartisan to merit five years after the partisan to nonpartisan reform, and only four judges were selected under the nonpartisan system. Therefore Florida is excluded from the baseline selection regressions, but including it does not change the results.
17 Tennessee moved to merit selection in 1972, but moved back to partisan selection in 1978. Only one judge was selected by the merit process so it is not included in the analysis.
18 See Footnote 16 re Florida. 19 Note that in the electoral selection systems, the judges may be initially appointed by the governor to the average difference in performance between judges selected under the new system and judges selected under the old system, controlling for other time-varying state-level factors. Standard errors are clustered by state and year, where generally we got smaller standard errors with clustering by state, clustering by state-year, or not clustering. Table 9 reports the estimates from Equation (5.3) for our baseline set of outcomes. First, in Columns 1 through 3, we do not see any robust selection effect on output. Judges selected by partisan elections, non-partisan elections, and merit commissions write about the same number of words per year.
In Columns 4 through 6, however, wee see significant effects on work quality. Relative to their partisan-elected colleagues, nonpartisan-elected judges write higher-quality decisions. In turn, merit-selected judges also write higher-quality decisions than partisan-elected colleagues. This effect is not driven by differences in the types of cases they are assigned (Column 5). It is also not explained by judge demographic characteristics (Column 6).
As seen in Column 7, there is no significant difference between judges in terms of the number of opinions they are assigned. As a result, the effect on total citations to a judge in a year (Column 8) is positive for both nonpartisan relative to partisan, and merit relative to partisan. Meanwhile, there is a marginally significant negative difference between meritselected judges and nonpartisan-elected colleagues. Table 10 reports the selection-system effects for some additional measures of output and quality. In Columns 1 and 2, we see again that there is no difference between the types of judges in terms of output. Columns 3 through 5 show that in the case of quality, there are robust differences between the systems. When looking at all citations (including negative and distinguishing, not just positive) or discussion citations (where the previous cases is actively discussed), there are significant positive differences between nonpartisan and partisan, and between merit and partisan. In the case of out-of-state citations, the effect of nonpartisan (from partisan) is no longer significant, but strongly significant for merit selection (relative to partisan).
We ran a number of robustness checks and alternative specifications, which are reported in the replication materials. In general, the results on partisan-to-merit are more robust than the results on partisan-to-nonpartisan. The results were not sensitive to subsetting by type of case (criminal, civil, etc.), using log transformations, dropping first and last year of each judge career, or adding more case characteristics components. Without normalizing fill a vacant seat, rather than being initially selected through a competitive electoral process. We still code the appointed judges as being selected under the electoral system -since the predecessor's choice whether to step down is endogenous to the system.  the variance within court year, some of the results on work quality are only significant at the 10%, rather than 1% or 5%, level. The results were of similar magnitude, but more statistically significant, when weighting by number of opinions written. The results on partisan-to-nonpartisan were somewhat sensitive to the "window" before and after the form for including judges: if you only look at judges selected 5 years either side of the reform, the effect is not significant, but the effect is seen when including judges selected at least 10 years either side of the reform. Adding state-specific trends in judge starting year shrinks the coefficients; the effects are significant only with clustering by state-year (rather than twoway clustering by state and year). When requiring that there be at least 2 judges selected from each system, the effect of partisan-to-nonpartisan shrinks and is no longer significant. Finally, we dropped each treated state one by one; the partisan-to-nonpartisan results were unaffected, while the partisan-to-merit effects shrank and became marginally insignificant when either Indiana or Oklahoma were dropped. Overall, these results point to the importance of selection systems for the quality of judge decision-making. Partisanship of election processes reduces decision quality, but not total output. From A. Kirkland and Coppock (2017) we know that partisan labels leads to less scrutiny of the candidates. Hence this finding is consistent with the hypothesis that the primary system does not select better candidates than an at large election. In these regressions there seems to be little difference in quality between a non-partisan systems and merit systems.

Conclusion
The goal of this paper has been to contribute some evidence regarding the hypothesis that the choice between a "politician" and "bureaucrat" entails a trade-off between a sensitivity to the desires of the electorate and the execution of the mission to make high quality legal decisions (Maskin and Tirole, 2004;Tabellini, 2007, 2008). A substantial body of work documents that public officials respond to the preferences of the electorate (Ashworth, 2012). We complement this literature with evidence on the other side of the balance sheet: how electoral pressures interact with a judge's mission to provide high work quality. Sitting judges spend more time on their work in response to a weakening of re-election pressure, and they make decisions that are evaluated as higher-quality by other judges. Moreover, judges selected by a technocratic merit commission are of higher quality. The results are detailed in Table 11. Merit, Relative to Nonpartisan~L eft-most column indicates the treatment, and the other column headers indicate the outcome measure. Arrows indicate a positive or negative effect on judge performance. A tilde (~) indicates no effect. Pluses (+) and asterisks (*, **) indicate significance/robustness of the effect.
For incumbent judges, we find that contested systems reduce output in election years, but not uncontested elections. This is consistent with a simple model in which campaign effort takes time away from judging. Moving from nonpartisan to uncontested elections increases case quality, consistent with the notion that nonpartisan contested elections are more demanding of a judge's time than uncontested elections. There is no within-judge effect of moving from partisan to uncontested elections, reflecting that nonpartisan elections are most competitive -due to less bias than partisan elections. Finally, the merit-based selection process selects better judges than the election systems. These results are consistent with a selection model where better-informed experts can choose higher-quality officials than voters on average.
Our evidence is broadly in line with the early rational-choice approaches of Downs (1957) and Ferejohn (1986), in which voters use their information to make the best decisions they can, conditional upon their policy preferences. But more information is not always better; more information on candidate quality can improve performance (see Pande, 2011), but more information on political affiliation can reduce performance.
Should all states immediately move to a merit system with uncontested retention elections? Our evidence would certainly strengthen arguments to do so. But there are other criteria besides judicial citations for ranking courts, and ballot referenda for the merit plan have failed many times. There may be many other social impacts of these courts, but at present there aren't data-driven ways to measure them. There is an ongoing debate on which is the superior system (e.g. Pozen, 2010); the fact that states continue to experiment with different systems suggests that it is not clear which system is optimal. If a single system were clearly optimal, then we would have expected the market to have moved in that direction quickly, consistent with Posner's (1987) view that legal institutions move in the direction of efficient exchange.
The fact that we do find a pattern of effects predicted by our simple model helps explain why there is experimentation. While the results are consistent with merit commissions selecting better judges, judging is not a purely technical activity. The political views of judges color the ideological content of their decisions (see Epstein et al., 2013), which may explain why many jurisdictions prefer to give voters a clear signal of the political views of judges. Optimizing states would change systems only if it led to an improvement; hence at any point in time there should be only small variation across states (as Choi et al. (2010) find). 20 Finally, our results highlight the fact that the American legal system is neither simple nor static. It is a complex, dynamic system consisting of a number of interlocking ingredients. Our study focuses upon one of the most important and influential ingredients of this system: the U.S. state supreme court judges who rule on all aspects of private law, including contract, tort, and property law. Our evidence is consistent with the hypothesis that these judges are professionals who are interested in enhancing the quality of the law. Hence, we have observed many states moving away from partisan political processes for selection toward nonpartisan and merit-based processes. These more "bureaucratic" systems have selected better judges and imposed incentives more aligned with the mission of increasing the quality of American law.
to measure the impact of a location on wages due to the self-selection by workers; similarly, one cannot use cross-sectional data to measure the impact of selection systems on judge performance due to self-selection of judges across states. See Heckman and Honore (1990) for details. Tennessee moved from partisan to merit-uncontested in 1972, then moved back to partisan elections in 1975. It is not included in the analysis.
Utah instituted an intermediate appellate court in 1988, two years after the reform from nonpartisan to merit-uncontested.

A.2 Appendix Tables
This section provides additional tables and empirical results. Figure 4 shows the distributions of the variances of the outcomes by court-year and by judges. This substantial variation justifies the normalization of variances within these groups of observations.
Next we look at additional outcome variables for the set of judges analyzed. Figures 7, 8, and 9 provide judge-specific plots for three potential outcome variables to measure output. Respectively, they report the number of opinions written annually, the number of words written annually, and "work output" the number of words written per two-year period after residualizing on case characteristics. As discussed in the text, work output provides the most consistent distinctions between judges of these measures. A host of other results (described briefly in the text) are shown in the replication materials.