Abstract
The number of situations that require individual judgments and evaluations, and that may be object of different sources of conscious and unconscious biases is endless. This paper proposes a practical score aggregation procedure that attempts to reduce and mitigate the influence of bias in subjective judgments. The argument is based on the idea that bias is associated with deviations from the panel mean and/or deviations from the judges’ grading style. Consequently, the procedure is not specific to a particular type of bias, but rather addresses general forms of bias. We also discuss a set of desirable properties. The proposed score aggregation procedure is then applied to a unique data set from the 2000 Summer Olympic Games diving competition.
Similar content being viewed by others
Notes
In this paper, we focus mostly on sports performance, but the number of situations that require individual judgments and evaluations, which can be affected by different sources of bias, is endless. The approach in this paper may be extended to these other dimensions of our lives (e.g., the rating of any kind of items, goods and services, such as for example, wines, books, films, music, policies, scientific refereeing or any kind of talent competition, as well as, tourist locations or blog comments). Nowadays, the internet is making these evaluation procedures increasingly common.
Several studies have focused on bias in evaluation contexts other than sports. For instance, to mention just a few, in musical competitions, Ginsburgh and Van Ours (2003) found that judging panel members are influenced by the order of appearance of candidates, while Tsay (2013) found that judges are influenced more by what they see than by what they hear. In the Eurovision Song Contest, Ginsburgh and Noury (2008) found that linguistic and cultural similarities between singers and judges are determinant, while in academic awards, Hamermesh and Schmidt (2003) found that affiliation is crucial in the judges’ decision. In this context, some statistically based rating procedures have shown better results than expert opinions (Dawes et al. 1989; Meehl 1954). Other inconsistencies and paradoxical observations are reported in the literature (Ashenfelter and Quandt 1999; Fritz et al. 2012; Hodgson 2008; Plessner and Haar 2006). Further development of these issues is beyond the scope of the present paper.
The grading style must not vary with the order in which the history is presented. This aspect has implications for Sect. 4, when we discuss some of the properties of the proposed aggregation procedure. This implies that the history cannot include scores entered in the present competition. Otherwise, the order of the performances could interfere with the measurement of the grading style, which must be stable throughout the competition. This aspect also places restrictions on the use of moving averages as aggregate measurements of the history.
We are intentionally ambiguous about the length of the history of past scores. We leave this decision to the social planner or the sport’s governing body responsible for the competition. The longer the history and the closer in time the better. However, on the same panel, we may have judges with different histories in terms of length, but also in terms of quality. In that sense, homogenizing all the histories by using the length of the shortest history as a reference may not be a good idea, because it could imply a loss of data about the history of the other judges.
Grading style can be defined in different ways. These alternatives have in common the use of information from the history of past scores. For instance, grading style could have been defined as:
$$\begin{aligned} {\overline{s}}_{i{\mathbf {h}}_{j}}={\overline{s}}_{{\mathbf {h}}_{j}}{\overline{s}}_{i}/ {\overline{s}}_{{\mathbf {h}}}, \end{aligned}$$where \({\overline{s}}_{{\mathbf {h}}}\equiv \frac{1}{n}\sum \nolimits _{j=1}^{n} {\overline{s}}_{{\mathbf {h}}_{j}}\) is the arithmetic mean of the grading histories of all the judges on the panel. The results in this paper would not change significantly if we were to consider this measure. However, grading style is more correctly defined if judge \(j^{\prime }s\) grading history is made relative to the mean score of the panels on which judge j has participated. Alternatively, grading style could have been defined as: \( {\overline{s}}_{i{\mathbf {h}}_{j}}={\overline{s}}_{i}\sum \nolimits _{t=1}^{T}(s_{.j}^{t}/{\overline{s}}_{.(j)}^{t}).\) This case is conceptually equivalent to the one in this paper, and leads to almost exactly the same results, but the approach in this paper is more intuitive and simpler to apply. Other definitions of grading style are also possible.
The history of past grades of each judge usually includes scores from different competitions with different levels and different stages. In this context, for instance, the average score in the early stages of the same competition tends to be lower than the average scores in the later stages, in which only the best competitors are left . Similarly, the average scores in national competitions tend to be lower than the average scores in the Olympics, because in the Olympic Games competitors tend to be better on average. The ratio \({\overline{s}}_{{\mathbf {h}}_{j}}/{\overline{s}}_{{\mathbf {h}} _{(j)}}\) corrects for this heterogeneity.
We can consider alternative weight functions, but with similar implications. For instance, we can consider different parameters to control for deviations from judge \(j^{\prime }s\) grading style and from the panel mean, i.e., \( \beta \) and \(\gamma ,\) respectively. In this case, we could have:
$$\begin{aligned} w_{ij}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})\equiv \frac{\alpha \sum \nolimits _{k\ne j}^{n}\left| s_{ik}-{\overline{s}}_{i {\mathbf {h}}_{k}}\right| ^{\beta }+(1-\alpha )\sum \nolimits _{k\ne j}^{n}\left| s_{ik}-{\overline{s}}_{i}\right| ^{\gamma }}{(n-1)(\alpha \sum \nolimits _{k=1}^{n}\left| s_{ik}-{\overline{s}}_{i{\mathbf {h}} _{k}}\right| ^{\beta }+(1-\alpha )\sum \nolimits _{k=1}^{n}\left| s_{ik}-{\overline{s}}_{i}\right| ^{\gamma })}. \end{aligned}$$Other formulations are also possible. For instance, we can also consider the following simplified weighted mean formulation:
$$\begin{aligned} w_{ij}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})\equiv \frac{\sum \nolimits _{k\ne j}^{n}\left| s_{ik}-(\alpha {\overline{s}}_{i {\mathbf {h}}_{k}}+(1-\alpha ){\overline{s}}_{i})\right| ^{\gamma }}{ (n-1)\sum \nolimits _{k=1}^{n}\left| s_{ik}-(\alpha {\overline{s}}_{i{\mathbf {h}}_{k}}+(1-\alpha ){\overline{s}}_{i})\right| ^{\gamma }}. \end{aligned}$$These formulations may differ slightly in terms of properties, but the crucial aspect is that all of them penalize deviations from the judges grading style. Other approaches, like majority judgment, which is based on the median score, have also been considered in the literature (Balinski and Laraki 2007, 2010; Bassett Jr and Persky 1994; Wu and Yang 2004)
In the case where \(\gamma \rightarrow 0,\) the score aggregation procedure converges to the mean \({\overline{s}}_{i}^{*}\rightarrow {\overline{s}}_{i},\) because all grades are equally weighted, while in the case where \(\gamma \rightarrow \infty ,\) the score aggregation procedure ignores the most extreme score and weights all the other scores equally (with some specificities in the case of more than one extreme score).
The score aggregation function can be written in more general terms as:
$$\begin{aligned} {\overline{s}}_{i}^{+}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\equiv \sum \nolimits _{j=1}^{n}f_{j}({\mathbf {s}}_{i},\{ {\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})g(s_{ij}), \end{aligned}$$where \(f_{j}(.)\) is a weight function that receives the vectors \({\mathbf {s}} _{i}\) and \(\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n}\) as inputs, and g(.) is a function that receives the grade of judge j as input. Then, if the function \(f_{j}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\) is continuous, homogeneous of degree zero on \({\mathbf {s}} _{i},\) with \(f_{j}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\in [0,1]\) and \(\sum \nolimits _{j=1}^{n}f_{j}( {\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})=1,\) and the function \(g(s_{ij})=s_{ij},\) then the properties of the general aggregation function \({\overline{s}}_{i}^{+}\) match the properties of \({\overline{s}} _{i}^{*}.\)
We consider the intermediate cases \(\alpha =1/3\) and \(\alpha =2/3\) with \( \gamma =2,\) because they are sufficiently representative and informative.
The statistical method employed by Emerson et al. (2009) is particularly powerful for detecting bias and manipulation. However, since bias can be hidden in very complex and strategic ways, there is no perfect method to deal with this possibility. For instance, a judge may penalize a particular athlete in the early stages of competition, in which the qualification of that athlete is almost guaranteed (because of the athlete’s quality), in order to later benefit this same athlete in the most crucial stages of the competition. Similarly, a judge may simultaneously penalize and benefit two different athletes of the same nationality. In those cases, the aggregation of data is likely to lead to the conclusion that bias is not statistically significant because of cancellation effects.
In this context, in order to introduce bias, judges are forced to award more extreme scores than when the aggregation function is simply the arithmetic mean. Such extreme behavior exposes them to public opinion and to detection by third party monitoring. For this reason, the simultaneous use of transparency policies is important.
In some cases, scores are dissociated from the identity of the judges, which makes bias analysis extremely difficult for researchers and the general public. In other cases (e.g., online judgment of items, goods or services), the data is proprietary and not freely available.
References
Arrow KJ (1950) A difficulty in the concept of social welfare. J Political Econ 58(4):328–346
Asch S (1951) Effects of group pressure upon the modification and distortion of judgments. In: Guetzkow H (ed) Groups, Leadership, and Men, pp 222–236
Ashenfelter O, Quandt R (1999) Analyzing a wine tasting statistically. Chance 12(3):16–20
Baker GP (1992) Incentive contracts and performance measurement. J Polit Econ 100(3):598–614
Balinski M, Laraki R (2007) A theory of measuring, electing, and ranking. Proc Nat Acad Sci 104(21):8720–8725
Balinski M, Laraki R (2010) Majority judgment: measuring, ranking, and electing. MIT press, Cambridge
Balinski M, Laraki R (2014) Judge: don’t vote!. Oper Res 62(3):483–511
Bar-Eli M, Plessner H, Raab M (2011) Judgment, decision-making and success in sport. John Wiley & Sons, New Jersey
Baron J (2007) Thinking and deciding (4ed). Cambridge University Press, Cambridge
Bassett GW Jr, Persky J (1994) Rating skating. J Am Stat Ass 89(427):1075–1079
Beliakov G, Pradera A, Calvo T (2007) Aggregation functions: a guide for practitioners, vol 221. Springer, Heidelberg
Buchanan JT, Henig EJ, Henig MI (1998) Objectivity and subjectivity in the decision making process. Ann Oper Res 80:333–345
Coupe T, Gergaud O, Noury A (2018) Biases and strategic behaviour in performance evaluation: the case of the FIFA’s best soccer player award. Oxf Bull Econ Stat 80(2):358–379
Cust EE, Sweeting AJ, Ball K, Robertson S (2019) Machine and deep learning for sport-specific movement recognition: a systematic review of model development and performance. J Sports Sci 37(5):568–600
Damisch L, Mussweiler T, Plessner H (2006) Olympic medals as fruits of comparison? assimilation and contrast in sequential performance judgments. J Exp Psychol Appl 12(3):166–178
Dawes RM, Faust D, Meehl PE (1989) Clinical versus actuarial judgment. Science 243(4899):1668–1674
Deutsch M, Gerard HB (1955) A study of normative and informational social influences upon individual judgment. J Abnormal Soc Psychol 51(3):629–636
Díaz-Pereira MP, Gomez-Conde I, Escalona M, Olivieri DN (2014) Automatic recognition and scoring of olympic rhythmic gymnastic movements. Hum Mov Sci 34:63–80
Dohmen T, Sauermann J (2016) Referee bias. J Econ Surv 30(4):679–695
Duggan M, Levitt SD (2002) Winning isn’t everything: corruption in sumo wrestling. Am Econ Rev 92(5):1594–1605
Emerson JW, Seltzer M, Lin D (2009) Assessing judging bias: an example from the 2000 olympic games. Am Stat 63(2):124–131
Felsenthal DS, Machover M (2008) The majority judgement voting procedure: a critical evaluation. Homo Oeconomicus 25(3/4):319–334
Findlay LC, Ste-Marie DM (2004) A reputation bias in figure skating judging. J Sport Exerc Psychol 26(1):154–166
Frank MG, Gilovich T (1988) The dark side of self-and social perception: black uniforms and aggression in professional sports. J Pers Soc Psychol 54(1):74–85
Frey B (2017) Omnimetrics and awards. Tech. rep., Center for Research in Economics, Management and the Arts (CREMA)
Frey BS, Gallus J (2017) Towards an economics of awards. J Econ Surv 31(1):190–200
Fritz C, Curtin J, Poitevineau J, Morrel-Samuels P, Tao F-C (2012) Player preferences among new and old violins. Proc Nat Acad Sci 109(3):760–763
Garicano L, Palacios-Huerta I, Prendergast C (2005) Favoritism under social pressure. Rev Econ Stat 87(2):208–216
Gibbard A (1973) Manipulation of voting schemes: a general result. Econometrica 41(4):587–601
Ginsburgh V, Noury AG (2008) The eurovision song contest. Is voting political or cultural? Eur J Politic Econ 24(1):41–52
Ginsburgh VA, Van Ours JC (2003) Expert opinion and compensation: evidence from a musical competition. Am Econ Rev 93(1):289–296
Grabisch M, Marichal J-L, Mesiar R, Pap E (2011a) Aggregation functions: construction methods, conjunctive, disjunctive and mixed classes. Inf Sci 181(1):23–43
Grabisch M, Marichal J-L, Mesiar R, Pap E (2011b) Aggregation functions: means. Inf Sci 181(1):1–22
Hamermesh DS, Schmidt P (2003) The determinants of econometric society fellows elections. Econometrica 71(1):399–407
Helsen W, Gilis B, Weston M (2006) Errors in judging offside in association football: test of the optical error versus the perceptual flash-lag hypothesis. J Sports Sci 24(5):521–528
Hilbert M (2012) Toward a synthesis of cognitive biases: how noisy information processing can bias human decision making. Psychol Bull 138(2):211–237
Hodgson RT (2008) An examination of judge reliability at a major us wine competition. J Wine Econ 3(2):105–113
Kahneman D, Tversky A (1972) Subjective probability: a judgment of representativeness. Cogn Psychol 3(3):430–454
Kahneman D, Tversky A (1996) On the reality of cognitive illusions. Psychol Rev 103(3):582–591
Keynes JM (1936) The general theory of employment, interest and money. Kessinger Publishing, Whitefish
Larsen T, Price J, Wolfers J (2008) Racial bias in the nba: implications in betting markets. J Quant Anal Sports 4(2):1–21
Lee J (2008) Outlier aversion in subjective evaluation: Evidence from world figure skating championships. J Sports Econ 9(2):141–159
Lock R, Lock J (2003) The statistical sports fan: judging figure skating judges. STATS 36:20–24
Looney MA (2004) Evaluating judge performance in sport. J Appl Meas 5(1):31–47
Meehl PE (1954) Clinical versus statistical prediction: a theoretical analysis and a review of the evidence. University of Minnesota Press, Minneapolis
Nevill AM, Newell SM, Gale S (1996) Factors associated with home advantage in english and scottish soccer matches. J Sports Sci 14(2):181–186
Osório A (2017) Judgement and ranking: living with hidden bias. Ann Oper Res 253(1):501–518
Page L, Page K (2007) The second leg home advantage: evidence from european football cup competitions. J Sports Sci 25(14):1547–1556
Parsons CA, Sulaeman J, Yates MC, Hamermesh DS (2011) Strike three: discrimination, incentives, and evaluation. Am Econ Rev 101(4):1410–1435
Pfister H-R, Böhm G (2008) The multiplicity of emotions: a framework of emotional functions in decision making. Judgm Decis Mak 3(1):5–17
Plessner H, Haar T (2006) Sports performance judgments from a social cognitive perspective. Psychol Sport Exerc 7(6):555–575
Popović R (2000) International bias detected in judging rhythmic gymnastics competition at Sydney-2000 olympic games. Phys Educ Sport 1(7):1–13
Price J, Remer M, Stone DF (2012) Subperfect game: profitable biases of NBA referees. J Econ Manage Strategy 21(1):271–300
Price J, Wolfers J (2010) Racial discrimination among NBA referees. Q J Econ 125(4):1859–1887
Satterthwaite MA (1975) Strategy-proofness and arrow’s conditions: existence and correspondence theorems for voting procedures and social welfare functions. J Econ Theory 10(2):187–217
Shah AK, Oppenheimer DM (2008) Heuristics made easy: an effort-reduction framework. Psychol Bull 134(2):207
Simon H (1955) A behavioral model of rational choice. Q J Econ 69(1):99–118
Sutter M, Kocher MG (2004) Favoritism of agents-the case of referees’ home bias. J Econ Psychol 25(4):461–469
Tsay C-J (2013) Sight over sound in the judgment of music performance. Proc Nat Acad Sci 110(36):14580–14585
Tversky A, Kahneman D (1974) Judgment under uncertainty: heuristics and biases. Science 185(4157):1124–1131
Unkelbach C, Memmert D (2010) Crowd noise as a cue in referee decisions contributes to the home advantage. J Sport Exerc Psychol 32(4):483–498
Wang XT, Simons F, Brédart S (2001) Social cues and verbal framing in risky choice. J Behav Decis Mak 14(1):1–15
Wolfers J (2006) Point shaving: corruption in NCAA basketball. Am Econ Rev 96(2):279–283
Wu SS, Yang MCK (2004) Evaluation of the current decision rule in figure skating and possible improvements. Am Stat 58(1):46–54
Zitzewitz E (2006) Nationalism in winter sports judging and its lessons for organizational decision making. J Econ Manage Strategy 15(1):67–99
Zitzewitz E (2014) Does transparency reduce favoritism and corruption? evidence from the reform of figure skating judging. J Sports Econ 15(1):3–30
Acknowledgements
Financial support from the GRODE Universitat Rovira i Virgili and Generalitat de Catalunya under Projects 2018PFR-URV-B2-53 and 2017SGR770, and the Spanish Ministry of Science and Innovation Project RTI2018-094733-B-100 (AEI/FEDER, UE) is gratefully acknowledged. I would like to thank Jonathan Baron, Juan Pablo Rincón-Zapatero, the Editor and two anonymous Referees, as well as several seminars and congress participants for helpful comments and discussions. The usual caveat applies.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Osório, A. Performance Evaluation: Subjectivity, Bias and Judgment Style in Sport. Group Decis Negot 29, 655–678 (2020). https://doi.org/10.1007/s10726-020-09672-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10726-020-09672-4