Skip to main content
Log in

Performance Evaluation: Subjectivity, Bias and Judgment Style in Sport

  • Published:
Group Decision and Negotiation Aims and scope Submit manuscript

Abstract

The number of situations that require individual judgments and evaluations, and that may be object of different sources of conscious and unconscious biases is endless. This paper proposes a practical score aggregation procedure that attempts to reduce and mitigate the influence of bias in subjective judgments. The argument is based on the idea that bias is associated with deviations from the panel mean and/or deviations from the judges’ grading style. Consequently, the procedure is not specific to a particular type of bias, but rather addresses general forms of bias. We also discuss a set of desirable properties. The proposed score aggregation procedure is then applied to a unique data set from the 2000 Summer Olympic Games diving competition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. In this paper, we focus mostly on sports performance, but the number of situations that require individual judgments and evaluations, which can be affected by different sources of bias, is endless. The approach in this paper may be extended to these other dimensions of our lives (e.g., the rating of any kind of items, goods and services, such as for example, wines, books, films, music, policies, scientific refereeing or any kind of talent competition, as well as, tourist locations or blog comments). Nowadays, the internet is making these evaluation procedures increasingly common.

  2. Several studies have focused on bias in evaluation contexts other than sports. For instance, to mention just a few, in musical competitions, Ginsburgh and Van Ours (2003) found that judging panel members are influenced by the order of appearance of candidates, while Tsay (2013) found that judges are influenced more by what they see than by what they hear. In the Eurovision Song Contest, Ginsburgh and Noury (2008) found that linguistic and cultural similarities between singers and judges are determinant, while in academic awards, Hamermesh and Schmidt (2003) found that affiliation is crucial in the judges’ decision. In this context, some statistically based rating procedures have shown better results than expert opinions (Dawes et al. 1989; Meehl 1954). Other inconsistencies and paradoxical observations are reported in the literature (Ashenfelter and Quandt 1999; Fritz et al. 2012; Hodgson 2008; Plessner and Haar 2006). Further development of these issues is beyond the scope of the present paper.

  3. The grading style must not vary with the order in which the history is presented. This aspect has implications for Sect. 4, when we discuss some of the properties of the proposed aggregation procedure. This implies that the history cannot include scores entered in the present competition. Otherwise, the order of the performances could interfere with the measurement of the grading style, which must be stable throughout the competition. This aspect also places restrictions on the use of moving averages as aggregate measurements of the history.

  4. We are intentionally ambiguous about the length of the history of past scores. We leave this decision to the social planner or the sport’s governing body responsible for the competition. The longer the history and the closer in time the better. However, on the same panel, we may have judges with different histories in terms of length, but also in terms of quality. In that sense, homogenizing all the histories by using the length of the shortest history as a reference may not be a good idea, because it could imply a loss of data about the history of the other judges.

  5. Grading style can be defined in different ways. These alternatives have in common the use of information from the history of past scores. For instance, grading style could have been defined as:

    $$\begin{aligned} {\overline{s}}_{i{\mathbf {h}}_{j}}={\overline{s}}_{{\mathbf {h}}_{j}}{\overline{s}}_{i}/ {\overline{s}}_{{\mathbf {h}}}, \end{aligned}$$

    where \({\overline{s}}_{{\mathbf {h}}}\equiv \frac{1}{n}\sum \nolimits _{j=1}^{n} {\overline{s}}_{{\mathbf {h}}_{j}}\) is the arithmetic mean of the grading histories of all the judges on the panel. The results in this paper would not change significantly if we were to consider this measure. However, grading style is more correctly defined if judge \(j^{\prime }s\) grading history is made relative to the mean score of the panels on which judge j has participated. Alternatively, grading style could have been defined as: \( {\overline{s}}_{i{\mathbf {h}}_{j}}={\overline{s}}_{i}\sum \nolimits _{t=1}^{T}(s_{.j}^{t}/{\overline{s}}_{.(j)}^{t}).\) This case is conceptually equivalent to the one in this paper, and leads to almost exactly the same results, but the approach in this paper is more intuitive and simpler to apply. Other definitions of grading style are also possible.

  6. The history of past grades of each judge usually includes scores from different competitions with different levels and different stages. In this context, for instance, the average score in the early stages of the same competition tends to be lower than the average scores in the later stages, in which only the best competitors are left . Similarly, the average scores in national competitions tend to be lower than the average scores in the Olympics, because in the Olympic Games competitors tend to be better on average. The ratio \({\overline{s}}_{{\mathbf {h}}_{j}}/{\overline{s}}_{{\mathbf {h}} _{(j)}}\) corrects for this heterogeneity.

  7. We can consider alternative weight functions, but with similar implications. For instance, we can consider different parameters to control for deviations from judge \(j^{\prime }s\) grading style and from the panel mean, i.e., \( \beta \) and \(\gamma ,\) respectively. In this case, we could have:

    $$\begin{aligned} w_{ij}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})\equiv \frac{\alpha \sum \nolimits _{k\ne j}^{n}\left| s_{ik}-{\overline{s}}_{i {\mathbf {h}}_{k}}\right| ^{\beta }+(1-\alpha )\sum \nolimits _{k\ne j}^{n}\left| s_{ik}-{\overline{s}}_{i}\right| ^{\gamma }}{(n-1)(\alpha \sum \nolimits _{k=1}^{n}\left| s_{ik}-{\overline{s}}_{i{\mathbf {h}} _{k}}\right| ^{\beta }+(1-\alpha )\sum \nolimits _{k=1}^{n}\left| s_{ik}-{\overline{s}}_{i}\right| ^{\gamma })}. \end{aligned}$$

    Other formulations are also possible. For instance, we can also consider the following simplified weighted mean formulation:

    $$\begin{aligned} w_{ij}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})\equiv \frac{\sum \nolimits _{k\ne j}^{n}\left| s_{ik}-(\alpha {\overline{s}}_{i {\mathbf {h}}_{k}}+(1-\alpha ){\overline{s}}_{i})\right| ^{\gamma }}{ (n-1)\sum \nolimits _{k=1}^{n}\left| s_{ik}-(\alpha {\overline{s}}_{i{\mathbf {h}}_{k}}+(1-\alpha ){\overline{s}}_{i})\right| ^{\gamma }}. \end{aligned}$$

    These formulations may differ slightly in terms of properties, but the crucial aspect is that all of them penalize deviations from the judges grading style. Other approaches, like majority judgment, which is based on the median score, have also been considered in the literature (Balinski and Laraki 2007, 2010; Bassett Jr and Persky 1994; Wu and Yang 2004)

  8. In the case where \(\gamma \rightarrow 0,\) the score aggregation procedure converges to the mean \({\overline{s}}_{i}^{*}\rightarrow {\overline{s}}_{i},\) because all grades are equally weighted, while in the case where \(\gamma \rightarrow \infty ,\) the score aggregation procedure ignores the most extreme score and weights all the other scores equally (with some specificities in the case of more than one extreme score).

  9. The score aggregation function can be written in more general terms as:

    $$\begin{aligned} {\overline{s}}_{i}^{+}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\equiv \sum \nolimits _{j=1}^{n}f_{j}({\mathbf {s}}_{i},\{ {\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})g(s_{ij}), \end{aligned}$$

    where \(f_{j}(.)\) is a weight function that receives the vectors \({\mathbf {s}} _{i}\) and \(\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n}\) as inputs, and g(.) is a function that receives the grade of judge j as input. Then, if the function \(f_{j}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\) is continuous, homogeneous of degree zero on \({\mathbf {s}} _{i},\) with \(f_{j}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\in [0,1]\) and \(\sum \nolimits _{j=1}^{n}f_{j}( {\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})=1,\) and the function \(g(s_{ij})=s_{ij},\) then the properties of the general aggregation function \({\overline{s}}_{i}^{+}\) match the properties of \({\overline{s}} _{i}^{*}.\)

  10. We consider the intermediate cases \(\alpha =1/3\) and \(\alpha =2/3\) with \( \gamma =2,\) because they are sufficiently representative and informative.

  11. The statistical method employed by Emerson et al. (2009) is particularly powerful for detecting bias and manipulation. However, since bias can be hidden in very complex and strategic ways, there is no perfect method to deal with this possibility. For instance, a judge may penalize a particular athlete in the early stages of competition, in which the qualification of that athlete is almost guaranteed (because of the athlete’s quality), in order to later benefit this same athlete in the most crucial stages of the competition. Similarly, a judge may simultaneously penalize and benefit two different athletes of the same nationality. In those cases, the aggregation of data is likely to lead to the conclusion that bias is not statistically significant because of cancellation effects.

  12. In this context, in order to introduce bias, judges are forced to award more extreme scores than when the aggregation function is simply the arithmetic mean. Such extreme behavior exposes them to public opinion and to detection by third party monitoring. For this reason, the simultaneous use of transparency policies is important.

  13. In some cases, scores are dissociated from the identity of the judges, which makes bias analysis extremely difficult for researchers and the general public. In other cases (e.g., online judgment of items, goods or services), the data is proprietary and not freely available.

References

  • Arrow KJ (1950) A difficulty in the concept of social welfare. J Political Econ 58(4):328–346

    Google Scholar 

  • Asch S (1951) Effects of group pressure upon the modification and distortion of judgments. In: Guetzkow H (ed) Groups, Leadership, and Men, pp 222–236

  • Ashenfelter O, Quandt R (1999) Analyzing a wine tasting statistically. Chance 12(3):16–20

    Google Scholar 

  • Baker GP (1992) Incentive contracts and performance measurement. J Polit Econ 100(3):598–614

    Google Scholar 

  • Balinski M, Laraki R (2007) A theory of measuring, electing, and ranking. Proc Nat Acad Sci 104(21):8720–8725

    Google Scholar 

  • Balinski M, Laraki R (2010) Majority judgment: measuring, ranking, and electing. MIT press, Cambridge

    Google Scholar 

  • Balinski M, Laraki R (2014) Judge: don’t vote!. Oper Res 62(3):483–511

    Google Scholar 

  • Bar-Eli M, Plessner H, Raab M (2011) Judgment, decision-making and success in sport. John Wiley & Sons, New Jersey

    Google Scholar 

  • Baron J (2007) Thinking and deciding (4ed). Cambridge University Press, Cambridge

    Google Scholar 

  • Bassett GW Jr, Persky J (1994) Rating skating. J Am Stat Ass 89(427):1075–1079

    Google Scholar 

  • Beliakov G, Pradera A, Calvo T (2007) Aggregation functions: a guide for practitioners, vol 221. Springer, Heidelberg

    Google Scholar 

  • Buchanan JT, Henig EJ, Henig MI (1998) Objectivity and subjectivity in the decision making process. Ann Oper Res 80:333–345

    Google Scholar 

  • Coupe T, Gergaud O, Noury A (2018) Biases and strategic behaviour in performance evaluation: the case of the FIFA’s best soccer player award. Oxf Bull Econ Stat 80(2):358–379

    Google Scholar 

  • Cust EE, Sweeting AJ, Ball K, Robertson S (2019) Machine and deep learning for sport-specific movement recognition: a systematic review of model development and performance. J Sports Sci 37(5):568–600

    Google Scholar 

  • Damisch L, Mussweiler T, Plessner H (2006) Olympic medals as fruits of comparison? assimilation and contrast in sequential performance judgments. J Exp Psychol Appl 12(3):166–178

    Google Scholar 

  • Dawes RM, Faust D, Meehl PE (1989) Clinical versus actuarial judgment. Science 243(4899):1668–1674

    Google Scholar 

  • Deutsch M, Gerard HB (1955) A study of normative and informational social influences upon individual judgment. J Abnormal Soc Psychol 51(3):629–636

    Google Scholar 

  • Díaz-Pereira MP, Gomez-Conde I, Escalona M, Olivieri DN (2014) Automatic recognition and scoring of olympic rhythmic gymnastic movements. Hum Mov Sci 34:63–80

    Google Scholar 

  • Dohmen T, Sauermann J (2016) Referee bias. J Econ Surv 30(4):679–695

    Google Scholar 

  • Duggan M, Levitt SD (2002) Winning isn’t everything: corruption in sumo wrestling. Am Econ Rev 92(5):1594–1605

    Google Scholar 

  • Emerson JW, Seltzer M, Lin D (2009) Assessing judging bias: an example from the 2000 olympic games. Am Stat 63(2):124–131

    Google Scholar 

  • Felsenthal DS, Machover M (2008) The majority judgement voting procedure: a critical evaluation. Homo Oeconomicus 25(3/4):319–334

    Google Scholar 

  • Findlay LC, Ste-Marie DM (2004) A reputation bias in figure skating judging. J Sport Exerc Psychol 26(1):154–166

    Google Scholar 

  • Frank MG, Gilovich T (1988) The dark side of self-and social perception: black uniforms and aggression in professional sports. J Pers Soc Psychol 54(1):74–85

    Google Scholar 

  • Frey B (2017) Omnimetrics and awards. Tech. rep., Center for Research in Economics, Management and the Arts (CREMA)

  • Frey BS, Gallus J (2017) Towards an economics of awards. J Econ Surv 31(1):190–200

    Google Scholar 

  • Fritz C, Curtin J, Poitevineau J, Morrel-Samuels P, Tao F-C (2012) Player preferences among new and old violins. Proc Nat Acad Sci 109(3):760–763

    Google Scholar 

  • Garicano L, Palacios-Huerta I, Prendergast C (2005) Favoritism under social pressure. Rev Econ Stat 87(2):208–216

    Google Scholar 

  • Gibbard A (1973) Manipulation of voting schemes: a general result. Econometrica 41(4):587–601

    Google Scholar 

  • Ginsburgh V, Noury AG (2008) The eurovision song contest. Is voting political or cultural? Eur J Politic Econ 24(1):41–52

    Google Scholar 

  • Ginsburgh VA, Van Ours JC (2003) Expert opinion and compensation: evidence from a musical competition. Am Econ Rev 93(1):289–296

    Google Scholar 

  • Grabisch M, Marichal J-L, Mesiar R, Pap E (2011a) Aggregation functions: construction methods, conjunctive, disjunctive and mixed classes. Inf Sci 181(1):23–43

    Google Scholar 

  • Grabisch M, Marichal J-L, Mesiar R, Pap E (2011b) Aggregation functions: means. Inf Sci 181(1):1–22

    Google Scholar 

  • Hamermesh DS, Schmidt P (2003) The determinants of econometric society fellows elections. Econometrica 71(1):399–407

    Google Scholar 

  • Helsen W, Gilis B, Weston M (2006) Errors in judging offside in association football: test of the optical error versus the perceptual flash-lag hypothesis. J Sports Sci 24(5):521–528

    Google Scholar 

  • Hilbert M (2012) Toward a synthesis of cognitive biases: how noisy information processing can bias human decision making. Psychol Bull 138(2):211–237

    Google Scholar 

  • Hodgson RT (2008) An examination of judge reliability at a major us wine competition. J Wine Econ 3(2):105–113

    Google Scholar 

  • Kahneman D, Tversky A (1972) Subjective probability: a judgment of representativeness. Cogn Psychol 3(3):430–454

    Google Scholar 

  • Kahneman D, Tversky A (1996) On the reality of cognitive illusions. Psychol Rev 103(3):582–591

    Google Scholar 

  • Keynes JM (1936) The general theory of employment, interest and money. Kessinger Publishing, Whitefish

    Google Scholar 

  • Larsen T, Price J, Wolfers J (2008) Racial bias in the nba: implications in betting markets. J Quant Anal Sports 4(2):1–21

    Google Scholar 

  • Lee J (2008) Outlier aversion in subjective evaluation: Evidence from world figure skating championships. J Sports Econ 9(2):141–159

    Google Scholar 

  • Lock R, Lock J (2003) The statistical sports fan: judging figure skating judges. STATS 36:20–24

    Google Scholar 

  • Looney MA (2004) Evaluating judge performance in sport. J Appl Meas 5(1):31–47

    Google Scholar 

  • Meehl PE (1954) Clinical versus statistical prediction: a theoretical analysis and a review of the evidence. University of Minnesota Press, Minneapolis

    Google Scholar 

  • Nevill AM, Newell SM, Gale S (1996) Factors associated with home advantage in english and scottish soccer matches. J Sports Sci 14(2):181–186

    Google Scholar 

  • Osório A (2017) Judgement and ranking: living with hidden bias. Ann Oper Res 253(1):501–518

    Google Scholar 

  • Page L, Page K (2007) The second leg home advantage: evidence from european football cup competitions. J Sports Sci 25(14):1547–1556

    Google Scholar 

  • Parsons CA, Sulaeman J, Yates MC, Hamermesh DS (2011) Strike three: discrimination, incentives, and evaluation. Am Econ Rev 101(4):1410–1435

    Google Scholar 

  • Pfister H-R, Böhm G (2008) The multiplicity of emotions: a framework of emotional functions in decision making. Judgm Decis Mak 3(1):5–17

    Google Scholar 

  • Plessner H, Haar T (2006) Sports performance judgments from a social cognitive perspective. Psychol Sport Exerc 7(6):555–575

    Google Scholar 

  • Popović R (2000) International bias detected in judging rhythmic gymnastics competition at Sydney-2000 olympic games. Phys Educ Sport 1(7):1–13

    Google Scholar 

  • Price J, Remer M, Stone DF (2012) Subperfect game: profitable biases of NBA referees. J Econ Manage Strategy 21(1):271–300

    Google Scholar 

  • Price J, Wolfers J (2010) Racial discrimination among NBA referees. Q J Econ 125(4):1859–1887

    Google Scholar 

  • Satterthwaite MA (1975) Strategy-proofness and arrow’s conditions: existence and correspondence theorems for voting procedures and social welfare functions. J Econ Theory 10(2):187–217

    Google Scholar 

  • Shah AK, Oppenheimer DM (2008) Heuristics made easy: an effort-reduction framework. Psychol Bull 134(2):207

    Google Scholar 

  • Simon H (1955) A behavioral model of rational choice. Q J Econ 69(1):99–118

    Google Scholar 

  • Sutter M, Kocher MG (2004) Favoritism of agents-the case of referees’ home bias. J Econ Psychol 25(4):461–469

    Google Scholar 

  • Tsay C-J (2013) Sight over sound in the judgment of music performance. Proc Nat Acad Sci 110(36):14580–14585

    Google Scholar 

  • Tversky A, Kahneman D (1974) Judgment under uncertainty: heuristics and biases. Science 185(4157):1124–1131

    Google Scholar 

  • Unkelbach C, Memmert D (2010) Crowd noise as a cue in referee decisions contributes to the home advantage. J Sport Exerc Psychol 32(4):483–498

    Google Scholar 

  • Wang XT, Simons F, Brédart S (2001) Social cues and verbal framing in risky choice. J Behav Decis Mak 14(1):1–15

    Google Scholar 

  • Wolfers J (2006) Point shaving: corruption in NCAA basketball. Am Econ Rev 96(2):279–283

    Google Scholar 

  • Wu SS, Yang MCK (2004) Evaluation of the current decision rule in figure skating and possible improvements. Am Stat 58(1):46–54

    Google Scholar 

  • Zitzewitz E (2006) Nationalism in winter sports judging and its lessons for organizational decision making. J Econ Manage Strategy 15(1):67–99

    Google Scholar 

  • Zitzewitz E (2014) Does transparency reduce favoritism and corruption? evidence from the reform of figure skating judging. J Sports Econ 15(1):3–30

    Google Scholar 

Download references

Acknowledgements

Financial support from the GRODE Universitat Rovira i Virgili and Generalitat de Catalunya under Projects 2018PFR-URV-B2-53 and 2017SGR770, and the Spanish Ministry of Science and Innovation Project RTI2018-094733-B-100 (AEI/FEDER, UE) is gratefully acknowledged. I would like to thank Jonathan Baron, Juan Pablo Rincón-Zapatero, the Editor and two anonymous Referees, as well as several seminars and congress participants for helpful comments and discussions. The usual caveat applies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to António Osório.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Osório, A. Performance Evaluation: Subjectivity, Bias and Judgment Style in Sport. Group Decis Negot 29, 655–678 (2020). https://doi.org/10.1007/s10726-020-09672-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10726-020-09672-4

Keywords

JEL classification:

Navigation