Performance Evaluation: Subjectivity, Bias and Judgment Style in Sport

Osório, António

doi:10.1007/s10726-020-09672-4

Performance Evaluation: Subjectivity, Bias and Judgment Style in Sport

Published: 07 May 2020

Volume 29, pages 655–678, (2020)
Cite this article

Group Decision and Negotiation Aims and scope Submit manuscript

António Osório¹

808 Accesses
7 Citations
Explore all metrics

Abstract

The number of situations that require individual judgments and evaluations, and that may be object of different sources of conscious and unconscious biases is endless. This paper proposes a practical score aggregation procedure that attempts to reduce and mitigate the influence of bias in subjective judgments. The argument is based on the idea that bias is associated with deviations from the panel mean and/or deviations from the judges’ grading style. Consequently, the procedure is not specific to a particular type of bias, but rather addresses general forms of bias. We also discuss a set of desirable properties. The proposed score aggregation procedure is then applied to a unique data set from the 2000 Summer Olympic Games diving competition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Judgment and Decision-Making

Evaluation in Sports Performance

uPATO—Individual Measures

Notes

In this paper, we focus mostly on sports performance, but the number of situations that require individual judgments and evaluations, which can be affected by different sources of bias, is endless. The approach in this paper may be extended to these other dimensions of our lives (e.g., the rating of any kind of items, goods and services, such as for example, wines, books, films, music, policies, scientific refereeing or any kind of talent competition, as well as, tourist locations or blog comments). Nowadays, the internet is making these evaluation procedures increasingly common.
Several studies have focused on bias in evaluation contexts other than sports. For instance, to mention just a few, in musical competitions, Ginsburgh and Van Ours (2003) found that judging panel members are influenced by the order of appearance of candidates, while Tsay (2013) found that judges are influenced more by what they see than by what they hear. In the Eurovision Song Contest, Ginsburgh and Noury (2008) found that linguistic and cultural similarities between singers and judges are determinant, while in academic awards, Hamermesh and Schmidt (2003) found that affiliation is crucial in the judges’ decision. In this context, some statistically based rating procedures have shown better results than expert opinions (Dawes et al. 1989; Meehl 1954). Other inconsistencies and paradoxical observations are reported in the literature (Ashenfelter and Quandt 1999; Fritz et al. 2012; Hodgson 2008; Plessner and Haar 2006). Further development of these issues is beyond the scope of the present paper.
The grading style must not vary with the order in which the history is presented. This aspect has implications for Sect. 4, when we discuss some of the properties of the proposed aggregation procedure. This implies that the history cannot include scores entered in the present competition. Otherwise, the order of the performances could interfere with the measurement of the grading style, which must be stable throughout the competition. This aspect also places restrictions on the use of moving averages as aggregate measurements of the history.
We are intentionally ambiguous about the length of the history of past scores. We leave this decision to the social planner or the sport’s governing body responsible for the competition. The longer the history and the closer in time the better. However, on the same panel, we may have judges with different histories in terms of length, but also in terms of quality. In that sense, homogenizing all the histories by using the length of the shortest history as a reference may not be a good idea, because it could imply a loss of data about the history of the other judges.
Grading style can be defined in different ways. These alternatives have in common the use of information from the history of past scores. For instance, grading style could have been defined as:
$$\begin{aligned} {\overline{s}}_{i{\mathbf {h}}_{j}}={\overline{s}}_{{\mathbf {h}}_{j}}{\overline{s}}_{i}/ {\overline{s}}_{{\mathbf {h}}}, \end{aligned}$$
where ${\overline{s}}_{{\mathbf {h}}}\equiv \frac{1}{n}\sum \nolimits _{j=1}^{n} {\overline{s}}_{{\mathbf {h}}_{j}}$ is the arithmetic mean of the grading histories of all the judges on the panel. The results in this paper would not change significantly if we were to consider this measure. However, grading style is more correctly defined if judge $j^{\prime }s$ grading history is made relative to the mean score of the panels on which judge j has participated. Alternatively, grading style could have been defined as: $ {\overline{s}}_{i{\mathbf {h}}_{j}}={\overline{s}}_{i}\sum \nolimits _{t=1}^{T}(s_{.j}^{t}/{\overline{s}}_{.(j)}^{t}).$ This case is conceptually equivalent to the one in this paper, and leads to almost exactly the same results, but the approach in this paper is more intuitive and simpler to apply. Other definitions of grading style are also possible.
The history of past grades of each judge usually includes scores from different competitions with different levels and different stages. In this context, for instance, the average score in the early stages of the same competition tends to be lower than the average scores in the later stages, in which only the best competitors are left . Similarly, the average scores in national competitions tend to be lower than the average scores in the Olympics, because in the Olympic Games competitors tend to be better on average. The ratio ${\overline{s}}_{{\mathbf {h}}_{j}}/{\overline{s}}_{{\mathbf {h}} _{(j)}}$ corrects for this heterogeneity.
We can consider alternative weight functions, but with similar implications. For instance, we can consider different parameters to control for deviations from judge $j^{\prime }s$ grading style and from the panel mean, i.e., $ \beta $ and $\gamma ,$ respectively. In this case, we could have:
$$\begin{aligned} w_{ij}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})\equiv \frac{\alpha \sum \nolimits _{k\ne j}^{n}\left| s_{ik}-{\overline{s}}_{i {\mathbf {h}}_{k}}\right| ^{\beta }+(1-\alpha )\sum \nolimits _{k\ne j}^{n}\left| s_{ik}-{\overline{s}}_{i}\right| ^{\gamma }}{(n-1)(\alpha \sum \nolimits _{k=1}^{n}\left| s_{ik}-{\overline{s}}_{i{\mathbf {h}} _{k}}\right| ^{\beta }+(1-\alpha )\sum \nolimits _{k=1}^{n}\left| s_{ik}-{\overline{s}}_{i}\right| ^{\gamma })}. \end{aligned}$$
Other formulations are also possible. For instance, we can also consider the following simplified weighted mean formulation:
$$\begin{aligned} w_{ij}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})\equiv \frac{\sum \nolimits _{k\ne j}^{n}\left| s_{ik}-(\alpha {\overline{s}}_{i {\mathbf {h}}_{k}}+(1-\alpha ){\overline{s}}_{i})\right| ^{\gamma }}{ (n-1)\sum \nolimits _{k=1}^{n}\left| s_{ik}-(\alpha {\overline{s}}_{i{\mathbf {h}}_{k}}+(1-\alpha ){\overline{s}}_{i})\right| ^{\gamma }}. \end{aligned}$$
These formulations may differ slightly in terms of properties, but the crucial aspect is that all of them penalize deviations from the judges grading style. Other approaches, like majority judgment, which is based on the median score, have also been considered in the literature (Balinski and Laraki 2007, 2010; Bassett Jr and Persky 1994; Wu and Yang 2004)
In the case where $\gamma \rightarrow 0,$ the score aggregation procedure converges to the mean ${\overline{s}}_{i}^{*}\rightarrow {\overline{s}}_{i},$ because all grades are equally weighted, while in the case where $\gamma \rightarrow \infty ,$ the score aggregation procedure ignores the most extreme score and weights all the other scores equally (with some specificities in the case of more than one extreme score).
The score aggregation function can be written in more general terms as:
$$\begin{aligned} {\overline{s}}_{i}^{+}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\equiv \sum \nolimits _{j=1}^{n}f_{j}({\mathbf {s}}_{i},\{ {\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})g(s_{ij}), \end{aligned}$$
where $f_{j}(.)$ is a weight function that receives the vectors ${\mathbf {s}} _{i}$ and $\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n}$ as inputs, and g(.) is a function that receives the grade of judge j as input. Then, if the function $f_{j}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})$ is continuous, homogeneous of degree zero on ${\mathbf {s}} _{i},$ with $f_{j}({\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}} _{(k)}\}_{k=1}^{n})\in [0,1]$ and $\sum \nolimits _{j=1}^{n}f_{j}( {\mathbf {s}}_{i},\{{\mathbf {h}}_{k},{\mathbf {h}}_{(k)}\}_{k=1}^{n})=1,$ and the function $g(s_{ij})=s_{ij},$ then the properties of the general aggregation function ${\overline{s}}_{i}^{+}$ match the properties of ${\overline{s}} _{i}^{*}.$
We consider the intermediate cases $\alpha =1/3$ and $\alpha =2/3$ with $ \gamma =2,$ because they are sufficiently representative and informative.
The statistical method employed by Emerson et al. (2009) is particularly powerful for detecting bias and manipulation. However, since bias can be hidden in very complex and strategic ways, there is no perfect method to deal with this possibility. For instance, a judge may penalize a particular athlete in the early stages of competition, in which the qualification of that athlete is almost guaranteed (because of the athlete’s quality), in order to later benefit this same athlete in the most crucial stages of the competition. Similarly, a judge may simultaneously penalize and benefit two different athletes of the same nationality. In those cases, the aggregation of data is likely to lead to the conclusion that bias is not statistically significant because of cancellation effects.
In this context, in order to introduce bias, judges are forced to award more extreme scores than when the aggregation function is simply the arithmetic mean. Such extreme behavior exposes them to public opinion and to detection by third party monitoring. For this reason, the simultaneous use of transparency policies is important.
In some cases, scores are dissociated from the identity of the judges, which makes bias analysis extremely difficult for researchers and the general public. In other cases (e.g., online judgment of items, goods or services), the data is proprietary and not freely available.

References

Arrow KJ (1950) A difficulty in the concept of social welfare. J Political Econ 58(4):328–346
Google Scholar
Asch S (1951) Effects of group pressure upon the modification and distortion of judgments. In: Guetzkow H (ed) Groups, Leadership, and Men, pp 222–236
Ashenfelter O, Quandt R (1999) Analyzing a wine tasting statistically. Chance 12(3):16–20
Google Scholar
Baker GP (1992) Incentive contracts and performance measurement. J Polit Econ 100(3):598–614
Google Scholar
Balinski M, Laraki R (2007) A theory of measuring, electing, and ranking. Proc Nat Acad Sci 104(21):8720–8725
Google Scholar
Balinski M, Laraki R (2010) Majority judgment: measuring, ranking, and electing. MIT press, Cambridge
Google Scholar
Balinski M, Laraki R (2014) Judge: don’t vote!. Oper Res 62(3):483–511
Google Scholar
Bar-Eli M, Plessner H, Raab M (2011) Judgment, decision-making and success in sport. John Wiley & Sons, New Jersey
Google Scholar
Baron J (2007) Thinking and deciding (4ed). Cambridge University Press, Cambridge
Google Scholar
Bassett GW Jr, Persky J (1994) Rating skating. J Am Stat Ass 89(427):1075–1079
Google Scholar
Beliakov G, Pradera A, Calvo T (2007) Aggregation functions: a guide for practitioners, vol 221. Springer, Heidelberg
Google Scholar
Buchanan JT, Henig EJ, Henig MI (1998) Objectivity and subjectivity in the decision making process. Ann Oper Res 80:333–345
Google Scholar
Coupe T, Gergaud O, Noury A (2018) Biases and strategic behaviour in performance evaluation: the case of the FIFA’s best soccer player award. Oxf Bull Econ Stat 80(2):358–379
Google Scholar
Cust EE, Sweeting AJ, Ball K, Robertson S (2019) Machine and deep learning for sport-specific movement recognition: a systematic review of model development and performance. J Sports Sci 37(5):568–600
Google Scholar
Damisch L, Mussweiler T, Plessner H (2006) Olympic medals as fruits of comparison? assimilation and contrast in sequential performance judgments. J Exp Psychol Appl 12(3):166–178
Google Scholar
Dawes RM, Faust D, Meehl PE (1989) Clinical versus actuarial judgment. Science 243(4899):1668–1674
Google Scholar
Deutsch M, Gerard HB (1955) A study of normative and informational social influences upon individual judgment. J Abnormal Soc Psychol 51(3):629–636
Google Scholar
Díaz-Pereira MP, Gomez-Conde I, Escalona M, Olivieri DN (2014) Automatic recognition and scoring of olympic rhythmic gymnastic movements. Hum Mov Sci 34:63–80
Google Scholar
Dohmen T, Sauermann J (2016) Referee bias. J Econ Surv 30(4):679–695
Google Scholar
Duggan M, Levitt SD (2002) Winning isn’t everything: corruption in sumo wrestling. Am Econ Rev 92(5):1594–1605
Google Scholar
Emerson JW, Seltzer M, Lin D (2009) Assessing judging bias: an example from the 2000 olympic games. Am Stat 63(2):124–131
Google Scholar
Felsenthal DS, Machover M (2008) The majority judgement voting procedure: a critical evaluation. Homo Oeconomicus 25(3/4):319–334
Google Scholar
Findlay LC, Ste-Marie DM (2004) A reputation bias in figure skating judging. J Sport Exerc Psychol 26(1):154–166
Google Scholar
Frank MG, Gilovich T (1988) The dark side of self-and social perception: black uniforms and aggression in professional sports. J Pers Soc Psychol 54(1):74–85
Google Scholar
Frey B (2017) Omnimetrics and awards. Tech. rep., Center for Research in Economics, Management and the Arts (CREMA)
Frey BS, Gallus J (2017) Towards an economics of awards. J Econ Surv 31(1):190–200
Google Scholar
Fritz C, Curtin J, Poitevineau J, Morrel-Samuels P, Tao F-C (2012) Player preferences among new and old violins. Proc Nat Acad Sci 109(3):760–763
Google Scholar
Garicano L, Palacios-Huerta I, Prendergast C (2005) Favoritism under social pressure. Rev Econ Stat 87(2):208–216
Google Scholar
Gibbard A (1973) Manipulation of voting schemes: a general result. Econometrica 41(4):587–601
Google Scholar
Ginsburgh V, Noury AG (2008) The eurovision song contest. Is voting political or cultural? Eur J Politic Econ 24(1):41–52
Google Scholar
Ginsburgh VA, Van Ours JC (2003) Expert opinion and compensation: evidence from a musical competition. Am Econ Rev 93(1):289–296
Google Scholar
Grabisch M, Marichal J-L, Mesiar R, Pap E (2011a) Aggregation functions: construction methods, conjunctive, disjunctive and mixed classes. Inf Sci 181(1):23–43
Google Scholar
Grabisch M, Marichal J-L, Mesiar R, Pap E (2011b) Aggregation functions: means. Inf Sci 181(1):1–22
Google Scholar
Hamermesh DS, Schmidt P (2003) The determinants of econometric society fellows elections. Econometrica 71(1):399–407
Google Scholar
Helsen W, Gilis B, Weston M (2006) Errors in judging offside in association football: test of the optical error versus the perceptual flash-lag hypothesis. J Sports Sci 24(5):521–528
Google Scholar
Hilbert M (2012) Toward a synthesis of cognitive biases: how noisy information processing can bias human decision making. Psychol Bull 138(2):211–237
Google Scholar
Hodgson RT (2008) An examination of judge reliability at a major us wine competition. J Wine Econ 3(2):105–113
Google Scholar
Kahneman D, Tversky A (1972) Subjective probability: a judgment of representativeness. Cogn Psychol 3(3):430–454
Google Scholar
Kahneman D, Tversky A (1996) On the reality of cognitive illusions. Psychol Rev 103(3):582–591
Google Scholar
Keynes JM (1936) The general theory of employment, interest and money. Kessinger Publishing, Whitefish
Google Scholar
Larsen T, Price J, Wolfers J (2008) Racial bias in the nba: implications in betting markets. J Quant Anal Sports 4(2):1–21
Google Scholar
Lee J (2008) Outlier aversion in subjective evaluation: Evidence from world figure skating championships. J Sports Econ 9(2):141–159
Google Scholar
Lock R, Lock J (2003) The statistical sports fan: judging figure skating judges. STATS 36:20–24
Google Scholar
Looney MA (2004) Evaluating judge performance in sport. J Appl Meas 5(1):31–47
Google Scholar
Meehl PE (1954) Clinical versus statistical prediction: a theoretical analysis and a review of the evidence. University of Minnesota Press, Minneapolis
Google Scholar
Nevill AM, Newell SM, Gale S (1996) Factors associated with home advantage in english and scottish soccer matches. J Sports Sci 14(2):181–186
Google Scholar
Osório A (2017) Judgement and ranking: living with hidden bias. Ann Oper Res 253(1):501–518
Google Scholar
Page L, Page K (2007) The second leg home advantage: evidence from european football cup competitions. J Sports Sci 25(14):1547–1556
Google Scholar
Parsons CA, Sulaeman J, Yates MC, Hamermesh DS (2011) Strike three: discrimination, incentives, and evaluation. Am Econ Rev 101(4):1410–1435
Google Scholar
Pfister H-R, Böhm G (2008) The multiplicity of emotions: a framework of emotional functions in decision making. Judgm Decis Mak 3(1):5–17
Google Scholar
Plessner H, Haar T (2006) Sports performance judgments from a social cognitive perspective. Psychol Sport Exerc 7(6):555–575
Google Scholar
Popović R (2000) International bias detected in judging rhythmic gymnastics competition at Sydney-2000 olympic games. Phys Educ Sport 1(7):1–13
Google Scholar
Price J, Remer M, Stone DF (2012) Subperfect game: profitable biases of NBA referees. J Econ Manage Strategy 21(1):271–300
Google Scholar
Price J, Wolfers J (2010) Racial discrimination among NBA referees. Q J Econ 125(4):1859–1887
Google Scholar
Satterthwaite MA (1975) Strategy-proofness and arrow’s conditions: existence and correspondence theorems for voting procedures and social welfare functions. J Econ Theory 10(2):187–217
Google Scholar
Shah AK, Oppenheimer DM (2008) Heuristics made easy: an effort-reduction framework. Psychol Bull 134(2):207
Google Scholar
Simon H (1955) A behavioral model of rational choice. Q J Econ 69(1):99–118
Google Scholar
Sutter M, Kocher MG (2004) Favoritism of agents-the case of referees’ home bias. J Econ Psychol 25(4):461–469
Google Scholar
Tsay C-J (2013) Sight over sound in the judgment of music performance. Proc Nat Acad Sci 110(36):14580–14585
Google Scholar
Tversky A, Kahneman D (1974) Judgment under uncertainty: heuristics and biases. Science 185(4157):1124–1131
Google Scholar
Unkelbach C, Memmert D (2010) Crowd noise as a cue in referee decisions contributes to the home advantage. J Sport Exerc Psychol 32(4):483–498
Google Scholar
Wang XT, Simons F, Brédart S (2001) Social cues and verbal framing in risky choice. J Behav Decis Mak 14(1):1–15
Google Scholar
Wolfers J (2006) Point shaving: corruption in NCAA basketball. Am Econ Rev 96(2):279–283
Google Scholar
Wu SS, Yang MCK (2004) Evaluation of the current decision rule in figure skating and possible improvements. Am Stat 58(1):46–54
Google Scholar
Zitzewitz E (2006) Nationalism in winter sports judging and its lessons for organizational decision making. J Econ Manage Strategy 15(1):67–99
Google Scholar
Zitzewitz E (2014) Does transparency reduce favoritism and corruption? evidence from the reform of figure skating judging. J Sports Econ 15(1):3–30
Google Scholar

Download references

Acknowledgements

Financial support from the GRODE Universitat Rovira i Virgili and Generalitat de Catalunya under Projects 2018PFR-URV-B2-53 and 2017SGR770, and the Spanish Ministry of Science and Innovation Project RTI2018-094733-B-100 (AEI/FEDER, UE) is gratefully acknowledged. I would like to thank Jonathan Baron, Juan Pablo Rincón-Zapatero, the Editor and two anonymous Referees, as well as several seminars and congress participants for helpful comments and discussions. The usual caveat applies.

Author information

Authors and Affiliations

Department of Economics, ECO-SOS, Universitat Rovira i Virgili, Reus, Spain
António Osório

Authors

António Osório
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to António Osório.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Osório, A. Performance Evaluation: Subjectivity, Bias and Judgment Style in Sport. Group Decis Negot 29, 655–678 (2020). https://doi.org/10.1007/s10726-020-09672-4

Download citation

Published: 07 May 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s10726-020-09672-4

Keywords

JEL classification:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Evaluation: Subjectivity, Bias and Judgment Style in Sport

Abstract

Access this article

Similar content being viewed by others

Judgment and Decision-Making

Evaluation in Sports Performance

uPATO—Individual Measures

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

JEL classification:

Navigation

Performance Evaluation: Subjectivity, Bias and Judgment Style in Sport

Abstract

Access this article

Similar content being viewed by others

Judgment and Decision-Making

Evaluation in Sports Performance

uPATO—Individual Measures

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL classification:

Search

Navigation