Calibration with confidence: a principled method for panel assessment

Frequently, a set of objects has to be evaluated by a panel of assessors, but not every object is assessed by every assessor. A problem facing such panels is how to take into account different standards among panel members and varying levels of confidence in their scores. Here, a mathematically based algorithm is developed to calibrate the scores of such assessors, addressing both of these issues. The algorithm is based on the connectivity of the graph of assessors and objects evaluated, incorporating declared confidences as weights on its edges. If the graph is sufficiently well connected, relative standards can be inferred by comparing how assessors rate objects they assess in common, weighted by the levels of confidence of each assessment. By removing these biases, ‘true’ values are inferred for all the objects. Reliability estimates for the resulting values are obtained. The algorithm is tested in two case studies: one by computer simulation and another based on realistic evaluation data. The process is compared to the simple averaging procedure in widespread use, and to Fisher's additive incomplete block analysis. It is anticipated that the algorithm will prove useful in a wide variety of situations such as evaluation of the quality of research submitted to national assessment exercises; appraisal of grant proposals submitted to funding panels; ranking of job applicants; and judgement of performances on degree courses wherein candidates can choose from lists of options.


Introduction
We address the widespread problem of how to account for differences in standards, confidence and bias in assessment panels, such as those evaluating research quality or grant proposals, employment or promotion applications and classification of university degree courses, in situations where not every assessor can evaluate every object to be assessed.A common approach to assessment of a range of objects by such a panel is to assign to each object the average of the scores awarded by the assessors who evaluate that object, but this ignores the likely possibility that different assessors have different levels of stringency, expertise and bias [1].Some panels shift the scores for each assessor to make the average of each take a normalised value, but this ignores the possibility that the set of objects assigned to one assessor may be of a genuinely different standard from that assigned to another.For an experimental scientist, the issue is obvious: calibration.
One approach is to seek to calibrate the assessors beforehand on a common subset of objects, perhaps disjoint from the set to be evaluated [2].This means that they each evaluate all the objects in the subset and then some form of rescaling is agreed to bring the assessors into line as far as possible.This would not work well, however, in a situation where the range of objects is broader than the expertise of a single assessor.It also ignores the possibility of bias, such as due to familiarity effects [3].Regardless of how well the assessors are trained, differences between individuals' assessments of objects remain in such ad hoc approaches [4].
If the expertise of two assessors overlap on some subject, however, any discrepancy between their evaluations can be used to infer information about their relative standards.Thus if the graph of the set of assessors, formed by linking two whenever they assess a common object, is sufficiently well connected one can expect to be able to infer a robust calibration of the assessors and hence robust scores for the outputs.
One approach to achieving such calibration was developed by R.A.Fisher [5], in the context of trials of crop treatments.See also Tukey's median polish [16].Fisher's approach is known as additive incomplete block analysis and a large body of associated literature and applications has since been developed, though its use in panel assessment seems rare.It is represented in the first column of Table 1 along with the simple averaging which it ameliorates.
Another ingredient that is important in many panel assessments, however, is varying weights that a panel may wish to put on different assessments.We refer to these weights as "confidences".If the panel expresses confidences in the assessments, for example by some pre-determined weights assigned to types of assessment or by the assessors declaring confidences in each of their scores, then it is natural to replace straight averaging by confidence-weighted averaging -see first row in Table 1.But this doesn't address the calibration issue.
In this paper we present and test a method to calibrate scores taking into account confidences, that is, we complete the bottom right corner of the matrix of approaches represented in Table 1, where our method is termed calibration with confidence.
We demonstrate that the method can achieve a greater degree of accuracy with fewer assessors than the other approaches, and derive robustness estimates taking the confidences into account.
We are aware of two other schemes that incorporate confidences into a calibration process.One is the abstract-review method for the SIGKDD'09 conference (section 4 of [14]).The other is the abstract-review method used for the NIPS2013 conference (building on [15] and described in [13]).Our method has the advantages of simplicity of implementation and delivery of a straightforward robustness analysis.

The basic model
Let us suppose that each assessor is assigned a subset of the objects to evaluate.Denote the resulting set of (assessor, object) pairs by E. Let us further suppose that the score s ao that assessor a assigns to object o is a real number related to a "true" value v o for the object by where b a can be called the bias of assessor a and ε ao are independent zero-mean random variables.Such a model forms the basis for additive incomplete block analysis (see e.g.ref. [6]).This was also proposed in ref. [7] (see equation (8.2b) therein) but without a method to extract the true scores.Here we will achieve this and make a significant improvement, namely the incorporation of varying confidences in the scores.
To take into account the varying expertise of the assessors, we propose that in addition to the score s ao , each assessor is asked to specify a level of confidence for that evaluation.This could be in the form of a rating such as "high", "medium", "low", as requested by some funding agencies, but we propose something more general and akin to experimental science.Confidence can be estimated by asking assessors to specify a standard deviation σ ao for their score.So let us suppose that with η ao independent zero-mean, random variables of common variance w.Here we set w = 1; extensions to other values of w are considered in the Supplementary Information (SI).Thus our basic model is straight averaging confidence-weighted averaging Fisher calibration calibration with confidence Table 1: Panel Assessment Methods: The matrix of four approaches according to use of confidences and/or calibration.

Solution of the basic model
Given the data {(s ao , σ ao ) : (a, o) ∈ E} for all assigned assessor-object pairs, we wish to extract the true values v o and assessor biases b a .The simplest procedure is to minimise the sum of squares where the confidence level is defined as This procedure can be justified if the η ao are assumed to be normally distributed, because then it gives the maximum-likelihood values for v o and b a .It can also be viewed as orthogonal projection of the vector of scores to the subspace of the form s ao = v o + b a in the metric given by ao c ao s 2 ao .Now expression ( 4) is minimised with respect to v o iff It is notationally convenient to extend the sums to all assessors (respectively objects) by assigning the value c ao = 0 to any assessor-object pair that is not in E (i.e. for which a score was not returned).Then these conditions can be written as Here, V o = a c ao s ao is the confidence-weighted total score for object o, B a = o c ao s ao is that for assessor a, is the total confidence in the assessment of an object and is the total confidence expressed by assessor a. Equations ( 6) and ( 7) form a linear system of equations for the v o and b a .It has an obvious degeneracy in that one could add a constant k to all the v o and subtract k from all the b a and obtain another solution.We can remove this degeneracy by, for example, imposing the condition This is the simplest possibility and corresponds to a translation that brings the average bias over assessors to zero.Alternatives are discussed in the SI.Define a graph Γ linking assessor a to object o if and only if (a, o) ∈ E. The question of whether the set of equations ( 6) and ( 7) has a unique solution after breaking the degeneracy depends on the connectivity of Γ. Define a linear operator L by writing equations ( 6) and (7) as where v, b, V and B denote the column vectors formed by the v o , b a , V o and B a respectively.L has null space of dimension equal to the number of connected components of Γ (this follows from Perron-Frobenius theory, see e.g.ref. [8]).Thus if Γ is connected, the null space of L has dimension one, so corresponds precisely to the null vectors v o = k ∀o, b a = −k ∀a, that we already noticed and dealt with.This condition ensures that if (11) has a solution then there is a unique one satisfying (10).
It remains to check that the right hand side of equation (11) lies in the range of L, thus ensuring that a solution exists.This is true if all null forms of the adjoint operator L † send the right hand side to zero.The null space of L † has the same dimension as that of L, because L is square, and an obvious non-zero null form is defined by It follows from the definitions of V and B that α(V, B) = 0. Thus under the assumption that the assessor-object graph Γ is connected, equations (6) and ( 7) have a unique solution (v, b) satisfying equation (10).Note that it is clear that connectedness of Γ is necessary for uniqueness, otherwise one could follow an analogous procedure, adding and subtracting constants independently in each connected component of Γ, and thereby produce more solutions.
The equations have a special structure, due to the bipartite nature of Γ, that is worth exploiting.The first equation ( 6) can be written as This can be substituted into the second equation ( 7) to obtain so that The dimension of the system ( 15) is the number of assessors (rather than the sum of the numbers of assessors and objects).Replacing one of the equations in (15) by equation (10) gives a system with a unique solution that can be solved for b by any method of numerical linear algebra, e.g.LUP decomposition [9].Then v can be recovered from equation (13).

Robustness
A key question with any black-box solution like the one presented here is how robust the outcome is to mistakes or anomalous judgements.For s = (s ao ) (a,o)∈E , define the operator K by as a shorthand for the definitions in the line after equations ( 6) and ( 7), so that the equations can be written as Thus, if a change δs is made to the scores, we obtain changes δv, δb of magnitude where L −1 is defined by restricting the domain of L to (10) and its range to α(V, B) = 0, and appropriate norms are chosen.In the SI we propose that appropriate choices are and With a preferred degeneracy-breaking condition C ′ a b a = 0 instead of (10) we obtain where µ 2 is the second smallest eigenvalue of a certain matrix formed from the confidences.In particular, this gives The task for the designer of E is to make none of the C o much smaller than the others and to make µ 2 significantly larger than 0. The former is evident (no object should receive significantly less assessment or less expert assessment than the others), but the latter depends on how well connected is the graph.

Case Studies
We have tested the approach in two contexts.In the first, we use a computer-generated set of data containing values of assessed items, together with biases and confidences of the assessors.This has the advantage of allowing us to compare the values obtained by the new approach with the true underlying value of each item.In the second test, the method is applied to evaluating students' grades through examination marks.In this test, of course, there is no possibility to access "true" values.
Case Study 1 -Simulation: In the simulation, N O = 3000 objects are assessed by a panel of N A = 15 assessors.(This choice was motivated as realistic by the number of outputs and reviewers in the applied mathematics unit of assessment at the UK's 2008 research assessment exercise.)The simulation was carried out using MATLAB, and the system of equations was solved using its built-in procedure, which computed the LU decomposition of L.
True values of the items v o were assumed to be normally distributed with a mean of 50 and standard deviation set to 15, but with v o values truncated at 0 and 100.The assessor biases b a were assumed to be normally distributed with a mean of 0 and a standard deviation of 15.Each assessor was considered to have high, medium, or low confidence in each assessment, and these were modelled using standard deviations for the awarded scores of σ ao = 5, 10 or 15 respectively.The allocated scores follow equation ( 3), truncating at 0 and 100.
With r assessors per item (which we took to be the same for each item in this instance), each simulation generated rN O object scores s ao .From these, we generated N O value estimates vo and N A estimates of assessor biases ba using the calibration processes.We then took the mean and maximum values of the errors in the estimates, Straight averaging also delivered a value estimate vo , as well as mean and maximal values of the errors dv o .Finally, we determined the averages of the errors dv o and db a over 100 simulations.The results for these averaged mean and maximal errors in the scores are denoted by dv and (dv) max , respectively and those for the biases (for the calibrated approaches only) are denoted db and (db) max .
Results for all three methods are presented in Figs.1-4.The mean and maximal absolute errors for the straight averaging approach, the incomplete-block-analysis method and the calibration-with-confidences approach are given in Panels (a)-(d) of Figs. 1 and 2 for different confidence profiles.Fig. 1, panel (e) gives the improvements achieved by the calibration methods as a ratio of the mean errors coming from Fisher's incompleteblock-analysis approach to the straight averaging approach dv Fisher / dv averaging and a ratio of the mean errors coming from the calibration-with-confidence approach to the straight averaging approach dv confidence / dv averaging .Fig. 2 panel (e) gives the analogous improvement ratios for the maximal errors.
We observe that, for each method, the scores become more accurate (errors decrease) as the number of assessors per object r increases.It is also clear from Fig. 1 that the calibration-with-confidences approach represents a significant improvement over Fisher's method for the mean errors, which itself is a strong improvement over the straight averaging approach.In fact, Fig. 1(e) demonstrates that the Fisher approach delivers mean errors between about 60% and 80% of those coming from the averaging approach, the better improvements being associated with lower assessor numbers.This is also the most desirable configuration for realistic assessments, as it represents employment of a minimal number of assessors per object.The calibration-with-confidences approach reduces errors by about a further 10% irrespective of the number of assessors.
With only two assessors , the straight averaging method gives errors averaging about 10%.
Over r = 6 readers per object are required to bring the mean error down to 6%.Fisher's method, however, achieves this level of improvement with only 2 or 3 readers.The calibration-with-confidences method delivers a further level of improvement of about 1%.One also notes that, for the calibration approaches, relatively little is gained on average by employing more than four assessors per object.Fig. 1(e) shows that the most dramatic improvements are for low numbers of assessors; As the number of readers increases, the improvement ratio worsens.
Fig. 2 shows that Fisher's approach also leads to significant improvements in the maximal error values relative to those obtained through simple averaging.With two assessors per object, maximal errors are reduced from about 45% to 30-35%.The calibration-withconfidences approach does not, however, significantly improve upon this.However, with 6 assessors per object the maximal error value of about 25% delivered by the simple averaging process is reduced to about 20% by Fisher's method and to as low as 16% when half the readers have a high degree of confidence in their scores.
Finally in Fig. 3(c) we plot the errors of the bias estimates.Both mean and maximal errors are depicted and neither displays a systematic dependence on the number of assessors per object r.

Case Study 2 -Examination Data:
To indicate the difference the calibration can make to real-life data, we applied the method to the examination results of 82 students across 81 examinations (wherein each student sat a subset of the examinations).For this case study, we determined all four value estimates listed in Table 1.We denote by S c the scores obtained by simple weighted averaging and by V c the value estimates coming from the calibration-with-confidences approach.Setting weights and confidences to one then delivers the straight averages S 1 and the Fisher-calibrated values V 1 .For the new declared-confidences approach, we set the c ao to the CATS (credit accumulation and transfer scheme) weighting for the module a.We argue that the latter is an appropriate surrogate for confidence because modules have amount of assessment (a measure of stringency) roughly proportional to their CATS weighting.If scores on component parts are independent then the final score (expressed relative to the notional maximum) will have variance inversely proportional to the CATS weighting.Alternatively, the use of CATSweighted averages by exam boards is equivalent (under independent Gaussian deviations assumption) to maximum likelihood estimation with variances inversely proportional to CATS.
In Fig. 4 the various approaches are compared.Panel (a) gives the estimates from all four approaches for each object and significant differences are apparent.The effects of calibration are given in Panel (b) where V 1 − S 1 and V c − S c are plotted for each object o.The former compares Fisher's calibration method to simple averaging while the latter compares our approach to weighted averaging.It is clear that the calibration delivers estimates for the true values which can be substantially different from the raw average examination marks in each case.In fact the new method is approximately to weighted averages what Fisher's method is to simple averages.The difference can be up to 7 percentage points in each case, which would be sufficient to alter the classification of examination results, though several caveats need mentioning.The calibration methods have produced an average upward shift of about 2 percentage points, which may be sensible to subtract off (it can be rectified by alternative degeneracy-breaking conditions, as discussed in the SI).Secondly, classification of examination results depends also on other factors taken into account by examination boards, like mitigating circumstances and reading of scripts by external examiners.Thirdly, although the CATS weightings for the modules chosen varied from 6, 7.5, 12, 15 to 18, most counted for either 15 or 18 CATS points, so the confidence levels are too close to generate large differences in this case.
The differences between the results coming from incomplete block analysis and the new approach are displayed in Panel (c).There, the values of V c − V 1 for each object are plotted and illustrate the differences confidences make.These differences are up to 2%, which might make changes to degree classification in borderline cases.In other words (for case study 2), the difference between the results of the declared confidence method and incomplete block analysis can be about a quarter of that between the latter approach and simple averaging.
Finally, the estimates for the biases of each assessor are plotted in Fig. 4(d) for both incomplete block analysis and the declared confidences approach.The plot signals that calibration returned a broad range of values for the bias of the assessors, although declaring confidences has little effect in this instance.

Discussion
There are a number of simple refinements which one could introduce to the core method suggested here.These include how to deal with different types of bias, different scales for confidence, different ways to remove the degeneracy in the equations and how to deal with the endpoints on a marking scale.Some suggestions are made in the SI.An advantage to any approach based upon this type of calibration is that it does not produce the artificial discontinuities across field boundaries that tend to arise if the domain is partitioned into fields and evaluation in each field carried out separately.We therefore suggest that a method such as this, which takes into account declared confidences in each assessment, is well suited to a multitude of situations in which a number of objects is assessed by a panel.
[14] Flach PA et al, Novel tools to streamline the conference review process; experiences from SIGKDD'09, http://microsoft.com/pubs/122784/reviewercalibration.pdfOne could develop refinements to the basic model (3).For example, assessor bias might not be a purely additive effect.An assessor may have a bias for or against topics in which they have lower confidence [11].Assessors may like to give round-number scores, and they may have different scales for confidence, so their confidences may need calibrating as well as their scores.
We can remove the degeneracy in the equations in different manners to equation ( 10) used here.Indeed, in case study 2, we note from Fig 4(b) of the main text that the bulk of the estimates using the calibration approach shift upwards from the averaged scores a c ao s ao /C o to the estimates v o .We believe this is an artefact of the condition (10) we used to break degeneracy.In the case study, there was a significant fraction of "easier" modules but taken by few students.Condition (10) gave these equal weight so that the remaining modules were considered by the method to be negatively biased on average.Thus, on average, the students' scores were increased by the calibration.An alternative degeneracy-breaking condition is a C ′ a b a = 0, which from ( 7) automatically implies o C o v o = ao c ao s ao thus avoiding the possibility of such systematic shifts.Another will be given shortly.
Additionally, assessors might have not only an additive bias but also different scales, so for example One problem is that often assessors are asked to assign scores in a fixed range [A, B], e.g.1−10.Then any model for bias really ought to be nonlinear to respect the endpoints.One way to treat this is to apply a nonlinear transformation to map a slightly larger interval (a, b) onto R, e.g.
apply our method to the transformed scores, using the inverse square of the derivative of the transformation for the confidences, and then apply the inverse transformation to the "true" values.On the other hand, it may be inadvisable to specify a fixed range because it requires an assessor to have knowledge of the range of the objects before starting scoring.Thus one could propose asking assessors to use any real numbers and then use equation ( 23) to extract true values v.A simpler strategy that might work nearly as well is to allow assessors to use any positive numbers but then to take logarithms and fit equation ( 3) to the log-scores.The assessor biases would then be like logarithms of exchange rates.The confidences would need translating appropriately too.One might need a more subtle model if some assessors value objects nearer to their expertise more highly than those further away.
One issue with our method is that the effect of an assessor who assesses only one object is only to determine their own bias, apart from an overall shift along the null vector (v, b) = (1, −1) for the rest.To rectify this one could incorporate a prior probability distribution for the biases (indeed, this was done by [15] in the form of a regulariser).

Robustness of the method
Here we present our approach to the quantification of the robustness of our method to small changes in the scores, using norms that take into account the confidences.
One can measure the size of a change ∆s ao to a score s ao by comparing it to the declared standard deviation σ ao .Thus we take the size of ∆s ao to be √ c ao |∆s ao |.We propose to measure the size of an array ∆s of changes ∆s ao to the scores by the square root of the sum of squares of the sizes of the changes to each score: Supremum or sum-norms could also be considered but we will stick to this choice here.
Our operator L −1 K is equivalent to orthogonal projection with respect to the norm (26) from the scores to the subspace Σ of the form s ao = v o + b a with a degeneracybreaking condition to eliminate the ambiguity in direction of the vector The tightest bounds are obtained by choosing the degeneracy-breaking condition perpendicular to this vector with respect to the inner product corresponding to equation (27).Thus initially we choose degeneracy-breaking condition where C = ao c ao and v ref is a reference value for the scores chosen near the confidenceweighted average score s = ao c ao s ao /C.The degeneracy-breaking condition can equivalently be written as Choosing v ref exactly equal to s makes the confidence-weighted average bias come out to 0 and the confidence-weighted average value come out to s, but to consider robustness with respect to changes in the scores, we need at this stage of the discussion to keep v ref fixed.
Theorem: For a connected graph Γ and with the degeneracy-breaking condition (28), the size of the change (∆v, ∆b) resulting from a given array of changes ∆s in scores is bounded by where µ 2 is the second smallest eigenvalue of the matrix and where D T ao = c ao / C o C ′ a , I N is the identity matrix of rank N and N A , N O are the numbers of assessors and objects respectively.
Proof: Firstly, the orthogonal projection in metric (26) from s to the subspace Σ never increases length.Secondly, if ∆s ao = ∆v o + ∆b a with ao c ao (∆v o − ∆b a ) = 0 then where g is the vector with components Then, because we restricted to the orthogonal subspace to the null vector in results-norm and M is non-negative and symmetric, where index i ranges over all objects and assessors.Positivity of µ 2 holds as soon as the graph Γ is connected, because M is a transformation of the weighted graph Laplacian to scaled variables, so dividing by µ 2 and taking the square root yields the result.
The computation of the eigenvalue µ 2 of M can be reduced from dimension Proof: The equations for an eigenvalue-eigenvector pair µ, (ṽ, b) of M are ṽ + D b = µṽ (33) Applying D T to the first equation, multiplying the second by (1 − µ), and then substituting for (1 − µ)D T ṽ in the second yields We find it makes the bounds increase by only a factor of √ 2.
Proposition: For Γ connected and using degeneracy-breaking condition (36), the size of (∆v, ∆b) resulting from changes ∆s to the scores is at most √ µ 2 ∆s scores .
Proof: If the degeneracy-breaking condition (29) gives a change (∆v, ∆b) for a change ∆s to the scores, then switching to degeneracy-breaking condition (36) just adds an amount k of the null vector n = (1, −1) to achieve a C ′ a (∆b a − k) = 0, i.e.
where we recall that C = a C ′ a .In the results metric the null vector has length , which one can recognise as one half of the inner product of (1, 1) with (∆v, ∆b) in results-norm, so it is bounded by C/2 (∆v, ∆b) .Thus the length of the correction vector is at most that of (∆v, ∆b).The correction is perpendicular to (∆v, ∆b), thus the vector sum has length at most √ 2 (∆v, ∆b) .
One may also consider robustness with respect to changes in the confidences c ao .If an assessor declares extra high confidence for an evaluation, for example, that can significantly skew the resulting v and b.The analysis is more subtle, however, because of how the c ao appear in the equations.

Scale for confidences
We motivated the model by proposing that the noise terms be of the form σ ao η ao with the η ao independent zero-mean random variables with unit variance, so that the σ ao are standard deviations.Nevertheless, multiplying all the confidences by the same number does not change the results of the least squares fit, nor our quantification of robustness.Thus the η ao can be taken to have any variance w, as long as it is the same for all assessments.
In case study 2, for example, we took the confidences to be the CATS weighting of the modules.It is only ratios of confidences that have significance.
If a user finds it difficult to persuade assessors to provide standard deviations, but they are willing to provide high, medium, low confidence ratings, then these could be converted to numerical values by for example choosing a number p near 2 and assigning confidences p 2 , 1, p −2 to the ratings high, medium, low.The interpretation of p is the ratio of the standard deviation for a low confidence evaluation to that for a medium one, and for a medium one to a high one.
The fitting procedure can be extended to infer a best fit value for w.Even if the assessors provide confidences based on assuming w = 1, the best fit for w is not 1 in general.Assuming independent Gaussian errors, the maximum likelihood value for w comes out to be w is the residual from the least squares fit (v, b) for (v, b) and N is the total number of assessments.

Posterior probability distribution
Another point of view on robustness is the Bayesian one.From a prior probability on (v, b) and a model for the η ao , one can infer a posterior probability for (v, b), whose  Mean errors plotted against the numbers of readers per object r for the simple arithmetic-mean approach (upper curves, orange), the incomplete-blockanalysis method (middle curves, green) and the calibration-with-confidences approach (lower curves, blue).The various panels and line types represent different confidence profiles with probabilities for high, medium and low confidences in the ratio (a) 1:1:1, (b) 1:1:2, (c) 1:2:1, (d) 2:1:1.Panel (e) gives improvements in the form of ratios of mean errors from the calibrated approaches of the incomplete-block-analysis approach (green curves) and calibration-with-confidences approach (blue) to the simple arithmetic approach.
3.5   ; the results from additive incomplete block analysis V 1 (green ×), in which confidences are not accounted for; and results from the new calibration method described here, V c (blue •), accounting for declared confidences.(b) The differences calibration make: Comparison of the results from Fisher's additive incomplete block analysis (green ×) and those from the calibrate-withconfidences approach (blue •) to the results of simple averaging and weighted averaging, respectively.V c − S c (or V 1 − S 1 ) can be up to about 7 percentage points.(c) The difference confidences make: Here, calibration-with confidence values V c are compared with estimated values coming from Fisher's additive incomplete block analysis approach V 1 .The new approach shifts scores by up to about two percentage points for the data of case study 2. (d) How confidence affects bias: Here, b a represents the estimated biases of the assessors.The blue data points ("•") come from the declared-confidences method.The green data ("×") represents the corresponding values with constant confidence (c ao = 1 ∀a, o), corresponding to additive incomplete block analysis.The biggest difference is 1.3 (for assessor a = 63).
a:(a,o)∈E c ao (s ao − v o − b a ) = 0, and with respect to b a iff o:(a,o)∈E c ao (s ao − v o − b a ) = 0.
It is also reasonable to measure the size of a change ∆v o to a true value v o by comparing it to the standard deviation implied by the sum of confidences in the scores for object o.Thus the size of ∆v o is a c ao |∆v o | = √ C o |∆v o |, where C o is the total confidence in the assessment of object o.Similarly, we measure the size of a change ∆b a in bias b a by o c ao |∆b a | = C ′ a |∆b a | where C ′ a is the total confidence expressed by a given assessor.Finally, we measure the size of a change (∆v, ∆b) to the vector of values and biases by the square root of sum of squares of the individual sizes, (∆v, ∆b) results = o

)
Thus either b = 0 or (1 − µ)2 is an eigenvalue λ of D T D. In the first case, from the first equation µ = 1.Conversely, if (λ, b) is an eigenvalue-eigenvector pair for D T D with λ = 0 then λ > 0 because D T D is non-negative, so put ṽ = ±D b/ √ λ to see that (ṽ, b) is an eigenvector of M with eigenvalue µ = 1 ± √ λ.If λ = 0 and D b = 0 then µ = 1 is an eigenvalue of M with eigenvector (ṽ, b) for any ṽ with D T ṽ = 0, e.g.ṽ = 0.Thus there is a two-to-one correspondence between eigenvalues µ of M not equal to 1 and positive eigenvalues λ of D T D (counting multiplicity): µ = 1 ± √ λ.Any remaining eigenvalues are 1 for M and 0 for D T D. The degeneracy gives an eigenvector ṽo = √ C o , ba = − C ′ a of M with eigenvalue 0 and it corresponds to an eigenvalue 1 of D T D. All other eigenvalues of M are non-negative because M is.All other eigenvalues of D T D are less than or equal to 1 by the Cauchy-Schwarz inequality.So if the second largest eigenvalue λ 2 of D T D (counting multiplicity) is positive then the second smallest eigenvalue µ 2 of M (counting multiplicity) is 1 − √ λ 2 .If λ 2 = 0 then µ 2 = 1 because existence of λ 2 implies N A ≥ 2 so M has dimension at least 3 and we have only two simple eigenvalues µ = 0 and 2 from the simple eigenvalue 1 of D T D, so M must have another one but any other value than 1 would give a positive λ 2 ; so the same formula holds.If there is no second eigenvalue of D T D (because N A = 1) then if N O ≥ 2 the second largest eigenvalue of M must be 1 by the same argument.If both N A and N O are 1 then the second largest eigenvalue of M is the other one associated with the eigenvalue 1 of D T D, namely 2. Finally, a user may prefer to use the degeneracy-breaking condition a C ′ a b a = 0 (36) rather than (29), perhaps out of uncertainty about what value of v ref to use.Or a user may be happy to use (29) with v ref equal to the confidence-weighted average score, but wants v ref to follow this average score if changes are made to the scores.That comes out equivalent to using (36).So we extend our discussion of robustness to treat this case.

Figure 1 :
Figure 1: Panels (a)-(d):Mean errors plotted against the numbers of readers per object r for the simple arithmetic-mean approach (upper curves, orange), the incomplete-blockanalysis method (middle curves, green) and the calibration-with-confidences approach (lower curves, blue).The various panels and line types represent different confidence profiles with probabilities for high, medium and low confidences in the ratio (a) 1:1:1, (b) 1:1:2, (c) 1:2:1, (d) 2:1:1.Panel (e) gives improvements in the form of ratios of mean errors from the calibrated approaches of the incomplete-block-analysis approach (green curves) and calibration-with-confidences approach (blue) to the simple arithmetic approach.

Figure 4 :
Figure 4: Case study 2: Case study 2: (a) The estimates for object scores and values for 82 students labelled by o.The various symbols represent the averaged raw scores S 1 (orange △); the CATS-weighted averages S c (red +); the results from additive incomplete block analysis V 1 (green ×), in which confidences are not accounted for; and results from the new calibration method described here, V c (blue •), accounting for declared confidences.(b) The differences calibration make: Comparison of the results from Fisher's additive incomplete block analysis (green ×) and those from the calibrate-withconfidences approach (blue •) to the results of simple averaging and weighted averaging, respectively.V c − S c (or V 1 − S 1 ) can be up to about 7 percentage points.(c) The difference confidences make: Here, calibration-with confidence values V c are compared with estimated values coming from Fisher's additive incomplete block analysis approach V 1 .The new approach shifts scores by up to about two percentage points for the data of case study 2. (d) How confidence affects bias: Here, b a represents the estimated biases of the assessors.The blue data points ("•") come from the declared-confidences method.The green data ("×") represents the corresponding values with constant confidence (c ao = 1 ∀a, o), corresponding to additive incomplete block analysis.The biggest difference is 1.3 (for assessor a = 63).