Modelling intransitivity in pairwise comparisons with application to baseball data

The seminal Bradley-Terry model exhibits transitivity, i.e., the property that the probabilities of player A beating B and B beating C give the probability of A beating C, with these probabilities determined by a skill parameter for each player. Such transitive models do not account for different strategies of play between each pair of players, which gives rise to {\it intransitivity}. Various intransitive parametric models have been proposed but they lack the flexibility to cover the different strategies across $n$ players, with the $O(n^2)$ values of intransitivity modelled using O(n) parameters, whilst they are not parsimonious when the intransitivity is simple. We overcome their lack of adaptability by allocating each pair of players to one of a random number of $K$ intransitivity levels, each level representing a different strategy. Our novel approach for the skill parameters involves having the $n$ players allocated to a random number of $A<n$ distinct skill levels, to improve efficiency and avoid false rankings. Although we may have to estimate up to $O(n^2)$ unknown parameters for $(A,K)$ we anticipate that in many practical contexts $A+K<n$. Using a Bayesian hierarchical model, $(A,K)$ are treated as unknown, and inference is conducted via a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm. Our semi-parametric model, which gives the Bradley-Terry model when $(A=n-1, K=0)$, is shown to have an improved fit relative to the Bradley-Terry, and the existing intransitivity models, in out-of-sample testing when applied to simulated and American League baseball data. Supplementary materials for the article are available online.


Introduction
The seminal Bradley-Terry model (Bradley and Terry, 1952) is commonly used to rank objects from paired comparison data.Given a set I of n objects with each object i ∈ I having skill r i ∈ R, then the Bradley-Terry model gives, for i = j ∈ I, where a b denotes preference for object a over b, and r 1 = 0 to avoid identifiability issues.
A ranking of the objects is given by sorting estimates of r := {r i ∈ R : i ∈ I}.This model is transitive, i.e., p (BT) jk is given by p (BT) ij and p ik , for all i = j = k ∈ I, see Section 3. Now consider the game of Rock-Paper-Scissors, a zero-sum game in which Rock beats Scissors, Scissors beats Paper, and Paper beats Rock, and specifically consider the deterministic scenario where players (r,p,s) always pick (Rock, Paper, Scissors) respectively.In this scenario, all win probabilities in a game are either 0 or 1 depending on the opponent, and each player wins their next game with probability 1/2 if their next opponent is to be selected at random.Whatever way the skill of a player is defined, the symmetry of this game set-up unquestionably leads to the conclusion that the three players have equal skill levels.
Conclusions drawn from a Bradley-Terry model fitted to data from this simple game are surprisingly poor.Given a round-robin tournament, where each player plays all other players an equal number of times, the model will correctly estimate that all players are equally ranked in terms of skills; however, it would also estimate all pairwise win probabilities to be 1/2, which couldn't be more wrong.Even worse, is that any illusory ranking can result when the tournament is not round-robin, e.g., if the most common pairing of players is (r,s) and the other two pairings occur equally often then the Bradley-Terry model will rank player r as top.The key reason for the failure of the Bradley-Terry model is its transitive nature, a trait shared by almost all commonly used ranking systems.
Here we develop a novel pairwise comparison model, and an associated ranking system, which accounts for intransitivity.Thus, it describes how specific pairwise probabilities differ from probabilities given by overall skill levels alone, i.e., how probabilities differ from those given by the Bradley-Terry model.The Rock-Paper-Scissors game also illustrates that ranking can involve ties, where subsets of players can have equal skill levels, and that tournament structure can effect the subsequent inference.We also address some aspects associated with these issues.
The concept and associated modelling of intransitivity is not new.Makowski and Piotrowski (2006) present many examples of competitions exhibiting intransitivity and argue that it can occur whenever the best strategy in a given comparison depends on the strategy of the opponent, and Smead (2019) provides a philosophical argument as to why intransitivity is particularly likely to occur in sports.Given this, it is not surprising to find cases of intransitivity in esports (Makhijani and Ugander, 2019;Chen and Joachims, 2016;Duan et al., 2017).Other applications include social choice, real sensory analysis, and election data-sets.
With n competitors there are n(n − 1)/2 interactions, or intransitivities, so even in roundrobin competitions, with m rounds, there are too many terms to estimate efficiently using empirical methods, unless m/n is large.Causeur and Husson (2005) proposed an O(n 2 ) parameter extension of the Bradley-Terry model to address intransitivity.Subsequently O(nd) parametric models have been studied for some fixed d ∈ N (d n), see all the models in Section 2, but they lack the flexibility to cover the potentially O(n 2 ) different intransitivities across n players, leading to bias; whilst they are not parsimonious when the intransitivity is simple, leading to inefficiency.
Although the concept of intransitivity is quite clear, there is no established measure of the amount of intransitivity in a dataset.In this work, we propose a definition of intransitivity through a distance metric between the assumed probability of paired comparisons under a Bradley-Terry model, and the empirical or model-based probability estimate, such that for any given dataset the magnitude of the intransitivity present is unambiguous.A flexible model then, is one which is capable of exploring the space of all possible combinations of intransitivity, as defined by this measure.Any parametric model is restricted to a subset of this space by definition, with this restriction being most obviously revealed when assessing predictive performance.
We then develop a novel semi-parametric extension of the Bradley-Terry model, allocating the n(n − 1)/2 pairs of objects to a random number K, with 0 ≤ K ≤ n(n − 1)/2, of distinct intransitivity levels, each level representing a different strategy.We term this model the Intransitive Clustered Bradley-Terry (ICBT) model.Relative to the aforementioned parametric models, this ICBT model provides greater flexibility to enable the incorporation of varying structures, and degrees of, intransitivity.As many of these strategies will have similar effects, we anticipate that K should be small, yet the random property of K provides the potential for it to be large when required.This flexibility ensures that our model is parsimonious, whatever the complexity of the data.For our Rock-Paper-Scissors illustration K = 1.
Moreover, our novel approach for the objects' skills is to allocate the n objects into a random number of A + 1 ≤ n distinct skill levels, to improve efficiency and avoid false rankings.
This constraint recognises that from paired comparison data there will be objects that are indistinguishable as having statistically significantly different skill levels, e.g., for our Rock-Paper-Scissors illustration A = 0.So clustering skills avoids over-interpretation of misinformed rankings, a feature Masarotto and Varin (2012) address by clustering skills via a lasso procedure.
The basis of our model is the belief that in practice there are likely to small subsets of skill and intransitivity levels, namely A ≤ n − 1 and K n(n − 1)/2 respectively.As we have little prior knowledge about the skills of the objects or the intransitivities of the pairs of objects, we allow the clustering of objects into different skill levels, and of the pairs of objects into separate intransitivity levels, to be determined entirely through a Bayesian hierarchical model.We take each of (A, K), the allocations of objects to skill levels, and the allocations of the pairs of objects to intransitivity levels as unknown, with inference being conducted via a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm.This formulation does offer computational challenges; however, we anticipate that typically the posterior will give a high probability that A + K < n and that many of the cluster allocations also will be strongly identified.Our inference framework offers the opportunity to select a highly simplified model, with the values of A, K and allocations fixed at values given by posterior means/modes if these are found to align with known structure about the paired comparison.In the absence of such knowledge our results allow for the full uncertainty of these features to be accounted for.
In certain circumstances our model has the potential to identify and correct for imbalanced tournament structure on overall rankings since teams are not penalised if they (unfairly) compete most frequently against those whom they perform systematically worse to relative to what is expected based on respective skills alone.
We use American League Baseball data to illustrate the performance of our methods in comparison to existing models for a range of reasons.Firstly, each game results in a win or a loss for a team.Secondly, it is known to be a highly strategic game, see Section 5, so we anticipate that the level of intransitivity will be high.Finally, although the tournament structure is not round robin, each team plays each other team often, and so the existence of intransitivity should become apparent in inference.Indeed this is found in Section 5, where our model is shown to have an improved fit over the Bradley-Terry model and existing parametric intransitivity models in out of sample testing for each of the nine seasons we study.
The layout is as follows.Section 2 introduces other approaches to modelling intransitivity.
Section 3 then introduces our novel measure of intransitivity, the ICBT model, and the ranking formulation.Section 4 contains details of the inference, including prior specification, our full Bayesian hierarchical modelling strategy, an overview of the RJMCMC algorithm and its novel features, and an overview of a simulation study.Section 5 compares this model with the Bradley-Terry model and other competitor models, using American League baseball data.Section 6 is a discussion.Full details of the RJMCMC algorithm, simulation study, and extended analysis of the baseball application are in the supplementary material.

Literature on Intransitive Modelling
The blade-chest model of Chen and Joachims (2016), extends the Bradley-Terry model into ddimensions by incorporating so-called blade and chest vectors b i , c i ∈ R d for each object i ∈ I.
There are two versions: the -dist and -inner variants, given respectively by logit p The blade and chest parameters of all the objects can be viewed as features on a d-dimensional latent space.Then, if an object i's blade is close to object j's chest, and simultaneously object i's chest is far from object j's blade, then object i has an additional advantage over object j.If The degrees of freedom are restricted by regularization, using an L 2 norm for the object strength vectors and Frobenius norm for both transitivity matrices.The tuning parameters are selected via cross-validation.
Makhijani and Ugander (2019) introduced a majority vote model with object i having a vector of d skill attributes, (µ i,1 , . . ., µ i,d ), where d is odd.Then, given a suitable choice of mapping function f , e.g., logistic or Gaussian, define q l ij = f (µ i,l − µ j,l ), ∀l ∈ {1, . . ., d} to be the probability of i beating j based only on their lth attribute.Then, majority vote model says that the probability of i being preferred to j overall, is the probability that it wins across the majority of attributes.For d = 1 the model is linearly transitive, but not when d = 3, as 3 Modelling

Measure of Intransitivity
noting it is independent of the choice of bridge object j.Therefore, there can be no interaction that is specific to the pair {i, k}, that is not already captured between all other pairs.Including intransitivity; however, allows for some pairwise probabilities to depart from those assumed by the Bradley-Terry model.This can be formalised by supposing that for all i = k ∈ I the true probability of preference i k is given as some function f : Bradley-Terry probability and the intransitivity, θ ik , of the pair {i, k}, then we can write where we identify the form of f in Section 3.2.We define the intransitivity to be the displacement of the true probabilities from the Bradley-Terry probabilities on the log-odds scale, so that is the amount of intransitivity between the pair of objects {i, k}.A value of θ ik = 0 indicates that the comparison is transitive, i.e., the pairwise probabilities could be modelled by the Bradley-Terry model.As a consequence we require f (x, 0) = x, x ∈ [0, 1].The choice of log-odds ratio in equation ( 3) reflects the non-linearity of probabilities.For example, if = 0.099, then a small linear shift in probability from 0.5 to 0.5 + has little impact on the odds, which remain at approximately 1:2.However, a linear shift in probability from 0.9 to 0.9 + has a huge impact on the odds, which move from 1:10 to 1:1000.Moreover, our definition (3) for the intransitivity θ ik also imposes rotational symmetry for pairs of objects, that is so we need to find {θ ik , ∀i > k ∈ I} only, in order to completely define {θ ik , ∀i = k ∈ I}.

Model formulation
To find the function f in equation ( 2), equation (3) can be simply rearranged which gives and so for any pair {i, k}, equation ( 5) can be re-written as Here the effect of θ ik is clear, positive (negative) θ ik , increases (decreases) the probability of team i beating team k relative to their skills alone, i.e., relative to the Bradley-Terry model.
Thus far, the model contains the flexibility to describe pletely.Here P contains n(n − 1)/2 degrees of freedom, because p ki = 1 − p ik ; however, the model contains n(n − 1)/2 + n parameters: n(n − 1)/2 values of intransitivity between pairs, and n skill parameters from r, and thus the model parameters are not identifiable.One way of ensuring identifiability in the standard Bradley-Terry model is to fix one object's skill level, and here it is chosen that r 1 = 0.As well as this constraint on the objects' skill parameters, the intransitivity parameters require constraints for parameter identifiability.The minimal set of required constraints is identified in Proposition 1, see the Appendix for the proof.
Proposition 1 Consider a round-robin tournament with pairs of objects (i, j) being compared, with i, j ∈ I, with i = j where |I| = n.Suppose that the probabilities of i beating j are p ij where these probabilities are given by expression (6), with r 1 = 0 and intransitivity values θ ij .If a set of n−1 pairs of objects, indexed by J n−1 , have their intransitivity values set to arbitrary specified values, then all the rest of the {r i } and {θ ij } parameters in expression (6) are identifiable if J n−1 forms a connected graph over I. Furthermore, if less than n − 1 pairs' intransitivity values are specified or if J n−1 is not a connected graph over I then identifiability is not achievable.
We choose the n − 1 constraints to be θ ij = 0, ∀(i, j) : i = 1, j ∈ I \ {1}, that is, all pairs involving object 1 have intransitivity set to 0. Proposition 1 gives that if any further constraints are imposed on the intransitivity values the flexibility of model ( 6) will be compromised.
With the above constraints, the minimal conditions for parameter identifiability are satisfied, but the model is still likely to overfit with so many parameters.To rectify this we restrict the total number of degrees of freedom, by restricting both the number of intransitivity values to only K ≤ n(n − 1)/2 unique values and restricting the number of unknown skill values to be A < n, where A + K ≤ n(n − 1)/2 and ideally A + K n(n − 1)/2.In this fashion our ICBT model embraces intransitivity in a parsimonious way.
Firstly, consider the A + 1 unique skill values, which ensures parsimony in the model by clustering the objects' skills r into distinct values which are sufficiently statistically significantly different.Since r 1 = 0 is fixed, there are only A unknown skill levels, φ ∈ R A .By defining the labels of the set of skill levels to be A := {−A − , . . ., 0, . . ., A + } with A + being the number of skill levels which are greater than 0 and A − the number of skill levels less than 0 such that we impose the equivalent condition in our model by fixing the skill level with label {0} to be φ 0 = 0, and fixing object 1 to always be allocated to this cluster.
The possible skill values an object can take are therefore defined as where φ are the unknown skill levels, and the ordering helps with label switching problems in the inference.The skill cluster allocation of object i, denoted y {i} ∈ {0, 1} A+1 , is a binary vector which takes the value 1 at position s ∈ A and 0 everywhere else, if object i ∈ I belongs to cluster s.The set Y := {y {i} : i ∈ I \ {1}} then contains all the objects' skill cluster allocations except object {1} which has fixed cluster allocation.Therefore, by defining S {i} (Y ) := argmax s y {i},s , ∀i ∈ I \ {1}, then the objects' skills can be written as Now consider the K unique values of intransitivity to describe the different inter-object strategies.Of the n(n − 1)/2 pairs of objects, many will adopt similar strategies depending on their opponents.These similar strategies are translated by the model as having similar departures from transitivity, and are thus clustered together.For example, suppose some group of objects V : j / ∈ V competed against object j in the same way.Then it would be reasonable to assume that θ ij is the same for all i ∈ V.This creates clusters of pairs of objects, such that the pairs are clustered according to them having identical intransitivity.
In order to measure the departure from a Bradley-Terry model, a linearly transitive cluster is imposed, which contains the set of pairs J T ⊆ {{i, k} : i = k ∈ I}, which have an intransitivity level θ 0 = 0. Thus, the Bradley-Terry modelling assumption (1) holds for these pairs, such ik , for all {i, k} ∈ J T .Given the existence of this cluster, there must be strong evidence from the data to produce an additional cluster with an intransitivity level close to 0. This choice does not impose transitivity of pairs as the linearly transitive cluster J T may be empty, except for all pairs with object 1 which are classified as transitive due to our imposition of constraints for identifiability from Proposition 1.Let the distinct set of intransivity levels be where K = {1, . . ., K} and the levels of intransitivity are ordered from smallest to largest.
The levels of intransitivity, θ, contain the set of positive values of intransitivity which, due to symmetry and the completely transitive cluster with intransitivity value θ 0 , then define the full 2K + 1 possible values of intransitivity between any pair of objects.
We define the intransitivity cluster allocation of a given pair {i, k} to be another binary matrix z {i,k} , which takes the value 1 at position s ∈ {−K, . . ., K} and 0 everywhere else, if the pair {i, k} belongs to cluster s.The clusters are therefore labelled from −K to K, where a cluster labelled k ∈ {1, . . .K} has cluster level θ k , a cluster labelled k ∈ {−K, . . ., −1} has cluster level −θ −k , and a cluster with label 0 has cluster level θ 0 = 0.The set Z := {z {i,k} , ∀i > k ∈ I \ {1}} then defines all the cluster allocations for all the free pairs i = k ∈ I \ {1}, because of the rotational symmetry.For example, if the Kth index of z {i,k} has value z {i,k},K = 1, then this indicates that the pair {i, k} belongs to the cluster with label K, whose cluster level is the largest level of intransitivity θ K , and this enforces that the pair {k, i} belongs to cluster −K and has the smallest level of intransitivity −θ K .If the cluster allocation of the pair {i, k} ∈ I \ {1} is then the level of intransitivity for a pair {i, k}, θ ik can be redefined as where 1 is the indicator function, and remembering that θ 0 = 0.
The full model can be written either in terms of equation ( 6), noting that the parameters will be clustered, or can be written in terms of the levels and the cluster allocations, So the ICBT model is defined by Due to the intransitivity levels being fixed to 0 for all pairs of objects involving object 1, an adjustment is required to get a more interpretable value of intransitivity between the pairs.
We define the adjusted intransitivity to be that is, the difference between the logits of the pairwise probability between our ICBT model and the Bradley-Terry model.Note that the rotational symmetry of {θ ij } (4) also imposes To help see the value of this reparametrisation, consider then the earlier example of a deterministic game of Rock-Paper-Scissors.Take Rock as the constrained object, then Rock has fixed skill level r r = 0, and that pairs involving Rock have intransitivity 0, that is θ rp = θ rs = 0, where the p and s subscripts denote Paper and Scissors.To maintain that Rock always beats Scissors p rs = 1, then from the constraints, we get an excellent approximation from the ICBT model when r s = −M for some large M , with the approximation improving as M → ∞.
Likewise r p = M , and θ ps = −3M .With this model there is only one skill level M , and one non-zero intransitivity level −3M .This parametrisation somewhat hides the symmetry of the intransitivity over pairs.However, with definition (11), then θ * rs = θ * sp = θ * pr = M , resulting in an intuitive and easy interpretation of the intransitivity, reflecting the symmetry of the game, no-matter the choice of the fixed parameters.

Model Ranking
In the Bradley-Terry model, the skill parameters can simply be ordered to give a rank since a greater skill always results in higher win probabilities against all other objects.In our ICBT model this is not the case, because both the intransitivity parameters of each pair and the skill parameters of the objects impact the win probability between any pair.However, below we present two intuitive methods for determining overall ability, and therefore ranking.
Firstly, if p ij = Pr{i j} is the probability of an object i beating object j according to our model, then we can rank the objects by ordering that is, p i. is the average probability of object i beating any other object j = i ∈ I.
Secondly, if we consider the intransitivity between an object i and an opposing object j = i as some "boost" which contributes to the overall ability (which could be negative), then the overall ability a i of object i could be defined by that is, the object skill plus its average intransitivity level.Definition ( 13) is equivalent to the Bradley-Terry definition of 'ability'.Defining logit p (BT ) ii = 0, ∀i ∈ I, a Bradley-Terry gives where the sum on the right hand side does not depend on i, so the skill of object i is entirely determined by r i .Similarly, in our model then given definition (13), both ( 14) and ( 15) have the same form but with a i replacing r i .
Then a ranking can be formed by ordering the set of abilities a := {a i : i ∈ I}.We argue that the first method, using the probabilities p . to rank the objects, is more meaningful since it is directly associated with the pairwise probabilities, the modelling of which is our ultimate aim.
The application to baseball data of both methods is discussed in the supplementary material.

Likelihood
The data, x := {x c : c ∈ C}, are binary, and i j denotes that i is preferred to j.Then x c = 1 if i c j c , and x c = 0 otherwise, where i c , j c ∈ I are the objects being compared in comparison c.Then, the log likelihood for the ICBT model is where p icjc is given by the ICBT model for all c ∈ C and is calculated from the set of parameters (φ, Y, A, θ, Z, K).All pairs' intransitivities {θ ik : i = j ∈ I} can be found from the intransitivity levels θ and the cluster allocations Z, using equation ( 9), so it is only necessary to do inference on these parameters, rather than the full 2K + 1 separate clusters.Therefore from here onwards the term intransitivity levels refers only to those K values which have positive intransitivity.
Similarly, any individual object's skill r i ∀i ∈ I can be found from knowing the ability levels φ and the cluster allocations Y , using equation ( 7).We formulate a Bayesian hierarchical model, which treats both K and A as unknown parameters, thus accounting for uncertainty in the number of clusters.The posterior is therefore written as where L(•) = exp[ (•)] is the likelihood and π (φ, Y, A, θ, Z, K) is the prior.

Prior Specification
Formulating the prior, we make the assumption that Z ⊥ ⊥ θ|K that is, the intransitivity level allocations and intransitivity levels are independent from one another given the number of intransitivity levels K. Likewise, it is assumed that Y ⊥ ⊥ φ|A.Furthermore, we assume that the clustering of the objects' skills and the clustering of the pairs' intransitivities are independent systems, that is, A ⊥ ⊥ K, φ ⊥ ⊥ θ, and Y ⊥ ⊥ Z.This means that the prior specification for the two features we are clustering, skills and intransitivities, can be approached separately.
Consider first the prior specification for the clustering of the intransitivity values of the pairs.Remember that labels z {i,j} , ∀i > j ∈ I \ {1} have domain {−K, . . ., K}, that is, z {i,j} : {−K, . . ., K} → {0, 1}, ∀i > j ∈ I \ {1}, and also that z {i,1} , ∀i ∈ I \ {1} (and by symmetry z {1,j} , ∀j ∈ I \ {1} too) are fixed in the transitive level {0} for identifiability purposes, see Section 3.2.Let the prior on the cluster allocation be where ω K is on {−K, . . ., K} such that The distribution of Z| (ω K , K) is assumed independent over all pairs i > j ∈ I \ {1}, i.e., . ., K} is the set of allocated pairs of objects belonging to cluster k.We set ω K |K ∼ Dirichlet (γ K ) to come from 2K + 1 dimensional Dirichlet prior distribution, and γK ∈ R 2K+1 + is the hyper-parameters vector.We use an uninformative prior, setting γK = γ K 1 2K+1 where 1 2K+1 is a vector of ones of length 2K + 1 and γ K ∈ R + .In this case, the ω K parameter can be marginalised out, by where integration on ω K is taken over the 2K + 1 simplex.This is referred to as a Dirichletmultinomial allocation prior.The prior for K is a Poisson(λ K ) distribution with probability mass function denoted g 0 (k|λ K ) so that E [K|λ K ] = λ K , with λ K > 0. Note that K = 0 is feasible, as this corresponds to the Bradley-Terry model since θ|(K = 0) = ∅ and so only the transitive cluster exists, that is θ K |(K = 0) = θ 0 , and {i, j} ∈ J T , ∀i = j ∈ I, so all pairs belong to the transitive cluster J T .Formally K < n(n − 1)/2 − n but as this is large relative to our prior beliefs on K, for simplicity we ignore this constraint in the inference.
As the θ elements are ordered in increasing order and are positive, the prior on the θ parameters is taken to be the joint distribution of K order statistics drawn from independent gamma random variables, such that and where h 0 (x|α, β) is the Gamma(α, β) density with shape and scale α, β > 0 respectively.
Consider the prior for the skill levels clustering.The set of skill cluster allocations has distribution Y = {y {i} |ω A , A ∼ multinomial(1, ω A ), ∀i ∈ {2, . . ., n}}, where ω A has domain on {−A − , . . ., A + }.The distribution of y {i} |(ω A , A) is assumed independent over all objects where ) is modelled to come from an A + 1 dimensional Dirichlet prior distribution, with γA = γ A 1 A+1 where γ A ∈ R + .Marginalising out as in derivation ( 17), another Dirichlet-multinomial allocation prior is obtained by integrating ω A over the A + 1 dimensional simplex.The prior density for the skill allocations is therefore given as The prior for the number of unknown skill levels A is taken to be a truncated Poisson distribution with parameter (λ A ), λ A > 0 with probability mass function Similarly to θ, the prior choice for φ is taken to be the joint distribution of order statistics of independent and identically distributed A + 1 Gaussian random variables such that where φ a ∼ N 0, ν 2 A ∀a ∈ A \ {0}, and with ν A ∈ R + .The (A + 1)! term arises as φ 0 can occur anywhere in the sequence of φ.

Reversible jump Markov chain Monte Carlo sampler
Inference is made via a reversible jump Markov chain Monte Carlo sampler (Green, 1995), which provides samples from the posterior distribution π (φ, Y, A, θ, Z, K|x), that is, the intransitivity and skill levels, the allocations to the these levels, and the number of levels.Since the number of skill and intransitivity levels (A, K) are assumed to be unknown, the uncertainty in these parameters must be accounted for, thus motivating the use of a reversible jump sampler.In a sense, the reversible jump sampler mixes over models as well as parameters, and thus fully accounts for this uncertainty in the final inference.
The ICBT model is structured to try to favour the Bradley-Terry model as a special case, and this is reflected in our sampler, by explicitly incorporating the completely transitive cluster θ 0 as an ever present cluster, even if no pairs are allocated to this cluster at a given iteration of the sampler.To ensure the skill and intransitivity levels both remain ordered, the updates to these levels occur in a transformed space such that no update can lead to a change in order.
The reversible jump algorithm used is a split-merge sampler (Green and Richardson, 2001), which is adapted from the work of Ludkin (2020).The sampler comprises three separate moves: a standard Markov chain Monte Carlo Metropolis-Hastings move, which samples parameters φ, θ, and reallocates clusters Y, Z; splitting or merging clusters; and adding or deleting empty clusters.For the construction of the algorithm and how it is implemented, see the supplementary material.

Model assessment
The inferences produced by any model are only meaningful if the model itself is accurate.This accuracy is measured here by how well the model fits out of sample.If C is the set of total observed pairwise comparisons, then let C s be the set of comparisons on which the model is fitted and C t be the set of comparisons on which the model performance is analysed, such that We use log-loss l(x * ) of the test dataset x * to measure model performance, which we take to be the average negative log-likelihood per observation in x * , i.e., where pij is the point estimate of p ij based on the training dataset of comparisons C s and x * := {x c : c ∈ C t } is the set of test data, where the notation is as used in the expression for the likelihood (16).

Simulation study
The model was tested using simulated datasets where the number of objects, the number of round-robin tournaments, and the amount of intransitivity varied between the datasets.The sensitivity of our model to these parameters was then tested by comparing out of sample prediction accuracy with a standard Bradley-Terry model.This provided insights into the amount of data, and the amount of intransitivity, required for our more complex model to outperform the Bradley-Terry model.A full analysis is provided in the supplementary material.
5 Baseball Data

Data
Baseball was chosen to illustrate the methodology due to the high frequency of games, with accessible data for the American League Baseball obtained from www.retrosheet.org.The data are from the 2010-2018 seasons, with the 2010-2012 seasons involving 14 teams, and the 2013-2018 seasons involving 15 teams due to the Houston Astros moving from the National League to the American League.We analyse each season's data separately here, and jointly over years in the supplementary material.
The tournament structure is not as simple as the round robin tournament we considered in the simulation study.The American League is split into three divisions based on location: East, Central and West, with five teams in each (since 2013).Within the same division, pairs of teams play each other approximately 20 times, and pairs of teams from different divisions play each other around 5-7 times, as well as any Playoffs and World Series matchups, totalling around 140-160 matches per team every season, depending on the season and the team.Baseball is known to be a highly strategic game, with issues such as player selection, handedness of the of batters, strength and speed of players, and tactics such as "small ball" vs "long ball" all considered of great importance.So we anticipate that the level of intransitivity will be high.
The vast majority (at least 99.5%) of all matches are played at the home of one of the two teams competing in the game, with the rest played at neutral venues.Playing at home is well known to have the potential to increase the probability of the home team winning the match across a range of sports (Dixon and Coles, 1997).Although prediction and model interpretation could be improved by incorporating this effect, we decided not to address home advantage here.
Our reason was that none of the existing intransitivity models have such a feature, as they were developed for applications devoid of home advantage, such as e-sports, so a comparison of the different models would only be fair if we did not include this property.However, in Section 6 we formulate the home advantage adaptation given its potential interest.
If pairs of teams do not play equally home and away, then ignoring home advantage could lead to misinterpretation of the estimated ICBT model parameters, e.g., if team i mostly played team k with team i at home, the home advantage would feed into θ ik .We do not believe this is problematic due to the near perfect balance of home to away matches per team, and the maximum home percentage within pairs of teams is 70% for 2010-11 and only 57% subsequently.

Inference
The baseball data are analysed using the ICBT model, and its results are compared with those of the Bradley-Terry model and with the existing models of Section 2 except for the model of Duan et al. (2017) due to the subjective choices required for some parameters.
The ICBT model incorporates uncertainty in the choice of model itself, that is, the number of clusters and therefore how many parameters.Our prior distributions for number of intransitivity levels K and skill levels A are shown in Figure 1.We used a Poisson(λ K = 2) prior for K, with the hyper-parameter to give a 95% prior chance that K ∈ [0, 5], as it was thought that there would only be a few different pairwise strategies.Similarly, the prior for A was taken to be Poisson(λ A = 7) as this hyper-parameter choice gave a 95% prior chance that A ∈ [2, 13].
The justification for our choice of the other hyper-parameters (γ K , γ A , α, β, ν A ) and a sensitivity analysis to hyper-parameter choice is reported in the supplementary material.Now consider the pairwise interactions between teams.These interactions could be inferred from either the intransitivity of the posterior mean θ * ij , ∀i = j, or by the posterior mean of the intransitivity parameter θij , ∀i = j.The supplementary material contains a comparison for both and concludes that θ * ij is more meaningful and interpretable here, so we focus on that.
For the 2018 season, Figure 2 (left) shows θ * ij , for each pair of teams i > j ∈ I: recall that intransitivity has rotational symmetry, i.e., θ * ij = −θ * ji , ∀i = j.The teams are sorted by their rank according to p ., given by definition (12), see Figure 2 (right).Reading from the teams on the y-axis to x-axis there is a large positive value of intransitivity from Baltimore (BAL) to Tampa Bay (TBA) of 0.78 with 95% credible interval (0.36, 1.22), indicating that Baltimore played better against Tampa Bay than expected, given their overall abilities.This is consistent with the data, with Baltimore winning 11 out of 19 matches between the two teams, despite being ranked lower.
The analysis of these intransitivities between pairs, and that of the skills of each team, can be combined to produce an overall ranking of the teams.As discussed in Section 3.3, with further details in the supplementary material, p .provides a suitable ranking of the teams.For  this inducing parsimony and avoiding obtaining distinct rankings for some items when there is no evidence from the data that they are not equally good.We have shown evidence from American League baseball that our model provides a distinct improvement on existing models.

Model Performance
The ICBT model has complete flexibility, in the sense that cluster allocation to skill and intransitivity levels is not predetermined.In order that the data identify the appropriate structure of clustering, and for the inference to account for the uncertainty in this choice, the model is fitted via RJMCMC.
Based on the clusters with the highest posterior probabilities, we anticipate that experts in the particular sport may be able to identify some patterns of clustering that are interpretable, e.g., associated with different styles of play.In such cases, these clustering features could be hard wired into the model as the only options, resulting in more efficient inference.A referee made the helpful suggestion that if accounting for clustering uncertainty was not an issue then the inference could be simplified by estimating the ICBT model with group lasso penalties to induce clusters.We feel that our model works sufficiently well for the current applications but agree that it presents an exciting springboard for the consideration of various extensions to the model and its inference.We finish by illustrating a few such possible extensions.
In Section 5 we did not attempt to account for home advantage, which is widely recognised as an important feature in sport, e.g., Cattelan et al. (2013) incorporate it in a Bradley-Terry model, though to the best of our knowledge it has not been accounted for in the existing intransitivity models.The most natural way to achieve this is to change p ik given by expression (6) to a probability p , and p where γ ∈ R determines the effect of playing at home, which here is common over all pairs of teams.If γ > 0 (γ < 0) then the probability of a home win is increased (decreased) relative to the other factors of skill and intransitivity.This effect can be extended to vary over teams by replacing γ by γ i in expression (22).To ensure these γ i parameters are all identifiable, we fix γ 1 = 0, though no additional constraints are needed if there is a common γ, but that is all that is required under the conditions of Proposition 1 on the other parameters, as we are able to exploit data that distinguishes which team is at home.This article only considered win-loss scenarios.Extensions of the Bradley-Terry have been proposed for handling draws.Two distinct methods for handling draws are given by Cattelan et al. (2013) and Hankin (2020).The former use ordinal logistic regression, treating win, loss, draw as outcomes of an ordered multinomial random variable, which can then be analysed via an ordered link model.In contrast, the latter treats the problem as a competition between the two teams and a third theoretical team, such that when the theoretical team wins the outcome of the match corresponds to a draw between the two actual teams.The ICBT model can be adapted similarly, with the use of the clustering strategy extended to pooling teams to account for their similar cautiousness, leading to them drawing more often than would be expected.
We have assumed that all teams play each other.If this is not the case we cannot improve on the prior inference for the θ ik parameters for pairs (i, k) that do not play each other.This is not a restriction for Bradley-Terry or the existing intransitivity models, where the associated p ik are determined by the observed pairs.This raises issues about identifiability of the ICBT model parameters.Our approach, through Proposition 1, is no longer sufficient leaving the open problem of which parameters to fix in order to give the most efficient inference.

d = 2
this model can represent a deterministic Rock-Paper-Scissors game, by placing the blade of Rock at the chest of Scissors, the blade of Scissors at the chest of Paper, and the blade of Paper at the chest of Rock.By increasing d, ever more complex relationships can be captured between the pairs of objects.Given n objects and b i , c i ∈ R d for each object i, the model contains 2d(n − 1) identifiable parameters.The r parameters can be absorbed into the blade and chest parameters; however, the above parametrisation makes it clear that the Bradley-Terry model is a special case of the blade-chest model, when b i = b j = c i = c j , ∀i, j ∈ I. Duan et al. (2017) introduce a generalised model for intransitivity, with logit p

From
the model definition (1), the Bradley-Terry model assumes linear transitivity.This assumption constrains the pairwise probabilities of the model such that, given p for any i = j = k ∈ I, the probability p (BT) ik is completely determined.It is straightforward to show that the form of p (BT) ik is given as because A + a=−A − |c a | = n, where c a := {i, ∀i ∈ I \ {1} : y {i},a = 1} is the set of objects belonging to skill cluster a ∈ A.

Figure 1 :
Figure 1: Posterior distributions of the K intransitivity levels (left) and the A skill levels (right) for the 2018 season: with the associated prior distributions in a lighter colour.

Figure 2 :
Figure 2: Analysis of 2018 season: the posterior mean of the intransitivity parameter, θ * ij across all pairs of teams i > j ∈ I (left); ranking according to definition (12) (black) and Bradley-Terry model (red) for all teams i ∈ I (right).
ik of the home team i beating the away team k, withp [− (θ ik + γ + r i − r k )]

Table 1 :
To test the model performance, 70% of games from each season are randomly selected to be training data, on which the model is fitted, with the remaining 30% used as test data, on which Negative relative log-loss ×10 3 (compared to a coin-tossing model) for each year of baseball data for the ICBT, Bradley-Terry, blade-chest and majority vote models.95% confidence intervals, in parentheses, come from random training-test splits of the data.