Faster identification of faster Formula 1 drivers via time-rank duality

Two natural ways of modelling Formula 1 race outcomes are a probabilistic approach, based on the exponential distribution, and econometric modelling of the ranks. Both approaches lead to exactly soluble race-winning probabilities. Equating race-winning probabilities leads to a set of equivalent parametrisations. This time-rank duality is attractive theoretically and leads to quicker ways of dis-entangling driver and car level effects.


Introduction
Modelling Formula 1 races is an interesting econometric problem (Bell et al., 2016; van Kesteren and Bergkamp, 2023) of significant wider interest (Maurya, 2021).It is of interest to separate out driver-level and car-level effects.Previously, such an analysis has only been possible over longer time periods (Bell et al., 2016;Eichenberger and Stadelmann, 2009;van Kesteren and Bergkamp, 2023).Here, we present a solution that requires only one season of previous data.Formula 1 races are most easily modelled assuming car finishing times are independently and exponentially distributed random variables.Under this assumption race-winning probabilities can be written down in closed from.This tractability also enables relatively easy model calibration via bookmakers' odds.However, this approach is at odds with much of the publicly-available race data.Final race-finishing times are typically unavailable as lapped cars do not typically finish the full race distance.In contrast, the most convenient way of modelling publicly-available race data is regression modelling of the final finishing position (Eichenberger and Stadelmann, 2009).Thus, in this paper, we combine both modes of analysis -an approach we term time-rank duality.
The layout of this paper is as follows.Section 2 outlines a probabilistic approach to modelling race-finishing times and model calibration via bookmakers' odds.Section 3 establishes theoretical duality between this probabilistic approach and regression modelling of the final rank.
Firstly, we show that a regression model for ranks can be used to estimate race-winning probabilities and then to an equivalent exponential-distribution parameterisation using the method in Section 2. Secondly, we show that, under the simplifying assumption of homoscedasticity, regression parameters can be reverse-engineered from a set of race-winning probabilities e.g.those corrsesponding to a given exponential-distribution parameterisation or a set of bookmakers' odds.Section 4 discusses empirical regression modelling of historical results.Section 5 combines both approaches to enable quicker identification of individual driver-level effects.Section 6 concludes.A mathematical appendix is contained at the end of the paper.
2 Probabilistic approach to modelling finishing times Models based around the exponential distribution are amongst the most convenient ways to model Formula 1 races.This is due to its tractability alongside its usage in classical applied probability models.A related formulation based on the Weibull distribution is explored in the Appendix.Whilst its non-constant hazard function may be more physically realistic, the Weibull distribution may be more cumbersome in applications due to its additional shape parameter.
Suppose, for the sake of simplicity, that a race consists of n cars whose finishing times T 1 , T 2 , . . ., T n are independent exponential distributions with parameters λ 1 , λ 2 , . . ., λ n .Independence is a common simplifying assumption in sports models (Scarf et al., 2019) but may be difficult to justify empirically.A standard result in probability theory (Grimmett and Stirzaker, 2020) gives: ii.If X and Y are independent exponential distributions with parameters λ X and λ Y then iii.Consider the Formula 1 race with independent and exponentially distributed finishing times as outlined above.Then P r(Car j wins) = λ j n i=1 λ i .
Proposition 1 shows that given a sequence of win probabilities p 1 , p 2 , . . ., p n , calculated e.g. from bookmakers' odds, we can estimate the parameters λ i .This can be done by minimising the Residual Sum of Squares (RSS): The minimisation in (1) can be done numerically.Results of the procedure applied to bookmakers' data are shown in Table 1.The R code and data to reproduce these results is openly available on GitHub * .In Table 1 odds can be converted to probabilities as follows.The win probability corresponding to odds of 25/1 for Lewis Hamilton victory can be calculated via Win probabilities for the remaining drivers are calculated similarly, and then renormalised ( Strumbelj, 2014) so that they sum to 1.These renormalised win probabilities are given in the fourth column of Table 1.Estimated λ values from the minimisation in (1) are in the fifth column.

Econometric modelling of the final race ranking
Empirical Formula 1 data are most commonly listed in terms of the rank rather than the strict finishing times.The analysis of historical race data is therefore most easily accomplished by regression modelling of the final rank obtained (Eichenberger and Stadelmann, 2009).This implicitly assumes a Gaussian model for sporting outcomes (Scarf et al., 2019).
Consider two related problems.Firstly, suppose that there are n cars in the race and the final ranking r i of car i can be approximated by a normal distribution: r i ∼N (µ i , σ 2 i ).The approximate probability that car i wins the race is given by where Φ( Since the sum of the ranks is equal to n(n+1) summing equation ( 3) over i gives Combining equations (3-4) therefore gives the estimated µ i values corresponding to the given win probabilities p i .Table 2 applies this approach to estimate a set of μi and σ2 regression parameters for the bookmakers' data in Table 1.

Regression modelling of historical results
In this section we calibrate the model to historical results (observed race rankings) from the 2022 season which was the last fully-completed season at the time of writing.This follows a similar approach to modelling historical results in Fry et al. (2021).Following Eichengreen and Stadelmann (2009) we regress the finishing position against the dummy variables corresponding to each of the constructors.We then use stepwise regression (Fry and Burke, 2022) to automatically choose the best model.We constrain all models fitted to include a dummy variable indicating the teams' second (less-favoured) driver.Forwards and stepwise regression choose the same model indicated below in Table 3.In contrast, backward selection suggests a more complex model.However, an F -test, not reported, is non-significant suggesting the simpler model in

Appendix: Mathematical proofs
In Proposition 2 we consider race times to be independent Weibull distributions with common shape parameter k.This is a small technical of Proposition 1, where finishing times are exponential.We present the proof for Proposition 2 below, noting that Proposition 1 is the special case of k = 1 in Proposition 2.

Proposition 2
i.If T 1 , . . ., T n are independent and Weibull distributed with parameters (λ 1 , k), . . ., (λ n , k) ii.If X ∼ Weibull(λ X , k) and Y ∼ Weibull(λ Y , k) and X and Y are independent then iii.Consider the Formula 1 race with independent and Weibull distributed finishing times as outlined above.Then iii.For the sake of argument suppose j = 1.Then Hence the result follows from part ii.

Table 1 :
•) denotes the standard normal CDF.Secondly, suppose we are given a sequence of win probabilities p 1 , p 2 , . . ., p n for Cars 1, 2, . . ., n.Under the simplifying assumption of σ 2 i = σ 2 , equivalent to the classical normal linear regression model (Fry and Burke, 2022), from equation Results of the model applied to betting data for the 2023 Qatar Grand Prix. (Source: www.bet365.com.)

Table 3
should suffice.Negative and significant parameters in Table3indicate more successful constructors with lower expected final finishing positions.andFernando Alonso out-perform the level of the car that they drive.Results match previous suggestions that Verstappen's performance level is historically significant (van Kesteren and Bergkamp, 2023).Future work will adjust the above models to account for cars that fail to finish races.Substantial interest in the analytical modelling of sports remains(Singh et al.,