Gaussian field on the symmetric group: Prediction and learning

In the framework of the supervised learning of a real function deﬁned on an abstract space X , Gaussian processes are widely used. The Euclidean case for X is well known and has been widely studied. In this paper, we explore the less classical case where X is the non commutative ﬁnite group of permutations (namely the so-called symmetric group S N ). We provide an application to Gaussian process based optimization of Latin Hypercube Designs. We also extend our results to the case of partial rankings.


Introduction
The problem of ranking a set of items is a fundamental task in today's data driven world.Analysing observations which are not quantitative variables but rankings has been often studied in social sciences.It has also become a popular problem in statistical learning thanks to the generalization of the use of automatic recommendation systems.Rankings are labels that model an order over a finite set E N := {1, . . ., N }.Hence, an observation is a set of preferences between these N points.It is thus a one to one relation σ acting from E N onto E N .In other words, σ lies in the finite symmetric group S N of all permutations of E N .More precisely, assume that we have a finite set X = {x 1 , • • • , x N } and we have to order the elements of X.A ranking on X is a statement of the form where all the i j , j = 1 • • • , N are different.We can associate to this ranking the permutation σ defined by σ(i k ) = k.Reversely, to a permutation σ, we can associate the following ranking We refer to the works of Douglas E. Critchlow (see for example [19,16,18]) for an introduction to rankings, together with various results.
Our aim is to predict outputs corresponding to permutations inputs.For instance, the permutation input can correspond to an ordering of tasks, in applications.In a workflow management system, there may be a large number of tasks that may be done in different orders but are all necessary to achieve the goal.Workflow prediction or optimization problems currently occur in fields such as grid computing [44], and logistics [11].
Another example of application is given by the maintenance of machines in a supply line.Machines in a supply line need to be tuned or monitored in order to optimize the production of a good.The machines can be tuned in different orders, each corresponding to a permutation and these choices have an impact on the quality of the production of the goods, measured by a quantitative variable Y , for instance the amount of defects in the produced goods.Hence, the objective of the model will thus be to forecast the outcome of a specific order for the maintenance of the machines in order to optimize the production.
Another interesting case of output corresponding to a permutation input is of the form max x∈X f (σ, x), where f is a function both acting on the permutation σ and on some external variable x.This output corresponds to a worst case for the performance or the cost given by the permutation σ.Classical examples of this kind of output are the max distance criterion for Latin Hypercube Designs [35,40] and the robust deviation for a tour in the robust traveling salesman problem [37].In Section 3.4, we discuss and address the example of the max distance criterion.
In this paper, we will be in the framework of Gaussian processes indexed by S N .Actually, Gaussian process models rely on the definition of a covariance function that characterizes the correlations between values of the process at different observation points.As the notion of similarity between data points is crucial, i.e. close location inputs are likely to have similar target values, covariance functions (symmetric positive definite kernels) are the key ingredient in using Gaussian processes for prediction.Indeed, the covariance operator contains nearness or similarity informations.In order to obtain a satisfying model one needs to choose a covariance function (i.e. a symmetric positive definite kernel) that respects the structure of the index space of the dataset.
A large number of applications gave rise to recent researches on ranking including ranking aggregation [29], clustering rankings (see [12]) or kernels on rankings for supervised learning.Constructing kernels over the set of permutations has been studied following several different ways.In [27], Kondor provides results about kernels in non-commutative finite groups and constructs diffusion kernels (which are positive definite) on S N .These diffusion kernels are based on a discrete notion of neighbourhood.Notice that the kernels considered therein are quite different from those considered in this paper.Furthermore, the diffusion kernels are not in general covariance functions because of their tricky dependency on permutations.The recent reference [25] proves that the Kendall and Mallow's kernels are positive definite.Further, [32] extends this study characterizing both the feature spaces and the spectral properties associated with these two kernels.A real data set [10] on rankings is studied in [32].The authors used a kernel regression to predict the age of a participant with his/her order of preference of six sources of news regarding scientific developments: TV, radio, newspapers and magazines, scientific magazines, the internet, school/university.There are applications where not all of the items in (1) are ranked.Rather, a partial ranking is given (see for example the "sushi" dataset available at http://www.kamishima.net or movie datasets).The books [17] and [33] provide metrics on partial rankings and the papers [28] and [25] provide kernels on partial rankings and deal with the complexity reduction of their computation.
The goal in this paper is threefold: first we define Gaussian processes indexed by S N by providing a wide class of covariance kernels.We generalize previous results on the Mallow's kernel (see [25]).Second, we consider the Kriging models (see for instance [41]) that consist in inferring the values of a Gaussian random field given observations at a finite set of observation points.Here, the observations points are permutations.We study the asymptotic properties of the maximum likelihood estimator of the parameters of the covariance function.We also prove the asymptotic accuracy of the Kriging prediction under the estimated covariance parameters.Further, we provide simulations that illustrate the very good performances of the proposed kernels.Finally, we provide an application to Gaussian process based optimization of Latin Hypercube Designs.Last, we show that the Gaussian process framework may be adapted to the cases of learning with partially observed rankings.We define a class of covariance kernels on partial rankings, for which we show how to reduce the computation complexity.In simulations, we show that our suggested kernels yield more efficient Gaussian process predictions than the kernels given in [25].
The paper falls into the following parts.In Section 2, we recall some facts on S N and provide some covariance kernels on this set.Asymptotic results on the estimation of the covariance function are presented in Section 3. Section 3 also contains an application to the optimization of Latin Hypercube Designs.Section 4 provides new covariance kernels for partial rankings with a comparison with the ones given in [25] in a numerical experiment.Section 5 concludes the paper.The proofs are all postponed to the appendix.

Covariance model for rankings
Recall that we define S N as the set of all permutations on E N := {1, . . ., N }.An element σ of S N is a bijection from E N to E N .We aim at constructing kernels, or covariance functions, on S N .We will base these kernels on the three following distances on S N (see [21]).For any permutations π and σ of S N , • The Kendall's tau distance is defined by This distance counts the number of pairs on which the permutations disagree in ranking.
We aim at defining a Gaussian process indexed by permutations.Notice that, generally speaking, using the abstract Kolmogorov construction (see for example [20] Chapter 0), the law of a Gaussian random process (Y x ) x∈E indexed by an abstract set E is entirely characterized by its mean and covariance functions Of course, here the framework is much simpler as S N is finite (|S N | = N !), and the Gaussian distribution is obviously completely determined by its mean and covariance matrix.Hence, if we assume that the process is centered, we only have to build a covariance function on S N .First, we recall the definition of a positive definite kernel on an abstract space E. A symmetric map K : E × E → R is called a positive definite kernel if for all n ∈ N and for all (x 1 , • • • , x n ) ∈ E n , the matrix (K(x i , x j )) i,j is positive semi-definite.In this paper, we say that K is a strictly positive definite kernel if K is symmetric and, for all n ∈ N and for all These notions are particularly interesting for S N (and any finite set).Indeed, if K is a strictly positive definite kernel, then for any function f : and K is of course an universal kernel (see [36]).
Remark 1.Since S N is a finite discrete space, remark that the Reproducible Kernel Hilbert Space (RKHS) of a kernel K is defined by the set of the functions of the form (6), and the universality of the kernel K is equivalent to the equality of its RKHS with the set of the functions from S N to R. This is, in turn, equivalent to the fact that K is strictly positive definite.
We now provide two different parametric families of covariance kernels.The members of these families have the general form and Here, d is one of the three distances defined in (3), ( 4) and (5).More precisely, for the Kendall's (resp.Hamming's and Spearman's footrule) distance let K τ θ 1 ,θ 2 (,θ 3 ) ) ) be the corresponding covariance function.For concision, sometimes we will write K θ 1 ,θ 2 (,θ 3 ) (resp.d) for one of these three kernels (resp.distances).
We show in the next proposition that K θ 1 ,θ 2 is strictly positive definite.
Propositions 1 and 2 enable to define Gaussian processes indexed by permutations.
Remark 3. The authors of [2] define strictly positive definite kernels on graphs with Euclidean edges with two different metrics: the geodesic metric and the "resistance metric".The kernels are obtained by applying completely monotonous functions to these metrics (distances).They provide different classes of such functions: the power exponential functions (which are considered in our work, see (8)), the Matérn functions (with a smoothness parameter 0 < ν ≤ 1/2), the generalized Cauchy functions and the Dagum functions.One can show that Proposition 2 remains valid for all these kernels, by remarking as in [2] that these kernels are based on completely monotonous functions.Some of the proofs of [2] are based on techniques similar to the proof of Proposition 2, using Schoenberg's theorems.
We remark that the finite set of permutations S N is a graph, when two permutations σ 1 and σ 2 are connected if there exists a transposition π such that σ 1 = σ 2 π.Hence, it is natural to ask if the results of [2] can imply or extend some of the results in this paper.The answer however appears to be negative.Indeed, the distances considered in [2] are the geodesic or the "resistance" distances, ans the distances in (3), ( 4) and (5) do not fall into this category.
One could also consider the set of the permutations as a fully connected weighted graph, where the weight of the edge between σ 1 and σ 2 is d(σ 1 , σ 2 ), and where d is d τ or d H or d S .Nevertheless, also with this graph, the results of [2] do not apply, since the graphs addressed by this reference have a particular structure (finite sequential 1-sum of Euclidean cycles and trees).
We finally remark that [2] constructs covariance functions not only on finite graphs, but between connected vertices.In contrast, the covariance functions constructed here are defined only on the finite set S N .
3 Gaussian fields on the symmetric group

Maximum likelihood
Let us consider a Gaussian process Y indexed by σ ∈ S N , with zero mean and covariance function K * .In a parametric setting, a classical assumption is that the covariance function K * belongs to some parametric set of the form where Θ ⊂ R p is given and for all θ ∈ Θ, K θ is a covariance function.The parameter θ is generally called the covariance parameter.In this framework, K * = K θ * for some parameter θ * ∈ Θ.
The parameter θ * is estimated from noisy observations of the values of the Gaussian process at several inputs.Namely, to the observation point σ i , we associate the observation Y (σ i ) + ε i , for i = 1, . . ., n, where (ε i ) i is an independent Gaussian white noise.Let us consider a sample of random permutations Assume that we observe Σ and a random vector Here, Y is Gaussian process indexed by S N and independent of Σ.We assume that Y is centered with covariance function K θ * 1 ,θ * 2 (see (7) in Section 2) and that Y is the unknown process to predict and ε is an additive white noise.Notice that θ 3 denotes here the variance of the nugget effect while it is a power in Section 2 (see (8)).We keep the same name in order to use the compact notation θ for the parameter of the model.The Gaussian process Y is stationary in the sense that for all σ 1 , • • • , σ n ∈ S N and for all τ ∈ S N , the finite-dimensional distribution of Y at σ 1 , • • • , σ n is the same as the finite-dimensional distribution at Several techniques have been proposed for constructing an estimator 3 ): maximum likelihood estimation [43], restricted maximum likelihood [14], leave-one-out estimation [13,3], leaveone-out log probability [42]... Here, we shall focus on the maximum likelihood method.It is widely used in practice and has received a lot of theoretical attention.Assume that ).The maximum likelihood estimator is defined as with where

Asymptotic results
When considering the asymptotic behaviour of the maximum likelihood estimator, two different frameworks can be studied: fixed domain and increasing domain asymptotics [41].Under increasing-domain asymptotics, as n → ∞, the observation points σ 1 , • • • , σ n are such that min i =j d(σ i , σ j ) is lower bounded and d(σ i , σ j ) becomes large with |i − j|, (thus we can not keep N fixed as n → +∞).
Under fixed-domain asymptotics, the sequence (or triangular array) of observation points For a Gaussian field on R d , under increasing-domain asymptotics, the true covariance parameter θ * can be estimated consistently by maximum likelihood.Furthermore, the maximum likelihood estimator is asymptotically normal [34,14,15,4].Moreover, prediction performed using the estimated covariance parameter θn is asymptotically as good as the one computed with θ * as pointed out in [4].Finally, note that in the symmetric group, the fixed-domain framework can not be considered (contrary to the input space R d ) since S N is a finite space.
We will consider hereafter the increasing-domain framework.We thus consider a number of observations n that goes to infinity.Hence, the size N of the permutations can not be fixed, as pointed out above.We thus let the size of the permutations be a function of n, that we write N n , with N n → ∞ as n → ∞.
To summarize, we consider a sequence of Gaussian processes Y n on S Nn , with N n −→ n→+∞ +∞ and where we consider a triangular array (σ (n) i ) i≤n ⊂ S Nn of observation points.However, for the sake of simplicity, we only write Y and (σ i ) i≤n and the dependency on n is implicit.We observe values of the Gaussian process on the permutations Σ = (σ 1 , • • • , σ n ), that are assumed to fulfill the following assumptions: Here, we recall that d τ , d H and d S are defined in Section 2. Notice that β and c are assumed to be independent on n.
These conditions are natural under increasing-domain asymptotics.Indeed, Condition 1 provides asymptotic independence for pairs of observations with asymptotically distant indices.It allows to show that the variance of L θ and of its gradient converges to 0. Condition 2 ensures the asymptotic discrimination of the covariance parameters (see Lemma 4 in the appendix).These conditions can be ensured with particular choices of sampling schemes for (σ 1 , • • • , σ n ) (using the distances previously discussed).
As an example consider the following setting.We fix k ∈ N.
id} a random permutation such that (τ i ) i are independent (we do not make further assumptions on the law of The following theorems give both the consistency and the asymptotic normality of the estimator when the number of observations increases. Theorem 1.Let θ M L be defined as in (11), where the distance d used to define the set {K θ ; θ ∈ Θ} is d τ , d H or d S .Assume that Conditions 1 and 2 hold with the same choice of the distance d.Then, Theorem 2. Under the assumptions of Theorem 1, let M M L be the 3 × 3 matrix defined by Furthermore, where Given the maximum likelihood estimator θ n = θ M L , the value Y (σ n ), for any input σ n ∈ S Nn , can be forecasted by plugging the estimated parameter in the conditional expectation expression for Gaussian processes.Hence Y (σ n ) is predicted by with We point out that Ŷ θn (σ n ) is the conditional expectation of Y (σ n ) given y 1 , • • • , y n , when assuming that Y is a centered Gaussian process with covariance function K θn .
The following theorem shows that the forecast with the estimated parameter behaves asymptotically as if the true covariance parameter were known.Theorem 3.Under the assumptions of Theorem 1, for any fixed sequence (σ n ) n∈N , with σ n ∈ S Nn for n ∈ N, we have Remark 4. Theorem 3 does not imply that but where σ n is random.Here, Theorem 3 does not imply (19) as it holds for deterministic sequences (σ n ) n∈N .It would be interesting, in future work, to extend Theorem 3 to show (19).
The proofs of Theorems 1, 2 and 3 are given in the appendix, Sections B.2, B.3 and B.4 respectively.They are based on lemmas stated and proved in Section B.1.In [4] and [5], similar results for maximum likelihood are given for Gaussian fields indexed on R d and on the set of all probability measures on R (see also [7]).At the beginning of Appendix B, we also discuss the similarities and differences between the proofs of Theorems 1, 2 and 3 and these given in [4] and [5].

Numerical experiments
As an illustration of Theorem 1, we provide a numerical illustration showing that the maximum likelihood is consistent.We generated the observations as discussed in Section 3 with k = 3.We recall that For each value of n, we estimate the probability P( θ n − θ * > ε) using a Monte-Carlo method and a sample of 1000 values of 1 θn−θ * >ε .Figure 1 depicts these estimates for ε = 0.5, θ * = (0.1, 0.8, 0.3) and Θ In Figure 2, we display the density of the coordinates of the maximum likelihood estimator for different values of n ranging from 20, 60 to 150.These densities  have been estimated with a sample of 1000 values of the maximum likelihood estimator.We observe that the densities can be far from the true parameter for n = 20 or n = 60 but are quite close to it for n = 150.Further, we see that for n = 150, the Kendall's tau distance seems to give better estimates for θ * 3 .However, the computation time of the distance matrix is much longer with the Kendall's tau distance than with the other distances.
In Figure 3, for a given σ n , we display estimates of the probability that the deviation between the prediction of Y (σ n ) given in (17) with the parameter θn and the prediction of Y (σ n ) with the parameter θ * exceeds 0.3.Indeed, Theorem 3 ensures us that this probability converges to 0 as n → +∞.

Application to the optimization of Latin Hypercube Designs
We consider here an application of Proposition 2 to find an optimal Latin Hypercube Design (LHD).A LHD is a design of experiments (X j ) j≤N ∈ [0, 1] d where, for each component i ∈ [1 : d], the projections of X 1 , ..., X N on the component i are equispaced in [0, 1] (see [35]).We will thus consider that each component of one X j is equal to k/(N − 1) for some k ∈ [0 : N − 1].We also remark that we can always permute the variables so that the first component of X j is equal to (j − 1)/(N − 1).So, for each LHD (X j ) j≤N , there exist σ 2 , ..., σ d ∈ S N such that for all j ∈ [1 : N ], we have Hence, there is a bijection between the set of LHD with N points and the set S d−1 N .
Now, if (X j ) j≤N is a LHD, we can define its measure of space filling quality as f ((X j ) j≤N ) = sup that is the largest distance of a point of [0, 1] d to (X j ) j≤N .We remark that LHDs minimizing f are called minimax [40].Our aim is to find a minimax LHD (X * j ) j≤N .However, given a LHD (X j ) j≤N , its quality f ((X j ) j≤N ) is not an obvious quantity and its computation is expensive.
To estimate this quantity, we suggest to generate N tot random points (x l ) l≤Ntot uniformly on [0, 1] d , to compute their distance to the LHD and to take the maximum value.This estimation is costly (because of the large number N tot ) and noisy (because of the randomness of the points (x l ) l≤Ntot ).Thus, we suggest to use a Gaussian process model on f and to apply the Expected Improvement (EI) strategy [26].Nevertheless, remark that f is a positive function, whereas a Gaussian process realization can take negative values.In this case, different options are possible: firstly, we can ignore the information of the inequality constraint; secondly, we can use Gaussian process under inequality constraints (see [6]); thirdly, we can use a transformation of the function to remove the inequality constraint.We choose here the third strategy and we model log(f ) by a Gaussian process realization.We remark that log(f ) can take positive and negative values.
We thus assume that the unknown function log(f ) to minimize is a realization of a Gaussian process.We have to find a positive definite kernel on S d−1 N .Thanks to Proposition 2, we have three positive definite kernels on S N , thus on S d−1 N (taking the tensor product of these kernels).Thus, we apply the EI strategy with these three kernels to find the best LHD with N max calls to the function f .The N max / 2 first LHDs are generated uniformly on S d−1 N and the other ones are generated sequentially by following the EI strategy.
More precisely, for i ∈ [N max /2 − 1 : N max − 1], let us explain how to choose the i + 1-th observation, when we have observed the vectors (σ (we remark that f can be defined equivalently as a function f (σ 2 , . . ., σ d ) of d − 2 permutations or as a function f ((X j ) j≤N ) of a LHD).We model log(f ) by a realization of a Gaussian process Z, with a conditional mean written Then, we let where where , and E i is the expectation conditionally to the observations (20).We have an explicit expression of EI, where φ and Φ are the standard normal density and distribution functions.To choose (σ ), we thus solve an optimization problem for EI, which has a very small cost compared to evaluating f , since the computation of EI is instantaneous.We thus choose the set of permutations that maximizes EI over 2000 sets of uniformly distributed permutations.
We refer to [26] for more details on EI.The parameters of the covariance functions are estimated by maximum likelihood at each step.
We run an experiment where we compare the performances of the 5 following methods:  • Random sampling, to generate N max LHDs of the form {(X (i) j ) j≤N ; i ≤ N max } by generating σ 2 , ..., σ d uniformly and independently; • Simulated annealing, choosing that two LHDs (σ j ) 2≤j≤d and (σ j ) 2≤j≤d are neighbours if there exist transpositions τ 2 , ..., τ d such that for all j ∈ [2 : d], we have σ j = σ j τ j ; • EI with Kendall distance; • EI with Hamming distance; • EI with Spearman distance.
We can see in Figure 4 that the best LHDs are found by EI, particularly with the Spearman distance.The simulated annealing is slightly better than random sampling.
We display in Figure 5 the distributions of the qualities {f ((X (i) j ) j≤N ); i ≤ N max } for the five methods.We can notice that the simulated annealing does not explore the set of all the LHDs and does not find the best minimum.EI performs minimisation and exploration to find better minima.We can then provide the best LHD of EI with the Spearman distance.This LHD is given by the permutations  To conclude, the kernels on permutations provided in Section 2 enable us to use EI that gives much better results than simulated annealing or random sampling to find the best LHD.

Covariance model for partial ranking 4.1 A new kernel on partial rankings
In application, it can happen that partial rankings rather than complete rankings are observed.A partial ranking aims at giving an order of preference between different elements of X without comparing all the pairs in X.Hence, a partial ranking R is a statement of the form where m < N , and The partial ranking means that any element of X j is preferred to any element of X j+1 but the elements of X j cannot be ordered.Given a partial ranking R, we consider the following subset of for any choice of In the statistical literature, there is a natural way to extend a positive definite kernel K on S N to the set of partial rankings (see [28], [25]).To do so, one considers for R and R two partial rankings the following averaged kernel Here, |E R | denotes the cardinal of the set E R .Notice that, if K is a positive definite kernel on permutations, then K is also a positive definite kernel [24].Indeed, if where we set Observe that the computation of K is very costly.Indeed, we have to sum over |E R ||E R | permutations.Several works aim to reduce the computation cost of this kernel (see [28,30,31]).However, its efficient computation remains an issue.
In the following, we provide another way to extend the kernels K θ 1 ,θ 2 ,θ 3 to partial rankings.We will provide computational simplifications for this extension.First, define the measure of dissimilarity d avg on partial rankings as the mean of distances d(σ, σ Since d avg (R, R) = 0 in general, we need to define d partial as follows partial is a pseudometric on partial rankings (i.e. it satisfies the positivity, the symmetry, the triangular inequality and is equal to 0 on the diagonal {(R, R), R is a partial ranking}).
We remark that other metrics on partial rankings are defined in [17], in particular the Hausdorff metrics and the fixed vector metrics (based on the group representation of S N ).These two metrics are different from the one defined in (27).Our suggested metric d partial will enable us to define positive definite kernels in Proposition 4. In future work, it would be interesting to study the construction of positive definite kernels based on the Hausdorff and fixed vector metrics.
We further define The next proposition warrants that this last function is in fact a covariance kernel, which will later enable to define Gaussian processes on partial rankings.
Proposition 4. K θ 1 ,θ 2 ,θ 3 is a positive definite kernel for the Kendall's tau distance, the Hamming distance and the Spearman's footrule distance.

Kernel computation in partial ranking
At a first glance, the computation of the kernel K θ 1 ,θ 2 ,θ 3 (R, R ) on partial rankings may still appear very costly due to the evaluation of d partial .Indeed, we have ).However, this computation problem can be quite simplified.As we will show in this subsection, the mean of the distances is much easier to compute than the mean of exponential of distances.We write d τ,avg (resp.d H,avg and d S,avg ) for the average distance in (26) when the distance on the permutations is d τ (resp.d H and d S ).
To begin with, let us consider the case of top-k partial rankings.A top-k partial ranking (or a top-k list) is a partial ranking of the form where It can be seen as the "highest rankings".In order to alleviate the notations, let just write The following proposition shows that the computation cost to evaluate d avg (and so the kernel values) might be reduced when the partial rankings are in fact top-k partial rankings.Before stating this proposition let us define some more mathematical objects.Let where j 1 < j 2 < • • • < j p and p is an integer no larger than k.Let, for l = 1, • • • p, c j l (resp.c j l ) denotes the rank of j l in I (resp. in I ).Further, let r := k − p and define Ĩ (resp.Ĩ ) as the complementary set of Writing these two sets in ascending order, we may finally define for j = 1, • • • , r, u j (resp.u j ) as the rank in I (resp I ) of the j-th element of Ĩ (resp.Ĩ ).
Example.Assume that n = 7, I = (3, 2, 1, 4, 5) and I = (3, 5, 1, 6, 2).We have (j 1 , j 2 , j 3 , j 4 ) = (1, 2, 3, 5) (the items ranked by I and I , in increasing order).Thus, ) or (c j l >c j l ,c j l <c j l Notice that the sequences (c j l ), (c j l ) and (u j ), (u j ) are easily computable and so d avg (I, I ) too.Let us discuss an easy example to handle the computation of the previous sequences.To compute the pseudometric d partial defined in (27), we also need to compute d τ,avg on the diagonal {(I, I)| I is a top-k partial ranking}.The following corollary gives these computations.
Corollary 1.Let I be a top-k partial ranking.Then, Remark 5. Similar results as Proposition 5 are stated in Sections III.B and III.C of [17] for the Hausdorff metrics and the fixed vector metrics respectively.
In the case of the Hamming distance, we may step ahead and provide a simpler computational formula for the average distance between two partial rankings whenever their associated partitions share the same number of members (see Proposition 6 below).More precisely let R 1 and R 2 be two partial rankings such that assume also that for j | and denote by γ j this integer.Obviously, N = k j=1 γ j so that γ := (γ j ) j is an integer partition of n.Further, when partial ranking case.For j = 1, • • • , k, let Γ j be the set of all integers lying in j−1 l=1 γ l + 1, j l=1 γ l .Set further, where S Γ i is the set of permutations on Γ i .Notice that S γ is nothing more than the subgroup of S n letting invariant the sets With these extra notations and definitions, we are now able to compute d H,avg (R 1 , R 2 ).Proposition 6.In the previous setting, we have where, for 1 ≤ l ≤ N , Γ(l) is the integer j such that l ∈ Γ j .

Numerical experiments
We have proposed in Section 4.1 a new kernel K θ 1 ,θ 2 ,θ 3 defined by (28) on partial rankings.We show in Section 4.2 that in several cases (for example with top-k partial rankings), we can reduce drastically the computation of this kernel.Another direction is given in [25] by considering the averaged Kendall kernel and reducing the computation of this kernel on top-k partial rankings.This kernel is available on the R package kernrank.We write K the averaged Kendall kernel, and we define K θ 1 := θ 1 K.
In this section, we compare our new kernel K θ 1 ,θ 2 ,θ 3 with the averaged Kendall kernel K θ 1 in a numerical experiment where an objective function indexed by topk partial rankings is predicted, by Kriging.We take N = 10 and for simplicity, we take the same value k = 4 for all the top-k partial rankings.For a top-k partial ranking I = (i 1 , i 2 , i 3 , i 4 ), the objective function to predict is f (I) := 2i 1 +i 2 −i 3 −2i 4 .We make 500 noisy observations (y i ) i≤500 with y i = f (I i )+ε i , where (I i ) i≤500 are i.i.d.uniformly distributed top-k partial rankings and (ε i ) i≤500 are i.i.d.N (0, λ 2 ), with λ = 1 2 .As in Section 3, we estimate (θ, λ) by maximum likelihood.Then, we compute the predictions ( y i ) i≤500 of y = (y i ) i≤500 , with y the observations corresponding to 500 other test points (I i ) i≤500 , that are i.i.d.uniform top-k partial rankings.For the four kernels (our kernel K θ 1 ,θ 2 ,θ 3 with the 3 distances and the averaged Kendall kernel K θ 1 ), we provide the rate of test points that are in the 90% confidence interval together with the coefficient of determination R 2 of the predictions of the test points.Recall that where y is the average of y .The results are provided in Table 1.
The rate of test points that are in the 90% confidence interval is close to 90% for the four kernels.We can deduce that the parameters (θ, λ) are well estimated by maximum likelihood, even for the averaged Kendall kernel K θ 1 .
However, we can see that the coefficient of determination of the averaged Kendall kernel K θ 1 is close to 0. The predictions given by the averaged Kendall kernel K θ 1 are nearly as bad as predicting with the empirical mean.In the opposite way the coefficient of determination of our kernels is larger than 0.9 for the Kendall distance, and larger than 0.99 for the Hamming distance and the Spearman distance.That means that the prediction given by our kernels are much better than the empirical mean.
To conclude, we provide a class of positive definite kernels K θ 1 ,θ 2 ,θ 3 which seems to be significantly more efficient than the averaged Kendall kernel K θ 1 , in the case of Gaussian process models on partial rankings.

Conclusion
In this paper, we provide a Gaussian process model for permutations.Following the recent works of [25] and [32], we propose kernels to model the covariance of such processes and show the relevance of such choices.Based on the three distances on the set of permutations, Kendall's tau, Hamming distance and Spearman's footrule distance, we obtain parametric families of relevant covariance models.To show the practical efficiency of these parametric families, we apply them to the optimization of Latin Hypercube Designs.In this framework, we prove under some assumptions on the set of observations, that the parameters of the model can be estimated and the process can be forecasted using linear combinations of the observations, with asymptotic efficiency.Such results enable to extend the well-known properties of Kriging methods to the case where the process is indexed by ranks and tackle a large variety of problems.We remark that our asymptotic setting corresponds to the increasing domain asymptotic framework for Gaussian processes on the Euclidean space.It would be interesting to extend our results to more general sets of permutations under designs that do not necessarily satisfy Conditions 1 and 2.
We also show that the Gaussian process framework can be extended to the case of partially observed ranks.This corresponds to many practical cases.We provide new kernels on partial rankings, together with results that significantly simplify their computation.We show the efficiency of these kernels in simulations.We leave a specific asymptotic study of Gaussian processes indexed by partial rankings open for further research.
As highlighted in [33], data consisting of rankings arise from many different fields.Our suggested kernels on total rankings and partial rankings could lead to different applications to real ranking data.We treated the case of regression in Sections 3.3 and 4.3.In Section 3.4, we used these kernels for an optimization problem.One could also use our suggested kernels in classification, as it is done in [25], in [32] or in [28], and also using Gaussian process based classification [39] with ranking inputs.
Case of the other distances.For the Hamming distance and the Spearman's footrule distance, we show that the kernel K is strictly positive definite on the set F of the functions from [1 : N ] to [1 : N ].Indeed, if "for all n ∈ N and all ,j≤n is a symmetric positive definite matrix", then "for all n ∈ N and all σ 1 , • • • , σ n ∈ S N ⊂ F such that σ i = σ j if i = j, (K(σ i , σ j )) 1≤i,j≤n is a symmetric positive definite matrix".Now, to prove the strict positive definiteness of K on F , it suffices to index the elements of F by f 1 , • • • , f N N and to prove that the matrix M := (K(f i , f j )) 1≤i,j≤N N is symmetric positive definite.We index the elements of F using the following bijective map . Thus, it suffices to show that the N N × N N matrices M defined by Mi,j := K J −1 N (i), J −1 N (j) , are positive definite matrices for these three distances.Straightforward computations show that • For the Hamming distance, M is the Kronecker product of N matrices, all equal to (exp(−ν1 i =j )) i,j∈[1:N ] .
In all cases, M is a Kronecker product of positive definite matrices thus is also a positive definite matrix.Proof.

Proof of Proposition 2
Proof.Let us prove that d is a definite negative kernel, that is, for all c 1 , ..., c k ∈ R such that k i=1 c i = 0, we have k i,j=1 c i c j d(σ i , σ j ) ≤ 0. Let c 1 , ..., c k ∈ R such that k i=1 c i = 0 and let σ 1 , ..., σ k ∈ S N .We have is equal to 0. So, d is a negative definite kernel.Hence d θ 3 is a definite negative kernel for all θ 3 ∈ [0, 1] (see for example Property 21.5.9 in [38]).The function F : t → θ 2 exp(−θ 1 t) is completely monotone, thus, using Schoenberg's theorem (see [8] for the definitions of these notions and Schoenberg's theorem), K θ 1 ,θ 2 ,θ 3 is a positive definite kernel.

Proof of Proposition 3
Proof.Let us write, with the notation of Lemma 1, Then,

Proof of Proposition 4
Proof.Let us prove that d partial is a definite negative kernel.We define Let (c 1 , ..., c k ) ∈ R k such that k i=1 c i = 0. We have So, d partial is a definite negative kernel, and we may conclude as in the proof of Proposition 2.

Proof of Proposition 5
Proof.Assume that σ (resp.σ ) is a uniform random variable of E I (resp.E I ).
We ) or (c j l >c j l ,c j l <c j l ) .
Thus, the total contribution of the pairs in this case is 1≤l<l ≤p 1 (c j l <c j l ,c j l >c j l ) or (c j l >c j l ,c j l <c j l ) .
2. Consider the case where a and b both appear in one top-k partial ranking (say I) and exactly one of i or j, say i appear in the other top-k partial ranking.Let us call P 2 the set of (a, b) such that a < b and (a, b) is in this case.We have Let us compute the first sum.Recall that Ĩ = {i u 1 , ..., i ur }.
Proof.Let N be the norm on R 3 defined by with c as in Condition 2. Let α > 0. We want to find a positive lower-bound over θ ∈ Θ \ B N (θ * , α), where B N (θ * , α) is the ball with the norm N of center θ * and radius α, of Let θ ∈ Θ \ B N (θ * , α).
Finally, with probability at least α: , which is contradicted using (47) and recalling Lemma 4. In the above display, we recall that the norm | • | for matrices is defined at the beginning of Appendix B. It remains to prove (47) and (48).
Step 2: We prove (47).For all σ ∈ (S Nn ) n satisfying Conditions 1 and 2, recalling that • 2 F and • are defined at the beginning of Appendix B, The previous display holds true because, with R 1 2 θ * , the unique matrix square root of R θ * , we have Then, we have the relation AB 2 F ≤ A 2 B 2 F .Thus, we have Here, we have used z T Az ≤ z 2 A for a symmetric positive definite matrix A , the fact that AB ≤ A B for matrices A and B, and the fact that, by Cauchy-Schwarz, Hence, sup θ∈Θ ∂L θ ∂θ i is bounded in probability conditionally to Σ = σ, uniformly in σ.Indeed z ∼ N (0, I n ) thus 1/n z 2 is bounded in probability, conditionally to Σ and uniformly in Σ.
Then sup i∈
By contradiction, let us now assume that √ nM Then, there exists a bounded measurable function g : R 3 → R, ξ > 0 such that, up to extracting a subsequence, we have: ∂θ i r θ (σ n ) of ∂ 2 ∂θ j θ i r θ (σ n ) and W θ is product of the matrices R −1 θ , ∂ ∂θ i R θ and ∂ 2 ∂θ j θ i R θ .It is sufficient to show that From Fubini-Tonelli Theorem (see [9]), we have There exists a constant c so that for X a centred Gaussian random variable From Lemma 3, there exists B < ∞ such that, a.s. Thus Finally, for some α ∈ [0 : 2] 3 such that |α| ≤ 2, we have Thus, it suffices to bound this term.Using the proof of Lemma That concludes the proof.
shows that the Conditions 1 and 2 are satisfied with β = 1 and c = 1 + k(k − 1)/2 for the Kendall's tau distance, c = 2 + k for the Hamming distance, c = 2 + k 2 for the Spearman's footrule distance.Indeed, the three distances in S k are upper-bounded by k(k − 1)/2, k and k 2 respectively.

Figure 1 :
Figure 1: Monte Carlo estimates of P( θ n − θ * > 0.5) for different values of n, the number of observations, with θ * = (0.1, 0.8, 0.3) and Kendall's tau distance, the Hamming distance and the Spearman's footrule distance from left to right.

Figure 2 :
Figure 2: Density of the coordinates of θ n for the number of observations n = 20 (in red), n = 60 (in blue), n = 150 (in green) with θ * = (0.1, 0.8, 0.3) (represented by the red vertical line).We used the Kendall's tau distance, the Hamming distance and the Spearman's footrule distance from left to right.

Figure 4 :
Figure 4: Minimal quality of LHD found by the five methods.

Figure 5 :
Figure 5: Distributions of the quality of LHDs for the five methods.

. Further, u 1 = 4 and u 1 = 4 . 5 .
Proposition Let I and I be two top k-partial rankings.Set N := N − k − 1 and m := N − |I ∪ I |.Then, d τ,avg (I, I ) = 1≤l<l ≤p 1 (c j l <c j l ,c j l >c j l

Lemma 1 .
For all the three distances, there exist constantsd N ∈ N * , C N ∈ R and a function Φ : S N → R d N such that d(σ, σ ) = C N − Φ(σ), Φ(σ ) .Here•, • denotes the standard scalar product on R d N .

2 F
Let us define f (t) := − ln(t) + t − 1.The function f is minimal in 1 and f (1) = 0 and f (1) = 1.So there exists A > 0 such that for all t ∈ [a, b], f (t) ≥ A(t − 1) Tr I n − R by Lemma 2, writing a = Aθ −2 3,max , and recalling that |A| 2 = 1 n A for a matrix A.

∂θ i 2 , 2 F
using Lemma 2 and where we recall that 1 n • = | • |, see the beginning of Appendix B. Hence, from Lemma 5, there exists C min > 0 such that lim inf n→∞ λ min (M M L ) ≥ C min .

Table 1 :
Rate of test points that are in the 90% confidence interval and coefficient of determination for the four kernels.