On kernel methods for covariates that are rankings

Permutation-valued features arise in a variety of applications, either in a direct way when preferences are elicited over a collection of items, or an indirect way in which numerical ratings are converted to a ranking. To date, there has been relatively limited study of regression, classification, and testing problems based on permutation-valued features, as opposed to permutation-valued responses. This paper studies the use of reproducing kernel Hilbert space methods for learning from permutation-valued features. These methods embed the rankings into an implicitly defined function space, and allow for efficient estimation of regression and test functions in this richer space. Our first contribution is to characterize both the feature spaces and spectral properties associated with two kernels for rankings, the Kendall and Mallows kernels. Using tools from representation theory, we explain the limited expressive power of the Kendall kernel by characterizing its degenerate spectrum, and in sharp contrast, we prove that Mallows' kernel is universal and characteristic. We also introduce families of polynomial kernels that interpolate between the Kendall (degree one) and Mallows' (infinite degree) kernels. We show the practical effectiveness of our methods via applications to Eurobarometer survey data as well as a Movielens ratings dataset.


Introduction
Ranking data are ubiquitous, arising in any context in which preferences are expressed over a collection of alternatives. Familiar examples include voting for candidates, rating consumer items, and expressing preferences over web search listings. A theoretical understanding of how to aggregate and learn based on people's preferences is of widespread scientific and economic interest.
There are many ways in which to analyze and model ranking data. While classical approaches have generally been based on parametric models (including the Bradley-Terry, Plackett-Luce and Thurstone models [1, 2, 3, 4, 5]), recent work has focused on nonparametric modeling ideas. Such is the case of the current paper, where our nonparametric framework is based on reproducing kernel Hilbert spaces. We view kernel-based methodology as particularly appropriate for ranking problems-kernels allow us to take learning problems involving ranking data from the cumbersome setting of the non-Abelian symmetric group to the familiar setting of Hilbert spaces, allowing direct access to the extensive suite of regression, classification, testing and clustering techniques available for Hilbert spaces.
In the setting of Euclidean spaces there are a number of standard kernels that have proved useful in practice-including linear, polynomial and radial basis function kernels-and for which there is a rich theoretical understanding. In the case of the symmetric group there are also a number of standard kernels, including the Kendall, Mallows and diffusion kernels and while aspects of these kernels have been analyzed theoretically [see, e.g. 6,7], our theoretical understanding of these kernels is currently incomplete. In particular, the understanding of these kernels in terms of their feature maps and spectral properties is quite limited.
The goal of the current paper is to analyze the spectral properties of Kendall and Mallows kernels, as well as a new class of polynomial kernels. All of these kernels have the desirable property of being right-invariant and thus (as we discuss below) are amenable to spectral analysis via a non-Abelian variant of Bochner's theorem. Such a spectral analysis (e.g., an understanding of the extent to which these kernels are bandlimited) directly informs analyses of the statistical properties of these kernels; for example, providing insight into the ability of kernels on rankings to discriminate between distributions over rankings, and permitting analyses of statistical convergence rates.
In brief, our main contributions are the following: 1. We fully characterize the Fourier spectrum of the Kendall kernel. In particular we show that it has only two nonzero irreducible representations, both of which turn out to be exactly rank one matrices; this degeneracy suggests its strength as a kernel is useful only in a limited range of problems.
and rankings. Think of the set [d] : = {1, 2, . . . , d} as the set of labels for some collection of d objects. For any permutation σ : [d] → [d] we can view σ(i) as the rank of object i. The set of all permutations forms a group with the standard function composition σ • σ ′ . This group is known as the symmetric group on d elements and it is denoted by S d .

Kernels depending on discordant pairs
A kernel k : S d × S d → R is right invariant if k(σ, σ ′ ) = k(σ • π, σ ′ • π) for all permutations σ, σ ′ , π ∈ S d . This property holds if and only if k(σ, σ ′ ) = κ(σ ′ • σ −1 ) for some function κ : S d → R. We overload notation for simplicity by referring to κ as k (usage will be clear from context). Right invariance of kernels is desirable for applications involving rankings since it corresponds to a relabeling of the objects being ranked. Furthermore, as we discuss later, we can use Fourier analysis to study such kernels. The kernels of interest in this paper all measure the similarity between two rankings through the number of pairs of objects that they order in the same way or in opposite ways. More precisely, letting n d (σ, σ ′ ) and n c (σ, σ ′ ) denote (respectively) the number of discordant and concordant pairs between permutations σ and σ ′ , we have the relations

Fourier analysis on right-invariant kernels
In this section, we set out the basic definitions and concepts that will allow us to state our theorems; these concepts are described at an elementary level by [8] and [9], with a concise summary given by [10]. The proofs of our results require more extensive machinery from the theory of Fourier analysis on groups, as found, for example, in [11], [12], or [13]. We introduce these more advanced concepts on an as-needed basis throughout the paper and in Appendix A.
The Fourier transform of a function f : S d → C takes the form As a contrast with the Fourier transform for real-valued functions, instead of being indexed by a frequency vector ξ, the Fourier transform is indexed by λ, which is partition of d-that is, nonincreasing sequence of integers that sum to d. Furthermore, instead of e iξ ⊤ x , the standard exponential basis functions in C, the terms ρ λ are matrix-valued functions in C d λ ×d λ . Let us now make these ideas more precise. A representation of the symmetric group is a matrixvalued function ρ : S d → C dρ×dρ such that ρ(σ) is invertible and ρ(σσ ′ ) = ρ(σ)ρ(σ ′ ) for all permutations σ, σ ′ ∈ S d . The integer d ρ is called the dimension of the representation. As an immediate consequence of the definitions, it follows that A representation ρ is reducible if it is equivalent to the direct sum of two representations. To be more explicit, a representation ρ is reducible if there exist two representations ρ 1 and ρ 2 and an invertible matrix C ∈ C dρ×dρ such that A representation that is not reducible is called irreducible. For brevity, we refer to irreducible representations as irreps. The symmetric group has a finite number of distinct irreps (an explanation of the meaning of "distinct" is provided in Appendix A), and these irreps have a standard indexing by finite sequences of positive integers λ = (λ 1 , λ 2 , . . . , λ r ) such that λ 1 ≥ λ 2 ≥ . . . λ r and r i=1 λ i = d. Such sequences are called partitions of d and λ ⊢ d means that λ is a partition of d.
Returning to equation (4) and using the terminology just introduced, the Fourier transform of a function on the symmetric group can be described as a mapping from the irreps ρ λ to matrices in C d λ ×d λ . This version of the Fourier transform shares many similar properties with its counterpart over real numbers, inclusion the Fourier inversion formula and the Plancherel formula. For more on these properties and other results needed in this paper, we refer the reader to Appendix A.
Before turning to our main results, it is convenient to introduce some notation for partial ordering of the partitions of d. Given any two partitions λ = (λ 1 , λ 2 , . . . , λ r ) and µ = (µ 1 , µ 2 , . . . , µ l ), We say λ ⊳ µ whenever it is not true that λ µ. The irreps of the symmetric group inherit the same partial ordering. For future reference, we also state a version of Bochner's theorem for right-invariant kernels on the symmetric group.
Theorem. Bochner's Theorem (Symmetric group). A right-invariant kernel k : S d × S d → C is positive definite if and only if the matrix k(ρ λ ) is positive semi-definite for all partitions λ ⊢ d.

Main results
Equipped with this background, we now turn to statements of our main results, as well as discussion of some of their consequences.

Fourier analysis of the Kendall and Mallows kernels
We begin with a theorem that characterizes the spectrum of the Kendall kernel: The Kendall kernel has the following properties: (a) When d = 2, the Fourier transform of the Kendall kernel is equal to 0 at ρ (2) and equal to 2 at ρ (1,1) .
(b) When d ≥ 3, the Fourier transform k τ of the Kendall kernel is zero at all irreducible representations except for ρ (d−1,1) and ρ (d−2,1,1) . Furthermore, at both of the latter two representations, the Fourier transform k τ has rank one.
Since the Fourier spectrum of a kernel determines its "richness," this theorem has a number of important consequences. Recall that any kernel on a domain X induces a semi-metric on the set of probability distributions on X , known as the maximum mean discrepancy; for instance, see the papers [14,15,16] for more details. When X = S d , this semi-metric is given by where F k denotes the unit ball of the reproducing kernel Hilbert space (RKHS) defined by the kernel k. The k is said to be characteristic if MMD k actually defines a metric on the set of probability distributions-that is, if MMD k (P, Q) = 0 implies that P = Q. With this context, Theorem 2 shows that the MMD semi-metric induced by the Kendall kernel is not characteristic; in fact, it is actually very weak, as made precise in the following: Corollary 3. When d ≥ 3, for the Kendall kernel, the MMD semi-metric is given by This result follows by combining the Fourier-analytic characterization of Theorem 2 with a more general expression of MMD k in the Fourier domain, as presented in Appendix D.2. Corollary 6 shows that most differences between P and Q do not contribute to MMD k (P, Q). The only the differences that contribute to MMD k are the (d − 1)-dimensional matrix P (ρ (d−1,1) ) − Q(ρ (d−1,1) ) and the d−1 To be more precise, the Kendall kernel can differentiate between P and Q if and only if their Fourier transforms at ρ (d−1,1) or ρ (d−2,1,1) differ along a single direction aligning with the only eigenvector with a non-zero eigenvalue of k τ (ρ (d−1,1) ) or k τ (ρ (d−2,1,1) ).
We now turn to Fourier analysis of the Mallows kernel (3b). Despite its superficial similarity to the Kendall kernel, it has very different properties. Note that Theorem 4 corrects an assertion in the paper by [7]; those authors suggested that since the Mallows kernel depends only on the relative rankings of pairs of objects, the Fourier transform k ν m should be expected to be zero at all irreps λ ⊳ (d − 2, 1, 1). Theorem 4 shows that this natural intuition does not actually hold.
Theorem 4 also has implications for the universality of the Mallows kernel. Recall that a continuous kernel k on a compact metric space X is called universal (in the sense of [17]) if the RKHS defined by it is dense, in L ∞ norm, in the space of continuous functions. When X = S d , this condition is satisfied if any functions on S d can be written as a linear combination of functions k(π, ·), with π ∈ S d . [16] show that a universal and continuous kernel on a compact metric space is characteristic. On S d , a kernel is universal if and only if it is characteristic. As with Theorem 2, Theorem 4 has implications for the kernel MMD induced by the Mallows kernel. In particular, it shows that the Mallows kernel is both characteristic and universal, and hence the MMD is a metric on probability distributions over S d .

A family of polynomial-type kernels
Based on our previous results, it is natural to suspect that there exists a family of kernels interpolating between the relative simplicity of the Kendall kernel, which is analogous to a linear kernel on R d , and the richness of the Mallows kernel. Based on this intuition, let us introduce three families of polynomial-type kernels on the symmetric group, defined as follows: We refer to these three kernels, respectively as the polynomial kernel, the normalized polynomial kernel, and the ν-normalized polynomial kernel of degree k. Since each depends only on the number of discordant pairs, each kernel is right-invariant by Proposition 1. Moreover, each kernel is positive semidefinite, since each one can be written as a polynomial function of the Kendall kernel-itself a PSD kernel-with non-negative coefficients. The first part of the theorem shows that the polynomial kernels of degree p do not detect differences between distributions at irreps ρ λ with λ not higher in the partial ordering than the partition (max{d − 2p, 1}, 1, . . . , 1). Intuitively, as the degree of the polynomial kernels increases they are able to detect more differences between probability distributions. The second part of the theorem shows that the polynomial kernels of degree at least d − 1 detect all differences between probability distributions.
The appeal of defining the second and third kernels, k p and k p,ν respectively, in addition to the first one, is two-fold. On the one hand, in practice, the kernel k p becomes difficult to evaluate when p is large because k p (σ, σ) = 2 p . On the other hand, the two normalized kernels satisfy the relations The first limit is a constant times the Mallows kernel with the parameter ν = 2 d 2 −1 , while the second limit is the Mallows kernel k ν m . This observation suggests we can infer properties about the Mallow's kernel by working with the ν-normalized polynomial kernel. Indeed, our proof of Theorem 4 makes use of this fact.

Proofs of main results
In this section, we provide the proofs of our main results, with the more technical aspects deferred to the appendices. We first need to introduce some additional background on overcomplete representations, which we do in the following section.

Overcomplete Representations and the James Submodule Theorem
Thus far, we have defined the Fourier transform for irreducible representations, but its definition makes sense for reducible representations also. Although the Fourier transform at the irreps fully determines a function on the symmetric group (see Appendix A), it is sometimes useful to consider the Fourier transform evaluated at reducible representations, for instance for computational reasons. To this end, we introduce the overcomplete permutation representations τ λ where λ ⊢ d see Appendix A for further details).
The representation τ (d) is the trivial representation: and its entries are indexed by the d 2 subsets of two elements of [d], and it is defined by The last representation of immediate interest is τ (d−2,1,1) , which takes values in R d(d−1)×d(d−1) and its entries are indexed by ordered tuples with two elements contained in [d], and is defined by It is easy to check that these four maps are indeed representations. These representations are easy to work with because they map permutations to matrices with 0 or 1 entries, and moreover each row and column contain exactly one 1. How do the representations τ λ relate to the irreps ρ λ ? James' submodule theorem, as presented in full detail in Appendix A, implies that there exist orthogonal matrices C λ such that The fact that τ (d−2,1,1) contains two copies of ρ (d−1,1) is of importance later. This theorem allows us to infer properties about the Fourier transform of functions at the irreps ρ λ by working with the representations τ λ , which are easier to work with.
We are now ready to present the proofs of our main theorems. We present them slightly out of order, starting with Theorem 2 (Kendall), then Theorem 5 (polynomial), and lastly Theorem 4 (Mallows). This ordering makes most sense in terms of flow, since the proof for Mallows is built on the proof for the polynomial kernel, which itself builds on the proof for Kendall.

Proof of Theorem 2
For d = 2 and d = 3, the irreps ρ λ are easy to describe in closed-form [11]; in particular, we have Accordingly, it remains to prove Theorem 2 when d ≥ 4. Each representation ρ : S d → C dρ×dρ defines a collection of d 2 ρ functions σ → ρ(σ) ij on the symmetric group. An important result in the representation theory states that the functions defined by the irreps ρ λ form a basis for the space of functions over the symmetric group. The next lemma allows us to exploit this fact. Lemma 6. The Kendall function σ → k τ (σ) is a linear combination of the functions defined by the representation τ (d−2,1,1)) .

Proof of Theorem 5
We use an approach similar to the proof of Theorem 2. Then, by the James submodule theorem together with the linear independence of the functions defined by the irreps ρ λ , we find that the Fourier transforms of the three polynomial kernels are zero at all irreps ρ λ with λ ⊳ (max{d − 2p, 1}, 1, . . . , 1). The first part of Theorem 5 is now proved.
To prove the second part of Theorem 5 we make use of feature maps of the three polynomial kernels. In general, a feature map of a kernel k is an embedding of the symmetric group into a Hilbert space that makes the kernel linear. [7] present such a feature map for the Kendall kernel.
Note that it satisfies the relation k τ (σ, We need such feature maps for the three polynomial kernels as well. Up to constants, the feature maps for the three kernels k p , k p , and k p,ν are the same. For simplicity, we work with the kernel k p . All the arguments presented here extend to the other two polynomial kernels as well.
We now give a recursive construction of feature maps Φ p : where the coordinates are indexed by the unordered pair t 0 = {−1, 0} and the d 2 unordered pairs t r = {i r , j r } with i r , j r ∈ [d] and i r < j r . We denote the set of these unordered pairs by The feature map Φ 1 clearly satisfies k 1 (σ, σ ′ ) = Φ 1 (σ) ⊤ Φ 1 (σ ′ ). Now we use the map Φ p−1 to construct a feature map Φ p for p ≥ 1. By definition, we have Therefore, the polynomial kernel of degree p between σ and σ ′ is equal to the inner product of the matrices Φ 1 (σ)Φ p−1 (σ) ⊤ and Φ 1 (σ ′ )Φ p−1 (σ ′ ) ⊤ . By induction, we see that Φ p can be obtained from Φ 1 by taking the outer product with itself p times, meaning that the embedding Φ p : can be expressed as where s 1 , s 2 , . . . , s p is a sequence of elements of T . The following lemma is the key result that allows us to show that the three polynomial kernels of degree greater or equal than d − 1 are characteristic.
We prove this claim in Appendix C.2; here we provide some intuition for the argument. By construction, each entry of Φ d−1 is equal to a product of up to d − 1 terms 21 {σ(i)<σ(j)} − 1 times a constant. The key property that makes the result true is that the indicator functions 1 {σ=σr} can be expressed as a product of d − 1 indicator functions 1 {σ(i)<σ(j)} . For example, when d = 3, the product 1 {σ(1)<σ(3)} 1 {σ(3)<σ(2)} is equal to the indicator function of the permutation [1, 3,2]. Moreover, the degree d − 1 is the smallest with this property. Although we do not prove that the vectors Φ p (σ) are not linearly independent for p < d − 1, we believe that this independence does indeed hold.
As mentioned previously, a universal kernel on the symmetric group is also characteristic. Hence, it suffices to show that the polynomial kernel k d−1 is universal. Therefore, it is enough to check that the Gram matrix M τ = [k d−1 (σ i , σ j )] is invertible, where σ 1 , σ 2 , . . . , σ d! enumerate all the elements of S d . The Gram matrix can be written as . From Lemma 9, we know that the vectors Φ d−1 are independent, and hence the Gram matrix M τ is full rank, which completes the proof of Theorem 5.

Proof of Theorem 4
We first give a direct proof that the Mallows kernel is universal and characteristic. Theorem 5 shows that the ν-normalized polynomial kernel k p,ν defined by p is characteristic when the degree p is greater or equal than d − 1. Moreover, we saw that as the degree p increases to infinity, the kernel k p,ν converges to the Mallows kernel k ν m . Therefore, it is not surprising that the Mallows kernel is characteristic since it is the limit of characteristic kernels.
Let us now make this rough argument precise. In order to show that a kernel on S d is characteristic, it suffices to check that it is universal [16]. To check the latter condition, we need to show that the Gram matrix M m = [k ν m (σ i , σ j )] is strictly positive definite; here the permutations σ 1 , σ 2 , . . . , σ d! enumerate the elements of S d .
Recall that the Hadamard product between two matrices A and B of the same dimensions, denoted by A • B, is formed by taking elementwise-product of the entries; we use A •p to denote the Hadamard product of the matrix A with itself p times. By Schur's theorem, the Hadamard product A • B of any two PSD matrices is also PSD.
. Performing a Taylor series expansion of the exponential function yields where the series on the right hand side is entry-wise absolutely convergent. For some 0 ≤ α i ≤ 1, re-arranging terms yields The first term in the right hand side of (14) is the Gram matrix of the ν-normalized polynomial kernel of degree d − 1, and thus it is a strictly positive definite matrix. The second term is a positive semi-definite matrix because of Schur's theorem. Hence M m is strictly positive definite and Theorem 4 is now proved. For completeness, we also note that the result of Theorem 4 can be obtained via a more abstract argument, using the results of [18]. Given a compact metric space X and a separable Hilbert space H, let Ψ : X → H a continuous and injective map. The authors show that the kernel k on X × X given by is universal. Since [16] showed that continuous and universal kernels are also characteristic on a compact metric space, it follows that the kernel k is characteristic as well. The symmetric group is a compact metric space and we can choose Ψ = Φ, the feature map of the Kendall kernel. We can thus conclude that the kernel defined in equation (15) is universal and characteristic; since it is equal to the Mallows kernel up to constants, the claim of Theorem 4 follows.

Conclusion and open questions
In this paper, we provided a Fourier-analytic characterization following right-invariant kernels: Kendall, Mallows, and some new families of polynomial kernels. We showed that the Kendall kernel is nearly degenerate in the sense that it has only two nonzero Fourier matrices, both of which have rank one. Also, we showed that the Mallows kernel lies at the other extreme, being both universal and characteristic. These results reveal that the Kendall and Mallows kernels are extremely different, even though both of them depend only on counting discordant pairs between rankings; in this sense, they form a natural analog in the space of permutations to the linear and Gaussian kernel in Euclidean space. Our proposed families of polynomial kernels smoothly interpolate between the above two extremes, yielding a hierachy of kernels that are sensitive to differences between distributions at an increasing set of frequencies, and can model increasingly complex functions.
Many properties of the Fourier transform of the Mallows and polynomial kernels are still not understood. For example, unlike the case of the Kendall kernel, we do not know closed form descriptions of the Fourier matrices for these kernels. Such concise expressions would not only be of mathematical interest, but could also useful for computing in the spectral domain. It would also be interesting to understand the properties of these kernels when applied to partial rankings (top-k or random-k), which is even harder because partial rankings do not jointly form a group. We view the current results on kernels for full rankings as an important step towards developing and rigorously analyzing flexible kernel methods for partial rankings.
Finally, an important direction for future work is on finite sample rates for different learning methods that use these kernels. For example, it would be interesting to characterize the power of a two sample test using the Kendall or Mallows kernel MMD as a function of the number of samples and d and an appropriate signal-to-noise ratio. Similarly, characterizing exactly how hard it is to learn functions on the symmetric group seems like an interesting open problem.

A Background in Representation Theory
In this section, we present further notions and results about the representation theory for the symmetric group. Our exposition is brief and covers only the essential results needed in our work. For a more detailed introduction good resources include the thesis of [8] and the appendices by [9], with a concise summary also given by [10]. More detailed presentations can be found in [11], [12], or [13], ordered according to increasing levels of abstraction.

Groups
A group (G, ·) is a set G endowed with a multiplicative operation · : G × G → G such that (a) there exists an element e ∈ G called the identity element such that e · g = g · e = g for all g ∈ G.
It is easy to check that (R, +) or (R, ·) are examples of groups. It is also straightforward to check that the set of permutations together with the operation of composition form a group, called the symmetric group. Notice that we do not require g 1 ·g 2 = g 2 ·g 1 . A group with this property is called commutative or abelian. Abelian groups are easier to study than non-abelian ones. Unfortunately, the symmetric group is not abelian.

Equivalent Representations
Two representations ρ 1 and ρ 2 are equivalent if they have the same dimension and if there exits an invertible matrix C such that ρ 1 (σ) = C −1 ρ 2 (σ)C for all σ ∈ S d . In other words, two representations are equivalent if there exists a change of basis that makes one of them equal to the other. We use ρ 1 ≡ ρ 2 to denote the equivalence of the representations ρ 1 and ρ 2 .
For any representation ρ 1 , there exists an equivalent representation ρ 2 such that each matrix ρ 2 (σ) is unitary (i.e. ρ 2 (σ) * := ρ 2 (σ) ⊤ = ρ 2 (σ) −1 = ρ 2 (σ −1 )). Therefore, we can always assume that the representations we are working with are unitary. Furthermore, in the case of the irreps of the symmetric group, there exist bases such that each representation ρ λ is real, and hence orthogonal. The irreps in these bases are known as Young's orthogonal representations, and throughout this paper we work with these forms of ρ λ .

Irreps
We already said that an irreducible representation is a representation that is not equivalent to a direct sum of representations. The symmetric group, in fact any finite group, has a finite number of pairwise inequivalent irreps. Let us consider a maximal set of pairwise inequivalent irreps. There can be multiple such sets, but they are the same up to equivalence. To be more precise, between two maximal sets of irreps there exists a bijection such that an irrep in the first set is mapped to an equivalent irrep in the other set.
A fundamental result in representation theory states than any representation is equivalent to a direct sum of irreps. That is, each representation ρ can be decomposed into the direct of sum of some irreducible representations ρ 1 , ρ 2 , . . . , ρ k with some multiplicities m 1 , m 2 , . . . , m k : Let us recall that the entries of each representation ρ : S d → C dρ×dρ define d 2 ρ functions σ → ρ(σ) ij on the symmetric group. The functions defined by Young's orthogonal representations form a basis for the space of functions f : S d → C. This result is important and we exploit it extensively in this work.

The Fourier Transform
We saw that the Fourier transform of a function f : S d → C is a map from representations to matrices, and it is given by where ρ is a representation of the symmetric group.
This Fourier transform has properties similar to those of its counterpart over the real numbers. First of all, there exists a Fourier inversion formula and it takes the form The Fourier transform on the symmetric also satisfies the Plancherel formula: A third familiar property is that the Fourier transform of the convolution of two functions is the product of the Fourier transforms of the individual functions. The convolution of two functions f, g : S d → C is defined by f * g(π) = σ∈S d f (πσ −1 )g(σ).

Ferrer diagrams, Young tableaux, and Young tabloids
As mentioned in Section 2.2, it is natural to index the irreps of S d by partitions λ of d. The exact correspondence is not easy to describe, but it is useful to understand how to visualize the partitions λ and the corresponding irrep ρ λ .
The partitions λ ⊢ d are represented graphically in the form of Ferrer's diagrams. The diagram of a partition λ = (λ 1 , . . . , λ r ) is formed from boxes placed in rows such that row i contains λ i boxes. For example, the partitions of 4 are (4), (3, 1), (2, 2), (2, 1, 1), and (1, 1, 1, 1) are represented below: In this graphical representation, a wider partition is higher in the partial ordering, while a taller partition is lower in the partial ordering.
A Ferrer diagram with the elements of the set {1, 2, . . . , d} in its boxes is called a Young tableau. Young tableaux in which the rows are viewed as sets are called Young tabloids. To emphasize that the rows of a Young tabloid are not ordered we drop the vertical lines in the graphical representation. For example, the Young tabloids of the partition (2, 1) are

Overcomplete Representations and James' Submodule Theorem
In studying irreps or the Fourier transforms of functions it is often useful to consider reducible representations that have an easy to understand interpretation and contain copies of the irreps. We have seen in Section 4.2 that the representations τ λ play such a role. We now define these representations for a general partition λ.
Let {t 1 }, {t 2 }, ..., {t l } be an enumeration of all Young tabloids 2 of some partition λ ⊢ d. The representation τ λ takes values in R l×l and is defined by We note that the Fourier transform of a probability measure P at the representation τ λ encodes marginal probabilities: Therefore the Fourier transform at this representation has a concrete interpretation in "time domain". Nonetheless, because of the Fourier inversion formula we want to understand the properties of the kernel functions at irreps. James' Submodule Theorem give a decomposition of τ λ into irreps. We state the form of the theorem presented by [9].
The integers K λ,µ are known as Kostka's numbers and there are methods to compute them. For example, we have already mentioned in Section 4.2 that 2,1,1) .

B Proof of Theorem 2
Based on our discussion in Section 4.2, the only remaining details in the the proof of Theorem 4 are the proofs of Lemmas 6 and 7, which we provide here.

B.1 Proof of Lemma 6
The functions defined by the irreps ρ λ form a basis for the space of functions on the symmetric group. Then, to prove the claim it suffices to express the function k τ as a linear combination of the functions defined by ρ (d) , ρ (d−1,1) , ρ (d−2,2) , and ρ (d−2,1,1) . James' submodule theorem tells us that 2,1,1) .
Therefore, we just have to show that the Kendall function is a linear combination of the functions defined by τ (d−2,1,1) . From Proposition 1 we have The functions 1 {σ(i)=r,σ(j)=l} are defined by τ (d−2,1,1) by construction, so that the conclusion follows.
Computing k τ (τ (d) ) We first show that k τ (τ (d) = 0. Recall that τ (d) is the trivial representation, equal to 1 at all permutations, so we need to check that σ∈S d 1 − 2 d 2 −1 i(σ) = 0. Note that we have so that the conclusion follows.
Computing k τ (τ (d−1,1) ) In this case, we show that k τ (τ (d−1,1) ) = vv ⊤ , where the vector v ∈ R d has components v r = d − 2r + 1. Consider the functions g ij on S d defined by g ij (σ) = 1 − 21 {σ(i)>σ(j)} , for all i < j. Then for any representation ρ. We compute g ij (τ (d−2,2) ) for each tuple i < j and then sum up the results. The rows of τ (d−1,1) are indexed by tabloids of shape (d − 1, 1). Each of these tabloids is fully specified by the index contained in the second row. We identify the tabloids of shape (d − 1, 1) with those indices. Let t 1 and t 2 be two indices in [d]. Then There are three cases to consider. First, suppose that t 1 is distinct from both i and j. There are (d − 1)! permutations σ that satisfy σ(t 1 ) = t 2 , out of which exactly half satisfy g ij (σ) = 1 and the other half satisfy g ij (σ) = −1. Therefore, we are guaranteed that g ij (σ) t 1 t 2 = 0 when t 1 ∈ {i, j}.
The entries of k τ (τ (d−2,2) ) are indexed by tabloids of shape (d − 2, 2) which can be identified with the set of two indices contained in the second row. Therefore we can identify the tabloids of shape (d − 2, 2) with sets of two indices. Fix two such sets t 1 = {t 11 , t 12 } and t 2 = {t 21 , t 22 }. Once again we use the functions g ij (σ) : = 1 − 21 {σ(i)>σ(j)} . For these functions, we have By breaking into four cases, similar to the proof the computation of k τ (τ (d−2,2) ), we obtain Summing the terms g ij (τ (d−2,2) ) over pairs i < j yields the result.
Computing k τ (τ (d−2,1,1) ) We show that where v 1 , v 2 , and v 3 , are the vectors in R d(d−1) defined by The same ideas used in the computation of k τ (τ (d−2,2) ) apply here as well. However, the analysis is a bit more detailed because there are more cases to consider. The entries of k τ (τ (d−2,1,1) ) are indexed by tabloids of shape (d − 2, 1, 1). These tabloids are completely specified by the entries contained in the second and third rows. Hence, we can identify them with ordered tuples in [d] 2 . Fixing two such tuples t 1 = (t 11 , t 12 ) and t 2 = (t 21 , t 22 ), with t 11 = t 12 and t 21 = t 22 , we then have Arguments similar to the ones used in the computation of g ij (τ (d−1,1) ) enable us to compute g ij (τ (d−2,1,1) ) as well. In order to make the result more readable, let us split them into to cases: t 21 < t 22 and t 21 > t 22 . Then we obtain (19a) The conclusion then follows by computing the sum i<j g ij (τ (d−2,1,1) ) t 1 t 2 in the four possible cases obtained from the orderings of t 11 and t 12 , and of t 21 and t 22 .

C Proof of Theorem 5
Based on our discussion in Section 4.3 we are left to prove two lemmas. Lemma 8 states that polynomial kernels of degree p can be expressed as a linear combination of the functions defined by the representation τ (max{d−2p,1},1,...,1) . We prove this in Section C.1. Lemma 9 states that the vectors {Φ d−1 (σ)} σ∈S d are linearly independent. We prove this in Section C.2. This settles the proof of Theorem 5.

C.1 Proof of Lemma 8
We express the function σ → k p (σ) as a linear combination of the functions defined by the representation τ (max{d−2p,1},1,...,1) . The same property can be proved for k p and k p,ν analogously.
We first analyze the case 2p < d. By definition, we have which shows that the polynomial kernel k p is a linear combination of products of functions 1 {σ(i)>σ(j)} . The products of these functions contain at most p terms, which means there are at most 2p values σ(i 1 ), σ(i 2 ), ..., σ(i 2p ) on which the product function depends. But the indicator functions for events of the form {σ(i 1 ) = j 1 , . . . σ(i 2p ) = j 2p } form a basis for the functions that depend only on the values σ(i 1 ), σ(i 2 ), . . . σ(i p ). The conclusion follows for the case 2p < d because these indicator functions are exactly the functions defined by the representation τ (max{d−2p,1},1,...,1) . The case 2p ≥ d follows analogously once we observe that any product of 2p indicator functions 1 {σ(i)>σ(j)} is determined by d − 1 values {σ(i 1 ), . . . , σ(i d−1 )}. (To be clear, the number of values d − 1 because given d − 1 such values, the d th value is fixed).

C.2 Proof of Lemma 9
Recall from equation (12) that for i r , j r ∈ [d] with i r < j r , we use t r = {i r , j r } to denote unordered pairs with an additional t 0 = {−1, 0} for convenience, and T un to denote the set of all such d 2 + 1 unordered pairs. In equation (4.3), the definition of the feature map Φ d−1 : implies that where the product is only over s r = t 0 since Φ 1 (σ) t 0 = 1, and C s 1 s 2 ...s d−1 = d 2 |{r:sr =t 0 }|/2 is a positive constant independent of σ. We use the convention that an empty product evaluates to 1.
From now on, for a given unordered pair s = {i, j} ∈ T un , we introduce the shorthand notation +s : = {σ : σ(i) < σ(j)}, with −s denoting its complement. Moreover, we write several such signed unordered pairs next to each other we mean the intersection of the two sets. For example +s 1 − s 2 means s 1 ∩ s c 2 . We use induction to show that each α(σ) is zero. Assume that for some fixed integer p and for all choices of p unordered pairs s 1 , . . . , s p ∈ T un and for all possible binary signs ǫ 1 , . . . , ǫ p ∈ {+1, −1} the following holds: ǫ 1 s 1 ...ǫpsp α(σ) = 0.
We show that this property holds for all choices of p + 1 unordered pairs and binary signs. The base case p = 1 has been shown in Equation (22).
Fix a sequence of p + 1 distinct pairs s 1 , s 2 , . . . , and s p , all distinct from t 0 . Then, each sequence ǫ 1 s 1 , ǫ 2 s 2 , . . . , ǫ p s p can be encoded with a vector in {−1, +1} p+1 . For a given sign vector ǫ in {−1, +1} p+1 let sign(ǫ) be equal to the product of the entries of ǫ. Therefore, sign(ǫ) is +1 if the vector ǫ contains an even number of −1 entries, and is −1 otherwise. Then, we have This property holds for all pairs s j , not just for s 1 . Furthermore, by the induction step we know that the right hand side of the above equation equals zero. More generally, if ǫ and ξ are two sign vectors in {−1, +1} p+1 that differ only in a coordinate, we have Let G be the standard graph on the hypercube {−1, +1} p+1 , i.e. the graph with node set equal to {−1, +1} p+1 that connects to nodes by an edge only if they differ in a single coordinate. Then, observation (24) immediately implies that for any two sign vectors ǫ and ξ at distance two of each

D Other Proofs
In this appendix, we collect the proofs of various other results.

D.2 MMD k in Fourier Domain
We show that for any kernel k on S d the maximum mean discrepancy can satisfies the identity: Let α 1 , α 2 be two independent random permutations sampled according to the probability distribution P . Similarly β 1 and β 2 are independent and sampled according to Q. The Fourier inversion formula ensures that where the last equality follows because the irrep ρ λ is one of Young's orthogonal representations.
In particular, see the paper [16] for a proof of this last identity.