The convex distance inequality for dependent random variables, with applications to the stochastic travelling salesman and other problems

We prove concentration inequalities for general functions of weakly dependent random variables satisfying the Dobrushin condition. In particular, we show Talagrand's convex distance inequality for this type of dependence. We apply our bounds to a version of the stochastic salesman problem, the Steiner tree problem, the total magnetisation of the Curie-Weiss model with external field, and exponential random graph models. Our proof uses the exchangeable pair method for proving concentration inequalities introduced by Chatterjee (2005). Another key ingredient of the proof is a subclass of $(a,b)$-self-bounding functions, introduced by Boucheron, Lugosi and Massart (2009).


Introduction
The theory of concentration of measure for functions of independent random variables has seen major development since the groundbreaking work of Talagrand (1995) (see the books Ledoux (2001), Dubhashi and Panconesi (2009), and Boucheron, Lugosi and Massart (2013)). These inequalities are very useful for obtaining non-asymptotic bounds on various quantities arising from models that are based on collections of independent random variables.
However, for many applications it may be difficult, if not impossible, to describe the model by means of a collection of independent random variables, whereas simpler descriptions based on dependent random variables may be readily available. Such models arise, for example, in statistical physics, where certain distributions can be described as stationary distributions of appropriate Markov chains. Therefore, it is important to have concentration inequalities that are applicable beyond the independent setting.
In this paper, we will prove such inequalities for a certain type of dependence, namely for random variables satisfying the so-called the Dobrushin condition (however, we believe that the methods presented here can also be adapted to other settings). This condition is satisfied, in particular, in certain statistical physical models when the temperature is sufficiently high, and for sampling without replacement.
In order to get sharper bounds, it is natural to impose stronger conditions on the function f . In this article, we will do this by using the general formalism of (a, b)-self-bounding functions, introduced for independent random variables by Boucheron, Lugosi and Massart (2009).
Our main contribution in this paper is the following. We will prove concentration inequalities for a slightly restricted subclass of (a, b)-self-bounding functions, which we call (a, b)- * -self-bounding (the reason for using the * , instead of a letter, is to make it clear that we have two parameters, a and b). We show that our result implies a version of Talagrand's convex distance inequality for dependent random variables satisfying the Dobrushin condition.
Our approach in this paper is based on Stein's method of exchangeable pairs, as introduced in Chatterjee (2007). Recently, other variants of Stein's method, size-biasing and zero-biasing, have been adapted to prove concentration inequalities, see Ghosh and Goldstein (2011), and Goldstein and Islak (2013).
It is important to note that for certain types of dependence, such as uniform permutations (Talagrand (1995)) and Markov chains (Marton (1996), Samson (2000), Marton (2003), and Paulin (2014)) Talagrand's convex distance inequality was shown to hold. However, these approaches do not seem to easily generalise to dependent random variables satisfying the Dobrushin condition.
The rest of this article is organised as follows. In Section 2, we will introduce the main definitions used in the article. In Section 3, we present our main results. In Section 4, we discuss three applications, the stochastic salesman problem, the Steiner tree problem, and the total magnetisation of the Curie-Weiss model with external field. In Section 5 we prove some preliminary results, and in Section 6, we prove our main results. Finally, the Appendix includes a version of Talagrand's convex distance inequality for sampling without replacement.

Preliminaries
We start by introducing some notation. Let X := (X 1 , . . . , X n ) be a vector of random variables, where each X i takes values in a Polish space Λ i , and, similarly, let Λ := Λ 1 × Λ 2 × . . . × Λ n , and let F be the Borel sigma algebra on Λ.
We are going to use matrix norms. For an n × n matrix A = (a ij ) 1≤i,j≤n , we denote its operator norms by A 1 , A ∞ and A 2 , respectively. Note that, in particular, A 1 = max 1≤j≤n n i=1 |a ij | and A ∞ = max 1≤i≤n n j=1 |a ij |. Let g : Λ → R + be a non-negative function. We will be interested in the concentration properties of g(X). We will denote its centered version by f (x) := g(x) − E(g(X)).
The following definition of self-bounding functions is essentially that of Boucheron, Lugosi and Massart (2009).
Definition 2.1. Let a, b > 0. A function g : Λ → R + is called (a, b)-selfbounding if there exist measurable functions g i : Λ −i → R, i = 1, . . . , n, such that for every x ∈ Λ, note that (i) is not required in this case.
Remark 2.2. If g is (a, b)-self-bounding, then it is also weakly (a, b)-self-bounding.
Similarly, a function g : Λ → R is called weakly (a, b)- * -self-bounding if there exists functions α 1 , . . . , α n : Λ → R + such that (ii) above holds, and note that, again, (i) is not required in this case.
Remark 2.4. For each a, b ≥ 0, the following relations hold.
The reverse implications are false in general.
The following definition allows us to quantify the dependence between the random variables.
Definition 2.5 (Dobrushin's interdependence matrix). Suppose A = (a ij ) is an n × n matrix with nonnegative entries and zeroes on the diagonal such that for every i, and every x, y ∈ Λ, where d TV denotes the total variational distance (see Section 5.1), [n] := {1, . . . , n}, and µ i (·|x −i ) = P(X i ∈ ·|X −i = x −i ) denotes the marginal of X i . We call such A a Dobrushin interdependence matrix for the random vector X (or, equivalently, for the measure µ).
Remark 2.6. The condition A 1 < 1 is commonly called the Dobrushin condition in the literature. However, some authors use A 2 < 1 or A ∞ < 1 instead. The definition implicitly requires that µ i (·|x −i ) exists for every x −i . This may only be true in some of our applications in an almost sure sense. However, because we are going to assume that our random variables take values in a Polish space, we may use regular conditional probabilities, and change µ on a set of zero probability such that (2.2) becomes true everywhere, not just in an almost sure sense (see Faden (1985) for more details on the existence of regular conditional probabilities).

Main results
In this section, we state our main results regarding concentration for (a, b)- *self-bounding functions, and Talagrand's convex distance inequality. The results apply to weakly dependent random variables satisfying the Dobrushin condition.

A new concentration inequality for (a, b)- * -self-bounding functions
Our main result is a bound on the moment generating function (mgf) of functions of random variables satisfying the Dobrushin condition.
Theorem 3.1. Let X = (X 1 , . . . , X n ) be a vector of random variables, taking values in Λ. Let A be a Dobrushin interdependence matrix for X, and suppose that A 1 < 1 and A ∞ ≤ 1. Let g : Λ → R be a non-negative measurable function such that g(X) has finite mean, denoted by E(g). Let a, b ≥ 0. . (3.1) 3. Suppose that g is weakly (a, b)- * -self-bounding, and in addition, for every , the following inequality holds. . (3. 2) The proof of this is deferred to Section 6. As a corollary, we obtain concentration inequalities. For stating them, we will use a constant defined as follows. Let a c be the unique positive solution of Note that 0.285 < a c < 0.286.
Corollary 3.2. Under the conditions of Theorem 3.1, we have the following.

The convex distance inequality for dependent random variables
Recently, Talagrand's convex distance inequality was proven using the weakly self-bounding property in Section 2 of Boucheron, Lugosi and Massart (2009) (the original proof in Talagrand (1995) was based on mathematical induction). We are going to use similar ideas to prove a version of Talagrand's convex distance inequality based on Theorem 3.1 and, hence, applicable to dependent random variables satisfying the Dobrushin condition. The result is stated in terms of Talagrand's convex distance, which is defined as follows. For c ∈ R n + , and x, y ∈ Λ, we define d c (x, y) For a point x ∈ Λ and a set S ⊂ Λ, we let d c (x, S) := min y∈S d c (x, y) and which we call Talagrand's convex distance between a point x and a set S.
Theorem 3.3. Let X := (X 1 , . . . , X n ) be a vector of random variables, taking values in a Polish space Λ = Λ 1 × . . . × Λ n , equipped with the Borel σ-algebra F . Let µ be the probability measure on Λ induced by X. Let A be a Dobrushin interdependence matrix for X, and suppose that A 1 < 1 and A ∞ ≤ 1. Then for any S ∈ F , . (3.5) Remark 3.4. Inequality (3.5) is of the same form as Talagrand's original convex distance inequality in the independent case, but the latter holds with the constant (1 − A 1 )/26.1 being replaced by 1/4. Our bound takes into account the strength of dependence between the random variables. The following corollary of the above result generalises the so-called "method of non-uniformly bounded differences" to dependent random variables satisfying the Dobrushin condition.
Corollary 3.5. Let X = (X 1 , . . . , X n ) be a vector of random variables, taking values in Λ, equipped with the Borel σ-algebra F . Let µ be the probability measure on Λ induced by X. Let A be a Dobrushin interdependence matrix for X, and suppose that A 1 < 1 and A ∞ ≤ 1. Let g : Λ → R be a function satisfying that for some positive functions c 1 , . . . , c n : Λ → R + , for every x = (x 1 , . . . , x n ), y = (y 1 , . . . , y n ) in Λ, and uniformly for every x in Λ. Then for any t ≥ 0, where M(f ) denotes the median of g(X) (if the median is not unique, then the result holds for all of them).
Proof. The proof is along the same lines as the proof of Lemma 6.2.1 on page 122 of Steele (1997), except that the constant 4 is replaced by 26.1/(1 − A 1 ).

Applications
In this section, we apply our results to a variant of the stochastic travelling salesmen problem, Steiner trees, the Curie-Weiss model, and exponential random graphs.

Stochastic travelling salesman problem
One important and well studied problem in combinatoric optimisation is the travelling salesman problem (TSP). In the simplest, and most studied case, we are given n points in the unit square [0, 1] 2 , and we are required to find the shortest tour, that is, to find the permutation σ ∈ S n (S n denoting the symmetric group) that minimises where |x − y| denotes the Euclidean distance between x and y. Let us denote the length of the minimal tour by T (x 1 , . . . , x n ). There has been much effort to find efficient algorithms to compute the minimal tour (in general, this is a difficult, NP complete problem, but there are fast algorithms that find a tour that is at most a fixed constant times worse than the optimal tour, see Applegate et al. (2011) for a recent book on this topic).
From a probabilistic point of view, it is of interest to look at the concentration properties of T (X 1 , . . . , X n ), where X 1 , . . . , X n is a random sample from [0, 1] 2 . One of the classical applications of Talagrand's convex distance inequality is to show that, if X 1 , . . . , X n are i.i.d. uniformly distributed in [0, 1] 2 , then T (X 1 , . . . , X n ) is very sharply concentrated around its median (or equivalently, its expected value), with typical deviations of order 1. We are going to study a modified version of the travelling salesman problem. Let A := {a 1 , . . . , a N } be a fixed set of distinct points in [0, 1] 2 . Let L(x, y) : A 2 → R be the cost function, satisfying that for some constant C, |x − y| ≤ L(x, y) ≤ C|x − y| for every x, y ∈ A, (4.1) where |x − y| denotes the Euclidean distance of x and y. Note that the cost function does not need to be a metric, and we do not even assume that it is symmetric. A non-symmetric cost function may be used to model the time taken for driving between two locations in a city that are at different elevation, since going uphill can take longer than going downhill. For any set of distinct points {x 1 , . . . , x n } ∈ A, we let T (x 1 , . . . , x n ) be the shortest tour through all the points, that is the minimum of the sum L(x(σ(1)), x(σ(2)) + . . . + L(x(σ(n)), x(σ(1))) for σ ∈ S n . Since T is invariant under the permutation of the points, we will also use the notation T ({x 1 , . . . , x n }).
Assume that a set of n distinct points are chosen from A according some distribution µ on all the subsets of size n of A. Let and define the inhomogeneity coefficient of this distribution µ as ρ n (µ) := n (r n,1 (µ) + (N − n) · r n,2 (µ)) .

(4.2)
This coefficient is related to the distance of the distribution µ from the uniform distribution on all sets of size n, corresponding to sampling without replacement.
The following theorem is the main result of this section.
Theorem 4.1 (Stochastic TSP for random subsets). Let X be a random subset of size n of A, chosen according to a distribution µ, with inhomogeneity coefficient ρ n (µ) < 1. Then for any t ≥ 0, where M(T ) denotes the median of T .
Remark 4.2. The inequality has the same form as the original result in the independent case (in that bound, the exponent is of the form 4 exp(−t 2 /64)).
Example 4.3. Now we give a simple example of a distribution µ on A, which we call weighted sampling without replacement. Let p be a probability distribution on [N ] satisfying that p(i) is strictly positive for every i ∈ [N ]. Let us choose a random subset X ⊂ A as follows. Initially, X is empty. First, we pick an index from [N ] according to p, and put the set in A corresponding this index into X . Then, we pick another index from [N ], according to p conditioned on not choosing the first index. We obtain X by iterating this procedure n times in total. If we have picked the indices I 1 , . . . , I k ∈ [N ] in the first k steps, then P(k + 1th point is i) = p(i) j∈[N ]\{I 1 ,...,I k } p(j) (for 0 ≤ k < n). This means that for any i 1 , . . . , i n ∈ [N ], we have Based on this, for a set of n disjoint points {a i1 , . . . , a in } ⊂ A, we define µ({a i1 , . . . , a in }) by averaging over all the possible ways the random variables I 1 , . . . , I n can take values i 1 , . . . , i n , that is, with the summation in j 1 , . . . , j n is taken over all n! enumerations of i 1 , . . . , i n . Note that this sampling scheme can be equivalently formulated using independent exponentially distributed random variables with parameters p 1 , . . . , p N (exponential clocks), where we choose the sets corresponding to the indices of the smallest n such exponential variables (the first n clocks that ring). Let p max := max i∈[N ] p(i) and p min := min i∈[N ] p(i), then an elementary computation shows that for the weighted sampling without replacement scheme, which is smaller than 1 if n < N/ 1 + p max /p min + (p max /p min ) 2 /2 . Sampling without replacement corresponds to the case when p(i) = 1/N for every i ∈ [N ]. In this case, the condition of our theorem, ρ n (µ) < 1, is satisfied if n < N/2. In this particular case, using a theorem of Talagrand, we can show that the convex distance inequality holds for any n ≤ N , which implies that Theorem 4.1 also holds for any n ≤ N . See the Appendix for more details.
Note that it does not seem to be possible to deduce Theorem 4.1 using the results of Samson (2000). In the special case when X 1 , . . . , X n are n samples taken without replacement out of N possibilities, the total variational distance of the distributions L(X l |X 1 = x 1 , . . . , X k = x k ) and L(X l |X 1 = x 1 , . . . , This means that the above diagonal elements of the mixing matrix are at greater than 1/N , and the matrix created by taking the square root of every element has L 2 norm of O(1+n/ √ N ). Therefore we need to have n to be O( √ N ) to obtain concentration results that are only a constant times worse than in the independent case, whereas with our method, this is true for any n < N/2. Now we turn to the proof of Theorem 4.1. The proof consists of two parts. Firstly, we compute the coefficients of the Dobrushin interdependence matrix and verify the Dobrushin condition. Secondly, we check that the function T satisfies the conditions of Corollary 3.5.
The Dobrushin interdependence matrix is estimated in the following Lemma.
Lemma 4.4. Let µ be a distribution on the subsets of size n of A. Let X 1 , . . . , X n be random variables taking values in A, distributed as Then there is a Dobrushin interdependence matrix for X 1 , . . . , X n such that By the definition of the Dobrushin interdependence matrix, using the triangle inequality for the total variational distance, we can set a n(n−1) = sup This sum has two type of terms, the first type is when d equals b or c, and the second type is when d equals something else in A \ B. Terms of the first type are less then equal to r n,1 (µ), and terms of the second type are bounded by r n,2 (µ), thus a n(n−1) ≤ ρ n (µ)/n. Because of the symmetry of the distribution of X 1 , . . . , X n , the same holds for every a ij , thus the claim of the lemma follows.
The following lemma will be used to verify the properties of the function T .
Proposition 4.5 (Proposition 11.1 of Dubhashi and Panconesi (2009)). There is a constant c > 0 such that, for any set of points x 1 , . . . , x n ∈ [0, 1] 2 , there is a permutation σ ∈ S n satisfying That is, there is a tour going trough all points such that the sum of the squares of the lengths of all edges in the tour is bounded by an absolute constant c. By the argument outlined in Problem 11.6 of Dubhashi and Panconesi (2009), the above holds with c = 4.
The following lemma summarises the properties of the function T required for our proof.
Proof. For any x 1 , . . . , x n ∈ A, letσ be the permutation in S n that satisfies (4.5). If there are several such permutations, we choose the one that is smallest in the ordering of permutations ranging from (1, 2, . . . , n) to (n, n − 1, . . . , 1).
with i − 1 and i + 1 taken in the modulo n sense. With this choice, inequality (4.6) is proven on page 125 of Steele (1997), see also page 144 of Dubhashi and Panconesi (2009). Inequality (4.7) follows from Proposition 4.5, and the condition |x − y| ≤ L(x, y) ≤ C|x − y|.
Now we are ready to prove our concentration result.

Steiner trees
Suppose that H = {x 1 , . . . , x n } is a set of n distinct points on the unit square [0, 1] 2 . Then the minimal spanning tree (MST) of H is a connected graph with vertex set H such that the sum of the edge length is minimal (in Euclidean distance).
The minimal Steiner tree of H is the minimal spanning tree containing H as a subset of its vertices. By the definition, the sum of the edge lengths of this is less than equal to the sum of the edge lengths of the minimal spanning tree, since we can also add vertices and edges to the graph (an example where they differ is the equilateral triangle, where the minimal Steiner tree adds the centre of mass of the triangle to the graph, thus reducing the total edge length). We denote the sum of the edge lengths of the minimal Steiner tree by S(x 1 , . . . , x n ). Note that this is invariant to permutations of x 1 , . . . , x n , thus we can equivalently denote it by S({x 1 , . . . , x n }). This is a quantity of great practical importance, since it expresses the minimal amount of interconnect needed between the points x 1 , . . . , x n . It has found numerous applications in circuit and network design. Hwang, Richards and Winter (1992) is a popular book on this subject.
From a probabilistic perspective, a problem of interest is to quantify the behaviour of S(X 1 , . . . , X n ), where X 1 , . . . , X n are random variables that are i.i.d. uniformly distributed on [0, 1] 2 . Steele (1997) has proven that the total length of the minimal Steiner tree, S(X 1 , . . . , X n ), is sharply concentrated around its median, with typical deviations of order 1.
Here we study a modified version of this problem, when we choose a random subset of size n from a set of points A := {a 1 , . . . , a N } in [0, 1] 2 . Let µ be a probability measure on such subsets, and denote its inhomogeneity coefficient defined in (4.2) by ρ n (µ). Using our version of Talagrand's convex distance inequality for dependent random variables, we obtain the following concentration bound.
Theorem 4.7 (Minimal Steiner tree for random subsets). Let X be a random subset of size n of A, chosen according to a distribution µ, with inhomogeneity coefficient ρ n (µ) < 1. Then for any t ≥ 0, where M(S) denotes the median of S.
The proof consists, again, of two parts. First, we bound the Dobrushin interdependence matrix, then show that the function S satisfies the conditions of our version of the method of non-uniformly bounded differences for dependent random variables (Corollary 3.5). The first part is proven in Lemma 4.4. For the second part, we are going to use the following lemma.
Lemma 4.8 (Steele (1997), page 107, equation (5.26)). Let us denote the edge lengths of the minimum spanning tree for x 1 , . . . , x n ∈ [0, 1] 2 by e 1 , . . . , e n−1 . Then for some universal constant c, in particular, we can choose c = 410 (see page 108 of Steele (1997)). If there are multiple minimal spanning trees, then this holds for each of them.
The conditions on S are verified in the following lemma.
Lemma 4.9. For any x 1 , . . . , x n ∈ [0, 1] 2 , denote x = (x 1 , . . . , x n ), and for 1 ≤ i ≤ n, define α i (x) as two times the length of the incurring edges in the minimal spanning tree of x 1 , . . . , x n . Then for any x, y ∈ ([0, 1] 2 ) n , we have Moreover, for any Proof. The first claim is proven on pages 123-124 of Steele (1997). For the second claim, first notice that the vertices in the minimum spanning tree can have degree at most 6. Now for any 6 reals z 1 , . . . , z 6 , we have (z 1 + . . . + z 6 ) 2 ≤ 6(z 2 1 + . . . + z 2 6 ), and every edge belongs to two vertex so it is counted twice, thus by Lemma 4.8, we have Now we are ready to prove our concentration result.
Proof of Theorem 4.7. Using Lemma 4.4 and Lemma 4.9, the statement of the theorem follows by applying Corollary 3.5 with A 1 = A ∞ = ρ n (µ) and C = 19680.

Curie-Weiss model
The Curie-Weiss model of ferromagnetic interaction is the following. Consider the state space Λ = {−1, 1} n , and denote an element of the state space (a configuration) by σ = (σ 1 , . . . , σ n ). Define the Hamiltonian for the system as and the probability density where Z(β, h) := σ∈Λ exp(βH(σ)) is the normalizing constant. The following proposition gives bounds on the Dobrushin interdepence matrix for this model.
Proof. We will now calculate the Dobrushin interdependence matrix for this system. Suppose first that h = 0. Let x and y be two configurations, then we want to bound Since σ i can only take values 1 or −1, so the total variation distance is simply Now by writing m i (x) := 1 n j:j =i x j and m i (y) := 1 n j:j =i y j , we can write so by denoting , (4.10) we can write Now it is easy to check that |r ′ (t)| ≤ 1 2 , and changing one spin in x can change m i at most by 2/n. From this, we obtain a Dobrushin interdependence matrix A with a ij = β n for i = j. For this A, it is easy to see that Thus for the high temperature case 0 ≤ β < 1, we can apply Corollary 3.2 to obtain concentration inequalities.
In the case when writing the conditional probabilities for h = 0, one can show that in the above argument, r(t) in (4.10) gets replaced by r(t, h) := exp(t+h) exp(t+h)+exp(−t−h) . This function still satisfies that | ∂ ∂t r(t, h)| ≤ 1/2, thus A as defined above is a Dobrushin interdependence matrix in this case as well.
Now we are going to show a concentration inequality for the average magnetization of the Curie-Weiss model. Let us denote the average magnetization by m := 1 n n i=1 σ i . We have the following proposition. Proposition 4.11. For the above model, when 0 ≤ β < 1, and h ≥ 0, we have (4.12) Here n − (σ) is a sum of non-negative variables, so one can easily see that it is (1, 0)- * -self-bounding, and thus, by Theorem 3.1, we have for every t ≥ 0, . (4.14) In order to apply this bound, we will need to estimate E(n − (σ)) = n(1 − E(m))/2. For this, we are going to use Proposition 1.3 of Chatterjee (2007), stating that for any t ≥ 0, and thus for any t ≥ 0, and the same inequality holds for the lower tail as well, but with m(σ) − m * replaced by m * − m(σ). From this, using integration by parts, we obtain that Now the results follow by combining this with equations (4.11), (4.12), (4.13) and (4.14).

Exponential random graphs
Exponential random graph models are increasingly popular for modelling network data (see Chatterjee and Diaconis (2013)). For a graph with n vertices, the edges are distributed according to a probability distribution of the form where β = (β 1 , . . . , β k ) is a vector of real parameters, and T 1 , . . . , T k are functions on the space of the graphs (T 1 is usually the number of edges, while the rest can be the number of triangles, cycles, etc. ), and ψ(β) is the normalising constant.
The simplest special case of this model is the Erdős-Rényi graph. Let E be the number of edges of the graph, and let 0 < p < 1 be a parameter, then in this case, In this case, the edges are i.i.d. random variables distributed according to the Bernoulli distribution with parameter p. A more complex model, which was analysed in Chatterjee and Diaconis (2013), has the distribution where E denotes the number of edges, ∆ denotes the number of triangles, and ψ n (β 1 , β 2 ) is the normalising constant. Note that in this case, the edges are no longer independent, because the number of triangles introduces a form of dependence into the model. In general, for any model of the type (4.16), there is a certain set D ⊂ R k of non-zero volume such that when the parameters β ∈ D, the edges, as random variables, satisfy the Dobrushin condition (that is, there is an interdependence matrix such that A 1 < 1 and A ∞ < 1). This fact can be shown by a simple continuity argument, since the random variables are independent when β = 0. The set D is analogous to the high-temperature phase of statistical physical models.
The following theorem, based on our new concentration inequality for (a, b)-*-self-bounding functions, establishes concentration inequalities for subgraph counts in exponential random graph models in the high temperature phase.
Let S be a fixed graph with n S vertices and e S edges. Let N S denote the number of copies of S in our exponential random graph, then for any t ≥ 0, . (4.18) Remark 4.14. By the number of copies of S, we mean the number of subsets of size n S of the set of n vertices of our graph such that the corresponding subgraph contains S. A of similar concentration inequality can be shown to hold for the maximal degree among all the vertices (see Example 6.13 of Boucheron, Lugosi and Massart (2013)), which can be shown to be (1, 0)-*-self-bounding. Our results are sharper than what we could obtain using Theorem 4.3 of Chatterjee (2005) (McDiarmid's bound differences inequality for dependent random variables satisfying the Dobrushin condition).
Proof of Theorem 4.13. The proof is based on the *-self-bounding property of N S . If we add an edge to X, then N S will increase, or stay the same, while if we erase an edge from X, then N S will decrease, or stay the same. For x ∈ Λ, 1 ≤ i < j ≤ n, let α i,j (x) be the number of copies of S in x that contain the edge (i, j). Then 0 ≤ α i,j (x) ≤ n−2 nS−2 , and we can see that for any x, y ∈ Λ, Moreover, since S contains e S edges, we have This means that N S (x)/ n−2 nS −2 is (e S , 0)-*-self-bounding, and the results follow by Corollary 3.2.

Preliminary results
In this section, we will prove some preliminary results needed for proving our main results from Section 3. First, we prove a lemma about the total variational distance. After this, review the basics of the concentration inequalities by Stein's method of exchangeable pairs approach. Finally, we prove some lemmas about bounding moment generating functions.

Basic properties of the total variational distance
The total variational distance of two probability distributions µ 1 and µ 2 defined on the same measurable space (X , F ) is defined as d TV (µ 1 , µ 2 ) = sup S∈F |µ 1 (S) − µ 2 (S)|. (5.1) The following lemma proposes a coupling related to the total variational distance that we are going to use.
Lemma 5.1. Let µ 1 and µ 2 be two probability measures on a Polish space (X , F ). Then for any fixed q with d TV (µ 1 , µ 2 ) ≤ q ≤ 1, we can define a coupling of independent random variables χ, B, C, D such that χ has Bernoulli distribution with parameter q, and the random variables Proof. The proof is similar to Problem 7.11.16 of Grimmett and Stirzaker (2001). We define the measure µ 12 (·) on (X , F ) as µ 12 (S) = µ1(S)+µ2(S)
The density of random variables B, C and D with respect to µ 12 can be defined in terms of f (x) and g(x) as follows. Let us define h : X → R as h(x) = min(f (x), g(x)), and let p := d TV (µ 1 , µ 2 ). For any S ∈ F , we let and we set χ ∼ Bernoulli(q), B ∼ µ B , C ∼ µ C , D ∼ µ D be independent random variables. With this choice, it is straightforward to check that the conditions of the lemma are satisfied.

Concentration by Stein's method of exchangeable pairs
Let f : X → R, where X is a Polish space, and X is a random variable taking values in X . We are interested in the concentration properties of f (X). Suppose that E(f (X)) = 0. Let (X, X ′ ) be an exchangeable pair, m(θ) := E(e θf (X) ). Suppose that F (x, y) : X 2 → R is an antisymmetric function satisfying Then for any θ ∈ R, (5.4) By Chatterjee (2005), this can be further bounded by and conditions on ∆(X) : In this paper, we are also going to use (5.4), but instead of taking absolute value, we consider positive and negative parts.
In order to apply the approach for some function f , we need to find the antisymmetric function F (x, y) such that (5.3) is satisfied. Chapter 4 of Chatterjee (2005) finds such an antisymmetric function by a method using a Markov chain, we give a summary below.
An exchangeable pair (X, X ′ ) automatically defines a reversible Markov kernel P as where f is any function such that E|f (X)| < ∞. Let {X(k)} k≥0 and {X ′ (k)} k≥0 be two chains with Markov kernel P , having arbitrary initial values, and coupled according to some coupling scheme which satisfies the following property.
P For every initial value (x, y) of the joint chain {X(k)} k≥0 , {X ′ (k)} k≥0 , and every k, the marginal distribution of X(k) depends only on x and the marginal distribution of X ′ (k) depends only on y.
Under this assumption, the following lemma holds.
Lemma 5.2 (Lemma 4.2 of Chatterjee (2005)). Suppose the chains {X(k)} and {X ′ (k)} satisfy the property P described above. Let f : X → R be a function such that Ef (X) = 0. Suppose there exists a finite constant L such that for every (x, y) ∈ X 2 , Then, the function F , defined as

Additional lemmas
The following lemma proves concentration in the case when ∆(X) is not bounded almost surely, but itself is concentrated (a reformulation of Lemma 11 of Massart (2000)). Since the proof is short, we include it for completeness (it is based on part of the proof of Theorem 3.13 of Chatterjee (2005)).
Lemma 5.3. Let m(θ) = E(e θf (X) ). For any random variable V , and any L > 0, we have for every θ ∈ R, if the expectations on both sides exist.
Proof. Let u(X) := e θf (X) m(θ) . Let A, B ≥ 0 be two random variables with finite variance and E(A) = 1, then E(A log(B)) ≤ log(E(AB)), which can be shown by changing the measure and applying Jensen's inequality. Using this, we have here we applied our previous inequality with A = u(X) and B = e LV u(X) . Now using the fact that log(u(X)) = θf (X) − log(m(θ)), we obtain the result.
We will use the following well known result many times in our proofs.

Proofs of the main results
In this section, we are going to prove our main result, Theorem 3.1 and Corollary 3.2. The theorem concerns dependent random variables, and we need to introduce a certain amount of notation to handle them, making the proof rather technical. In order to help the reader in digesting this proof, we are going to prove the theorem first in the independent case, where we are free of the notational burden required for dependent random variables. Before starting the proof in the independent case, we introduce some notation and two lemmas that are going to be used in both the independent and the dependent cases.
Let X = (X 1 , . . . , X n ) be an vector of random variables taking value in Λ. Let f : Λ → R be the centered version of g, defined as f (x) = g(x) − E(g(X)) for every x ∈ Λ. (6.1) Let α 1 , . . . , α n : Λ → R + be functions such that for any x, y ∈ Λ, let α(x) := (α 1 (x), . . . , α n (x)). Note that at this point we do not yet make any specific self-bounding type assumptions on α(x). Let I be uniformly distributed in [n]. Suppose that (X, X ′ ) is an exchangeable pair, such that X i = X ′ i for every i ∈ [n] \ {I}. Suppose that for k ≥ 0, X(k) and X ′ (k) are Markov chains with kernel defined as in (5.5), satisfying Property P and (5.6). For k ≥ 0, define the random vector L(k) ∈ R n + as The following two lemmas bound the moment generating function of f in function of the vectors L(k) and α(x).
Lemma 6.1. Under the above assumptions, for θ > 0, if m(θ) < ∞, then we have Proof. Note that Using (6.2), we have thus the result follows.

Independent case
In this section, we are going to prove Theorem 3.1 and Corollary 3.2 under the additional assumption that X = (X 1 , . . . , X n ) is a vector independent random variables. First, we are going to construct a valid coupling of (X(k), X ′ (k)) k≥0 , satisfying Property P and (5.6). After this, we will use Lemma 6.1 and 6.1 to obtain the mgf bounds of Theorem 3.1. The construction of (X(k), X ′ (k)) k≥0 is the same as in Example on page 73 of Chatterjee (2005), sketched here for the sake of completeness. This is a version of the Glauber dynamics. First, we set X(0) = x, and X ′ (0) = y for some x, y ∈ Λ. Then we let I(1), I(2), . . . be independent random variables uniformly distributed on [n], and X * (1), X * (2), . . . be independent copies of X. Then in the first step, we define the vectors X(1) and X ′ (1) as equal to X(0), and X ′ (0), respectively, except in coordinate I(1), where we set X I(1) (1) = X ′ I(1) (1) = X * I(1) (1). We define X(k), X ′ (k) in the same way, by starting from X(k−1), X ′ (k−1), and changing their coordinate I(k) to X * I(k) (k). This coupling has shown to satisfy Property P and (5.6) in Chatterjee (2005) (via the coupon collector's problem). Finally, we note that X ′ is defined as one step in the dynamics, that is, we let X * be an independent copy of X, I be uniformly distributed on [n], independently of X and X * , and X ′ equals to X except in coordinate I, where it equals X * I . Now we are ready to prove Theorem 3.1 and Corollary 3.2 under the independence assumption.
Proof of Part 1 of Theorem 3.1 and Corollary 3.2 assuming independence. By Lemma 6.1, using the fact that f is bounded under our assumptions, we have that for θ > 0, Now by our assumption, α i (X(k)) ≤ 1, and using that g is (a,b)-*-self-bounding, The mgf bound now follows by rearrangement and integration, and applying Lemma 5.4 proves the concentration bound of Corollary 3.2.
Proof of Part 2 of Theorem 3.1 and Corollary 3.2 assuming independence. By Lemma 6.1, we have for θ > 0 Now by the fact that g is weakly (a, b)-*-self-bounding, we have n i=1 α i (X) 2 ≤ ag(X) + b, and n i=1 α i (X(k)) 2 ≤ ag(X(k)) + b.
We will use the conditional version of the Cauchy-Schwarz inequality: if A i , B i are random variables for 1 ≤ i ≤ n, then

Now writing
Substituting this into (6.3), we obtain Here we have used the fact that for θ > 0, E(e θf (X) f (X(k))) ≤ E(e θf (X) f (X)), (6.4) since using the exchangeability of f (X) and f (X(k)), since e θf (X) − e θf (X(k)) and f (X) − f (X(k)) always have the same sign. We conclude by applying Lemma 5.4.
The terms involving f (X(k)) cause some difficulty. Although we can show, in the same way as in Part 2, that for us the other sided inequality would be more convenient. Nevertheless, we can use the concentration properties of f (X(k)) from Part 2 to bound this term. By Lemma 5.3, for any L > 0, Now by exchangeability E(e Lf (X(k)) ) = E(e Lf (X) ) = m(L), and we can use the bound from Part 2 to obtain that for 0 < L < 1/(2a), Substituting this back to (6.5), and summing up in k as previously, we obtain A convenient choice for L, which makes the inequality tractable, is L = −θ.
Finally, we need to tackle the case when a < a c . Going back to equation (6.6), we can write that for 0 > θ > − 1 4a , Let us write C := 5 2 (aEg(X) + b), then by Markov's inequality, we have that for 0 > θ > − 1 4a , 0 < t < Eg(X), The minimum of the right hand side is taken at which satisfies 0 > θ min > − 1 4a whenever a < a c . Thus, in this case we have Now let us take a look at the x − log(1 + x)(1 + x) function for positive x, we can easily check that this is negative, and

Discussion
When compared to the original proof of Theorem 4.3 of Chatterjee (2005), we have introduced several new ideas in the proof. Firstly, instead of bounding we use the one sided version (F (X, X ′ )) + (f (X) − f (X ′ )) + . Moreover, we have not taken the expectation of this quantity with respect to X, but instead used a tricky symmetrisation argument in (6.12). Finally, we have also used Lemma 5.3, which was not needed for the original proof. In an upcoming paper, we are going to show that these techniques are powerful enough to imply the exponential and polynomial Efron-Stein inequalities for independent random variables, due to Boucheron, Lugosi and Massart (2003) and Boucheron et al. (2005). The dependent case remains an open problem.
Note that in Theorem 3.1, in each of the three cases, g is always going to be bounded, thus f is also bounded. This means that we have |f (x)| ≤ C for some absolute constant C for every x ∈ Λ. Using this and (6.11), we have so by summing up, we obtain that (5.6) holds with L = 2nC/ 1 − 1 n + 1 n A 1 . Now we are ready to prove Theorem 3.1 and Corollary 3.2.
We obtain the mgf bound in Theorem 3.1 by integration of this inequality, and our concentration bound in Corollary 3.2 from Lemma 5.4.
Proof of Part 3 of Theorem 3.1 and Corollary 3.2. Now we will bound the lower tail, so suppose that θ < 0. By Lemma 6.2, In Part 2, we proved that By summing up in k, we obtain × E af (X(k)) + af (X) + 2b + 2aE(g) 2 e θf (X) .
By Lemma 5.3, since m(θ) ≥ 1, for any L > 0, E(e θf (X) f (X(k))) ≤ L −1 log E(e Lf (X(k)) )m(θ) + L −1 θm ′ (θ), and by Part 2, for 0 By the convenient choice of L = −θ, we obtain that for 0 which implies our mgf bound (3.2) in Theorem 3.1. We will split the argument for obtaining tail inequalities in Corollary 3.2 into into two parts depending on the size of a.

The convex distance inequality for dependent random variables
In this section, we prove Theorem 3.3. Before turning to the proof, we will state some results. We will use Sion's minimax theorem, which states the following (Sion (1958), and Komiya (1988)).
Theorem 6.4. Let f (x, y) denote a function X × Y → R that is convex and lower-semicontinuous with respect to x, concave and upper-semicontinuous with respect to y. If X is convex and compact, then The following lemma is the * -self-bounding analogue of Lemma 1 of Boucheron, Lugosi and Massart (2009).
Proof. The second claim is proven in Lemma 1 of Boucheron, Lugosi and Massart (2009). The proof of the first claim is similar to the proof of Lemma 1 of Boucheron, Lugosi and Massart (2009) (see also Proposition 13 of Boucheron, Lugosi and Massart (2003)). We recall some of their argument here.
Let M(S) denote the set of probability measures on S. Then, using Sion's minimax theorem, we may rewrite d T as Denote the pair (ν, α) at which the saddle point is achieved by (ν,α).
Note that strictly speaking, the conditions of Sion's minimax theorem (X should be convex and compact) are not satisfied, however, this problem can be dealt with the same way as in Boucheron, Lugosi and Massart (2003) (by mapping the large space M(S) on the convex compact set of the probability measures on {0, 1} n ).
Now we are ready to prove the main result of this section.
As a consequence of these results, we obtain a version of Theorem 4.1 for sampling without replacement.