Distributional Replication

A function which transforms a continuous random variable such that it has a specified distribution is called a replicating function. We suppose that functions may be assigned a price, and study an optimization problem in which the cheapest approximation to a replicating function is sought. Under suitable regularity conditions, including a bound on the entropy of the set of candidate approximations, we show that the optimal approximation comes close to achieving distributional replication, and close to achieving the minimum cost among replicating functions. We discuss the relevance of our results to the financial literature on hedge fund replication; in this case, the optimal approximation corresponds to the cheapest portfolio of market index options which delivers the hedge fund return distribution.


Introduction
Suppose that X and Y are random variables. In this paper we consider estimating a function θ such that θ(X) and Y have the same distribution. Such a function is said to be a replicating function. Typically, there are many different replicating functions for a given pair of random variables X and Y. We suppose that to each function θ there corresponds a "price", denoted p(θ), and we seek to estimate the replicating function θ for which p(θ) is as small as possible. That is, we seek to estimate the cheapest replicating function for a given X and Y. To estimate this function from a sample of realizations of X and Y, we first obtain an estimate of the set of all replicating functions. The estimated set is formed by choosing a rich but manageable class of functions (i.e., a sieve space) and taking all those functions θ in that class for which the distance between the empirical distributions of θ(X) and Y is small. Our estimate of the cheapest replicating function is then obtained by minimizing p over the estimated set of replicating functions.
Our research is motivated by a literature in applied finance on "hedge fund replication". The hedge fund replication literature is concerned with the possibility of achieving financial returns that resemble those of a particular hedge fund, fund of hedge funds, or index of hedge funds, by engaging in an investment strategy that does not involve a direct investment in the fund or funds in question. Ideally, the replicating strategy should involve trading assets that are highly liquid, thereby avoiding the barriers to entry, lock-in periods and high fees that are characteristic of hedge fund investments. Several major investment banks have launched hedge fund replication products, including Goldman Sachs and Merrill Lynch in 2006 and J.P. Morgan in 2007 [1]. Hedge fund replication strategies have also attracted the attention of the popular press, with articles appearing in The Wall Street Journal [2] and The New Yorker [3], among other outlets. Simonian and Wu [4] have recently described the proliferation of hedge fund replication strategies in investing as a "cottage industry".
There are two broad streams of the hedge fund replication literature. In one stream, researchers have considered the direct approximation of hedge fund returns by investing in a portfolio of other assets. By direct approximation, we mean that the returns from the selected portfolio should be close to the hedge fund returns with high probability. Typically, the replicating strategy amounts to estimating a factor model for hedge fund returns, and then investing directly in the factors rather than in the hedge fund. Hasanhodzic and Lo [5] and Simonian and Wu [4] are representative of this stream of research. The second stream of the hedge fund replication literature is concerned with the distributional approximation of hedge fund returns, rather than their direct approximation. The aim here is to create a trading strategy that generates returns with the same statistical distribution as the hedge fund returns. This is a more modest goal than direct approximation, because in any given period the return generated by the replicating strategy need not resemble the return from the hedge fund. Key papers in this stream of the hedge fund replication literature include Amin and Kat [6], Kat and Palaro [7,8], and Kat [1]. The results in this paper concern the approach taken by these authors. Our aim is not to provide statistical methods ready to be applied to data, but rather to develop a mathematical framework for thinking about distributional replication.
Suppose that X represents the payoff after one month from a $1 investment in a market index, while Y represents the payoff after one month from a $1 investment in a hedge fund. Amin and Kat [6] propose to estimate a function θ such that θ(X) and Y have the same distribution function. Given a sample of n realizations of X and Y, their estimated replicating function isθ n =Q Y n •F X n , whereQ Y n is an estimate of Q Y , the quantile function of Y, andF X n is an estimate of F X , the distribution function of X. Assuming continuity of F X , the random variable Q Y (F X (X)) has the same distribution as Y, implying that Q Y • F X is a replicating function. We might therefore expectθ n (X) and Y to have similar distributions in large samples. The estimated functionθ n can be thought of as describing the payoff after one month of a derivative security written on the market index. Under suitable conditions, this payoff can be achieved using a continuously rebalanced self-financed portfolio of market shares and cash, as in the hedging strategy used to justify the celebrated Black-Scholes-Merton option pricing formula [9,10]. We let p(θ) denote the start-up cost of a hedging strategy with payoff θ(X), and refer to this quantity as the price of θ.
It need not be the case that p(θ) = 1 when θ is a replicating function. This is because the distributional equivalence of θ(X) and Y does not imply the existence of an arbitrage opportunity when their initial investment costs differ. Indeed, two replicating functions need not have the same price. Amin and Kat [6] aim to estimate the particular replicating function Q Y • F X because it is an increasing function of the market payoff X. In Dybvig [11,12] and Beare [13], it is shown under very general conditions that, given a collection of payoff functions that all achieve the same payoff distribution, the cheapest such function must allocate payoffs to states as a nonincreasing function of the state prices. Amin and Kat [6] observe that in a Black-Scholes world, the state price density (with respect to the true probability measure over states) is inversely related to X. Thus, the cheapest replicating function must be a nondecreasing function of X.
A key difference between the approach to distributional replication proposed in this paper, and the approach taken by Amin and Kat [6], is that we do not assume that the cheapest replicating function is nondecreasing. Instead, we search for the cheapest replicating function over a large space of functions, many of which are not monotone. Empirically, there is good reason to believe that the cheapest replicating function will not be monotone. Jackwerth [14] and Brown and Jackwerth [15] argue that the state price density (in their terminology, pricing kernel) implied by S&P500 options with one month to expiry changed dramatically after the stock market crash of 1987, becoming nonmonotone with respect to the return on the S&P500 index. See, in particular, Figure 2 in [15], in which the state price density is an increasing function of the market return for monthly return levels between approximately −3% and 3%, and decreasing elsewhere. Other empirical studies of the relationship between the state price density and market returns have largely confirmed that it is often nonmonotone [16][17][18][19][20][21][22]. See also [23] for a discussion of the relevance of such nonmonotonicity for constructing density forecasts of market returns. If the relationship between the state price density and the market return is not monotone, then the results of Dybvig [11,12] and Beare [13] imply that the cheapest replicating function θ will not be monotone. In this case, the approach to distributional replication taken here is advantageous.
There is a second major conceptual difference between the approach to distributional replication taken here, and the approach taken by Amin and Kat [6]. Amin and Kat propose to implement the desired payoff function θ by engaging in a continuous time hedging strategy, trading market shares and cash. In this paper, we propose to approximate θ by investing in a portfolio of European put and call options written on the market index at various strike prices. The portfolio may also include the market index itself, and risk-free zero-coupon bonds. A key advantage of our approach is that the price of the payoff function θ corresponding to such a portfolio may be calculated directly from observed option and bond prices. By comparison, Amin and Kat price θ by taking the risk neutral expected payoff of θ(X) under Black-Scholes conditions, and they require Black-Scholes conditions to hold in order for their hedging strategy to achieve the desired payoff. The empirical limitations of the Black-Scholes pricing model have been extensively documented. We avoid these difficulties by confining ourselves to functions θ for which the market price is directly observable, and which may be implemented in practice by investing directly in a portfolio of actively traded securities.
We embed our approach in the statistical framework of sieve estimation by assuming that the set of strike prices at which options may be traded becomes more dense as the sample size n increases, at a controlled rate. The payoff functions achievable using portfolios of this kind are continuous piecewise linear functions, with kinks at the allowable strike prices. We control the entropy (complexity) of this class of functions using the notion of VC-dimension [24], and are thereby able to bring the machinery of empirical process theory to bear in analyzing the asymptotic properties of our technique. The use of option payoff functions to form the basis for a sieve space is not entirely without precedent. Option payoffs appear as activation functions in the regularized neural network model studied by Corradi and White [25]: take m = 2 in their Equation (4.1). Those authors do not, however, explicitly discuss the connection to option payoffs and portfolio choice.
The approach taken by Amin and Kat [6], and in this paper, aims to replicate the univariate distribution of Y. Typically, the joint distribution of θ(X) with any other asset payoff will differ from the joint distribution of Y and that asset payoff. In particular, the joint distribution of θ(X) and the market payoff X will differ from the joint distribution of Y and X, and for this reason we cannot expect investors to find θ(X) to be a perfect substitute for Y in general. Intuitively, if the correlation between X and Y is lower than the correlation between X and θ(X), risk-averse investors may prefer a balanced portfolio formed from X and Y to a similar portfolio formed from X and θ(X). In response to this issue, Kat and Palaro [7,8] extend the approach of Amin and Kat [6] to the replication of bivariate distributions. They introduce a "reserve asset" with payoff Z, and seek to find a bivariate function θ such that the joint distribution of θ(X, Z) and X is the same as the joint distribution of Y and X. This replicating payoff function is implemented in practice using a continuously rebalanced portfolio formed by trading market shares, cash, and the reserve asset. We do not follow that approach in this paper, in part because it is generally not feasible to approximate a wide class of bivariate functions using a portfolio formed from options written on individual assets. Confining ourselves to the replication of univariate distributions may not seem unreasonable if we modify our interpretation of the random variable Y. Rather than representing the payoff from a $1 investment in a hedge fund, Y could represent the payoff from a $1 investment in a portfolio partly invested in the hedge fund and partly in the market index. Amin and Kat [6] take this approach in their empirical study of hedge fund efficiency. More generally, Y could be the payoff from a $1 investment in a portfolio formed from any number of arbitrary assets. If θ(X) has the same distribution as Y, and the price of θ is less than $1, an investor should prefer to invest in the replicating portfolio.
The remainder of this paper is structured as follows. In Sections 2 and 3 we develop a general approach to the estimation of replicating functions, without explicit reference to the financial application that serves as our motivation. In Section 2 we provide some basic mathematical tools for dealing with the notion of replicating functions, while in Section 3 we discuss the statistical estimation of replicating functions using the method of sieves. In Section 4 we explain how the mathematical material in Sections 2 and 3 can be applied to the problem of hedge fund replication. Section 5 outlines some areas for future research, and concludes. Throughout the paper, there are several numbered assumptions and propositions. In the statement of each proposition, it should be understood that all assumptions introduced prior to the proposition hold. Proofs of all numbered propositions may be found in Appendix A.

Replicating Functions
In this section we formally introduce the notion of a replicating function. We construct a pseudometric on the set of Borel measurable functions mapping the support of one random variable to the support of another, and we define a criterion function that identifies the set of replicating functions. Some useful results relating to these objects are given.
Let X and Y be real valued random variables, and let P X : B(R) → [0, 1] and P Y : B(R) → [0, 1] denote the probability measures corresponding to X and Y, where B(R) denotes the usual Borel σ-field on R. Let F X : R → [0, 1] and F Y : R → [0, 1] denote the distribution functions of X and Y. Let R X = cl({x ∈ R : 0 < F X (x) < 1}), and let R Y = cl({y ∈ R : 0 < F Y (y) < 1}); here, cl(A) denotes the Euclidean closure of a set A ⊆ R. The sets R X and R Y are intervals of the form [a, b], [a, ∞) or (∞, b], with a, b ∈ R. We place the following condition on F X and F Y . Assumption 1. F X and F Y are continuous and strictly increasing on R X and R Y respectively. Assumption 1 is stronger than is required to establish all of the results in this paper, but it will be convenient for us to maintain Assumption 1 throughout. Under Assumption 1, the restriction of F X to R X is a continuous and strictly increasing function, and therefore uniquely defines a continuous and strictly increasing inverse function Q X : F X (R X ) → R X . We refer to this function as the quantile function of X. The quantile function of Y, denoted Q Y : F Y (R Y ) → R Y , is defined in the same way. Note that F X (R X ) and F Y (R Y ) are equal to (0, 1], [0, 1), [0, 1] or (0, 1), depending on whether X and Y are almost surely bounded above, below, both, or neither.
Let Θ denote the set of all Borel measurable functions θ : R X → R Y . Though Θ depends on X and Y, we do not make this dependence explicit in our notation. We are interested in those functions θ ∈ Θ for which θ(X) and Y have the same distribution.

Definition 1.
A function θ ∈ Θ is called a replicating function for X and Y, or simply a replicating function or replicator, if P X θ −1 B = P Y B for all B ∈ B(R).
Note that a replicating function does not describe a relationship between X and Y in the usual sense. θ(X) and Y may be perfectly correlated, or independent. All that matters is that they have the same marginal distribution. We will let Θ * denote the set of all replicating functions for X and Y. Again, the dependence of Θ * on X and Y is not made explicit in our notation.
Our first result concerns the cardinality of Θ * . Proposition 1. Θ * is uncountably infinite. Moreoever, there exists an uncountable subset of Θ * in which no two functions are equal on a set of positive P X -measure.

Remark 1.
One example of a replicating function is the composition Q Y • F X , restricted to R X . Clearly, if F X is not continuous and F Y is continuous, so that Assumption 1 is violated, Θ * is empty.
We will sometimes find it helpful to consider the special case where X and Y are both distributed uniformly on the unit interval. In this case, the composition θ = Q Y • F X restricted to R X = [0, 1] is the identity function, θ(x) = x. Another simple example of a replicating function is θ(x) = 1 − x, restricted to [0, 1]. Graphs of these functions, and of four other replicating functions, are provided in Figure 1. We will letΘ denote the set of all Borel measurable functions θ : [0, 1] → [0, 1], and letΘ * denote the set of functions inΘ that are replicators when X, Y ∼ U(0, 1). As an aid to visualizing the functions inΘ * , a reader familiar with the concept of local time may find it helpful to think of each function θ ∈Θ as a (nonrandom) stochastic process on the unit interval. The functions θ ∈Θ * are precisely those for which the local time at y is equal to one for each y ∈ (0, 1). That is, for θ ∈Θ, we have θ ∈Θ * if and only if lim ε↓0 1 2ε for each y ∈ (0, 1). This can be shown by observing that the above limit is equal to the derivative of the distribution function of θ(X) at y when X ∼ U(0, 1).
We now introduce a pseudometric d on Θ.
It is obvious that d satisfies the four axioms for a pseudometric: nonegativity, symmetry, the triangle inequality, and the requirement that d(θ, θ) = 0 for all θ ∈ Θ. d is not a metric because we will have d(θ 0 , θ 1 ) = 0 when θ 0 and θ 1 are equal on a set of P X -measure one, even if the two functions are distinct. Note that when X, Y ∼ U(0, 1), d corresponds to the usual L 1 -seminorm for functions on [0, 1]. When X and Y are not uniform, d(θ 0 , θ 1 ) is equal to the L 1 distance between the deformed functions We now introduce a nonnegative function M : Θ → R that is intended to quantify the extent to which a function θ ∈ Θ achieves distributional replication. For θ ∈ Θ, let F X (·; θ) denote the distribution function of θ(X); that is, for y ∈ R and θ ∈ Θ, let Our pseudometric d endows M with a convenient smoothness condition. Specifically, M is Lipschitz continuous with respect to d, with Lipschitz constant no greater than one.
Our next result concerns the identification of the set of replicators Θ * using the criterion function M. It states that the set of replicators Θ * is precisely those functions θ ∈ Θ for which M(θ) = 0.
Propositions 2 and 3 jointly imply that, if θ 1 , θ 2 , . . . is a sequence of elements of Θ converging to some θ * ∈ Θ * in the pseudometric d, then M(θ n ) → 0 as n → ∞. We would like to interpret this to mean that θ n gets arbitrarily close to achieving distributional replication as n becomes larger. The next result makes this notion precise.
where "⇒" denotes weak convergence of probability measures (see e.g., [26]), and P X θ −1 n is the measure on B(R) given by P X θ −1 n B = P X {x ∈ R X : θ n (x) ∈ B} for each B ∈ B(R). We could also write this statement as θ n (X) → d Y, where "→ d " denotes convergence in distribution in the usual sense.
Our final result of this section is a modification of Proposition 4 that allows θ 1 , θ 2 , . . . to be random elements. An obvious first step towards defining such random elements would be to introduce a σ-field on Θ; however, such an approach leads to complications relating to the measurability of θ(X) when θ and X are both random. We will need to require each of the random elements θ n , n ∈ N, to be a random element of some subspace Θ n ⊂ Θ. Each subspace Θ n will be equipped with a σ-field T n that is well behaved in the following sense.

Definition 2. Given a collection of functions
Definition 2 is a version of a definition of admissibility given in Section 5.2 of [27]. B(R X ) denotes the Borel σ-field on R X , while the notation T ⊗ B(R X ) refers to the product σ-field on Θ × R X ; that is, the σ-field on Θ × R X generated by sets of the form A × B, with A ∈ T and B ∈ B(R X ). With Definition 2 in hand, we are now in a position to state the final result of this section. Proposition 5. Let Θ 1 , Θ 2 , . . . be a sequence of subsets of Θ, and for each n ∈ N let T n be an admissible structure for Θ n and let P θ n be a probability measure on T n . Let P θ n (X) be the probability measure on B(R) given by P θ n (X) B = P θ n ⊗ P X {(θ, x) ∈ Θ n × R X : θ(x) ∈ B} for each B ∈ B(R), where P θ n ⊗ P X is the product measure on T n ⊗ B(R X ). Then, as n → ∞, if Θ n MdP θ n → 0 then also P θ n (X) ⇒ P Y .

Remark 3.
It is possible to rephrase Proposition 5 in a somewhat less precise fashion that may be easier to interpret. For each n ∈ N, we can think of the measure P θ n as corresponding to a random function θ n taking values in Θ n . The measure P θ n (X) describes the distribution of θ n (X) when θ n and X are both random, and θ n is independent of X. The statement Θ n MdP θ n → 0 can be written as EM(θ n ) → 0. Thus, the final statement of Proposition 5 could be written as follows: as n → ∞, if EM(θ n ) → 0 then also θ n (X) → d Y.

Sieve Estimation of Replicating Functions
In this section we turn our attention to the statistical estimation of a replicating function using a sample of observations Assumption 2. {X i : i ∈ N} and {Y i : i ∈ N} are iid collections of real valued random variables defined on a complete probability space (Ω, F , P). Each X i has distribution function F X , and each Y i has distribution function F Y .

Remark 4.
The iid condition in Assumption 2 refers to the independence of X i and X j , and of Y i and Y j , when i = j. X i and Y j may be dependent for any i, j.

Remark 5.
The assumption that (Ω, F , P) is complete will be useful later when we employ a result due to Stinchcombe and White [28] that provides conditions under which certain real valued functions on Ω are analytic (in the measure-theoretic sense). The interested reader may refer to that paper for the definition of an analytic function. When (Ω, F , P) is complete, real valued functions on Ω are analytic if and only if they are measurable.
We wish to use our observed sample {(X i , Y i ) : 1 ≤ i ≤ n} to construct an estimate of a replicating function that has good properties when n is large. As was made clear in Proposition 1, the set of replicating functions is uncountably infinite in a nontrivial sense. We are thus confronted with the problem of partial identification: the distributional replication property does not uniquely identify the function we are seeking to estimate. The first step in our estimation procedure is to empirically discriminate between those functions that come close to achieving distributional replication, and those that do not. In the previous section, the function M : Θ → R was used to quantify the extent to which a function θ ∈ Θ achieves distributional replication. We will construct an empirical analogue to M. Given a sample of size n, let F Y n : R → [0, 1] denote the empirical distribution function of Y, and for θ ∈ Θ let F X n (·; θ) : R → [0, 1] denote the empirical distribution function of θ(X). That is, Let the function M n : Θ → R be defined by M n will serve as our empirical analogue to M. Note that we have suppressed the dependence of F Y n , F X n and M n on ω ∈ Ω in our notation. We would like M n to serve as a good approximation to M when n is large. Unfortunately, the space Θ is too rich for us to expect M n to be close to M uniformly over Θ. We shall instead consider the approximation of M by M n over a more manageable subset of the functions in Θ. We will consider a sequence of such subsets Θ 1 ⊆ Θ 2 ⊆ · · · , with Θ n becoming more complex as n grows, but at a slow enough rate to allow the uniform approximation error sup θ∈Θ n |M n (θ) − M(θ)| to decay to zero in a suitable sense. Our approach may be regarded as a version of the method of sieve estimation. See [29] for a general discussion of sieve estimation in econometrics.
To control the entropy (complexity) of Θ n , we shall employ the notion of VC-major dimension. VC-major dimension is a characterization of complexity for classes of functions that is related to the notion of VC-dimension for classes of sets. Definition 3. Let C be a collection of subsets of R. C is said to shatter a set of points D = {x 1 , . . . , x d } ⊂ R, d ∈ N, if all 2 d subsets of D can be written as the intersection of D with some set in C. C is said to be a VC-class if, for some d ∈ N, C cannot shatter any set of size d. If C is a VC-class then the VC-dimension of C, written V (C), is defined to be the smallest d ∈ N for which no set of size d is shattered by C. If C is not a VC-class, we set V (C) = ∞.
Definition 3 is standard in the literature on empirical processes; see e.g., Section 2.6.1 in [30]. Building on Definition 3, we define the VC-major dimension of a subset of Θ as follows.
Definition 4. Consider a collection of functions Θ ⊆ Θ. A subset of R is said to be majorized by Θ if it can be written as {x ∈ R X : θ(x) > c} for some θ ∈ Θ and some c ∈ R. Let C denote the collection of all sets majorized by Θ . We say that Θ is a VC-major class if C is a VC-class. The VC-major dimension of Θ , written V (Θ ), is defined to be the VC-dimension of C. Remark 6. The definition of VC-major dimension should not be confused with that of VC-subgraph dimension, which also appears frequently in the empirical process literature; in general, the two are different. When Θ is the set of indicator functions of a collection of sets C, the VC-major dimension and VC-subgraph dimension of Θ are both equal to the VC-dimension of C. Sections 2.6.2 and 2.6.4 in [30] provide discussions of VC-subgraph and VC-major classes respectively.
We will control the entropy of the spaces Θ n by bounding the growth rate of their VC-major dimension. In addition, we will need to introduce some additional technical conditions to ensure the measurability of certain real valued functions on Ω. For Θ ⊆ Θ, let B(Θ ) denote the Borel σ-field on Θ induced by the pseudometric d.
Assumption 3. For each n ∈ N, Θ n ⊆ Θ is a nonempty VC-major class. Further, B(Θ n ) is an admissible structure on Θ n , and (Θ n , B(Θ n )) is a Souslin measurable space.

Remark 7.
Refer to Stinchcombe and White [28] for the definition of a Souslin measurable space, and further discussion. Here, we note only that for (Θ n , B(Θ n )) to be a Souslin measurable space, it suffices that (Θ n , d) is a Polish metric space; that is, (Θ n , d) is a metric space that is topologically isomorphic to a complete separable metric space.

Remark 8.
In the proof of Proposition 6 it is established that sup θ∈Θ n |M n (θ) − M(θ)| is a measurable function from (Ω, F ) to (R, B(R)). Thus, our statement of Proposition 6 uses the ordinary expectation operator. It is common in the empirical process literature to see results of this kind expressed in terms of outer expectation; see e.g., Section 1.2 in [30].
, then when n is large we can use the empirical criterion function M n to distinguish between those functions in Θ n that are close to achieving distributional replication, and those that are not. We have yet to address the issue of partial identification: there may be many functions in Θ n that are close to achieving distributional replication. We wish to entertain the possibility that not all replicating functions are created equal. Let p : Θ → R be a function describing the "price" of each function θ ∈ Θ. Rather than seeking to estimate an arbitrary replicating function, we will seek to estimate a replicating function θ for which p(θ) is as small as possible.
Assumption 4. The function p : Θ → R is nonnegative, and continuous with respect to d.
Loosely speaking, we seek to estimate the cheapest, or optimal, replicating function. The following result concerns the selection of our estimated functionθ n . In it, we make the random nature of M n explicit by writing M n as a function of both ω ∈ Ω and θ ∈ Θ. Proposition 7. Let 1 , 2 , . . . and λ 1 , λ 2 , . . . be sequences of positive real numbers. For each n ∈ N, there exists a measurable functionθ n from (Ω, F ) to (Θ n , B(Θ n )) that satisfiesθ n (ω) ∈Θ * n (ω) and p(θ n (ω)) ≤ inf θ∈Θ * n (ω) p(θ) + n for all ω ∈ Ω, wherê Remark 9. The mathematical content of Proposition 7 is the existence of a random functionθ n satisfying the stated conditions. The proof applies the Sainte-Beauve measurable selection theorem (see Corollary 5.3.2 in [27]) and Theorem 2.17 of Stinchcombe and White [28], which concerns the measurability of the suprema of random functions over random sets. Proposition 7 also serves to define our estimated replicating functionθ n . That is, we takeθ n to be any random function satisfying the conditions given in Proposition 7.

Remark 10.
The random setΘ * n can be viewed as our estimate of the set of replicators Θ * . It consists of all those functions θ ∈ Θ n such that M n (θ) comes close to achieving its infimum over Θ n . Note that this infimum is not necessarily achieved by any θ ∈ Θ n . The tuning parameter λ n governs how close M n (θ) must be to inf ϑ∈Θ n M n (ϑ) before θ is admitted into the setΘ * n . We will require that λ n → 0 as n → ∞, but at a rate that is not too fast.θ n is chosen such that θ n (ω) ∈Θ * n (ω) for each ω ∈ Ω. Thus, ifΘ * n is an effective estimator of Θ * , we can expectθ n to come close to achieving distributional replication. Remark 11. The sequence 1 , 2 , . . . should be thought of as converging to zero very quickly. We would like to chooseθ n such that p(θ n (ω)) is equal to the infimum of p overΘ * n (ω) for each ω ∈ Ω, but in general this is not possible because the setΘ * n (ω) need not be compact. So instead, we chooseθ n such that p(θ n (ω)) is very close to the infimum of p overΘ * n (ω), with arbitrarily small approximation error n . This technical argument relates closely to what Chen [29] (p. 5561) refers to as an approximate sieve extremum estimate. Though λ n and n appear to play similar roles in Proposition 7, from a more substantive perspective we wish n to be as small as possible, while λ n plays a more involved role in the asymptotic results to follow, and must be chosen to converge to zero at a suitable rate.

Remark 12.
If there is no relevant notion of "price" over the space of functions Θ, we may simply take p to be constant over Θ. In this case, the sequence 1 , 2 , . . . and the function p play no role in Proposition 7. Instead, Proposition 7 merely asserts the existence of a measurable functionθ n from (Ω, F ) to (Θ n , B(Θ n )) that satisfiesθ n (ω) ∈Θ * n (ω) for each ω ∈ Ω.
It remains to show that our estimatorθ n has desirable asymptotic properties. To ensure thatθ n is well-behaved, the rate at which the sieve space Θ n expands, and at which the tuning parameter λ n decays, must be suitably controlled. The following assumption provides a sufficient condition of this kind.

Remark 13.
The requirement that n −1 λ −2 n V (Θ n ) → 0 and λ −1 n inf θ∈Θ n d(θ, θ † ) → 0 for each θ † ∈ Θ † places opposing constraints on the rate of expansion of Θ n as n → ∞. The complexity of Θ n must increase sufficiently fast for the sieve approximation error inf θ∈Θ n d(θ, θ † ) to tend to zero faster than λ n for each θ † ∈ Θ † , but not so fast that V (Θ n ) increases faster than nλ 2 n . On the other hand, the rate of decay of λ n may be arbitrarily slow, provided that λ n → 0.
Our final result of this section indicates that, when the above assumptions are satisfied, in large samples we can expect our estimated function to be close to achieving distributional replication, and close to achieving the minimum cost among replicators. We first require some additional notation. Let Pθ n be the probability measure on B(Θ n ) given by Pθ n B = Pθ −1 n B for each B ∈ B(Θ n ), and let Pθ n (X) be the probability measure on B(R) given by Pθ n (X) B = Pθ n ⊗ P X {(θ, x) ∈ Θ n × R X : θ(x) ∈ B} for each B ∈ B(R). Note that for Pθ n (X) to be well defined we need B(Θ n ) to be an admissible structure for Θ n ; this condition was given in Assumption 3. We can think of Pθ n (X) as the probability distribution ofθ n (X) whenθ n and X are distributed independently of one another.

Remark 14.
Proposition 8 indicates thatθ n can be expected to perform well with respect to the dual goals of distributional replication and cost minimization in large samples. This duality complicates any discussion of the optimal selection of the tuning parameter λ n . When λ n is large, we include functions in our estimated setΘ * n for which the empirical evidence for distributional replication is weaker, but we also minimize the function p over a larger set. In applications, the best choice of λ n would depend on an individual's relative preference for distributional replication, quantified by M(θ), and cost minimization, quantified by p(θ).

Distributional Replication Using Options
In this section we consider the problem of choosing a portfolio of options on some financial asset such that the payoff from our portfolio after a specified period of time has approximately the same statistical distribution as the payoff from a $1 investment in some other asset over the same time period. We would like to find the cheapest portfolio of options such that distributional replication is achieved; in particular, we would like the cost of the portfolio to be $1 or less. We will show how this problem of portfolio selection can be interpreted and solved using the machinery developed in the previous two sections.
We suppose that the random variables X and Y represent the dollar denominated payoffs after one period from a $1 investment in each of two assets. The asset with payoff X will be referred to as the base asset, and the asset with payoff Y will be referred to as the target asset. The price of a one share investment in either asset is taken to be $1. We assume that X and Y are nonnegative and may be arbitrarily large with nonzero probability, so that R X = R Y = [0, ∞) under Assumption 1. We may thus replace Assumption 1 with the following more restrictive condition. We find the payoff distribution of the target asset to be desirable, but we seek to achieve this distribution by investing in a portfolio composed of the base asset itself and a basket of European put and call options written on the base asset, with the options expiring after one period. The payoff of such a portfolio after one period is a nonrandom function of X; for instance, the payoff from a European call option with strike price s after one period is given by max{0, X − s}, while the payoff from a European put option with strike price s after one period is given by max{0, s − X}. We also allow our portfolio to include an investment in risk-free zero-coupon bonds with $1 par value, expiring after one period. The payoff from such a bond after one period is simply $1. We allow our portfolio to include long or short positions in each of the component assets, but the payoff from the complete portfolio must be nonnegative.
The payoff from a portfolio of options and bonds after one period is a nonrandom function of X. Thus, we can think of a portfolio as a function θ ∈ Θ, and write the payoff from the portfolio as θ(X). Suppose our portfolio includes options at m different strike prices s 1 , . . . , s m , with 0 < s 1 < · · · < s m < ∞. Without loss of generality, we may consider all options to be call options, since the payoff function for a put option with strike price s i can be replicated by selling one share of the base asset, purchasing a call option with strike price s i , and purchasing s i zero-coupon bonds. Suppose we form a portfolio by purchasing β 1 bonds, β 2 shares in the base asset, and β i+2 call options at strike price s i , with i = 1, . . . , m. The payoff function corresponding to our portfolio is then given by To ensure that the payoff from our portfolio is nonnegative, we require that β lies in a suitable subset of R m+2 . We will let Ψ m (s) denote the collection of all continuous functions from [0, ∞) to [0, ∞) that are linear on each of the m + 1 subintervals (0, s 1 ), (s 1 , s 2 ), . . . , (s m , ∞), and let B(Ψ m (s)) denote the Borel σ-field on Ψ m (s) generated by d.
We can see from Proposition 9 and Remark 7 that Ψ m (s) satisfies the conditions placed on Θ n in Assumption 3. The main idea behind the application discussed in this section is that Ψ m (s), the space of nonnegative payoff functions achievable using strike prices s, can be used to play the role of the sieve space Θ n described in the previous section. We obtain an expanding sequence of sieve spaces by assuming that the collection of strike prices s varies with the sample size n, becoming more dense (in a sense soon to be made precise) as n increases. Suppose that m 1 , m 2 , . . . is a nondecreasing sequence of natural numbers with m n → ∞ and m n /n → 0 as n → ∞. Let {s i,n : i = 1, . . . , m n ; n ∈ N} be a triangular array of positive real numbers satisfying (i) 0 < s 1,n < · · · < s m n ,n for each n ∈ N, and (ii) {s 1,n , . . . , s m n ,n } ⊆ {s 1,n+1 , . . . , s m n+1 ,n+1 } for each n ∈ N. We define our expanding sequence of sieve spaces by setting Θ n = Ψ m n (s 1,n , . . . , s m n ,n ). Proposition 9 implies that this choice of Θ n satisfies Assumption 3, with V (Θ n ) = m n + 3.
In the context of the present application, the function p introduced in the previous section describes, literally, the price of each payoff function θ ∈ Θ. For a payoff function θ ∈ Θ n , we can calculate the price p(θ) directly from the prices of bonds and options. Consider the function θ(x; β, Let p 1 denote the price of a bond, p 2 denote the price of a share in the base asset, and p i+2 denote the price of a call option with strike price s i , where i = 1, . . . , m. Note that p 2 = 1 by assumption. The price of θ(·; β, s) is simply ∑ m+2 i=1 p i β i . In this way we can calculate p(θ) for any θ ∈ Θ n , provided we observe the bond price p 1 and the prices of call options at strike prices s 1,n , . . . , s m n ,n .
Assumption 5 imposes a condition on the rate of decay of the sieve approximation error: we require that λ −1 n inf θ∈Θ n d(θ, θ † ) → 0 for each θ † ∈ Θ † , where Θ † is some dense subset of Θ * under d. The following result shows how Θ † may be chosen such that this condition is satisfied when our sieve space corresponds to portfolios of options.
Proposition 10 reveals that our sequence of sieve spaces constructed using option payoffs can approximate replicating functions satisfying a deformed Lipschitz condition, provided that sup 0≤i≤m n P X (s i,n , s i+1,n ) decays to zero at a suitable rate. Further, that set of deformed Lipschitz continuous replicating functions is dense in the set of all replicating functions. If we could choose our strike prices such that P X (s i,n , s i+1,n ) was constant across i = 0, . . . , m n , we would have inf θ∈Θ n d(θ, θ † ) = O(m −1 n ) for each θ † ∈ Θ † . Proposition 10 and part (i) of Proposition 9 show how the choice of strike prices is constrained by Assumption 5. Specifically, the conditions on Θ n imposed by Assumption 5 may be rewritten as follows: n −1 λ −2 n m n → 0 and λ −1 n m n sup 0≤i≤m n P X (s i,n , s i+1,n ) 2 → 0 as n → ∞. If our strike prices are chosen such that P X (s i,n , s i+1,n ) is constant across i = 0, . . . , m n , Assumption 5 will be satisfied provided that λ n = o(1), m n = o(nλ 2 n ) and m −1 n = o(λ n ). For instance, we could choose m n ∼ n a and λ n ∼ n −b , with 0 < b < a < 1 − 2b. As noted in Remark 14, it is difficult to see how an optimal choice of m n and λ n could be made in practice, because the two parameters may have different effects on the twin criterion functions M(θ) and p(θ), and one's relative preference for optimizing with respect to those two functions may be idiosyncratic. It is perhaps best to experiment with a range of different values for m n and λ n . Further, the choice of strike prices is likely to be constrained by the strike prices being actively traded on the market.

Conclusions
In this paper we have developed a mathematical framework for thinking about the estimation of a function θ such that θ(X) has the same distribution as Y. We have discussed the relevance of our results to financial applications in which one seeks to find the cheapest way to achieve a desired payoff distribution by trading liquid assets. We now briefly discuss two possible extensions of our results that may prove fruitful.
In terms of the relevance of our technical conditions in financial applications, the elephant in the room is clearly Assumption 2, which imposes an iid condition on the random variables (X i , Y i ), i = 1, . . . , n. It is certainly the case that time series of financial returns typically do not behave as though they were distributed independently over time, as is clear from the voluminous literature on stochastic volatility. The iid condition comes into play in the proof of Proposition 6, in which results in empirical process theory are used to establish a uniform bound on the error in the approximation of M by M n over our sieve space Θ n . The results we apply are based on iid conditions, but generalizations suitable for dependent data are available [31][32][33]. It seems likely that, with some strengthening of the rate conditions in Assumption 5, the results in this paper could be adapted to allow for dependent data. However, by allowing for the possibility of serial dependence, a further question is raised. The methods we have proposed are designed such that the unconditional distribution of θ(X) is approximately equal to the unconditional distribution of Y. If data are serially dependent, the more relevant objects may be distributions that are conditional on past information. Though we acknowledge the importance of this issue, it goes beyond the scope of this paper.
A second potential extension of our results would be to consider the replication of multivariate distributions. As discussed in the introduction, Kat and Palaro [7,8] consider estimating a transformation θ of a pair of random variables X and Z such that X and θ(X, Z) have the same joint distribution as X and Y. The difficulty with adapting our own method to this approach is that the class of bivariate functions that can be approximated by portfolios of options written on individual assets is rather small. One possible solution would be to consider portfolios formed from derivative securities that are written on multiple underlying assets; another would be to forgo exact distributional replication, and seek the closest distributional match from a smaller class of multivariate payoff functions that is approximable using portfolios of simple options. We leave these possibilities for future research.
Funding: This research received no external funding.
Data Availability Statement: Data is contained within the article.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proofs
Proof of Proposition 1. Choose a point c ∈ (0, 1), and letθ c : . F X is continuous under Assumption 1, so F X (X) ∼ U(0, 1). Hence, for any a ∈ (0, 1), we have implying thatθ c • F X (X) ∼ U(0, 1). Therefore, θ c (X) ∼ Y, and so θ c is a replicating function for X and Y. If we choose c 0 , , Q X (1)}, and so continuity of F X implies that θ c 0 (x) = θ c 1 (x) for all x in a set of P X -measure one. Thus, by allowing c to vary over (0, 1), we obtain an uncountable collection of replicating functions, no two of which are equal on a set of positive P X -measure.
Proof of Proposition 2. For θ 0 , θ 1 ∈ Θ we can use the triangle inequality to show that and so applying the triangle inequality again we obtain Tonelli's theorem implies that The distribution function F Y is continuous under Assumption 1, and so we have for each x ∈ R X . Therefore, Similarly, we have and so Proof of Proposition 3. It is obvious that M(θ) = 0 if θ ∈ Θ * . We will prove the reverse implication. Suppose M(θ) = 0. Then F X (y; θ) = F Y (y) for all y in a set of P Y -measure one. Suppose F X (c 0 ; θ) = F Y (c 0 ) for some c 0 in the interior of R Y . Right continuity of F X (·; θ) and F Y ensures that F X (y; θ) = F Y (y) for all y in some open interval (c 0 , c 1 ). Since F Y is strictly increasing on R Y under Assumption 1, (c 0 , c 1 ) must have strictly positive P Y -measure, leading to a contradiction. Thus it must be the case that F X (y; θ) = F Y (y) for all y in the interior of R Y . Since F X (·; θ) is nondecreasing and takes values between zero and one, and F Y increases continuously from zero to one over R Y , it follows that F X (y; θ) = F Y (y) for all y ∈ R, so that θ ∈ Θ * .

Proof of Proposition 4. If
Then there must be an increasing sequence of natural numbers n 1 , n 2 , . . . and a real number ε > 0 (or for all k sufficiently large. Suppose ε > 0. Since F Y is continuous under Assumption 1, we may choose c 1 > c 0 such that F Y (c 1 ) = F Y (c 0 ) + ε/2. Monotonicity of F X (·; θ n k ) and F Y then ensures that F X (y; θ n k ) ≥ F Y (y) + ε/2 for all y ∈ [c 0 , c 1 ], for all k sufficiently large. Consequently, we have for all k sufficiently large, implying that M(θ n ) 0.
Proof of Proposition 5. Since (θ, x) → θ(x) is T n ⊗ B(R X )-measurable, it follows that (θ, x, y) → 1{θ(x) ≤ y} is T n ⊗ B(R X ) ⊗ B(R)-measurable. Tonelli's theorem thus implies that (θ, y) → R X 1{θ(x) ≤ y}dP X (x) = F X (y; θ) is T n ⊗ B(R)-measurable, justifying the following interchange of integrals: We thus have Again using the T n ⊗ B(R)-measurability of (θ, y) → F X (y; θ), Tonelli's theorem implies that Θ n F X (y; θ)dP θ n (θ) = Θ n R X 1{θ(x) ≤ y}dP X (x)dP θ n (θ) = P θ n (X) (−∞, y] for each y ∈ R. Letting F θ n (X) denote the cdf corresponding to P θ n (X) , we now have Arguing as in the proof of Proposition 4, we can show that Proof of Proposition 6. Elementary arguments can be used to show that, for all θ ∈ Θ, We will establish a uniform bound on the order of each of the three terms on the right-hand side of (A1). These bounds will be expressed in terms of the outer expectation operator E * , denoting outer integration of nonnegative functions defined on the underlying probability space (Ω, F , P); see e.g., Section 1.2 in [30]. Obtaining a bound for the first term is simple as it does not depend on θ: Donsker's theorem yields For the second term on the right-hand side of (A1), we have where G n is the class of indicator functions of sets of the form {x ∈ R X : θ(x) ≤ y} with θ ∈ Θ n and y ∈ R. Note that G n is the collection of indicators of all complements of sets majorized by Θ n . Since Θ n is VC-major with dimension V (Θ n ), G n must be VC-subgraph with dimension V (Θ n ). Hence Theorem 2.6.7 in [30] implies that, for any ε ∈ (0, 1) and any probability measure Q on B(R), there exists K < ∞ such that we have the uniform entropy bound N(ε, G n , L 2 (Q)) ≤ KV (Θ n )(16e) V (Θ n ) ε −2(V (Θ n )−1) . Theorem 2.14.1 in [30] thus gives E * sup g∈G n |P X n g − P X g| = O( V (Θ n )/n), implying that For the third term on the right-hand side of (A1), we have where H n is the class of functions {|F X (·; θ) − F Y (·)| : θ ∈ Θ n }. Consider the simpler class of functions H 0 n = {F X (·; θ) : θ ∈ Θ n }. Since H 0 n is a subset of the collection of monotone increasing functions from R to [0, 1], Theorem 2.7.5 in [30] implies the existence of K < ∞ such that we have the uniform bracketing entropy bound N [] (ε, H 0 n , L 2 (P Y )) ≤ Kε −1 for all ε ∈ (0, 1). It is straightforward to show that N [] (ε, H n , L 2 (P Y )) ≤ N [] (ε, H 0 n , L 2 (P Y )), and so Theorem 2.14.2 in [30] gives E * sup h∈H n |P Y n h − P Y h| = O(n −1/2 ), implying that Collecting together these bounds on the order of the terms on the right-hand side of (A1), we obtain It remains only to show that ω → sup θ∈Θ n |M n (ω, θ) − M(θ)| is F -measurable. Corollary 5.3.5 in [27] is complete under Assumption 2, universal Fmeasurability implies F -measurability. Assumption 3 states that (Θ n , B(Θ n )) is a Souslin measurable space, and θ → M(θ) is continuous and hence B(Θ n )-measurable by Proposition 3, so it suffices for us to show that (ω, This condition is satisfied for each j since B(Θ n ) is an admissible structure for Θ n under Assumption 3, and each X j is F -measurable.
Since Ψ m (s) ⊂ Λ m (s), it must be the case that V (Ψ m (s)) ≤ V (Λ m (s)). Moreover, given any θ 0 ∈ Λ m (s) and any interval [a, b] ⊂ R, we can find θ 1 ∈ Ψ m (s) and c ∈ R such that It is easy to see that this implies V (Ψ m (s)) ≥ V (Λ m (s)). This proves (i). We next prove (ii). First, observe that d is a metric (rather than merely a pseudometric) on Ψ m (s), because any two distinct functions θ 0 , θ 1 ∈ Ψ m (s) must differ everywhere on some open interval, which must be of positive P X -measure under Assumption 6. Since F Y is strictly increasing on [0, ∞) under Assumption 6, we will thus have F Y • θ 0 and F Y • θ 1 differing everywhere on the interval in question, forcing d(θ 0 , θ 1 ) to be nonzero. It remains to show that the metric space (Ψ m (s), d) is topologically isomorphic to a complete separable metric space. Each function θ ∈ Ψ m (s) can be written in the form Similarly, each β ∈Ψ m (s) uniquely identifies a function θ ∈ Ψ m (s). We will denote this bijection between Ψ m (s) andΨ m (s) by S : Ψ m (s) →Ψ m (s). It is easy to see thatΨ m (s) is a complete separable metric space. We will show that S defines a topological isomorphism between (Ψ m (s), d) and (Ψ m (s),d), whered is the usual Euclidean metric onΨ m (s). That is, we will show that S and S −1 are continuous. Suppose β 1 , β 2 , . . . is a sequence inΨ m (s) converging to some β * ∈Ψ m (s), and let θ * = S −1 β * and θ n = S −1 β n for each n ∈ N. For x ∈ [0, ∞), Cauchy's inequality gives , and hence θ n converges to θ * pointwise. It then follows from dominated convergence that d(θ n , θ * ) → 0, which proves that S −1 is continuous. Suppose now that β 1 , β 2 , . . . does not converge to β * ∈Ψ m (s). Then we can choose a subsequence β n 1 , β n 2 , . . . and a constant ε > 0 such thatd(β n k , β * ) > ε for all k. For x ∈ R X and all k, we have where γ n k =d(β n k , β * ) −1 (β n k − β * ). The subsequence γ n 1 , γ n 2 , . . . takes values in the unit sphere in R m+2 , which is compact, and so we have a further subsequence γ n k 1 , γ n k 2 , . . . that converges to some γ * in the unit sphere. Therefore, arguing as we did above with Cauchy's inequality, we have ∑ m+2 i=1 γ n k ,i f i (x) → ∑ m+2 i=1 γ * i f i (x) pointwise in x as j → ∞. Noting that ∑ m+2 i=1 γ * i f i (x) = 0 on a set of positive P X -measure, we conclude that the subsequence θ n k cannot contain a further subsequence that converges to θ * pointwise on a set of P Xmeasure one. Recall (see e.g., Theorem 9.2.1 in [35]) that a sequence of random variables converges in probability if and only if every subsequence contains a further subsequence that is almost surely convergent. It must therefore be the case that θ n (X) p θ * (X). Since F Y has a continuous inverse, this implies that F Y (θ n (X)) p F Y (θ * (X)), which implies that E|F Y (θ n (X)) − F Y (θ * (X))| 0. That is, d(θ n , θ * ) 0. This establishes that S is continuous, which proves (ii).
Proof of Proposition 10. We first show that Θ † is dense in Θ * under d. Fix a function θ * ∈ Θ * . The functionθ * = F Y • θ * • Q X is a Borel measurable mapping from [0, 1) to [0, 1); we extend the domain and range ofθ * to [0, 1] by settingθ * (1) = 1. A well known consequence of Urysohn's lemma (see e.g., Lemma 2.6.3 in [35]) is that the continuous functions on [0, 1] form a dense subset of the Lebesgue integrable functions on [0, 1] under the L 1 -seminorm. It is also well known (see e.g., Theorem 11.2.4 and the following example in [35]) that any continuous function on [0, 1] can be approximated arbitrarily well by a continuous piecewise linear function on [0, 1], with finitely many kinks. Such a function can in turn be approximated arbitrarily well by a continuous piecewise linear function on [0, 1], with finitely many kinks, for which the slope of the function is nonzero wherever it is defined. We will letθ 1 ,θ 2 , . . . be a sequence of such functions, chosen such that lim k→∞ |θ * (u) −θ k (u)|du = 0 and 0 ≤θ k ≤ 1 for each k ∈ N.
Next we show that inf θ∈Θ n d(θ, θ † ) = O(m n sup 0≤i≤m n P X (s i,n , s i+1,n ) 2 ) for each θ † ∈ Θ † . Choose θ n ∈ Θ n such that θ n (s i,n ) = θ † (s i,n ) for i = 0, . . . , m n , and such that θ n (x) is constant for x ≥ s m n ,n . Then we have where F X (∞) = 1. Our choice of θ n and the Lipschitz property of F Y • θ † • Q X ensure that sup F X (s i,n )<u<F X (s i+1,n ) P X (s i,n , s i+1,n ) 2 ≤ Km n sup 0≤i≤m n P X (s i,n , s i+1,n ) 2 , giving the desired result.