Misspecified diffusion models with high-frequency observations and an application to neural networks

We study the asymptotic theory of misspecified models for diffusion processes with noisy nonsynchronous observations. Unlike with correctly specified models, the original maximum-likelihood-type estimator has an asymptotic bias under the misspecified setting and fails to achieve an optimal rate of convergence. To address this, we consider a new quasi-likelihood function that arrows constructing a maximum-likelihood-type estimator that achieves the optimal rate of convergence. Study of misspecified models enables us to apply machine-learning techniques to the maximum-likelihood approach. With these techniques, we can efficiently study the microstructure of a stock market by using rich information of high-frequency data. Neural networks have particularly good compatibility with the maximum-likelihood approach, so we will consider an example of using a neural network for simulation studies and empirical analysis of high-frequency data from the Tokyo Stock Exchange. We demonstrate that the neural network outperforms polynomial models in volatility predictions for major stocks in Tokyo Stock Exchange.


Introduction
High-frequency financial data, such as data on all intraday transactions from a stock market, are increasingly available.These data contain lot of information about the intraday stock market, so they are expected to contribute to the analysis of stock microstructures.More complicated structures exist in high-frequency data than in low-frequency data (e.g.daily or weekly time series of stock transactions), which increases the difficulty of statistical analysis.One problem is that observation noise is intensified with frequency.To explain empirical evidence, when we model stock price data as a continuous stochastic process, we must assume that the observations contain additional noise.Another significant problem with analysis of high-frequency data is that nonsynchronous observation occurs; namely, we observe the prices of different securities at different time points.Nonsynchronous observations make it more difficult to construct estimators of the covariation of pairs of stock.Covariation estimation in the presence of both noise and nonsynchronicity has been studied in many papers, and various consistent estimators have proposed.See, for example, Barndorff-Nielsen et al. [3], Christensen, Kinnebrock, and Podolskij [5], and Bibinger et al. [4].
These problems are related to the complex structure of observations.A model of (efficient) stock price dynamics also has a complex structure that includes intraday periodicity, volatility clustering, asymmetry of return distributions, and other complications.Though these individual complications have been investigated in several papers, a comprehensive model that explains all of these complications simultaneously has not yet been found.On the other hand, statistical machine learning has been showing great success in the analysis of complicated nonlinear structures, with the structure being found from training with rich data.Image recognition is the most notable success in this field of problems.The task of detecting an object in a picture was long considered to be difficult for computers because parametric models require infeasibly complicated nonlinear analysis to construct.However, deep neural networks have achieved object recognition by training with huge amounts of data.
Toward progress on analysis of market microstructures, we consider nonsynchronous observations contaminated by market microstructure noise.For this, let (Y t ) t≥0 be a a multi-dimensional stochastic process satisfying the following equation: (1.1) where (W t ) t≥0 is a multi-dimensional standard Wiener process and where µ t and b t, † are stochastic processes.
We consider a statistical model of (Y t ) t with nonsynchronous observations contaminated by market microstructure noise.Ogihara [12] studied maximum-likelihood-and Bayes-type estimation and showed estimators with asymptotic mixed normality when b t, † satisfies b t, † = b(t, X t , σ * ) (t ∈ [0, T ]) a.s., ( where b(t, x, σ) is a given function, σ * is an unknown parameter, and X t is an explanatory process (in practice for our problem, the stock prices of other stocks, accumulated trading volume and so on).Asymptotic efficiency of the estimators was also proved by showing local asymptotic normality when the diffusion coefficients are deterministic and each noise follows a normal distribution.However, as mentioned above, it is difficult to find a parametric model satisfying (1.2) in the practice of high-frequency data analysis.Models for which this assumption is unsatisfied are called misspecified models.To apply a neural network to the above situation, we approximate Σ t, † = b t, † b ⊤ t, † by Σ(t, X t , β) where ⊤ denotes the transpose operator and the function Σ is given by a neural network with a parameter β (see Section 2.2 for the precise definition).In this case, it is natural to consider misspecified models.Study of misspecified model is important in general situations because there is always a gap between a parametric model chosen by a statistician and the model of the real data.Misspecified models have not been well-studied for diffusion-type processes with high-frequency observations in a fixed interval, even for models in which neither nonsynchronicity nor market microstructure noise is included.For the case with the end time T of observations approaching infinity, Uchida and Yoshida [17] studied an ergodic diffusion process (X t ) t≥0 with observations (X khn ) n k=0 , where h n → 0, nh n → ∞, and nh 2 n → 0. In that case, the rate of convergence of the maximum-likelihood-type estimator for a parameter in the diffusion coefficients is √ nh n , which is different from the rate √ n seen in correctly specified cases.
In this paper, we study the asymptotic properties of a maximum-likelihood-type estimator σn with a misspecified model containing noisy, nonsynchronous observations.In contrast with the results for the correctly specified model of Ogihara [12], it is shown that the original maximum-likelihood-type estimator has an asymptotic bias for a misspecified model, resulting in a failure to achieve the optimal rate of convergence.Identifiability does not hold in general, and therefore we cannot ensure convergence of the maximum-likelihood-type estimator in the parameter space.However, if we consider a certain 'distance' D in the functional space of diffusion coefficients, we can show convergence of the value of a D between Σ(t, X t , σn ) and Σ t, † to the minimum value of D between Σ(t, X t , σ) and Σ t, † in the parametric family.If we assume uniqueness of the parameters that minimize D, then we obtain asymptotic mixed normality of the estimator.In this case, the limit of σn is a random variable, which prevents directly using martingale central limit theorems (Theorems 2.1 and 3.2 in Jacod [8]), which are typically used to show the asymptotic mixed normality of estimators.To deal with this problem, we develop techniques (notably, Proposition 3.2) using series expansions of the inverse co-volatility matrix to obtain asymptotic mixed normality.This result is general and seems to be useful for showing the asymptotic mixed normality of estimators with random limits in many situations.
We can apply the maximum-likelihood approach to any machine learning methodology that constructs a parametric model (Σ(t, X t , σ)) σ .Additionally, the maximum-likelihood approach has good compatibility with neural networks in several respects.
1.A neural network can be expressed as a parametric family {Σ(t, X t , β)} β , so construction of a maximumlikelihood-type estimator is immediate, as described in Ogihara [12].
2. Optimization methods of neural networks can still be applied by setting the minus quasi-likelihood function as the loss function.In particular, backpropagation still works in this setting.
3. The increase in optimization cost is relatively mild when increasing the dimensionality of parameters because we need only the gradient of loss function with respect to variables at the output layer, by virtue of backpropagation (see Section 2.2 for details).
It is noteworthy that this study enables us to apply machine learning methodologies to intraday stock highfrequency data analysis under conditions of both nonsynchronicity and market microstructure noise.By training the structure using high-frequency data, we can forecast stock price volatilities and covariations.We will apply the proposed method by training with data from major stocks in the Tokyo Stock Exchange.By training the model using three-month data of each stock and the proposed method, we can forecast the volatility function Σ t, † of each day for the next month.
The reminder of this paper is organized as follows.In Section 2, we explain our settings and provide an example of a neural network using the quasi-likelihood function.A general asymptotic theory of misspecified models is discussed in Section 3. We specify the limit of the estimator by using a distance D, which is related to Kullback-Leibler divergence in some sense.The maximum-likelihood-type estimator has an asymptotic bias and so does not achieve an optimal rate of convergence.We propose a modified estimator that does achieve an optimal rate.Section 4 studies simulation of neural networks for the case in which the latent diffusion process is a one-dimensional Cox-Ingersoll-Ross process, and the case in which the latent process is a two-dimensional CIR-type process with intraday periodicity.Using a neural network, we train the function Σ t, † without any information about the parametric model.We study Japanese stock high-frequency data in Section 5. Proofs are provided in Section 6.

Settings and an example neural network 2.1 Parametric estimation under misspecified settings
Let γ, γ W ∈ N and (Ω, F , P ) be a probability space with a filtration F = {F t } 0≤t≤T for some T > 0. We consider a γ-dimensional F-adapted process Y = {Y t } 0≤t≤T satisfying an integral equation (1.1), where {W t } 0≤t≤T is a γ W -dimensional standard F-Wiener process and where {µ t } 0≤t≤T and b † = {b t, † } 0≤t≤T are R γ -and R γ ⊗ R γWvalued F-progressively measurable processes, respectively.
We assume that the observations of processes occur in a nonsynchronous manner and are contaminated by market microstructure noise.That is, we observe the sequence i=0 are random times, {ǫ n,k i } i∈Z+,1≤k≤2 is an independent identical distributed random sequence, and Let γ X ∈ N and let ⊤ denote the transpose operator for matrices (including vectors).Let Ē denote the closure of a subset E of a Euclidean space.We consider estimation of the co-volatility matrix Σ t, † = b t, † b ⊤ t, † by using a functional Σ(t, X t , σ) with a γ X -dimensional càdlàg stochastic process X = (X t ) t∈[0,T ] and a parameter σ.We observe possibly noisy data: Xl j = X l T n,l j + η n,l j for 0 ≤ j ≤ K l,n and 1 ≤ l ≤ γ X , where {K l,n } 1≤l≤γX ,n∈N are positive integer-valued random variables, {T n,l j } K l,n j=0 are random times, and {η n,l j } K l,n j=0 are random variables which may be identically equal to zero.
We can arbitrarily set the explanatory variable X and the function Σ(t, x, σ).For example, we can use other stock price processes, a price process of a stock index, the accumulated volume of stock trades, Y itself, or some combination of these.Let Σ(t, x, σ): where B(•) denotes the minimal σ-field related to measurable sets or random variables in the parenthesis, and H 1 H 2 denotes the minimal σ-field that contains σ-fields H 1 and H 2 .Moreover, we assume that , where 1 A is the indicator function for a set A and v k, * is positive constant for 1 ≤ k ≤ γ.
We consider a maximum-likelihood-type estimator of σ based on a quasi-likelihood function.Construction is based on that described in Ogihara [12].Let {b n } n∈N and {ℓ n } n∈N be sequences of positive numbers satisfying For technical reasons, we construct a quasi-log-likelihood function by partitioning the whole observation interval [0, T ] into distinct local intervals {s m−1 , s m )} ℓn m=1 .Here the sequence {b n } n∈N is the order of sampling frequency, that is, We use several notations for clarity: ) and , where δ ij is Kronecker's delta.For an interval J = [a, b), we write |J| = b − a. E l denotes a unit matrix of size l.For a matrix A, we denote its (i, j) element by [A] where L 0 = 0 and .
Then, roughly speaking, we obtain approximation: for k = l.Therefore, by setting , we define a quasi-log-likelihood function The function H n takes two parameters: the first one is σ, which is the parameter for the estimating function of Σ † and is the parameter of interest.The second parameter is v, which is the parameter for noise variance.
Though we can consider simultaneous maximization of σ and v, we fix an estimate of v in advance.With this approach, we can apply our results to the case of non-Gaussian noise.
[V] There exist estimators {v n } n∈N of v * such that vn ≥ 0 almost surely and {b It is easy to obtain a suitable vn .For example, let vn

Then vn satisfies
We fix vn such that it satisfies [V ].We define a maximum-likelihood-type estimator σn = argmax σ H n (σ, vn ).
H n is constructed on the basis of the local Gaussian approximation of Z k m,i .This approximation is typically considered valid only when the observation noise ǫ n,k i follows a normal distribution.However, we can see in the proof that this approximation is also valid and σn still works for the case of non-Gaussian noise.

An example of a neural network
In Section 2.1, we construct the maximum-likelihood-type estimator for the parametric model (Σ(t, X t , σ)) σ .If we can find a parametric model that contains a good approximation of Σ t, † with a low-dimensionality parameter, then we can estimate Σ t, † by using the above method.However, as indicated in the introduction, it is not an easy task to find a good parametric model of this type.In contrast, it seems effective to learn Σ t, † by machine learning methods, since typical high-frequency data contain high volume data.In particular, a neural network is strongly compatible with the quasi-likelihood approach and is useful in practice.
We start from input data The neural network here consists of three types of layers: an input layer, hidden layers, and an output layer.The hidden layers consist of elements (u k j ), which are inductively defined as ) are elements of R, and h : R → R is a continuous function (the so-called activating function).In our simulation, we use Swish, proposed in Ramachandran, Zoph, and Le [15] and defined as h(x) = x/(1 + e −x ).
The output layer is given by ) with some small constant ǫ ≥ 0.Then, choosing an activating function h and the structure of hidden layers defines a parametric family of non-linear functions.As we will see in Remark 3.1, this family approximates any continuous function for sufficiently large L 1 with K = 2.In addition, this family expresses exponentially large complexity against the number K of layers, as studied in Montufar et al. [11].
Let H n (β, v) be a quasi-log-likelihood function constructed by the above Σ(t, x, β).From this, we can construct a maximum-likelihood-type estimator by letting βn = argmax β H n (β, vn ).Calculation of βn is not easy.However, several optimization techniques have been proposed for use in neural networks, and some of them can be applied to our case.In particular, the use of back propagation to obtain the gradient ∂H n /∂β k i,j is valid for our model, and we can quickly compute gradients by this method.Let h ∈ C 1 (R).The chain rule yields Then, setting ∆ k,j = ∂H n /∂u k j and applying the chain rule again yields Therefore, we can inductively calculate ∆ k,j ; that is, fast calculation of the gradients is possible.
Calculation of H n or its derivatives requires ℓ n times calculation of inverse matrices (S m ) m , with the matrix sizes of order k n .Back propagation offers the advantage that we need to calculate only the derivatives ∂H n /∂[b] ij for the output layer.

Asymptotic theory for misspecified model of diffusion-type processes
Ogihara [12] studied a correctly specified model given by Σ t, † ≡ Σ(t, X t , σ * ) for some nonrandom σ * ∈ Λ (3.1) and showed the asymptotic mixed normality of σn and local asymptotic normality in a special case.In machinelearning theory, which includes neural networks, we typically consider models that do not necessarily satisfy (3.1); as mentioned earlier, we call these misspecified models.
In the study of misspecified models, we sometimes see different asymptotic behavior than with correctly specified models.For example, Uchida and Yoshida [17] studied ergodic diffusion X = (X t ) t≥0 with observations (X khn ) n k=0 , where h n → 0, nh n → ∞ and nh 2 n → 0 as n → ∞.They showed that the rate of convergence of a maximum-likelihood-type estimator for parameters in diffusion coefficients is √ nh n , which is different than the rate √ n found for correctly specified models.In our case, we also observe different phenomena than in the correctly specified case.In particular, the maximum-likelihood-type estimator σn cannot attain an optimal rate of convergence due to the existence of asymptotic bias.Despite this, we can construct an estimator that attains the optimal rate by modifying the asymptotic bias.
When we consider a misspecified parametric model including a neural network, the limit of a parameter which maximizes H n is not guaranteed to be unique in general, meaning that we cannot ensure convergence of the maximum-likelihood-type estimator.Consistency should be studied not in the parameter space but in the space of co-volatility functions.If we define a function D(Σ 1 , Σ 2 ) over the space of co-volatility matrices by (3.5) later, we obtain Intuitively, D is a kind of extension of Kullback-Leibler divergence (see (3.6)), and we can obtain equivalence between D and the L 2 ([0, T ] × Ω) norm in the sense of (3.7).

Consistency
In this section, we study results related to the consistency of σn .Since convergence of σn is not guaranteed, we characterize the convergence by means of a function D(Σ 1 , Σ 2 ).
Here, we make some assumptions about the latent stochastic process X, Y and the market microstructure noise ) ij and let A be the operator norm for a matrix A. For random variables {X n } n∈N and a sequence {c n } n∈N of positive numbers, we write . We assume that Λ ⊂ R d and that Λ satisfies Sobolev's inequality; that is, for any p > d, there exists some C > 0 such that sup σ∈Λ |u(σ)| ≤ C k=0,1 ( Λ |∂ k σ u(σ)| p dσ) 1/p for any u ∈ C 1 (Λ).Notably, this holds when Λ has a Lipschitz boundary.See Adams and Fournier [1] for more details.
Moreover, we assume the following conditions.
is tight for any q > 0. 6.There exist progressively measurable processes {b for 0 ≤ j ≤ 1 and any q > 0, and Some of these conditions are standard conditions and easy to check.The market microstructure noise η n,k j for X requires point 5 of (A1).Roughly speaking, this condition is satisfied when the sum of η n,k j has order equivalent to the square root of the number of T n,k j in [s m−1 , s m ).This is satisfied if the sampling frequency of {T n,k j } is of order b n and η n,k j satisfies certain independence, martingale, or mixing conditions.Decomposition of X in point 6 of (A1) is used when we estimate the difference between E m [Z m Z ⊤ m ] and Σ sm−1, † , which appears in asymptotic representation of H n .To satisfy the condition Σ ′−1/2 Abs(Σ − Σ ′ )Σ ′−1/2 < 1, it is sufficient that Σ is symmetric and positive definite and γ ≤ 2. This condition is restrictive when the dimension γ of Y is large.We need this condition to obtain the expansion of S −1 m in Lemma 6.3, which is repeatedly used in the proof of the main results.
We assume some additional conditions for the sampling scheme.Let [A2] There exist η ∈ (0, 1/2), κ > 0, η ∈ (0, 1], and positive-valued stochastic processes converges to zero in probability as ).This conditions ensures that the intensity of the observation count converges to some intensity a j t with order (k n b For covariation estimators under the existence of market microstructure noise, Christensen, Kinnebrock, and Podolskij [5] studied a pre-averaging method, and Barndorff-Nielsen et al. [3] studied a kernel-based method.Though our quasi-likelihood approach does not use such data-averaging methods, the proofs of Lemmas 5.1 and 5.2 in Ogihara [12] show that we can replace the length of observation intervals in the covariance matrix with their averages.Doing so makes it important to identify the limit of local sampling counts.One case where [A2] holds is when observation times are generated by mixing processes as seen below.
Example 3.1.Let {N k t } t≥0 be an exponential α-mixing point process with stationary increments for → 0 for some ǫ > 0. See Example 2.1 in Ogihara [12] for the details.
The function D is deduced as a limit of the quasi-log-likelihood function.It is noteworthy that D is related to the Kullback-Leibler divergence in the following sense.Let ℓ n (θ) be the log-likelihood function of a sequence (X i ) n i=1 of independent identically distributed random variables, where the probability density function of X 1 is p(x, θ 0 ) for a parameter θ.Then, under suitable regularity conditions, we obtain where KL(p, q) is the Kullback-Leibler divergence of densities p and q.In contrast, (6.14) later yields We can therefore regard D as an extension of the Kullback-Leibler divergence in some sense.Moreover, under boundedness of a j t + (a j t ) −1 , Σ t (σ) , Σ t, † , and Σ −1 t (σ) , there exist positive constants C 1 and C 2 such that where C 1 and C 2 depend on only the upperbounds of a j t + (a j t ) −1 and Σ −1 (σ) .A proof is given in the appendix.From the above, D is equivalent to an L 2 norm.Remark 3.1.In the setting of Section 2.2, assume the conditions of Theorem 3.1 and (3.7) and further assume that Σ t, † ≡ Σ † (t, X t ) for some X = (X t ) t∈[0,T ] and some continuous function Σ † (t, x).Let ǫ, δ > 0. For any compact set S ⊂ O, we have for K = 2 and sufficiently large L 1 (K, L k are defined in Section 2.2).This property is called universal approximation property.See Mhaskar and Micchelli [10], Pinkus [14], and Sonoda and Murata [16] for the details.Therefore, applying Theorem 3.1 and (3.7) yields for sufficiently large n.

Optimal rate of convergence
In this section, we study the optimal rate of convergence.Ogihara [12] showed that local asymptotic normality holds for a correctly specified model having nonrandom diffusion coefficients, with the optimal rate of convergence equal to b 1/4 n for estimators of the parameter in the diffusion coefficients, and further that the maximum-likelihood-type estimator σn attains the optimal rate.In contrast, we will see that σn cannot attain the rate b 1/4 n in the misspecified setting due to an asymptotic bias term.We can attain the optimal convergence rate if we construct a maximum-likelihood-type estimator σn by using a bias-modified quasi-log-likelihood function.
To obtain the optimal convergence rate, we need stronger versions of [A1] and [A2].We give those here, calling them [B1] and [B2], respectively.
[B2] There exist positive valued stochastic processes {a j t } t∈[0,T ],1≤j≤γ such that for any q > 0 and ǫ > 0, we have (r is finite for any 1 ≤ j ≤ γ, where the second supremum is taken over all sequences We can see that [A2] is satisfied whenever [B2] is.
Here, we see the bias of the quasi-log-likelihood function The third term in the right-hand side of (3.9) is a bias term, which does not appear in the correctly specified model.For the correctly specified case (3.1), Ogihara [12] showed b 1/4 n (σ n − σ * ) = O p (1) under suitable conditions.However, in a general misspecified model, we cannot obtain this relation due to the bias term G m .For example, let γ = 1, S n,j i ≡ i/n for 0 ≤ i ≤ n, t k = kπ/(k j m + 1), let Σ be smooth and σ * be a Λ-valued random variable such that D(Σ(σ * ), Σ † ) = min σ∈ Λ D(Σ(σ), Σ † ) and σn P → σ * .Then, by differentiating both sides of the equation in the above proposition, we obtain, roughly, |} n is tight as seen in (6.21).On the other hand, we also obtain by (6.46), where by a calculation in Section A.3.Since sin y ≥ 2y/π for 0 ≤ y ≤ π/2 and 1 − cos x ≤ x 2 /2 for x ≥ 0, we obtain Because of this, we cannot ensure that b As seen above, we cannot obtain the optimal rate of convergence of the maximum-likelihood-type estimator σn , due to the bias term of H n .In the following, we consider how to remove the bias.First, we consider an estimator (B m,n ) of Σ sm−1, † , using the function g from Jacod et al. [9].The function g : [0, 1] → R is continuous and piecewise C 1 , with g(0) = g(1) = 0, and We next define a bias-corrected quasi-log-likelihood function Ȟn (σ) as Letting σn = argmax σ Ȟn (σ), we obtain the optimal rate of convergence for σn . Then

Fast calculation of the estimator
The estimator B m,n of Σ sm−1, † is constructed with the aim of bias correction.However, it is also useful for fast calculation of a parametric estimator.We define and a new parametric estimator by σn = argmax σ Ḣn (σ).
In the calculation of σn and σn , we must repeatedly calculate of ∂ σ H n or ∂ σ Ȟn .At each step, we must calculate the inverse matrix of S m (σ, vn ), which has size of order k n .However, we do not need to calculate the inverse of a large matrix to find ∂ σ Ḣn .Instead, we can calculate B m,n only once, taking k j m terms.This drastically shortens the calculation time.
Unfortunately, σn does not give the optimal rate of convergence in general, although we obtain the following result.
we can obtain ∂ σ Ḣn from calculating ∂ σ D1/2 m,n .Let U (m) be an orthogonal matrix and

Asymptotic mixed normality
In this section, we discuss the asymptotic mixed normality of σn .In a correctly specified model, the maximumlikelihood-type estimator σn satisfies the asymptotic mixed normality as seen in Theorem 2.1 of [12].In our misspecified model, we need further assumptions related to the smoothness of Σ(t, x, σ) and the convergence of σn in the parameter space Λ.Let N k denote a general k-dimensional standard normal random variable on an extension of some probability space, and let it be independent of any random variables in the original probability space.
[C] 1.There exists a Λ-valued random variable σ * such that σn σ Σ exists and is continuous on (t, x, σ), and there exists a locally bounded function L(x, y) such that By simple calculation, we obtain where where for a symmetric, positive definite matrix B and matrices A and C, ϕ B (A) is defined as in Lemma A.5, Because of [C], for any ǫ > 0, there exists a positive integer N such that we obtain 1,σ * N γ .In Proposition 7.2 in [12], the martingale central limit theorem of Jacod [8] is used to show the corresponding result with (3.12).However, σ * is random in our model, meaning that ∂ σ Ȟn (σ * ) is not a martingale.To deal with this problem, we prepare the following scheme to show stable convergence for the case of a random parameter.This scheme ought to be useful in many situations involving random parameters.
Let (A, G, P) be a probability space.Let X be a complete, separable metric space, let {Z n (x)} n∈N,x∈X be random fields on (A, G), let {Z(x)} x∈X be a continuous random field on an extension (A ′ , G ′ ) of (A, G).Let → G-L denote G-stable convergence.Proposition 3.2.Let (Z n,k ) n,k∈N and (V k,k ′ ) k,k ′ ∈N be random variables on (A, G), let (f k ) k∈N be continuous functions on X , K 0 = 0, and let (K m ) m∈N be a monotonically increasing sequence of positive integers satisfying K m → ∞ as m → ∞.Assume the following.
1.For any m, (V k,k ′ ) 1≤k,k ′ ≤Km is a symmetric, nonnegative definite matrix almost surely and 2. For any x ∈ X , there exists an open set U x satisfying the following property: given any ǫ > 0 and δ > 0 there exists a positive integer M such that P sup for any M ′ > M and n ∈ N, where The proof is given in Section A.3.When we have the stable convergence of ∂ σ Ȟn (σ) for a deterministic parameter σ, this proposition helps us to extend the convergence to the case with a random parameter.Indeed, by using Proposition 3.2, we obtain the asymptotic mixed normality of our estimator.
where N is a γ-dimensional standard normal random variable on an extension of (Ω, F , P ) and independent of F .

Simulation studies of neural networks
In this section, we simulate some diffusion processes Y with b t, † = b † (t, Y t ) for some function b † , and calculate the proposed estimators via neural network modeling in Section 2.2.We will verify whether the function b t, † is well approximated.

One-dimensional Cox-Ingersoll-Ross processes
First, we consider the case where the process Y is the Cox-Ingersoll-Ross (CIR) process derived in [6], that is, Y is a one-dimensional diffusion process satisfying a stochastic differential equation where σ * > 0 is the parameter.We assume that 2α 1 > σ 2 * to ensure inf t∈[0,T ] Y t > 0 almost surely.Let (ǫ n i ) i∈Z+ be independent identically distributed Gaussian random variables with variance v * > 0. Sampling times are given by S n i = inf{t ≥ 0 : N nt ≥ i} ∧ T with a Poisson process {N t } 0≤t≤T with parameter λ.We set X t = (t, Y t ), h(x) = x/(1 + e −x ), K = 3, L 1 = L 2 = 10, and ǫ = 0.0001 for Section 2.2.We generate 100 sample paths of Y , set the number of epochs to 3, 000 and randomly pick a simulated path at each optimization step.We use ADADELTA, proposed in Zeiler [18], with weight decay having parameter 0.005 for optimization.We set ℓ n = [n 0.45 ] and vn = (2J 1,n We also calculate the maximum-likelihoodtype estimator σmodel n for a parametric model Σ(t, x, σ) = σ 2 x, and the associated quasi-log-likelihood H model n .
Figure 1 shows the levels of the loss functions H model for 100 trials, where x (1) i = 0.1i and x (2) i = 0.1 + 0.1i.We also calculate similar quantities MSE model k for σmodel n .The performance of σmodel n is better since it is calculated by using the structure of parametric model.Though βn can be calculated without the structure of parametric model, we can see that βn achieves good performance for large number of epochs.
Figure 3 shows the quartiles of H model , the level of H n ( βn ) is the same as that of H n ( βn ) or even worse.This seems to hapen because the estimation accuracy of the estimator B m,n for Σ sm−1, † is not so good as seen in Lemma 6.6.We also calculated H model n − H n ( βn ) and MSE for βn in Figure 4 and Table 3, where βn is the estimator in Section 3.3 for the neural network model in Section 2.2.In Figure 4, the performances of βn are poor compared to βn for some initial values.In contrast, the time needed for calculation was reduced, as 3,000 training runs took 8.33 s for βn in contrast with 333.01 s for σn (CPUFIntel R Xeon R Processor E5-2680 v4, 35M Cache, 2.40 GHz).So βn is useful for reducing computational cost.

Two-dimensional Cox-Ingersoll-Ross with intraday seasonality
We also simulate sample paths when Y is a two-dimensional CIR-type process with intraday seasonality: where W t = (W 1 t , W 2 t ) is a two-dimensional standard Wiener process, and we define ǫ n,j i , S n,j i , λ j , vj,n and v j, * for j ∈ {1, 2} in a way similar to the first example.We set α 1 = α 2 = α We can see that the calculation time of βn is much shorter than that of βn : training 3000 times takes 12.74 s for βn but 1217.41 s for βn .This tendency becomes stronger as the dimensionality of Y is increase.and we adopt the estimator that attains the best value of H n over three optimization trials, randomly changing the initial value.We use high-frequency data of the volatility index (VI) as the explanatory process X t , this is the implied volatility calculated by using the option price of the Nikkei 225 index.
For comparison, we also calculate an estimator obtained by the polynomial functions for β = (β j ) 2p j=0 .We change p = 1, 2, and 3 (for Poly1, Poly2, Poly3, respectively) and compare the performance with that when using the neural network (NeNet).Since optimization of polynomial models are unstable compared to neural network, we calculate βn ten times and select the best result.
Table 5 shows the average values of −H n ( βn )/10000 for each day from April 1st, 2016 to December 30th, 2016.We can see that the neural network outperforms the polynomial models in each case (smaller is better).The neural network can express complex models since we can stably optimize the parameter by using back propagation.
We also trained Σ t, † for the two-dimensional process Y that corresponds to each stock pair of above five stocks.X is the same as above, we let K = 3, L 1 = L 2 = 30 and set weight decay with parameter 0.01.Table 6 shows the average values of −H n ( βn )/10000 under these settings.

Proofs
This section contains proofs of the main results.Some preliminary lemmas are proved in Section 6.1.Section 6.2 addresses the consistency results used in Theorem 3.1.The results related to the optimal rate of convergence (Proposition 3.1 and Theorem 3.2) are shown in Section 6.3.Theorem 3.3 is proved in Section 6.4.In Section 6.5, the results for asymptotic mixed normality (Theorem 3.4) are discussed.

Preliminary results
In this subsection, we will prove Lemma 6.2, which enables us to replace H n with the more tractable Hn in the proof of the main results.
For random variables (X n ) n∈N and (Y n ) n∈N , we let X n ≈ Y n mean X n − Y n P → 0 as n → ∞.We use the symbols C and C q for a generic positive constant that can vary from line to line (C q depends on q, with q > 0).
t , and b(j) t are bounded, and there exists a positive constant L such that and [B1] are satisfied and there exists a positive constant For a sequence c n of positive numbers, let us denote by { Rn (c n )} n∈N a sequence of random variables (which may depend on 1 ≤ m ≤ ℓ n and σ ∈ Λ) satisfying as n → ∞ for any q, q ′ , δ > 0 and some constants p 1 , • • • , p 4 ≥ 0, where kn = max j,m k j m and k n = min j,m k j m .Let In the proofs of the main results, we will replace H n with Hn .In the following we introduce some lemmas to show that this replacement is acceptable.
Then there exists a random sequence {Q n,q } q≥2 , not depending on S, such that Q n,q = Rn (1) for q ≥ 2, and the following hold.
This lemma is obtained similarly to Lemma 4.3 in [12].In contrast with Lemma 4.3 of [12], the smoothness of b is not assumed here, and the assumptions for b † are different from those in [12].However, these differences do not affect the conclusion.Remark 6.1.For random matrices S 1 and S 2 , we have Then, if both S 1 and S 2 satisfy the conditions of S in Lemma 6.1, we obtain where Qn,2 = Rn (1).
Remark 6.2.Suppose that the assumptions in Lemma 6.1 are satisfied.Then for q > 4, Lemma 6.1, the Burkholder-Davis-Gundy inequality and Jensen's inequality yield Proof.We obtain by an argument similar to that in the proof of Lemma 4.4 in [12].Moreover, we can make the following decomposition: where Ψj,n is defined similar to that in Lemma 4.4 in [12] for 1 ≤ j ≤ 3. Sobolev's inequality yields for sufficiently large q > 0, and then Point 2 of Lemma 6. → ∞ as n → ∞ for any ǫ > 0, we can similarly obtain (6.3).
The following lemma gives a useful expansion of S−1 m (σ); the expansion will be repeatedly used.
) with R some positive constant.Then (6.4) for any nonnegative integer r.
Proof.For p ≥ 2 and 1 ≤ i 0 , i p ≤ γ, we obtain From this, we have . Then we have and To obtain this, we used the fact D ≤ 1 for 1 ≤ i, j ≤ γ by Lemma 2 in Ogihara and Yoshida [13].
Using the above, we obtain Moreover, applying (6.6) and Lemma A.3 of [12] yields To find the limit of Hn , we first consider the limit of tr( S−1 Proof.Lemma 6.3 yields Here, we let Then, thanks to Lemma 5.2 of [12], we obtain (6.7) Therefore, we obtain (6.8) by Lemma 6.3.Moreover, if [B1 ′ ] and [B2] are also satisfied, then by Remark 5.2 in [12] and following the method of the proof of Lemma 5.2 in [12], we obtain . By an argument similar to the above, the right-hand side of (6.8) is equal to Hence, by setting Additionally, we can extend this to max i,j,m |∂ l σ Φ i,j,m (σ 1,n )| = Rn (b We write [12] and Lemmas A.2 and A.3 together yield By an argument similar to the above, (5.20) in [12], and the arguments after that yield for any 1 ≤ l ≤ γ.Therefore, we obtain We then have max ] and [B2] are satisfied.

Proof of Theorem 3.1
Using localization techniques similar to those used in Lemma 4.1 of Gobet [7], we may assume [A1 ′ ] instead of [A1] without weakening the outcome.

Proof of optimal rate of convergence
In this subsection, we prove the results of Section 3.2.We first prove Proposition 3.1.Proof of Proposition 3.1.
By localization techniques, we can assume By the definitions of Hn and ∆ n , we have Then, Lemma 6.2, the Hölder continuity of a t and [B1 ′ ] yield the desired result.
In the following, we check that Ȟn − H n cancels the bias term of H n .First, we show hat B m,n is a good approximation of Σm, † .
Let Bm,n be a γ × γ matrix defined by , and Here, Furthermore, we have If i = j, then the independence of ǫ i l1 and ǫ j l2 implies that the right-hand side is equal to Rn (k −1 n ).If i = j, then that right-hand side is equal to Therefore, we obtain ) by [B2].Therefore, we have For the last equation, we used the fact Proof.We first show that The left-hand side of (6.16) can be rewritten as where C (1) ), v * )dα.Then, using Lemma 6.6, Sobolev's inequality, and an estimate similar to (6.11) yields where by using the equation: We also obtain estimates for the limit functions by Sobolev's inequality and similar arguments.
Proof.Let Σ m,j = Σ m (σ j,n ) and From this, we obtain Let v s = (1 − s)v * + sv n for s ∈ [0, 1], then the first and the second terms of the right-hand side of (6.17) become The third term of the right-hand side of (6.17) is similarly estimated, giving Therefore, it is sufficient to show that Since there exists some ǫ > 0 such that P [min j vj,n ≥ ǫ] → 1 as n → ∞ by [V], Lemmas 6.6 and A.3 yield Together with Lemmas 6.7 and 6.6, it is sufficient to show that m E m (ã sm−1 , Σm (σ j,n ), Bm,n , v * ) (6.20) The left-hand side of (6.20) can be rewritten as Lemmas 6.6 and 6.4 yield for any σ and l ∈ {0, 1}.Sobolev's inequality then provides the desired results.
Hence we have for sufficiently large K and n.
Then, (6.Moreover, an argument similar to that for (6.39) and Sobolev's inequality yield similarly to (6.14).Then, Proposition 6.1 and the discussions in Section 3.4 complete the proof.
Therefore, we have

A.2 Some auxiliary lemmas
Lemma A.7.Let (Ω, F , P ) be a probability space.Let G be a σ-subfield of F , let X 1 , X 2 be continuous random fields on a separable metric space X , and let Y be a G-measurable random variable.Assume that (X 1 (x), U) d = (X 2 (x), U) for any x ∈ X and G-measurable random variable U.Then, (X 1 (Y), U) d = (X 2 (Y), U) for any bounded G-measurable random variable U.
Proof.Let d be the metric of X .Since X is a separable metric space, there exists a countable set X 0 = (y k ) k ⊂ X and a Borel function F n : X → X 0 such that d(F n (y), y) < 1/n (y ∈ X ) for n ∈ N.

Figure 1 :
Figure 1: Transition of H model n

n−
H n ( βn ) for ten training results with different initial values.The parameters are set as T = 1, n = 5000, λ = 1, α 1 = α 2 = 1, and σ * = 1.For each trial, the loss function seems to converge to a specific value as the number of epochs increases.Since σmodel n is calculated on the basis of a parametric model that includes the true model, the average value of −H model n is less than −H n ( βn ), as expected.However, we can see that −H n ( βn ) reaches a value close to the average of −H model n , and is calculated without any information of the parametric model.

Figure 2 :
Figure 2: Trained function √ Σ(0, x, βn ).The horizontal line shows the value of x, and the vertical line shows the value of √ Σ.

Figure 2
Figure 2 shows quartiles of trained functions √ Σ(0, x, βn ) for 100 training results.The true function is Σ † (t, x) ≡ √ x indicated by the dotted line.We can see that the true function is well-trained except near the origin.Since inf t Y t > 0 almost surely, the errors of trained functions are relatively large near the origin.

n−
H n ( βn ) for 100 training results with different initial values, where βn is the bias-corrected estimator in Section 3.2 for the neural network model in Section 2.2.Table 2 shows the numerical values for certain numbers of epochs.Though βn achieves the optimal rate b −1/4 n

Figure 3 :
Figure 3: Transition of the quartiles of H model n

Figure 5 :
Figure 5: Transition of H model n

Figure 6 :
Figure 6: H model n

Table 1 :
MSE median of βn for each number of epochs

Table 2 :
Average of H model n − H n (β) for each number of epochs

Table 3 :
MSE median of the estimator βn

Table 4 :
MSE median for each estimator