# Convergence of the deep BSDE method for coupled FBSDEs

## Abstract

The recently proposed numerical algorithm, deep BSDE method, has shown remarkable performance in solving high-dimensional forward-backward stochastic differential equations (FBSDEs) and parabolic partial differential equations (PDEs). This article lays a theoretical foundation for the deep BSDE method in the general case of coupled FBSDEs. In particular, a posteriori error estimation of the solution is provided and it is proved that the error converges to zero given the universal approximation capability of neural networks. Numerical results are presented to demonstrate the accuracy of the analyzed algorithm in solving high-dimensional coupled FBSDEs.

## Introduction

Forward-backward stochastic differential equations (FBSDEs) and partial differential equations (PDEs) of parabolic type have found numerous applications in stochastic control, finance, physics, etc., as a ubiquitous modeling tool. In most situations encountered in practice the equations cannot be solved analytically but require certain numerical algorithms to provide approximate solutions. On the one hand, the dominant choices of numerical algorithms for PDEs are mesh-based methods, such as finite differences, finite elements, etc. On the other hand, FBSDEs can be tackled directly through probabilistic means, with appropriate methods for the approximation of conditional expectation. Since these two kinds of equations are intimately connected through the nonlinear Feynman–Kac formula (Pardoux and Peng 1992), the algorithms designed for one kind of equation can often be used to solve another one.

However, the aforementioned numerical algorithms become more and more difficult, if not impossible, when the dimension increases. They are doomed to run into the so-called “curse of dimensionality” (Bellman 1957) when the dimension is high, namely, the computational complexity grows exponentially as the dimension grows. The classical mesh-based algorithms for PDEs require a mesh of size O(Nd). The simulation of FBSDEs faces a similar difficulty in the general nonlinear cases, due to the need to compute conditional expectation in high dimension. The conventional methods, including the least squares regression (Bender and Steiner 2012), Malliavin approach (Bouchard et al. 2004), and kernel regression (Bouchard and Touzi 2004), are all of exponential complexity. There are a limited number of cases where practical high-dimensional algorithms are available. For example, in the linear case, Feynman–Kac formula and Monte Carlo simulation together provide an efficient approach to solving PDEs and associated BSDEs numerically. In addition, methods based on the branching diffusion process (Henry-Labordere 2012; Henry-Labordere et al. 2019) and multilevel Picard iteration (Hutzenthaler et al. 2016, 2018; E et al. 2019) overcome the curse of dimensionality in their considered settings. We refer (E et al. 2019) for the detailed discussion on the complexity of the algorithms mentioned above. Overall there is no numerical algorithm in literature so far proved to overcome the curse of dimensionality for general quasilinear parabolic PDEs and the corresponding FBSDEs.

A recently developed algorithm, called the deep BSDE method (Han et al. 2018; E et al. 2017), has shown astonishing power in solving general high-dimensional FBSDEs and parabolic PDEs (Beck et al. 2017; Han and Hu 2019; Han et al. 2020). In contrast to conventional methods, the deep BSDE method employs neural networks to approximate unknown gradients and reformulates the original equation-solving problem into a stochastic optimization problem. Thanks to the universal approximation capability and parsimonious parameterization of neural networks, in practice the objective function can be effectively optimized in high-dimensional cases, and the function values of interests are obtained quite accurately.

The deep BSDE method was initially proposed for decoupled FBSDEs. In this paper, we extend the method to deal with coupled FBSDEs and a broader class of quasilinear parabolic PDEs. Furthermore, we present an error analysis of the proposed scheme, including decoupled FBSDEs as a special case. Our theoretical result consists of two theorems. Theorem 1 provides a posteriori error estimation of the deep BSDE method. As long as the objective function is optimized to be close to zero under fine time discretization, the approximate solution is close to the true solution. In other words, in practice, the accuracy of the numerical solution is effectively indicated by the value of the objective function. Theorem 2 shows that such a situation is attainable, by relating the infimum of the objective function to the expression ability of neural networks. As an implication of the universal approximation property (in the L2 sense), there exist neural networks with suitable parameters such that the obtained numerical solution is approximately accurate. To the best of our knowledge, this is the first theoretical result of the deep BSDE method for solving FBSDEs and parabolic PDEs. Although our numerical algorithm is based on neural networks, the theoretical result provided here is equally applicable to the algorithms based on other forms of function approximations.

The article is organized as follows. In Section 2, we precisely state our numerical scheme for coupled FBSDEs and quasilinear parabolic PDEs and give the main theoretical results of the proposed numerical scheme. In Section 3, the basic assumptions and some useful results from the literature are given for later use. The proofs of the two main theorems are provided in Sections 4 and 5, respectively. Some numerical experiments with the proposed scheme are presented in Section 6.

## A numerical scheme for coupled FBSDEs and main results

Let T(0,+) be the terminal time, $$\left (\Omega,\mathbb {F},\left \{\mathcal {F}_{t}\right \}_{0 \le t \le T}, \mathbb {P}\right)$$ be a filtered probability space equipped with a d-dimensional standard Brownian motion {Wt}0≤tT starting from 0. ξ is a square-integrable random variable independent of {Wt}0≤tT. We use the same notation $$\left (\Omega,\mathbb {F},\left \{\mathcal {F}_{t}\right \}_{0 \le t \le T}, \mathbb {P}\right)$$ to denote the filtered probability space generated by {Wt+ξ}0≤tT. The notation |x| denotes the Euclidean norm of a vector x and $$\left \|A\right \| = \sqrt {\text {trace}\left (A^{\mathrm {T}} A \right)}$$ denotes the Frobenius norm of a matrix A.

Consider the following coupled FBSDEs

\begin{array}{@{}rcl@{}} \left\{\begin{aligned} & X_{t} = \xi + \int_{0}^{t}b(s,X_s,Y_s)\, \mathrm{d}s + \int_{0}^{t}\sigma(s,X_s,Y_s)\, \mathrm{d}W_s, \qquad\qquad\qquad\quad \quad\quad \text{(2.1)} \\ & Y_t = g(X_T) + \int_{t}^{T}f(s,X_s,Y_s,Z_s)\, \mathrm{d}s - \int_{t}^{T}(Z_s)^{\mathrm{T}}\, \mathrm{d}W_s, \qquad\qquad\qquad \quad\, \, \text{(2.2)} \end{aligned}\right. \end{array}

in which Xt takes values in $$\mathbb {R}^{m}, Y_{t}$$ takes values in $$\mathbb {R}$$, and Zt takes values in $$\mathbb {R}^{d}$$. Here we assume Yt to be one-dimensional to simplify the presentation. The result can be extended without any difficulty to the case where Yt is multi-dimensional. We say (Xt,Yt,Zt) is a solution of the above FBSDEs, if all its components are $$\mathcal {F}_{t}$$-adapted and square-integrable, together satisfying Eqs. (2.1) and (2.2).

Solving coupled FBSDEs numerically is more difficult than solving decoupled FBSDEs. Except the Picard iteration method developed in Bender and Zhang (2008), most methods exploit the relation to quasilinear parabolic PDEs via the four-time-step-scheme in Ma et al. (1994). This type of methods suffers from high dimensionality due to spatial discretization of PDEs. In contrast, our strategy, starting from simulating the coupled FBSDEs directly, is a new purely probabilistic scheme. To state the numerical algorithm precisely, we consider a partition of the time interval $$[0,T], \pi : 0 = t_{0} < t_{1} < \dots < t_{N} = T$$ with h=T/N and ti=ih. Let $$\Delta W_{i} \mathrel {\mathop :}= W_{t_{i+1}} - W_{t_{i}}$$ for $$i = 0, 1, \dots, N-1$$. Inspired by the nonlinear Feynman–Kac formula that will be introduced below, we view Y0 as a function of X0 and view Zt as a function of Xt and Yt. Equipped with this viewpoint, our goal becomes finding appropriate functions $$\mu _{0}^{\pi }: \mathbb {R}^{m} \rightarrow \mathbb {R}$$ and $$\phi _{i}^{\pi }:\mathbb {R}^{m}\times \mathbb {R} \rightarrow \mathbb {R}^{d}$$ for $$i = 0, 1,\dots,N-1$$ such that $$\mu _{0}^{\pi }(\xi)$$ and $$\phi _{i}^{\pi }\left (X_{t_{i}}^{\pi }, Y_{t_{i}}^{\pi }\right)$$ can serve as good surrogates of Y0 and $$Z_{t_{i}}$$, respectively. To this end, we consider the classical Euler scheme

$$\left\{\begin{array}{ll} X_{0}^{\pi} = \xi, \quad Y_{0}^{\pi} = \mu_{0}^{\pi}(\xi), \\ X_{t_{i+1}}^{\pi} = X_{t_{i}}^{\pi} + b\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right)h + \sigma\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right)\Delta W_{i}, \\ Z_{t_{i}}^{\pi} = \phi_{i}^{\pi}\left(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right), \\ Y_{t_{i+1}}^{\pi} = Y_{t_{i}}^{\pi} - f\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi}\right)h + \left(Z_{t_{i}}^{\pi}\right)^{\mathrm{T}}\Delta W_{i}. \end{array}\right.$$
(2.3)

Without loss of clarity, here we use the notation $$X_{0}^{\pi }$$ as $$X_{t_{0}}^{\pi }, X_{T}^{\pi }$$ as $$X_{t_{N}}^{\pi }$$, etc.

Following the spirit of the deep BSDE method, we employ a stochastic optimizer to solve the following stochastic optimization problem

\begin{aligned} &\inf_{\mu_{0}^{\pi} \in \mathcal{N}^{\prime}_{0}, \phi_{i}^{\pi} \in \mathcal{N}_{i}} F\left(\mu_{0}^{\pi},\phi_{0}^{\pi},\dots,\phi_{N-1}^{\pi}\right) \mathrel{\mathop:}= E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}, \end{aligned}
(2.4)

where $$\mathcal {N}'_{0}$$ and $$\mathcal {N}_{i}~(0\leq i \leq {N-1})$$ are parametric function spaces generated by neural networks. To see intuitively where the objective function (2.4) comes from, we consider the following variational problem:

\begin{aligned} &\inf_{Y_{0},\{Z_{t}\}_{0\le t \le T}} E|g(X_{T}) - Y_{T}|^{2}, \\ &s.t.~~ X_{t} = \xi + \int_{0}^{t}b(s,X_{s},Y_{s})\, \mathrm{d}s + \int_{0}^{t}\sigma(s,X_{s},Y_{s})\, \mathrm{d}W_{s}, \\ & \qquad Y_{t} = Y_{0} - \int_{0}^{t}f(s,X_{s},Y_{s},Z_{s})\, \mathrm{d}s + \int_{0}^{t}(Z_{s})^{\mathrm{T}}\, \mathrm{d}W_{s}, \end{aligned}

where Y0 is $$\mathcal {F}_{0}$$-measurable and square-integrable, and Zt is a $$\mathcal {F}_{t}$$-adapted square-integrable process. The solution of the FBSDEs (2.1) and (2.2) is a minimizer of the above problem since the loss function attains zero when it is evaluated at the solution. In addition, the wellposedness of the FBSDEs (under some regularity conditions) ensures the existence and uniqueness of the minimizer. Therefore, we expect (2.4), as a discretized counterpart of (2.5), defines a benign optimization problem and the associated near-optimal solution provides us a good approximate solution of the original FBSDEs. The reason we do not represent $$Z_{t_{i}}$$ as a function of $$X_{t_{i}}$$ only is that the process $$\{X_{t_{i}}^{\pi }\}_{0\le i \le N}$$ is not Markovian, while the process $$\left \{X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right \}_{0 \le i \le N}$$ is Markovian, which facilitates our analysis considerably. If b and σ are both independent of Y, then the FBSDEs (2.1) and (2.2) are decoupled, we can take $$\phi _{i}^{\pi }$$ as a function of $$X_{t_{i}}^{\pi }$$ only, as the numerical scheme introduced in Han et al. (2018); E et al. (2017).

Our two main theorems regarding the deep BSDE method are the following, mainly on the justification and property of the objective function (2.4) in the general coupled case, regardless of the specific choice of parametric function spaces. An important assumption for the two theorems is the so-called weak coupling or monotonicity condition, which will be explained in detail in Section 3. The precise statement of the theorems can be found in Theorem 1’ (Section 4) and Theorem 2’ (Section 5), respectively.

### Theorem 1

Under some assumptions, there exists a constant C, independent of h, d, and m, such that for sufficiently small h,

\begin{aligned} \sup_{t \in [0,T]} \left(E|X_{t} - \hat{X}_{t}^{\pi}|^{2} + E|Y_{t} - \hat{Y}_{t}^{\pi}|^{2}\right) + \int_{0 }^{T} E|Z_{t} - \hat{Z}_{t}^{\pi}|^{2} \, \mathrm{d}t \le C\left[h + E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}\right], \end{aligned}
(2.5)

where $$\hat {X}_{t}^{\pi } = X_{t_{i}}^{\pi }, \hat {Y}_{t}^{\pi } = Y_{t_{i}}^{\pi }, \hat {Z}_{t}^{\pi } = Z_{t_{i}}^{\pi }$$ for t[ti,ti+1).

### Theorem 2

Under some assumptions, there exists a constant C, independent of h, d and m, such that for sufficiently small h,

\begin{aligned} &\inf_{\mu_{0}^{\pi} \in \mathcal{N}^{\prime}_{0}, \phi_{i}^{\pi} \in \mathcal{N}_{i}} E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \\ &\quad \le~ C\left\{h +\inf_{\mu_{0}^{\pi} \in \mathcal{N}^{\prime}_{0}, \phi_{i}^{\pi} \in \mathcal{N}_{i}} \left[{\vphantom{\inf_{\mu_{0}^{\pi} \in \mathcal{N}^{\prime}_{0}, \phi_{i}^{\pi} \in \mathcal{N}_{i}}}} E|Y_{0} - \mu_{0}^{\pi}(\xi)|^{2} \right.\right.\\ &\quad +\left.\left. \sum_{i = 0}^{N-1}E|E\left[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right] - \phi_{i}^{\pi}\left(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right)|^{2}h\right] \right\}, \end{aligned}

where $$\tilde {Z}_{t_{i}} =h^{-1}E\left [\int _{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm {d} t|\mathcal {F}_{t_{i}}\right ]$$. If b and σ are independent of Y, the term $$E[\tilde {Z}_{t_{i}}|X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }]$$ can be replaced with $$E[\tilde {Z}_{t_{i}}|X_{t_{i}}^{\pi }]$$.

Briefly speaking, Theorem 1 states that the simulation error (left side of Eq. (2.5)) can be bounded through the value of the objective function (2.4). To the best of our knowledge, this is the first result for the error estimation of the coupled FBSDEs, concerning both time discretization error and terminal distance. Theorem 2 states that the optimal value of the objective function can be small if the approximation capability of the parametric function spaces ($$\mathcal {N}'_{0}$$ and $$\mathcal {N}_{i}$$ above) is high. Neural networks are a promising candidate for such a requirement, especially in high-dimensional problems. There are numerous results, dating back to the 90s (see, e.g., Cybenko (1989); Funahashi (1989); Hornik et al. (1989); Barron (1993); Arora et al. (2018); Eldan and Shamir (2016); Cohen et al. (2016); Mhaskar and Poggio (2016); Bölcskei et al. (2017); Liang and Srikant (2017)), in regard to the universal approximation and complexity of neural networks. There are also some recent analysis (Grohs et al. 2018; Jentzen et al. 2018; Berner et al. 2018; Hutzenthaler et al. 2020) on approximating the solutions of certain parabolic partial differential equations with neural networks. However, the problem is still far from resolved. Theorem 2 implies that if the involved conditional expectations can be approximated by neural networks whose numbers of parameters growing at most polynomially both in the dimension and the reciprocal of the required accuracy, then the solutions of the considered FBSDEs can be represented in practice without the curse of dimensionality. Under what conditions this assumption is true is beyond the scope of this work and remains for further investigation.

The above-mentioned scheme in (2.3) and (2.4) is for solving FBSDEs. The so-called nonlinear Feynman–Kac formula, connecting FBSDEs with the quasilinear parabolic PDEs, provides an approach to numerically solve quasilinear parabolic PDEs (2.6) below through the same scheme. We recall a concrete version of the nonlinear Feynman–Kac formula in Theorem 3 below and refer interested readers to e.g., (Ma and Yong 2007) for more details. According to this formula, the term $$E|Y_{0} - Y_{0}^{\pi }|^{2}$$ can be interpreted as $$E|u(0,\xi) - \mu _{0}^{\pi }(\xi)|^{2}$$. Therefore, we can choose the random variable ξ with a delta distribution, a uniform distribution in a bounded region, or any other distribution we are interested in. After solving the optimization problem, we obtain $$\mu _{0}^{\pi }(\xi)$$ as an approximation of u(0,ξ). See (Han et al. 2018; E et al. 2017) for more details.

### Theorem 3

Assume

1. 1.

m=d and b(t,x,y),σ(t,x,y),f(t,x,y,z) are smooth functions with bounded first-order derivatives with respect to x,y,z.

2. 2.

There exist a positive continuous function ν and a constant μ, satisfying that

\begin{aligned} &\nu(|y|)\mathbf{I} \le \sigma\sigma^{\mathrm{T}}(t,x,y) \le \mu \mathbf{I}, \\ &|b(t,x,0)| + |f(t,x,0,z)| \le \mu. \end{aligned}
3. 3.

There exists a constant α(0,1) such that g is bounded in the Hölder space $$C^{2,\alpha }(\mathbb {R}^{m})$$.

Then the following quasilinear PDE has a unique classical solution u(t,x) that is bounded with bounded ut,xu, and $$\nabla ^{2}_{x} u$$,

\left\{\begin{aligned}&u_{t} + \frac{1}{2}\text{trace}\left(\sigma\sigma^{\mathrm{T}}(t,x,u)\nabla^{2}_{x} u\right) \\ &\quad+~b^{\mathrm{T}}(t,x,u) \nabla_{x} u + f\left(t,x,u,\sigma^{\mathrm{T}}(t,x,u) \nabla_{x} u\right) = 0, \\ &u(T,x) = g(x). \end{aligned}\right.
(2.6)

The associated FBSDEs (2.1) and (2.2) have a unique solution (Xt,Yt,Zt) with Yt=u(t,Xt),Zt=σT(t,Xt,u(t,Xt))xu(t,Xt), and Xt is the solution of the following SDE

$$X_{t} = \xi + \int_{0}^{t}b(s,X_{s},u(s,X_{s}))\, \mathrm{d}s + \int_{0}^{t}\sigma(s,X_{s},u(s,X_{s}))\, \mathrm{d}W_{s}.$$

### Remark 1

The statement regarding FBSDEs (2.1) and (2.2) in Theorem 3 is developed through a PDE-based argument, which requires m=d, uniform ellipticity of σ, and high-order smoothness of b,σ,f, and g. An analogous result through probabilistic argument is given below in Theorem 4 (point 4). In that case, we only need the Lipschitz condition for all of the involved functions, in addition to some weak coupling or monotonicity conditions demonstrated in Assumption 3. Note that the Lipschitz condition alone does not guarantee the existence of a solution to the coupled FBSDEs, even in the situation when b,f,σ are linear (see Bender and Zhang (2008); Ma and Yong (2007) for a concrete counterexample).

### Remark 2

Theorem 3 also implies that the assumption that the drift function b only depends on x,y is general. If b depends on z as well, one can move the associated term in (2.6) into the nonlinearity f and apply the nonlinear Feynman–Kac formula back to obtain an equivalent system of coupled FBSDEs, in which the new drift function is independent of z.

## Preliminaries

In this section, we introduce our assumptions and two useful results in Bender and Zhang (2008). We use the notation Δx=x1x2,Δy=y1y2,Δz=z1z2.

### Assumption 1

1. 1.

There exist (possibly negative) constants kb,kf such that

\begin{aligned} \left[b(t,x_{1},y) - b(t,x_{2},y)\right]^{\mathrm{T}}\Delta x &\le k_{b} |\Delta x|^{2},\\ [f(t,x,y_{1},z) - f(t,x,y_{2},z)]\Delta y &\le k_{f} |\Delta y|^{2}. \end{aligned}
2. 2.

b, σ, f, g are uniformly Lipschitz continuous with respect to (x,y,z). In particular, there are non-negative constants K, by,σx,σy,fx,fz, and gx such that

\begin{aligned} |b(t,x_{1},y_{1}) - b(t,x_{2},y_{2})|^{2} &\le K|\Delta x|^{2} + b_{y}|\Delta y|^{2}, \\ \|\sigma(t,x_{1},y_{1}) - \sigma(t,x_{2},y_{2})\|^{2} &\le \sigma_{x}|\Delta x|^{2} + \sigma_{y}|\Delta y|^{2}, \\ |f(t,x_{1},y_{1},z_{1}) - f(t,x_{2},y_{2},z_{2})|^{2} &\le f_{x}|\Delta x|^{2} + K|\Delta y|^{2} + f_{z}|\Delta z|^{2}, \\ |g(x_{1}) - g(x_{2})|^{2} &\le g_{x}|\Delta x|^{2}. \end{aligned}
3. 3.

b(t,0,0),f(t,0,0,0), and σ(t,0,0) are bounded. In particular, there are constants b0,σ0,f0, and g0 such that

\begin{aligned} |b(t,x,y)|^{2} &\le b_{0} + K|x|^{2} + b_{y}|y|^{2}, \\ \|\sigma(t,x,y)\|^{2} &\le \sigma_{0} + \sigma_{x}|x|^{2} + \sigma_{y}|y|^{2}, \\ |f(t,x,y,z)|^{2} &\le f_{0} + f_{x}|x|^{2} + K|y|^{2} + f_{z}|z|^{2}, \\ |g(x)|^{2} &\le g_{0} + g_{x}|x|^{2}. \end{aligned}

We note here by et al. are all constants, not partial derivatives. For convenience, we use $$\mathscr {L}$$ to denote the set of all the constants mentioned above and assume K is the upper bound of $$\mathscr {L}$$.

### Assumption 2

b,σ,f are uniformly Hölder-$$\frac {1}{2}$$ continuous with respect to t. We assume the same constant K to be the upper bound of the square of the Hölder constants as well.

### Assumption 3

One of the following five cases holds:

1. 1.

Small time duration, that is, T is small.

2. 2.

Weak coupling of Y into the forward SDE (2.1), that is, by and σy are small. In particular, if by=σy=0, then the forward equation does not depend on the backward one and, thus, Eqs. (2.1) and (2.2) are decoupled.

3. 3.

Weak coupling of X into the backward SDE (2.2), that is, fx and gx are small. In particular, if fx=gx=0, then the backward equation does not depend on the forward one and, thus, Eqs. (2.1) and (2.2) are also decoupled. In fact, in this case, Z = 0 and (2.2) reduces to an ODE.

4. 4.

f is strongly decreasing in y, that is, kf is very negative.

5. 5.

b is strongly decreasing in x, that is, kb is very negative.

The assumptions stated above are usually called weak coupling and monotonicity conditions in literature (Bender and Zhang 2008; Antonelli 1993; Pardoux and Tang 1999). To make it more precise, we define

\begin{aligned} L_{0} &= [b_{y} +\sigma_{y} ][g_{x} +f_{x} T]Te^{[b_{y} +\sigma_{y}][g_{x} +f_{x} T]T+[2k_{b} +2k_{f} +2+\sigma_{x} +f_{z} ]T}, \\ L_{1} &= [g_{x} +f_{x} T][e^{[b_{y} +\sigma_{y}][g_{x} +f_{x} T]T+[2k_{b} +2k_{f} +2+\sigma_{x} +f_{z} ]T + 1} \vee 1], \\ \Gamma_{0}(x) &= \frac{e^{x} - 1}{x}, ~~(x>0), \\ \Gamma_{1}(x,y) &= \sup_{0 < \theta < 1}\theta e^{\theta x}\Gamma_{0}(y), \\ c &= \inf_{\lambda_{1} > 0}\left\{\left[ e^{[2k_{b} +1+\sigma_{x} +[b_{y} +\sigma_{y} ]L_{1}]T} \vee 1\right]\left(1 + \lambda_{1}^{-1}\right)[b_{y}+\sigma_{y}]T \right. \\ &\quad~ \times\left[g_{x}\Gamma_{1}([2k_{f}+1 + f_{z}]T, [2k_{b}+1+\sigma_{x}+(1+\lambda_{1})[b_{y}+\sigma_{y}]L_{1}]T) \right. \\ &\quad~ + f_{x} T\Gamma_{0}([2k_{f}+1+f_{z}]T) \\ &\quad\times \left.\Gamma_{0} (2k_{b} +1+\sigma_{x} +(1+\lambda_{1})[b_{y} +\sigma_{y}]L_{1}\right]T)\left.{\vphantom{\left(1 + \lambda_{1}^{-1}\right)}}\right\}. \end{aligned}

Then, a specific quantitative form of the above five conditions can be summarized as:

$$L_{0} < e^{-1} \text{~~and~~} c < 1.$$
(3.1)

In other words, if any of the five conditions of the weak coupling and monotonicity conditions holds to certain extent, the two inequalities in (3.1) hold. Below, we refer to (3.1) as Assumption 3 and the five general qualitative conditions described above as the weak coupling and monotonicity conditions.

The above three assumptions are basic assumptions in Bender and Zhang (2008), which we need in order to use the results from (Bender and Zhang 2008), as stated in Theorems 4 and 5 below. Theorem 4 gives the connections between coupled FBSDEs and quasilinear parabolic PDEs under weaker conditions. Theorem 5 provides the convergence of the implicit scheme for coupled FBSDEs. Our work primarily uses the same set of assumptions except that we assume some further quantitative restrictions related to the weak coupling and monotonicity conditions, which will be transparent through the extra constants we define in proofs. Our aim is to provide explicit conditions on which our results hold and more clearly present the relationship between these constants and the error estimates. As will be seen in the proof, roughly speaking, the weaker the coupling (resp., the stronger the monotonicity, the smaller the time horizon) is, the easier the condition is satisfied, and the smaller the constant C related with error estimates are.

### Theorem 4

Under Assumptions 1, 2, and 3, there exists a function u: $$\mathbb {R} \times \mathbb {R}^{m} \rightarrow \mathbb {R}$$ that satisfies the following statements.

1. 1.

|u(t,x1)−u(t,x2)|2L1|x1x2|2.

2. 2.

|u(s,x)−u(t,x)|2C(1+|x|2)|st| with some constant C depending on $$\mathscr {L}$$ and T.

3. 3.

u is a viscosity solution of the PDE (2.6).

4. 4.

The FBSDEs (2.1) and (2.2) have a unique solution (Xt,Yt,Zt) and Yt=u(t,Xt). Thus, (Xt,Yt,Zt) satisfies decoupled FBSDEs

\left\{\begin{aligned} X_t &= \xi + \int_{0}^{t}b(s,X_s,u(s,X_s))\, \mathrm{d}s + \int_{0}^{t}\sigma(s,X_s,u(s,X_s))\, \mathrm{d}W_s, \\ Y_t &= g(X_T) + \int_{t}^{T}f(s,X_s,Y_s,Z_s) \, \mathrm{d}s - \int_{t}^{T}(Z_s)^{\mathrm{T}} \, \mathrm{d}W_s. \end{aligned}\right.

Furthermore, the solution of the FBSDEs satisfies the path regularity with some constant C depending on $$\mathscr {L}$$ and T

$$\sup_{t\in[0,T]} \left(E|X_{t} - \tilde{X}_{t}|^{2} + E|Y_{t} - \tilde{Y}_{t}|^{2}\right) + \int_{0}^{T} E|Z_{t} - \tilde{Z}_{t}|^{2}\,\mathrm{d} t \le C\left(1 + E|\xi|^{2}\right)h,$$
(3.2)

in which $$\tilde {X}_{t} = X_{t_{i}}, \tilde {Y}_{t} = Y_{t_{i}}, \tilde {Z}_{t} =h^{-1}E\left [\int _{t_{i}}^{t_{i+1}}Z_{t} \,\mathrm {d} t|\mathcal {F}_{t_{i}}\right ]$$ for t[ti,ti+1). If Zt is càdlàg, we can replace $$h^{-1}E\left [\int _{t_{i}}^{t_{i+1}}Z_{t}\,\mathrm {d} t|\mathcal {F}_{t_{i}}\right ]$$ with $$Z_{t_{i}}$$.

### Remark 3

Several conditions can guarantee Zt admits a càdlàg version, such as m=d and σσTδI with some δ>0, see e.g., (Zhang 2004).

### Theorem 5

Under Assumptions 1, 2, and 3, for sufficiently small h, the following discrete-time equation (0≤iN−1)

\left\{\begin{aligned} &\overline{X}_{0}^{\pi} = \xi, \\ &\overline{X}_{t_{i+1}}^{\pi} = \overline{X}_{t_{i}}^{\pi} + b\left(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi}\right)h + \sigma\left(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi}\right)\Delta W_{i}, \\ &\overline{Y}_{T}^{\pi} = g\left(\overline{X}_{T}^{\pi}\right), \\ &\overline{Z}_{t_{i}}^{\pi} = \frac{1}{h}E\left[\overline{Y}_{t_{i+1}}^{\pi}\Delta W_{i}|\mathcal{F}_{t_{i}}\right], \\ &\overline{Y}_{t_{i}}^{\pi} = E\left[\overline{Y}_{t_{i+1}}^{\pi} + f\left(t_{i},\overline{X}_{t_{i}}^{\pi},\overline{Y}_{t_{i}}^{\pi},\overline{Z}_{t_{i}}^{\pi}\right)h|\mathcal{F}_{t_{i}}\right], \end{aligned}\right.
(3.3)

has a solution $$\left (\overline {X}_{t_{i}}^{\pi }, \overline {Y}_{t_{i}}^{\pi }, \overline {Z}_{t_{i}}^{\pi }\right)$$ such that $$\overline {X}_{t_{i}}^{\pi } \in L^{2}(\Omega,\mathcal {F}_{t_{i}},\mathbb {P})$$ and

$$\sup_{t\in[0,T]} \left(E|X_{t} - \overline{X}_{t}^{\pi}|^{2} + E|Y_{t} - \overline{Y}_{t}^{\pi}|^{2}\right) + \int_{0}^{T} E|Z_{t} - \overline{Z}_{t}^{\pi}|^{2}\, \mathrm{d}t \le C\left(1 + E|\xi|^{2}\right)h,$$
(3.4)

where $$\overline {X}_{t}^{\pi } = \overline {X}_{t_{i}}^{\pi }, \overline {Y}_{t}^{\pi }= \overline {Y}_{t_{i}}^{\pi }, \overline {Z}_{t}^{\pi } = \overline {Z}_{t_{i}}^{\pi }$$ for t[ti,ti+1), and C is a constant depending on $$\mathscr {L}$$ and T.

### Remark 4

In Bender and Zhang (2008), the above result (existence and convergence) is proved for the explicit scheme, which is formulated as replacing $$f\left (t_{i},\overline {X}_{t_{i}}^{\pi },\overline {Y}_{t_{i}}^{\pi },\overline {Z}_{t_{i}}^{\pi }\right)$$ with $$f\left (t_{i},\overline {X}_{t_{i}}^{\pi },\overline {Y}_{t_{i+1}}^{\pi },\overline {Z}_{t_{i}}^{\pi }\right)$$ in the last equation of (3.3). The same techniques can be used to prove the implicit scheme, as we state in Theorem 5.

Finally, to make sure the system in (2.3) is well-defined, we restrict our parametric function spaces $$\mathcal {N}'_{0}$$ and $$\mathcal {N}_{i}$$ as in Assumption 4 below. Note that neural networks with common activation functions, including ReLU and sigmoid function, satisfy this assumption. Under Assumption 1 and 4, one can easily prove by induction that $$\left \{X_{t_{i}}^{\pi }\right \}_{0\le i \le N}, \left \{Y_{t_{i}}^{\pi }\right \}_{0 \le i \le N}$$ and $$\left \{Z_{t_{i}}^{\pi }\right \}_{0 \le i \le N-1}$$ defined in (2.3) are all measurable and square-integrable random variables.

### Assumption 4

$$\mathcal {N}_{0}^{'}$$ and $$\mathcal {N}_{i} (0 \le i \le N-1)$$ are subsets of measurable functions from $$\mathbb {R}^{m}$$ to $$\mathbb {R}$$ and $$\mathbb {R}^{m}\times \mathbb {R}$$ to $$\mathbb {R}^{d}$$ with linear growth, namely, $$\mu _{0}^{\pi }$$ and $$\left \{\phi _{i}^{\pi }\right \}_{0 \le i \le N-1}$$ in (2.3) satisfy $$|\mu _{0}^{\pi }(x)|^{2} \le A'_{0}+ B'_{0}|x|^{2}$$ and $$|\phi _{i}^{\pi }(x,y)|^{2} \le A_{i} + B_{i}|x|^{2} + C_{i}|y|^{2}$$ for 0≤iN−1.

## A posteriori estimation of the simulation error

We prove Theorem 1 in this section. Comparing the statements of Theorems 1 and 5, we wish to bound the differences between $$\left (X_{t_{i}}^{\pi }, Y_{t_{i}}^{\pi }, Z_{t_{i}}^{\pi }\right)$$ and $$\left (\overline {X}_{t_{i}}^{\pi }, \overline {Y}_{t_{i}}^{\pi }, \overline {Z}_{t_{i}}^{\pi }\right)$$ with the objective function $$E|g\left (X_{T}^{\pi }\right) - Y_{T}^{\pi }|^{2}$$. Recalling the definition of the system of Eq. (2.3), we have

\begin{array}{@{}rcl@{}} \left\{\begin{aligned} X_{t_{i+1}}^{\pi} &= X_{t_i}^{\pi} + b\left(t_i,X_{t_i}^{\pi},Y_{t_i}^{\pi}\right)h + \sigma\left(t_i,X_{t_i}^{\pi},Y_{t_i}^{\pi}\right)\Delta W_i, \qquad\qquad\qquad\quad\quad \, \, \, \,\text{(4.1)}\\ Y_{t_{i+1}}^{\pi} &= Y_{t_i}^{\pi} - f\left(t_i,X_{t_i}^{\pi},Y_{t_i}^\pi,Z_{t_i}^\pi \right)h + \left(Z_{t_i}^\pi \right)^{\mathrm{T}}\Delta W_i. \qquad\qquad\qquad\qquad\quad\quad\, \,\text{(4.2)} \end{aligned}\right. \end{array}

Taking the expectation $$E[\cdot |\mathcal {F}_{t_{i}}]$$ on both sides of (4.2), we obtain

$$Y_{t_{i}}^{\pi} = E\left[Y_{t_{i+1}}^{\pi} + f\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi}\right)h|\mathcal{F}_{t_{i}}\right].$$

Right multiplying (ΔWi)T on both sides of (4.2) and taking the expectation $$E[\cdot |\mathcal {F}_{t_{i}}]$$ again, we obtain

$$Z_{t_{i}}^{\pi} = \frac{1}{h}\left[Y_{t_{i+1}}^{\pi}\Delta W_{i}|\mathcal{F}_{t_{i}}\right].$$

The above observation motivates us to consider the following system of equations

\left\{\begin{aligned} &X_{0}^{\pi} = \xi, \\ &X_{t_{i+1}}^{\pi} = X_{t_{i}}^{\pi} + b\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right)h + \sigma\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right)\Delta W_{i}, \\ &Z_{t_{i}}^{\pi} = \frac{1}{h}E\left[Y_{t_{i+1}}^{\pi}\Delta W_{i}|\mathcal{F}_{t_{i}}\right], \\ &Y_{t_{i}}^{\pi} = E\left[{Y}_{t_{i+1}}^{\pi} + f\left(t_{i},{X}_{t_{i}}^{\pi},{Y}_{t_{i}}^{\pi},{Z}_{t_{i}}^{\pi}\right)h|\mathcal{F}_{t_{i}}\right]. \end{aligned}\right.
(4.3)

Note that (4.3) is defined just like the FBSDEs (2.1) and (2.2), where the X component is defined forwardly and the Y,Z components are defined backwardly. However, since we do not specify the terminal condition of $$Y_{T}^{\pi }$$, the system of Eq. (4.3) has infinitely many solutions. The following lemma gives an estimate of the difference between two such solutions.

### Lemma 1

For j=1,2, suppose $$\left (\!\left \{X_{t_{i}}^{\pi,j}\right \}_{0\le i \le N},\! \left \{Y_{t_{i}}^{\pi,j}\right \}_{0\le i \le N},\! \left \{Z_{t_{i}}^{\pi,j}\right \}_{0\le i \le N-1}\right)$$ are two solutions of (4.3), with $$X_{t_{i}}^{\pi,j}, Y_{t_{i}}^{\pi,j} \in L^{2}(\Omega,\mathcal {F}_{t_{i}},\mathbb {P}), 0 \le i \le N$$. For any λ1>0,λ2fz, and sufficiently small h, denote

\begin{aligned} A_{1} &\mathrel{\mathop:}= 2k_{b} + \lambda_{1}+ \sigma_{x} + Kh, \\ A_{2} &\mathrel{\mathop:}=\left(\lambda_{1}^{-1}+h\right)b_{y} + \sigma_{y}, \\ A_{3} &\mathrel{\mathop:}= -\frac{\ln[1-(2k_{f}+\lambda_{2})h]}{h}, \\ A_{4} &\mathrel{\mathop:}= \frac{f_{x}}{[1 - (2k_{f} + \lambda_{2})h]\lambda_{2}}. \end{aligned}
(4.4)

Let $$\delta X_{i} = X_{t_{i}}^{\pi,1} - X_{t_{i}}^{\pi,2}, \delta Y_{i} = Y_{t_{i}}^{\pi,1} - Y_{t_{i}}^{\pi,2}$$, then we have, for 0≤nN,

\begin{aligned} E|\delta X_{n}|^{2} &\le A_{2}\sum_{i = 0}^{n-1}e^{A_{1}(n-i-1)h}E|\delta Y_{i}|^{2}h, \\ E|\delta Y_{n}|^{2} &\le e^{A_{3}(N-n)h}E|\delta Y_{N}|^{2} + A_{4}\sum_{i = n}^{N-1}e^{A_{3}(i-n)h}E|\delta X_{i}|^{2}h. \end{aligned}

To prove Lemma 1, we need the following lemma to handle the Z component.

### Lemma 2

Let 0≤s1<s2, given $$Q \in L^{2}(\Omega,\mathcal {F}_{s_{2}},\mathbb {P})$$, by the martingale representation theorem, there exists an $$\mathcal {F}_{t}$$-adapted process {Hs}s1≤ss2 such that $$\int _{s_{1}}^{s_{2}} E|H_{s}|^{2}\,\mathrm {d} s < \infty$$ and $$Q = E[Q|\mathcal {F}_{s_{1}}] + \int _{s_{1}}^{s_{2}} H_{s} \,\mathrm {d} W_{s}$$. Then we have $$E[Q(W_{s_{2}} - W_{s_{1}})|\mathcal {F}_{s_{1}}] = E[\int _{s_{1}}^{s_{2}} H_{s} \, \mathrm {d}s|\mathcal {F}_{s_{1}}]$$.

### Proof

Consider the auxiliary stochastic process $$Q_{s} = (E[Q|\mathcal {F}_{s_{1}}] + \int _{s_{1}}^{s} H_{t} \,\mathrm {d} W_{t})(W_{s} - W_{s_{1}})$$ for s[s1,s2]. By Itô’s formula,

$$\mathrm{d} Q_{s} = (W_{s} - W_{s_{1}})H_{s} \,\mathrm{d} W_{s} + \left(E[Q|\mathcal{F}_{s_{1}}] + \int_{s_{1}}^{s} H_{t} \,\mathrm{d} W_{t}\right) \,\mathrm{d} W_{s} + H_{s} \,\mathrm{d} s.$$

Noting that $$Q_{s_{1}}=0$$, we have

$$E[Q(W_{s_{2}} - W_{s_{1}})|\mathcal{F}_{s_{1}}] = E[Q_{s_{2}}|\mathcal{F}_{s_{1}}] = E\left[\int_{s_{1}}^{s_{2}}H_{s} \,\mathrm{d} s|\mathcal{F}_{s_{1}}\right].$$

Proof of Lemma 1 Let

\begin{aligned} \delta Z_{i} &= Z_{t_{i}}^{\pi,1} - Z_{t_{i}}^{\pi,2},\\ \delta b_{i} &= b\left(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1}\right) - b\left(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2}\right), \\ \delta \sigma_{i} &= \sigma\left(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1}\right) - \sigma\left(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2}\right), \\ \delta f_{i} &= f\left(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1},Z_{t_{i}}^{\pi,1}\right) - f\left(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2},Z_{t_{i}}^{\pi,2}\right). \end{aligned}

Then we have

\begin{aligned} \delta X_{i+1} &= \delta X_{i} + \delta b_{i} h + \delta \sigma_{i} \Delta W_{i}, \end{aligned}
(4.5)
\begin{aligned} \delta Z_{i} &= \frac{1}{h}E[\delta {Y_{i+1}}\Delta {W_{i}}|{\mathcal{F}_{t_{i}}}], \end{aligned}
(4.6)
\begin{aligned} \delta Y_{i} &= E[\delta Y_{i+1} + \delta f_{i} h|\mathcal{F}_{t_{i}}].\\ \end{aligned}
(4.7)

By the martingale representation theorem, there exists an $$\mathscr {F}_{t}$$-adapted square-integrable process {δZt}titti+1 such that

$$\delta Y_{i+1} = E[\delta Y_{i+1}|\mathcal{F}_{t_{i}}] + \int_{t_{i}}^{t_{i+1}}(\delta Z_{t})^{\mathrm{T}}\, \mathrm{d}W_{t},$$

or, equivalently,

\begin{aligned} \delta Y_{i+1} &= \delta Y_{i} - \delta f_{i} h + \int_{t_{i}}^{t_{i+1}}(\delta Z_{t})^{\mathrm{T}} \, \mathrm{d}W_{t}. \end{aligned}
(4.8)

From Eqs. (4.5) and (4.8), noting that δXi,δYi,δbi,δσi, and δfi are all $$\mathcal {F}_{t_{i}}$$-measurable, and $$E[\Delta W_{i}|\mathcal {F}_{t_{i}}] = 0, E\left [\int _{t_{i}}^{t_{i+1}} (\delta Z_{t})^{\mathrm {T}} \, \mathrm {d}W_{t}|\mathcal {F}_{t_{i}}\right ] = 0$$, we have

\begin{aligned} E|\delta X_{i+1}|^{2} &= E|\delta X_{i} + \delta b_{i} h|^{2} + E\left[(\Delta W_{i})^{\mathrm{T}}(\delta \sigma_{i})^{\mathrm{T}}\delta \sigma_{i} \Delta W_{i}\right] \\ &= E|\delta X_{i} + \delta b_{i} h|^{2} + hE\|\delta \sigma_{i}\|^{2}, \end{aligned}
(4.9)
{}{\begin{aligned} E|\delta Y_{i+1}|^{2} &= E|\delta Y_{i} - \delta f_{i} h|^{2} + \int_{t_{i}}^{t_{i+1}}E|\delta Z_{t}|^{2} \, \mathrm{d}t. \end{aligned}}
(4.10)

From Eq. (4.9), by Assumptions 1 and 2 and the root-mean square and geometric mean inequality (RMS-GM inequality), for any λ1>0, we have

\begin{aligned} &E|\delta X_{i+1}|^{2}\\ =~&E|\delta X_{i}|^{2} + E|\delta b_{i}|^{2} h^{2} + hE\|\delta \sigma_{i}\|^{2} \\ & + 2hE\left[\left(b\left(t_{i},X_{t_{i}}^{\pi,1},Y_{t_{i}}^{\pi,1}\right) -b\left(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,1}\right)\right)^{\mathrm{T}}\delta X_{i} \right] \\ & + 2hE\left[\left(b\left(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,1}\right) -b\left(t_{i},X_{t_{i}}^{\pi,2},Y_{t_{i}}^{\pi,2}\right)\right)^{\mathrm{T}}\delta X_{i} \right] \\ \le~& E|\delta X_{i}|^{2} + \left(KE|\delta X_{i}|^{2} + b_{y} E|\delta Y_{i}|^{2}\right)h^{2} + 2k_{b}h E|\delta X_{i}|^{2} \\ & +\left(\lambda_{1} E|\delta X_{i}|^{2} + \lambda_{1}^{-1}b_{y} E|\delta Y_{i}|^{2}\right)h + \left(\sigma_{x} E|\delta X_{i}|^{2} + \sigma_{y} E|\delta Y_{i}|^{2}\right)h \\ =~& [1 + (2k_{b} + \lambda_{1}+ \sigma_{x} + Kh)h]E|\delta X_{i}|^{2} + \left[\left(\lambda_{1}^{-1}+h\right)b_{y} + \sigma_{y}\right]E|\delta Y_{i}|^{2}h. \end{aligned}

Recall $$A_{1} = 2k_{b} + \lambda _{1}+ \sigma _{x} + Kh, A_{2} =(\lambda _{1}^{-1}+h)b_{y} + \sigma _{y}, E|\delta X_{0}|^{2} = 0$$. By induction we can obtain that, for 0≤nN,

$$E|\delta X_{n}|^{2} \le A_{2}\sum_{i = 0}^{n-1}e^{A_{1}(n-i-1)h}E|\delta Y_{i}|^{2}h.$$

Similarly, from Eq. (4.10), for any λ2>0, we have

\begin{aligned} &\qquad~~\, E|\delta Y_{i+1}|^{2} \\ &\quad\ge~ E|\delta Y_{i}|^{2} +\int_{t_{i}}^{t_{i+1}}E|\delta Z_{t}|^{2} \, \mathrm{d}t\\ &\qquad - 2hE\left[\left(f\left(t_{i},X_{i}^{1,\pi},Y_{i}^{1,\pi},Z_{i}^{1,\pi}\right) - f\left(t_{i},X_{i}^{1,\pi},Y_{i}^{2,\pi},Z_{i}^{1,\pi}\right)\right)^{\mathrm{T}}\delta Y_{i}\right] \\ & \qquad- 2hE\left[\left(f\left(t_{i},X_{i}^{1,\pi},Y_{i}^{2,\pi},Z_{i}^{1,\pi}\right) - f\left(t_{i},X_{i}^{2,\pi},Y_{i}^{2,\pi},Z_{i}^{2,\pi}\right)\right)^{\mathrm{T}}\delta Y_{i}\right] \\ &\quad\ge~ E|\delta Y_{i}|^{2} + \int_{t_{i}}^{t_{i+1}}E|\delta Z_{t}|^{2} \, \mathrm{d}t - 2k_{f}h E|\delta Y_{i}|^{2} \\ &\qquad- \left[\lambda_{2} E|\delta Y_{i}|^{2} + \lambda_{2}^{-1}\left(f_{x}E|\delta X_{i}|^{2} + f_{z}E|\delta Z_{i}|^{2}\right)\right]h. \end{aligned}
(4.11)

To deal with the integral term in (4.11), we apply Lemma 2 to (4.6) and (4.8) and get

\begin{aligned} \delta Z_{i} &= \frac{1}{h}E\left[\int_{t_{i}}^{t_{i+1}}\delta Z_{t} \, \mathrm{d}t |\mathcal{F}_{t_{i}}\right], \end{aligned}

which implies, by the Cauchy inequality,

\begin{aligned} E|\delta Z_{i}|^{2}h &= \sum_{k=1}^{d} E|(\delta Z_{i})_{k}|^{2} h = \sum_{k=1}^{d}\frac{1}{h}E\left|E\left[\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})_{k} \,\mathrm{d}t|\mathcal{F}_{t_{i}}\right]\right|^{2} \\& \le \sum_{k=1}^{d}\frac{1}{h} E\left|\int_{t_{i}}^{t_{i+1}}(\delta Z_{t})_{k} \,\mathrm{d}t \right|^{2} \le \sum_{k=1}^{d}\int_{t_{i}}^{t_{i+1}} E|(\delta Z_{t})_{k}|^{2}\,\mathrm{d} t \\& = \int_{t_{i}}^{t_{i+1}} E|\delta Z_{t}|^{2}\,\mathrm{d} t, \end{aligned}

where (·)k denotes the k-th component of the vector. Plugging it into (4.11) gives us

\begin{aligned} E|\delta Y_{i+1}|^{2} \ge [1 - (2k_{f} + \lambda_{2})h]E|\delta Y_{i}|^{2} +\left(1-f_{z}\lambda_{2}^{-1}\right)E|\delta Z_{i}|^{2}h - f_{x}\lambda_{2}^{-1} E|\delta X_{i}|^{2}h. \end{aligned}
(4.12)

Then for any λ2fz and sufficiently small h satisfying (2kf+λ2)h<1, we have

$$E|\delta Y_{i}|^{2} \le[1 - (2k_{f} + \lambda_{2})h]^{-1}\left[E|\delta Y_{i+1}|^{2} + f_{x}\lambda_{2}^{-1} E|\delta X_{i}|^{2}h\right].$$

Recall $$A_{3} = -h^{-1}\ln [1-(2k_{f}+\lambda _{2})h], A_{4} = f_{x}\lambda _{2}^{-1}[1 - (2k_{f} + \lambda _{2})h]^{-1}$$. By induction we obtain that, for 0≤nN,

$$E|\delta Y_{n}|^{2} \le e^{A_{3}(N-n)h}E|\delta Y_{N}|^{2} + A_{4}\sum_{i = n}^{N-1}e^{A_{3}(i-n)h}E|\delta X_{i}|^{2}h.$$

Now we are ready to prove Theorem 1, whose precise statement is given below. Note that its conditions are satisfied if any of the five cases in the weak coupling and monotonicity conditions holds.

### Theorem 1 ′

Suppose Assumptions 1, 2, 3, and 4 hold true and there exist λ1>0,λ2fz such that $$\overline {A_{0}}<1$$, where

\begin{aligned} & \overline{A_{1}} \mathrel{\mathop:}= 2k_{b} + \lambda_{1} + \sigma_{x}, \\ & \overline{A_{2}} \mathrel{\mathop:}= b_{y}\lambda_{1}^{-1} + \sigma_{y}, \\ & \overline{A_{3}} \mathrel{\mathop:}= 2k_{f}+\lambda_{2}, \\ & \overline{A_{4}} \mathrel{\mathop:}= f_{x}\lambda_{2}^{-1}, \\ & \overline{A_{0}} \mathrel{\mathop:}= \overline{A_{2}}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}} \left\{g_{x} e^{(\overline{A_{1}}+\overline{A_{3}})T} + \overline{A_{4}}\frac{e^{(\overline{A_{1}}+\overline{A_{3}})T}-1}{\overline{A_{1}}+\overline{A_{3}}} \right\}. \end{aligned}
(4.13)

Then there exists a constant C>0, depending on $$E|\xi |^{2}, \mathscr {L}$$, T, λ1, and λ2, such that for sufficiently small h,

\begin{aligned} \sup_{t \in [0,T]} \left(E|X_{t} - \hat{X}_{t}^{\pi}|^{2} + E|Y_{t} - \hat{Y}_{t}^{\pi}|^{2}\right) + \int_{0 }^{T} E|Z_{t} - \hat{Z}_{t}^{\pi}|^{2} \, \mathrm{d}t \le C\left[h + E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}\right], \end{aligned}
(4.14)

where $$\hat {X}_{t}^{\pi } = X_{t_{i}}^{\pi }, \hat {Y}_{t}^{\pi } = Y_{t_{i}}^{\pi }, \hat {Z}_{t}^{\pi } = Z_{t_{i}}^{\pi }$$ for t[ti,ti+1).

### Remark 5

The above theorem also implies the coercivity of the objective function (2.4) used in the deep BSDE method. Formally speaking, the coercivity means that if $$\sum _{i = 0}^{N-1}E|Z_{t_{i}}^{\pi }|^{2} + E|Y_{0}^{\pi }|^{2} \rightarrow +\infty$$, we have $$E|g\left (X_{T}^{\pi }\right) - Y_{T}^{\pi }|^{2} \rightarrow + \infty$$, which is a direct result from Theorem 1’.

### Remark 6

If any of the weak coupling and monotonicity conditions introduced in Assumption 3 holds to a sufficient extent, there must exist λ1,λ2 satisfying the conditions in Theorem 1’. We discuss the 5 cases in what follows.

1. 1.

Suppose all other constants and λ1>0,λ2fz are fixed, if T>0 is sufficiently small, then the second factor of $$\overline {A_{0}}$$ could be sufficiently close to 0 such that $$\overline {A_{0}} < 1$$.

2. 2.

Suppose all other constants and λ1>0,λ2fz are fixed, if by≥0 and σy≥0 are sufficiently small, then $$\overline {A_{2}} \geq 0$$ could be sufficiently small such that $$\overline {A_{0}} < 1$$.

3. 3.

Suppose all other constants and λ1>0,λ2fz are fixed, if fx≥0 and gx≥0 are sufficiently small, then $$\overline {A_{4}}$$ and thus the last factor in $$\overline {A_{0}}$$ could be sufficiently close to 0 such that $$\overline {A_{0}}<1$$.

4. 4.

Suppose all constants except kf and λ2>0 are fixed. Let $$\overline {A_{1}}' \mathrel {\mathop :}= \overline {A_{1}} + \overline {A_{3}} = 2k_{b} + 2k_{f} + \sigma _{x} + \lambda _{1} + \lambda _{2}$$ and rewrite $$\overline {A_{0}}$$ as

\begin{aligned} \overline{A_{0}} = \overline{A_{2}}\left\{g_{x} \frac{e^{\overline{A_{1}}^{\prime}T} - 1}{\overline{A_{1}}^{\prime}} + \overline{A_{4}}\frac{e^{\overline{A_{1}}^{\prime}T}+e^{-\overline{A_{1}}^{\prime}T}-2}{\left(\overline{A_{1}}^{\prime}\right)^{2}} \right\}. \end{aligned}

It is straightforward to check that there exists a negative constant C1 such that when $$\overline {A_{1}}' \leq C_{1}, \left (e^{\overline {A_{1}}'T}-1\right)/\overline {A_{1}}' < 1/\left (2\overline {A_{2}}g_{x}\right)$$. By the definition of $$\overline {A_{1}}'$$, if kf is sufficiently negative, there exists λ2fx such that $$\overline {A_{1}}'=C_{1}$$ and λ2 is sufficiently large to ensure

\begin{aligned} \overline{A_{2}}\,\overline{A_{4}}\frac{e^{C_{1}T} + e^{-C_{1}T}-2}{C_{1}^{2}}=\frac{f_{x}\overline{A_{2}}\left(e^{C_{1}T} + e^{-C_{1}T}-2\right)}{\lambda_{2}C_{1}^{2}} < \frac12. \end{aligned}

Combining these two estimates gives $$\overline {A_{0}}<1$$.

5. 5.

Noting that kb and kf play the same role in $$\overline {A_{1}}'$$, we use the same argument as above to show that when kb is sufficiently negative, there exists λ2fx such that $$\overline {A_{0}}<1$$.

### Proof

From the proof of this theorem and throughout the remainder of the paper, we use C to generally denote a constant that only depends on $$E|\xi |^{2}, \mathscr {L}$$, and T, whose value may change from line to line when there is no need to distinguish. We also use C(·) to generally denote a constant depending on $$E|\xi |^{2}, \mathscr {L}$$, T and the constants represented by ·.

We use the same notations as Lemma 1. Let $$X_{t_{i}}^{\pi,1} = X_{t_{i}}^{\pi }, Y_{t_{i}}^{\pi,1} = Y_{t_{i}}^{\pi }, Z_{t_{i}}^{\pi,1} = Z_{t_{i}}^{\pi }$$ (defined in system (2.3)) and $$X_{t_{i}}^{\pi,2} = \overline {X}_{t_{i}}^{\pi }, Y_{t_{i}}^{\pi,2} = \overline {Y}_{t_{i}}^{\pi }, Z_{t_{i}}^{\pi,2} = \overline {Z}_{t_{i}}^{\pi }$$ (defined in system (3.3)). It can be easily checked that both $$\left (\left \{X_{t_{i}}^{\pi,j}\right \}_{0\le i \le N}, \left \{Y_{t_{i}}^{\pi,j}\right \}_{0\le i \le N}, \left \{Z_{t_{i}}^{\pi,j}\right \}_{0\le i \le N-1}\right), j=1,2$$ satisfy the system of Eq. (4.3). Our proof strategy is to use Lemma 1 to bound the difference between two solutions through the objective function $$E|g\left (X_{T}^{\pi }\right) - Y_{T}^{\pi }|^{2}$$. This allows us to apply Theorem 5 to derive the desired estimates.

To begin with, note that for any λ3>0, the RMS-GM inequality yields

$$E|\delta Y_{N}|^{2} = E|g\left(\overline{X}_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \le \left(1 + \lambda_{3}^{-1}\right)E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} + g_{x}(1 + \lambda_{3})E|\delta X_{N}|^{2}.$$

Let

\begin{aligned} P = {\max_{0\le n \le N}e^{-A_{1}nh}E|\delta X_{n}|^{2}}, \quad S = {\max_{0\le n \le N} e^{A_{3}nh}E|\delta Y_{n}|^{2}}. \end{aligned}

Lemma 1 tells us

$$e^{-A_{1}nh}E|\delta X_{n}|^{2} \le A_{2}\sum_{i=0}^{n-1}e^{-A_{1}(i+1)h}E|\delta Y_{i}|^{2}h \le A_{2}S\sum_{i=0}^{n-1}e^{-A_{1}(i+1)h-A_{3}ih}h,$$

and

\begin{aligned} &e^{A_{3}nh}E|\delta Y_{n}|^{2} \\ \le & e^{A_{3}T}E|\delta Y_{N}|^{2} + A_{4}\sum_{i = n}^{N-1}e^{A_{3}ih}E|\delta X_{i}|^{2}h \\ \le &e^{A_{3}T}\left[\left(1 + \lambda_{3}^{-1}\right)E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} + g_{x}(1 + \lambda_{3})E|\delta X_{N}|^{2}\right] + A_{4}\sum_{i = n}^{N-1}e^{A_{3}ih}E|\delta X_{i}|^{2}h \\ \le &e^{A_{3}T}\left(1 + \lambda_{3}^{-1}\right)E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} + \left[g_{x}(1+\lambda_{3})e^{(A_{1}+A_{3})T} + A_{4}\sum_{i=n}^{N-1}e^{(A_{1} + A_{3})ih}h\right]P. \end{aligned}

Therefore by definition of P and S, we have

\begin{aligned} P &\le A_{2}he^{-A_{1}h}\frac{e^{-(A_{1}+A_{3})T} - 1}{e^{-(A_{1}+A_{3})h}-1}S, \\ S &\le e^{A_{3}T}\left(1+\lambda_{3}^{-1}\right)E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} + \left[g_{x}(1+\lambda_{3})e^{(A_{1}+A_{3})T}+ A_{4}h\frac{e^{(A_{1}+A_{3})T}-1}{e^{(A_{1}+A_{3})h }-1}\right]P. \end{aligned}

Consider the function

\begin{aligned} A(h) & = A_{2}he^{-A_{1}h}\frac{e^{-(A_{1}+A_{3})T} - 1}{e^{-(A_{1}+A_{3})h}-1}\left[g_{x}(1+\lambda_{3})e^{(A_{1}+A_{3})T}+ A_{4}h\frac{e^{(A_{1}+A_{3})T}-1}{e^{(A_{1}+A_{3})h }-1}\right]. \end{aligned}

When A(h)<1, we have

\begin{aligned} P &\le [1 - A(h)]^{-1}e^{A_{3}T}\left(1+\lambda_{3}^{-1}\right)A_{2}h e^{-A_{1}h}\frac{e^{-(A_{1}+A_{3})T} - 1}{e^{-(A_{1}+A_{3})h}-1}E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}, \\ S &\le [1 - A(h)]^{-1}e^{A_{3}T}\left(1+\lambda_{3}^{-1}\right)E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}. \end{aligned}

Let

$$\overline{P} = \max_{0\le n\le N}e^{-\overline{A_{1}}nh}E|\delta X_{n}|^{2}, \quad \overline{S} = \max_{0\le n \le N}e^{\overline{A_{3}}nh}E|\delta Y_{n}|^{2}.$$
(4.15)

Recall

\begin{aligned} {\lim}_{h\rightarrow 0}A_{i} = \overline{A_{i}}, \quad i=1,2,3,4, \end{aligned}

and note that

\begin{aligned} {\lim}_{h\rightarrow 0}A(h) = \overline{A_{2}}\frac{1-e^{-\left(\overline{A_{1}}+\overline{A_{3}}\right)T}}{\overline{A_{1}}+\overline{A_{3}}}\left[g_{x}(1+\lambda_{3})e^{(\overline{A_{1}}+\overline{A_{3}})T} + \overline{A_{4}}\frac{e^{(\overline{A_{1}}+\overline{A_{3}})T}-1}{\overline{A_{1}}+\overline{A_{3}}}\right]. \end{aligned}

When $$\overline {A_{0}} < 1$$, comparing $${\lim }_{h\rightarrow 0}A(h)$$ and $$\overline {A_{0}}$$, we know that, for any ε>0, there exists λ3>0 and sufficiently small h such that

\begin{aligned} \overline{P} &\le (1+\epsilon)\left[1 - \overline{A_{0}}\right]^{-1} \overline{A_{2}} e^{\overline{A_{3}}T}\left(1+\lambda_{3}^{-1}\right)\frac{ 1- e^{-(\overline{A_{1}}+\overline{A_{3}})T} }{\overline{A_{1}}+\overline{A_{3}}}E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}, \end{aligned}
(4.16)
\begin{aligned} \overline{S} &\le (1+\epsilon)\left[1 - \overline{A_{0}}\right]^{-1}e^{\overline{A_{3}}T}\left(1+\lambda_{3}^{-1}\right)E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}. \end{aligned}
(4.17)

By fixing ε=1 and choosing suitable λ3, we obtain our error estimates of E|δXn|2 and E|δYn|2 as

\begin{aligned} \max_{0\le n \le N}E|\delta X_{n}|^{2} &\le e^{\overline{A_{1}}T\vee 0} \overline{P} \le C(\lambda_{1},\lambda_{2}) E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}, \end{aligned}
(4.18)
\begin{aligned} \max_{0\le n \le N}E|\delta Y_{n}|^{2} &\le e^{(-\overline{A_{3}}T)\vee 0}\overline{S} \le C(\lambda_{1},\lambda_{2}) E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}. \end{aligned}
(4.19)

To estimate E|δZn|2, we consider estimate (4.12), in which λ2 can take any value no smaller than fz. If fz≠0, we choose λ2=2fz and obtain

$$\frac{1}{2}E|\delta Z_{i}|^{2} h \le \frac{f_{x}}{2f_{z}}E|\delta X_{i}|^{2}h + E|\delta Y_{i+1}|^{2} - [1-(2k_{f}+2f_{z})h]E|\delta Y_{i}|^{2}.$$

Summing from 0 to N−1 gives us

\begin{aligned} \sum_{i = 0}^{N-1}E|\delta Z_{i}|^{2}h \le~ & \frac{f_{x}T}{f_{z}}\max_{0\le n\le N}E|\delta X_{n}|^{2} + [4(k_{f}+f_{z})T \vee 0 + 2]\max_{0\le n\le N}E|\delta Y_{n}|^{2} \\ \le~ & C(\lambda_{1},\lambda_{2}) E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2}. \end{aligned}
(4.20)

The case fz=0 can be dealt with similarly by choosing λ2=1 and the same type of estimate can be derived. Finally, combining estimates (4.18), (4.19) and (4.20) with Theorem 5, we prove the statement in Theorem 1’. □

## An upper bound for the minimized objective function

We prove Theorem 2 in this section. We first state three useful lemmas. Theorem 2’, as a detailed statement of Theorem 2, and Theorem 6, as an variation of Theorem 2’ under stronger conditions, are then provided, followed by their proofs. The proofs of three lemmas are given at the end of the section.

The main process we analyze is (2.3). Lemma 3 gives an estimate of the final distance $$E|g\left (X_{T}^{\pi }\right) - Y_{T}^{\pi }|^{2}$$ provided by (2.3) in terms of the deviation between the approximated variables $$Y_{0}^{\pi }, Z_{t_{i}}^{\pi }$$ and the true solutions.

### Lemma 3

Suppose Assumptions 1, 2, and 3 hold true. Let $$X_{T}^{\pi }, Y_{0}^{\pi }, Y_{T}^{\pi }, \{Z_{t_{i}}^{\pi }\}_{0\le i \le N-1}$$ be defined as in system (2.3) and $$\tilde {Z}_{t_{i}} =h^{-1}E\left [\int _{t_{i}}^{t_{i+1}}Z_{t} \,\mathrm {d} t|\mathcal {F}_{t_{i}}\right ]$$. Given λ4>0, there exists a constant C>0 depending on $$E|\xi |^{2}, \mathscr {L}$$, T, and λ4, such that for sufficiently small h,

\begin{aligned} E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \le (1 + \lambda_{4})H_{\text{min}}\sum_{i=0}^{N-1}E|\delta \tilde{Z}_{t_{i}}|^{2}h + C\left[h + E|Y_{0} - Y_{0}^{\pi}|^{2}\right], \end{aligned}

where $$\delta \tilde {Z}_{t_{i}} = \tilde {Z}_{t_{i}} - Z_{t_{i}}^{\pi }, H(x) =(1 + \sqrt {g_{x}})^{2}e^{(2K + 2Kx^{-1} + x)T}\left (1 + f_{z}x^{-1}\right)$$, and $$H_{\text {min}} = \min _{x \in R^{+}} H(x)$$.

Lemma 3 is close to Theorem 2, except that $$\tilde {Z}_{t_{i}}$$ is not a function of $$X_{t_{i}}^{\pi }$$ and $$Y_{t_{i}}^{\pi }$$ defined in (2.3). To bridge this gap, we need the following two lemmas. First, similar to the proof of Theorem 1’, an estimate of the distance between the process defined in (2.3) and the process defined in (3.3) is also needed here. Lemma 4 is a general result to serve this purpose, providing an estimate of the difference between two backward processes driven by different forward processes.

### Lemma 4

Let $$X_{t_{i}}^{\pi,j} \in L^{2}(\Omega,\mathcal {F}_{t_{i}},\mathbb {P})$$ for 0≤iN,j=1,2. Suppose $$\left \{Y_{t_{i}}^{\pi,j}\right \}_{0\le i \le N}$$ and $$\left \{Z_{t_{i}}^{\pi,j}\right \}_{0\le i \le N-1}$$ satisfy

\left\{\begin{aligned} Y_{T}^{\pi,j} &= g\left(X_{T}^{\pi,j}\right), \\ Z_{t_{i}}^{\pi,j} &= \frac{1}{h}E\left[Y_{t_{i+1}}^{\pi,j}\Delta W_{i}|\mathcal{F}_{t_{i}}\right], \\ Y_{t_{i}}^{\pi,j} &= E\left[{Y}_{t_{i+1}}^{\pi,j} + f\left(t_{i},{X}_{t_{i}}^{\pi,j},{Y}_{t_{i}}^{\pi,j},{Z}_{t_{i}}^{\pi,j}\right)h|\mathcal{F}_{t_{i}}\right], \end{aligned}\right.
(5.1)

for 0≤iN−1,j=1,2. Let $$\delta X_{i} = X_{t_{i}}^{\pi,1} - X_{t_{i}}^{\pi,2}, \delta Z_{i} = Z_{t_{i}}^{\pi,1} - Z_{t_{i}}^{\pi,2}$$, then for any λ7>fz, and sufficiently small h, we have

\begin{aligned} \sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h\! \le\! \frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7} - f_{z}}\left\{g_{x} e^{A_{5}T - A_{5}h}E|\delta X_{N}|^{2} + \frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{A_{5}ih}E|\delta X_{i}|^{2}h\right\}\!, \end{aligned}

where $$A_{5} \mathrel {\mathop :}= -h^{-1}\ln [1 -(2k_{f}+\lambda _{7})h]$$.

Lemma 5 shows that, similar to the nonlinear Feynman–Kac formula, the discrete stochastic process defined in (2.3) can also be linked to some deterministic functions.

### Lemma 5

Let $$\left \{X_{t_{i}}^{\pi }\right \}_{0\le i\le N}, \left \{Y_{t_{i}}^{\pi }\right \}_{0\le i\le N}$$ be defined in (2.3). When $$h < 1/\sqrt {K}$$, there exist deterministic functions $$U_{i}^{\pi }: \mathbb {R}^{m} \times \mathbb {R} \rightarrow \mathbb {R}, V_{i}^{\pi }: \mathbb {R}^{m} \times \mathbb {R} \rightarrow \mathbb {R}^{d}$$ for 0≤iN such that $$Y_{t_{i}}^{\pi, '}=U_{i}^{\pi }\left (X_{t_{i}}^{\pi }, Y_{t_{i}}^{\pi }\right), Z_{t_{i}}^{\pi, '}=V_{i}^{\pi }\left (X_{t_{i}}^{\pi }, Y_{t_{i}}^{\pi }\right)$$ satisfy

\left\{\begin{aligned} Y_{t_{N}}^{\pi,^{\prime}} &= g\left(X_{t_{N}}^{\pi}\right), \\ Z_{t_{i}}^{\pi,^{\prime}} &= \frac{1}{h}E\left[Y_{t_{i+1}}^{\pi,^{\prime}}\Delta W_{i}|\mathcal{F}_{t_{i}}\right],\\ Y_{t_{i}}^{\pi,^{\prime}} &= E\left[{Y}_{t_{i+1}}^{\pi,^{\prime}} + f\left(t_{i},{X}_{t_{i}}^{\pi},{Y}_{t_{i}}^{\pi,^{\prime}},{Z}_{t_{i}}^{\pi,^{\prime}}\right)h|\mathcal{F}_{t_{i}}\right], \end{aligned}\right.
(5.2)

for 0≤iN−1. If b and σ are independent of y, then there exist deterministic functions $$U_{i}^{\pi }: \mathbb {R}^{m} \rightarrow \mathbb {R}, V_{i}^{\pi }: \mathbb {R}^{m} \rightarrow \mathbb {R}^{d}$$ for 0≤iN such that $$Y_{t_{i}}^{\pi, '}=U_{i}^{\pi }\left (X_{t_{i}}^{\pi }\right), Z_{t_{i}}^{\pi, '}=V_{i}^{\pi }\left (X_{t_{i}}^{\pi }\right)$$ satisfy (5.2).

Now we are ready to prove Theorem 2, with a precise statement given below. Like Theorem 1’, the conditions below are satisfied if any of the five cases of the weak coupling and monotonicity conditions holds to certain extent.

### Theorem 2 ′

Suppose Assumptions 1, 2, 3, and 4 hold true. Given any λ1,λ3>0,λ2fz, and λ7>fz, let $$\overline {A_{i}}, (i=1,2,3,4)$$ be defined in (4.4) and

\begin{aligned} \overline{A_{5}} \mathrel{\mathop:}= &~ \lambda_{7} + 2k_{f}, \\ \overline{A_{0}}^{\prime} \mathrel{\mathop:}= &~ \overline{A_{2}}\frac{1-e^{-(\overline{A_{1}}+\overline{A_{3}})T}}{\overline{A_{1}}+\overline{A_{3}}} \left\{g_{x}(1+\lambda_{3})e^{(\overline{A_{1}}+\overline{A_{3}})T} + \overline{A_{4}}\frac{e^{(\overline{A_{1}}+\overline{A_{3}})T}-1}{\overline{A_{1}}+\overline{A_{3}}} \right\}, \\ \overline{B_{0}} \mathrel{\mathop:}= &~ H_{\text{min}}\overline{A_{2}}e^{\overline{A_{3}}T} \frac{ 1- e^{-(\overline{A_{1}}+\overline{A_{3}})T} }{\overline{A_{1}}+\overline{A_{3}}}\left[1 - \overline{A_{0}}^{\prime}\right]^{-1}\left(1+\lambda_{3}^{-1}\right) \\ &~\times \frac{\lambda_{7}\left(e^{-\overline{A_{5}}T}\vee 1\right)}{\lambda_{7} - f_{z}}\left\{g_{x}e^{(\overline{A_{1}}+\overline{A_{5}})T} + \frac{f_{x}}{\lambda_{7}}\frac{e^{(\overline{A_{1}}+\overline{A_{5}})T}-1}{\overline{A_{1}}+\overline{A_{5}}} \right\}. \end{aligned}
(5.3)

If there exist λ1,λ2,λ3,λ7 satisfying $$\overline {A_{0}}' < 1$$ and $$\overline {B_{0}} < 1$$, then there exists a constant C depending on $$E|\xi |^{2}, \mathscr {L}$$, T, λ1,λ2,λ3, and λ7, such that for sufficiently small h,

$$E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \le C\left\{h + E|Y_{0} - Y_{0}^{\pi}|^{2} + \sum_{i=0}^{N-1}E|E\left[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right] - Z_{t_{i}}^{\pi}|^{2}h \right\},$$
(5.4)

where $$\tilde {Z}_{t_{i}} = h^{-1}E\left [\int _{t_{i}}^{t_{i+1}}Z_{t} \,\mathrm {d} t|\mathcal {F}_{t_{i}}\right ]$$. If Zt is cádlag, we can replace $$\tilde {Z}_{t_{i}}$$ with $$Z_{t_{i}}$$. If b and σ are independent of y, we can replace $$E\left [\tilde {Z}_{t_{i}}|X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right ]$$ with $$E\left [\tilde {Z}_{t_{i}}|X_{t_{i}}^{\pi }\right ]$$.

### Remark 7

If we take the infimum within the domains of $$Y_{0}^{\pi }$$ and $$Z_{t_{i}}^{\pi }$$ on both sides, we recover the original statement in Theorem 2.

### Remark 8

If any of the weak coupling and monotonicity conditions introduced in Assumption 3 holds to a sufficient extent, there must exist λ1,λ2,λ3,λ7 satisfying the conditions in Theorem 2’. The arguments are very similar to those provided in Remark 6. Hence, we omit the details here for the sake of brevity.

### Proof

Using Lemma 3 with λ4>0, we obtain

\begin{aligned} E|g(X_{T}^{\pi}) - Y_{T}^{\pi}|^{2} \le(1 + \lambda_{4}) H_{\text{min}} \sum_{i=0}^{N-1}E|\delta \tilde{Z}_{t_{i}}|^{2}h + C(\lambda_{4})\left[h+E|Y_{0} - Y_{0}^{\pi}|^{2}\right]. \end{aligned}
(5.5)

Splitting the term $$\delta \tilde {Z}_{t_{i}} = \tilde {Z}_{t_{i}} - Z_{t_{i}}^{\pi }$$ and applying the generalized mean inequality, we have (recall $$\overline {Z}_{t_{i}}^{\pi }$$ is defined in Theorem 5)

\begin{aligned} &~ E|\delta \tilde{Z}_{t_{i}}|^{2} \\ \le &~(1 + \lambda_{4})E|\overline{Z}_{t_{i}}^{\pi} - E\left[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right]|^{2}\\ &~ + (1+\lambda_{4}^{-1})\left\{E|\left(\tilde{Z}_{t_{i}} -\overline{Z}_{t_{i}}^{\pi}\right) - E\left[\left(\tilde{Z}_{t_{i}} -\overline{Z}_{t_{i}}^{\pi}\right)|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right]\right.\\ &~ \qquad\qquad\qquad +\left. (E[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}] - Z_{t_{i}}^{\pi})|^{2}\right\}\\ \le &~(1 + \lambda_{4})E|\overline{Z}_{t_{i}}^{\pi} - E\left[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right]|^{2}\\ &~ + 3\left(1+\lambda_{4}^{-1}\right)\left\{E|\tilde{Z}_{t_{i}} - \overline{Z}_{t_{i}}^{\pi}|^{2} + E|E\left[\left(\tilde{Z}_{t_{i}} - \overline{Z}_{t_{i}}^{\pi}\right)|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right]|^{2}\right.\\ &~ \qquad\qquad\qquad~ + \left.E|E\left[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right] - Z_{t_{i}}^{\pi}|^{2}\right\}\\ \le &~(1 + \lambda_{4})E|\overline{Z}_{t_{i}}^{\pi} - E\left[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right]|^{2}\\ &~ + 3\left(1+\lambda_{4}^{-1}\right)\left\{2E|\tilde{Z}_{t_{i}} - \overline{Z}_{t_{i}}^{\pi}|^{2} + E|E\left[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right] - Z_{t_{i}}^{\pi}|^{2}\right\}. \end{aligned}
(5.6)

From Eqs. (3.2) and (3.4), we know that

\begin{aligned} \sum_{i=0}^{N-1}E|\tilde{Z}_{t_{i}} - \overline{Z}_{t_{i}}^{\pi}|^{2}h &\leq 2\sum_{i=0}^{N-1}\int_{t_{i}}^{t_{i+1}}[E|Z_{t} - \tilde{Z}_{t_{i}}|^{2} + E|Z_{t} - \overline{Z}_{t_{i}}^{\pi}|^{2} \mathrm{d} t \\ &= 2\int_{0}^{T}[E|Z_{t} - \tilde{Z}_{t_{i}}|^{2} + E|Z_{t} - \overline{Z}_{t_{i}}^{\pi}|^{2} \mathrm{d} t \\ &\leq C\left(1+E|\xi|^{2}\right)h. \end{aligned}
(5.7)

Plugging estimates (5.6) and (5.7) into (5.5) gives us

\begin{aligned} &~E|g(X_{T}^{\pi}) - Y_{T}^{\pi}|^{2}\\ \le &~ (1 + \lambda_{4})^{2}H_{\text{min}} \sum_{i=0}^{N-1} E|\overline{Z}_{t_{i}}^{\pi} - E\left[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right]|^{2} h \\ &~ + C(\lambda_{4})\left\{h + E|Y_{0} - Y_{0}^{\pi}|^{2} + \sum_{i=0}^{N-1} E|E\left[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right] - Z_{t_{i}}^{\pi}|^{2}h \right\}. \end{aligned}
(5.8)

It remains to estimate the term $$\sum _{i=0}^{N-1} E|\overline {Z}_{t_{i}}^{\pi } - E\left [\overline {Z}_{t_{i}}^{\pi }|X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right ]|^{2} h$$, to which we intend to apply Lemma 4. Let $$X_{t_{i}}^{\pi,1} = X_{t_{i}}^{\pi }$$ and $$X_{t_{i}}^{\pi,2} = \overline {X}_{t_{i}}^{\pi }$$. The associated $$Z_{t_{i}}^{\pi,1}$$ and $$Z_{t_{i}}^{\pi,2}$$ are then defined according to Eq. 5.1. Note that $$Z_{t_{i}}^{\pi,2} = \overline {Z}_{t_{i}}^{\pi }$$ but $$Z_{t_{i}}^{\pi,1}$$ is not necessarily equal to $$Z_{t_{i}}^{\pi }$$, due to the possible violation of the terminal condition. From Lemma 5, we know $$Z_{t_{i}}^{\pi,1}$$ can be represented as $$V_{i}^{\pi }\left (X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right)$$ with $$V_{i}^{\pi }$$ being a deterministic function. By the property of conditional expectation, we have

\begin{aligned} E|\overline{Z}_{t_{i}}^{\pi} - E\left[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right]|^{2} \leq E|\overline{Z}_{t_{i}}^{\pi} - V_{i}\left(X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right)|^{2}, \end{aligned}

for any Vi. Therefore we have the estimate

\begin{aligned} &~\sum_{i=0}^{N-1}E|\overline{Z}_{t_{i}}^{\pi} - E\left[\overline{Z}_{t_{i}}^{\pi}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right]|^{2}h \le \sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h \\ \le &~ \frac{\lambda_{7}(e^{-A_{5}T}\vee 1)}{\lambda_{7} - f_{z}} \left\{g_{x}e^{A_{5}T -A_{5}h}E|\delta X_{N}|^{2}+ \frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{A_{5}ih}E|\delta X_{i}|^{2}h \right\}. \end{aligned}
(5.9)

Recall that $$\delta X_{i}=X_{t_{i}}^{\pi } - \overline {X}_{t_{i}}^{\pi }, \delta Z_{i}=Z_{t_{i}}^{\pi,1} - \overline {Z}_{t_{i}}^{\pi }$$. Similar to the derivation of estimate (4.16) (using a given λ3>0 without final specification) in the proof of Theorem 1’, when $$\overline {A_{0}}' < 1$$, we have

\begin{aligned} \overline{P} &\le (1+\lambda_{4})\overline{A_{2}}e^{\overline{A_{3}}T}\frac{ 1- e^{-(\overline{A_{1}}+\overline{A_{3}})T} }{\overline{A_{1}}+\overline{A_{3}}}[1 - \overline{A_{0}}^{\prime}]^{-1}\left(1+\lambda_{3}^{-1}\right)E|Y_{T}^{\pi}-g(X_{T}^{\pi})|^{2}, \end{aligned}
(5.10)

in which $$\overline {P} = \max _{0\le n \le N}e^{-\overline {A_{1}}nh}E|\delta X_{i}|^{2}$$. Plugging (5.10) into (5.9), and then into (5.8), we get

\begin{aligned} \sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h&\le \frac{\lambda_{7}\left(e^{-A_{5}T}\vee1\right)\overline{P}}{\lambda_{7}-f_{z}}\left\{g_{x}e^{(\overline{A_{1}}+A_{5})T - A_{5}h} + \frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{(\overline{A_{1}}+A_{5})ih}h\right\}, \end{aligned}

and

\begin{aligned} &~E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \\ \le &~ (1 + \lambda_{4})^{3} B(h)E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \\ &~ + C(\lambda_{4})\left\{h + E|Y_{0} - Y_{0}^{\pi}|^{2} + \sum_{i=0}^{N-1}E|E\left[\tilde{Z}_{t_{i}}|X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right] - Z_{t_{i}}^{\pi}|^{2}h \right\}, \end{aligned}
(5.11)

for sufficiently small h. Here B(h) is defined as

\begin{aligned} B(h) =&~ H_{\text{min}}\overline{A_{2}}e^{\overline{A_{3}}T} \frac{ 1- e^{-(\overline{A_{1}}+\overline{A_{3}})T} }{\overline{A_{1}}+\overline{A_{3}}}\left[1 - \overline{A_{0}}^{\prime}\right]^{-1}\left(1+\lambda_{3}^{-1}\right) \\ &~\times \frac{\lambda_{7}\left(e^{-A_{5}T}\vee 1\right)}{\lambda_{7} - f_{z}}\left\{g_{x}e^{(\overline{A_{1}}+A_{5})T-A_{5}h} + \frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{(\overline{A_{1}}+A_{5})ih}h \right\}. \end{aligned}

The forms of inequalities (5.4) and (5.11) are already very close. When $${{\lim }_{h\rightarrow 0}B(h)}=\overline {B_{0}} < 1$$, there exists λ4>0 such that for sufficiently small h, we have $$1 - (1+\lambda _{4})^{3}B(h) > \frac {1}{2}(1 - \overline {B_{0}})$$. Rearranging the term $$E|g\left (X_{T}^{\pi }\right) - Y_{T}^{\pi }|^{2}$$ in inequality (5.11) yields our final estimate. □

We shall briefly discuss how the universal approximation theorem can be applied based on Theorem 2’. For instance, Theorem 2.1 in Arora et al. (2018) states that every continuous and piecewise linear function with m-dimensional input can be represented by a deep neural network with rectified linear units and at most 1+ log2(m+1) depth. Now we view Y0 as a target function with input ξ and $$E\left [\tilde {Z}_{t_{i}}|X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right ]$$ as another target function with input $$\left (X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right)$$. Since E|Y0|2<+ and $$E|E\left [\tilde {Z}_{t_{i}}|X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right ]|^{2} \le E|\tilde {Z}_{t_{i}}|^{2} < + \infty$$, we know that both target functions can be approximated in the L2 sense by continuous and piecewise linear functions with arbitrary accuracy. Then the aforementioned statement implies that the two target functions can be approximated by two neural networks with rectified linear units and at most 1+ log2(m+1) depth, although the width might go to infinity as the approximation error decreases to 0. Therefore, according to Theorem 2’, there exist good neural networks such that the value of the objective function is small.

Note that there still exist some concerns about the result in Theorem 2’. First, the function $$E\left [\tilde {Z}_{t_{i}}|X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right ]$$ changes when $$Z_{t_{j}}^{\pi }$$ changes for j<i. Second, the function may depend on $$Y_{t_{i}}^{\pi }$$. Even if the FBSDEs are decoupled so that the above two concerns do not exist, we know nothing a priori about the property of $$E\left [\tilde {Z}_{t_{i}}|X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right ]$$. In the next theorem, we replace $$E\left [\tilde {Z}_{t_{i}}|X_{t_{i}}^{\pi },Y_{t_{i}}^{\pi }\right ]$$ with $$\sigma ^{\mathrm {T}}\left (t_{i},X_{t_{i}}^{\pi },u\left (t_{i},X_{t_{i}}^{\pi }\right)\right)\nabla _{x} u\left (t_{i},X_{t_{i}}^{\pi }\right)$$, which can resolve these problems. However, meanwhile we require more regularity for the coefficients of the FBSDEs.

### Theorem 6

Suppose Assumptions 1, 2, 3 and 4 and the assumptions in Theorem 3 hold true. Let u be the solution of corresponding quasilinear PDEs (2.6) and L be the squared Lipschitz constant of σT(t,x,u(t,x))xu(t,x) with respect to x. With the same notations of Theorem 2’, when $$\overline {A_{0}}' <1$$ and

\begin{aligned} \overline{B_{0}}^{\prime} \mathrel{\mathop:}= H_{\text{min}}L\overline{A_{2}}e^{\overline{A_{3}} T}\frac{\left(e^{\overline{A_{1}}T} -1\right)\left(1- e^{-(\overline{A_{1}}+\overline{A_{3}})T}\right)}{\overline{A_{1}}(\overline{A_{1}}+\overline{A_{3}})}\left[1 - \overline{A_{0}}^{\prime}\right]^{-1}\left(1+\lambda_{3}^{-1}\right) < 1, \end{aligned}

there exists a constant C>0 depending on E|ξ|2, T, $$\mathscr {L}$$, L, λ1,λ2, and λ3, such that for sufficiently small h,

$$E|g(X_{T}^{\pi}) - Y_{T}^{\pi}|^{2} \le C\left\{ h + E|Y_{0} - Y_{0}^{\pi}|^{2} + \sum_{i=0}^{N-1}E|f_{i}\left(X_{t_{i}}^{\pi}\right) - Z_{t_{i}}^{\pi}|^{2}h \right\},$$
(5.12)

where fi(x)=σT(ti,x,u(ti,x))xu(ti,x).

### Proof

By Theorem 3, we have $$Z_{t_{i}} = f_{i}(X_{t_{i}})$$, in which Xt is the solution of

$$X_{t} = \xi + \int_{0}^{t}b(s,X_{s},u(s,X_{s}))\, \mathrm{d}s + \int_{0}^{t}\sigma(s,X_{s},u(s,X_{s}))\, \mathrm{d}W_{s}.$$

Using Lemma 3 again with λ4>0 gives us

\begin{aligned} E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \le(1 + \lambda_{4}) H_{\text{min}} \sum_{i=0}^{N-1}E|\delta \tilde{Z}_{t_{i}}|^{2}h + C(\lambda_{4})\left[h+E|Y_{0} - Y_{0}^{\pi}|^{2}\right]. \end{aligned}

Given the continuity of σT(t,x,u(t,x))xu(t,x), we know Zt admits a continuous version. Hence the term $$\tilde {Z}_{t_{i}}$$ in $$\delta \tilde {Z}_{t_{i}} = \tilde {Z}_{t_{i}} -Z_{t_{i}}^{\pi }$$ can be replaced with $$Z_{t_{i}}$$, i.e.,

\begin{aligned} E|g\left(X_{T}^{\pi}\right) \,-\, Y_{T}^{\pi}|^{2} \!\le(1 + \lambda_{4}) H_{\text{min}} \sum_{i=0}^{N-1}E|Z_{t_{i}} \!- Z_{t_{i}}^{\pi}|^{2}h + C(\lambda_{4})\left[h+E|Y_{0} - Y_{0}^{\pi}|^{2}\right]. \end{aligned}
(5.13)

Similar to the arguments in inequalities (5.6) and (5.7), we have

\begin{aligned} &~ E|Z_{t_{i}} - Z_{t_{i}}^{\pi}|^{2} \\ \leq&~ \left(1 + \lambda_{4}^{-1}\right)E|f_{i}\left(X_{t_{i}}^{\pi}\right) - Z_{t_{i}}^{\pi}|^{2} + (1 + \lambda_{4})E|Z_{t_{i}} - f_{i}(X_{t_{i}}^{\pi})|^{2} \\ \leq&~ \left(1 + \lambda_{4}^{-1}\right)E|f_{i}\left(X_{t_{i}}^{\pi}\right) - Z_{t_{i}}^{\pi}|^{2} + (1 + \lambda_{4})LE|X_{t_{i}} - X_{t_{i}}^{\pi}|^{2} \\ \leq&~ \left(1 + \lambda_{4}^{-1}\right)E|f_{i}\left(X_{t_{i}}^{\pi}\right) - Z_{t_{i}}^{\pi}|^{2} \\ &~ + (1 + \lambda_{4})L \left[(1 + \lambda_{4})E|X_{t_{i}}^{\pi}- \overline{X}_{t_{i}}^{\pi}|^{2} + \left(1 + \lambda_{4}^{-1}\right)E|X_{t_{i}} - \overline{X}_{t_{i}}^{\pi}|^{2} \right] \\ \leq&~ (1 + \lambda_{4})^{2}LE|X_{t_{i}}^{\pi}- \overline{X}_{t_{i}}^{\pi}|^{2} + C(L,\lambda_{4})\left\{E|f_{i}\left(X_{t_{i}}^{\pi}\right) - Z_{t_{i}}^{\pi}|^{2} + h \right\}, \end{aligned}

where the last equality uses the convergence result (3.4). Plugging it into (5.13), we have

\begin{aligned} E|g(X_{T}^{\pi}) - Y_{T}^{\pi}|^{2} \le &~ (1 + \lambda_{4})^{3} H_{\text{min}} L \sum_{i=0}^{N-1}E|X_{t_{i}}^{\pi} - \overline{X}_{t_{i}}^{\pi}|^{2}h \\ &~ + C(L,\lambda_{4})\left\{h + E|Y_{0} - Y_{0}^{\pi}|^{2} + \sum_{i=0}^{N-1}E|f_{i}\left(X_{t_{i}}^{\pi}\right) - Z_{t_{i}}^{\pi}|^{2}h \right\} \end{aligned}
(5.14)

for sufficiently small h.

We employ the estimate (5.10) again to rewrite inequality (5.14) as

\begin{aligned} E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \le &~ (1 + \lambda_{4})^{4} \widetilde{B}(h)E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \\ &~ + C(L,\lambda_{4})\left\{h + E|Y_{0} - Y_{0}^{\pi}|^{2} + \sum_{i=0}^{N-1}E|f_{i}\left(X_{t_{i}}^{\pi}\right) - Z_{t_{i}}^{\pi}|^{2}h \right\}, \end{aligned}
(5.15)

where

\begin{aligned} \widetilde{B}(h) = H_{\text{min}}L\overline{A_{2}}e^{\overline{A_{3}} T}\frac{ 1- e^{-(\overline{A_{1}}+\overline{A_{3}})T} }{\overline{A_{1}}+\overline{A_{3}}}\left[1 - \overline{A_{0}}^{\prime}\right]^{-1}\left(1+\lambda_{3}^{-1}\right)\sum_{i=0}^{N-1}e^{i\overline{A_{1}}h}h. \end{aligned}

Arguing in the same way as that in the proof of Theorem 2’, when $$\tilde {B}(h)$$ is strictly bounded above by 1 for sufficiently small h, we can choose λ4 small enough and rearrange the terms in inequality (5.15) to obtain the result in inequality (5.12). □

### Remark 9

The Lipschitz constant used in Theorem 6 may be further estimated a priori. Denote the Lipschitz constant of function f with respect to x as Lx(f), and the bound of function f as M(f). Loosely speaking, we have

$$L_{x}(\sigma^{\mathrm{T}}(t,x,u(t,x))\nabla_{x} u(t,x)) \le M(\sigma)L_{x}(\nabla_{x} u) + M(\nabla_{x} u) [L_{x}(\sigma)+ L_{y}(\sigma)L_{x}(u)].$$

Here Lx(u)=M(xu(t,x)) can be estimated from the first point of Theorem 4 and L(xu(t,x))=M(xxu) can be estimated through the Schauder estimate (see, e.g., (Ma and Yong 2007, Chapter 4, Lemma 2.1)). Note that the resulting estimate may depend on the dimension d.

### Proof of Lemmas

Proof of Lemma 3 We construct continuous processes $$X_{t}^{\pi }, Y_{t}^{\pi }$$ as follows. For t[ti,ti+1), let

\begin{aligned} X_{t}^{\pi} &= X_{t_{i}}^{\pi} + b\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right) (t-t_{i}) + \sigma\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right)(W_{t} - W_{t_{i}}), \\ Y_{t}^{\pi} &= Y_{t_{i}}^{\pi} - f\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi}\right)(t - t_{i}) +\left(Z_{t_{i}}^{\pi}\right)^{\mathrm{T}}(W_{t} - W_{t_{i}}). \end{aligned}

From system (2.3), we see this definition also works at ti+1. We are interested in again the estimates of the following terms

\begin{aligned} \delta X_{t} &= X_{t} - X_{t}^{\pi}, \quad \delta Y_{t} = Y_{t} - Y_{t}^{\pi}, \quad \delta Z_{t} = Z_{t} - Z_{t_{i}}^{\pi},~~~t\in[t_{i},t_{i+1}). \end{aligned}

For t[ti,ti+1), let

\begin{aligned} \delta b_{t} &= b(t,X_{t},Y_{t}) - b\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right), \\ \delta \sigma_{t} &= \sigma(t,X_{t},Y_{t}) - \sigma\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi}\right), \\ \delta f_{t} &= f(t,X_{t},Y_{t},Z_{t}) - f\left(t_{i},X_{t_{i}}^{\pi},Y_{t_{i}}^{\pi},Z_{t_{i}}^{\pi}\right). \end{aligned}

By definition,

\begin{aligned} \mathrm{d}(\delta X_{t}) &= \delta b_{t}\,\mathrm{d}t + \delta \sigma_{t}\,\mathrm{d}W_{t}, \\ \mathrm{d}(\delta Y_{t}) &= -\delta f_{t} \,\mathrm{d}t + (\delta Z_{t})^{\mathrm{T}}\,\mathrm{d}W_{t}. \end{aligned}

Then by Itô’s formula, we have

\begin{aligned} \mathrm{d}|\delta X_{t}|^{2} &= [2(\delta b_{t})^{T}\delta X_{t} + \|\delta \sigma_{t}\|^{2}]\mathrm{d}t + 2(\delta X_{t})^{\mathrm{T}} \delta \sigma_{t} \,\mathrm{d}W_{t}, \\ \mathrm{d}|\delta Y_{t}|^{2} &= [-2(\delta f_{t})^{\mathrm{T}} \delta Y_{t} + |\delta Z_{t}|^{2}] \,\mathrm{d}t + 2\delta Y_{t} (\delta Z_{t})^{\mathrm{T}} \,\mathrm{d}W_{t}. \end{aligned}

Thus,

\begin{aligned} E|\delta X_{t}|^{2} &= E|\delta X_{t_{i}}|^{2} + \int_{t_{i}}^{t}E\left[2 (\delta b_{s})^{\mathrm{T}} \delta X_{s} + \|\delta \sigma_{s}\|^{2}\right] \,\mathrm{d} s, \\ E|\delta Y_{t}|^{2} &= E|\delta Y_{t_{i}}|^{2} + \int_{t_{i}}^{t} E\left[ -2 (\delta f_{s})^{\mathrm{T}} \delta Y_{s} + |\delta Z_{s}|^{2} \right]\, \mathrm{d} s. \end{aligned}

For any λ5,λ6>0, using Assumptions 1 and 2 and the RMS-GM inequality, we have

\begin{aligned} &~ E|\delta X_{t}|^{2} \\ \le &~E|\delta X_{t_{i}}|^{2} + \int_{t_{i}}^{t}[\lambda_{5}E|\delta X_{s}|^{2} + \lambda_{5}^{-1}E|\delta b_{s}|^{2} + E\|\delta \sigma_{s}\|^{2}] \,\mathrm{d}s \\ \le &~ E|\delta X_{t_{i}}|^{2} + \lambda_{5} \int_{t_{i}}^{t}E|\delta X_{s}|^{2} \,\mathrm{d}s + \int_{t_{i}}^{t}K\left(\lambda_{5}^{-1} + 1\right)|s - t_{i}|\,\mathrm{d}s \\ &~ + \int_{t_{i}}^{t}\left[(K\lambda_{5}^{-1} + \sigma_{x})E|X_{s} - X_{t_{i}}^{\pi}|^{2} + \left(b_{y}\lambda_{5}^{-1} + \sigma_{y}\right)E|Y_{s} - Y_{t_{i}}^{\pi}|^{2}\right]\,\mathrm{d}s. \end{aligned}
(5.16)

By the RMS-GM inequality, we also have

\begin{aligned} E|X_{s} - X_{t_{i}}^{\pi}|^{2} &\le (1 + \epsilon_{1})E|\delta X_{t_{i}}|^{2} + \left(1 + \epsilon_{1}^{-1}\right)E|X_{s} - X_{t_{i}}|^{2}, \\ \end{aligned}
(5.17)
\begin{aligned} E|Y_{s} - Y_{t_{i}}^{\pi}|^{2} &\le (1 + \epsilon_{2})E|\delta Y_{t_{i}}|^{2} + \left(1 + \epsilon_{2}^{-1}\right)E|Y_{s} - Y_{t_{i}}|^{2}, \end{aligned}
(5.18)

in which we choose $$\epsilon _{1} = \lambda _{6}\left (K\lambda _{5}^{-1}+\sigma _{x}\right)^{-1}$$ and $$\epsilon _{2} = \lambda _{6}\left (b_{y}\lambda _{5}^{-1} + \sigma _{y}\right)^{-1}$$. The path regularity in Theorem 4 tells us

\begin{aligned} \sup_{s\in[t_{i},t_{i+1}]} \left(E|X_{s} - X_{t_{i}}|^{2} + E|Y_{s} - Y_{t_{i}}|^{2}\right) \le Ch. \end{aligned}
(5.19)

Plugging inequalities (5.17), (5.18), (5.19) into (5.16) with simplification, we obtain

\begin{aligned} E|\delta X_{t}|^{2} &\le [1 + \left(K\lambda_{5}^{-1} + \sigma_{x} + \lambda_{6}\right)h]E|\delta X_{t_{i}}|^{2} + \lambda_{5} \int_{t_{i}}^{t}E|\delta X_{s}|^{2} \,\mathrm{d}s \\ &\hphantom{=~} +\left(b_{y}\lambda_{5}^{-1} + \sigma_{y} + \lambda_{6}\right) E|\delta Y_{t_{i}}|^{2} h + C(\lambda_{5},\lambda_{6})h^{2}. \end{aligned}
(5.20)

Then, by Grönwall inequality, we have

\begin{aligned} & E|\delta X_{t_{i+1}}|^{2} \\ \le~& e^{\lambda_{5} h}\left\{\left[1 + \left(K\lambda_{5}^{-1} + \sigma_{x} +\lambda_{6}\right)h\right]E|\delta X_{t_{i}}|^{2}\right. \\ & \quad~~~~+ \left.\left(b_{y}\lambda_{5}^{-1} + \sigma_{y} + \lambda_{6}\right)E|\delta Y_{t_{i}}|^{2}h + C(\lambda_{5},\lambda_{6})h^{2}\right\} \\ \le~& e^{A_{6}h}E|\delta X_{t_{i}}|^{2} + e^{\lambda_{5}h}\left(b_{y}\lambda_{5}^{-1}+\sigma_{y} + \lambda_{6}\right)E|\delta Y_{t_{i}}|^{2}h + C(\lambda_{5},\lambda_{6})h^{2}\\ \le~& e^{A_{6}h}E|\delta X_{t_{i}}|^{2} + A_{7}E|\delta Y_{t_{i}}|^{2}h + C(\lambda_{5},\lambda_{6})h^{2}, \end{aligned}
(5.21)

where $$A_{6} \mathrel {\mathop :}= K\lambda _{5}^{-1} + \sigma _{x} + \lambda _{5} + \lambda _{6}, A_{7} \mathrel {\mathop :}= b_{y}\lambda _{5}^{-1} + \sigma _{y} + 2\lambda _{6}$$, and h is sufficiently small.

Similarly, with the same type of estimates in (5.16) and (5.20), for any λ5,λ6>0, we have

\begin{aligned} &~E|\delta Y_{t}|^{2} \\ \le &~ E|\delta Y_{t_{i}}|^{2} + \int_{t_{i}}^{t}\left[\lambda_{5}E|\delta Y_{s}|^{2} + \lambda_{5}^{-1}E|\delta f_{s}|^{2} + E|\delta Z_{s}|^{2}\right] \,\mathrm{d}s \\ \le &~E|\delta Y_{t_{i}}|^{2} + \lambda_{5} \int_{t_{i}}^{t}E|\delta Y_{s}|^{2}\,\mathrm{d}s + \int_{t_{i}}^{t} K\lambda_{5}^{-1}|s - t_{i}|\,\mathrm{d}s \\ & + \int_{t_{i}}^{t}\lambda_{5}^{-1}\left[f_{x}E|X_{s} - X_{t_{i}}^{\pi}|^{2} + K E|Y_{s} - Y_{t_{i}}^{\pi}|^{2}\right]\,\mathrm{d}s \!+ \left(1 + f_{z}\lambda_{5}^{-1}\right)\int_{t_{i}}^{t}E|\delta Z_{s}|^{2} \,\mathrm{d}s \\ \le &~\left[1 + \left(K\lambda_{5}^{-1} + \lambda_{6}\right)h\right]E|\delta Y_{t_{i}}|^{2} + \lambda_{5}\int_{t_{i}}^{t}E|\delta Y_{s}|^{2}\,\mathrm{d}s + \left(f_{x}\lambda_{5}^{-1} + \lambda_{6}\right)E|\delta X_{t_{i}}^{\pi}|^{2}h \\ &~+\left(1 + f_{z}\lambda_{5}^{-1}\right)\int_{t_{i}}^{t}E|\delta Z_{s}|^{2}\,\mathrm{d}s + C(\lambda_{5},\lambda_{6})h^{2}. \end{aligned}

Arguing in the same way of (5.21), by Grönwall inequality, for sufficiently small h, we have

\begin{aligned} &E|\delta Y_{t_{i+1}}|^{2} \\ \le~&e^{A_{8}h}E|\delta Y_{t_{i}}|^{2} + A_{9}E|\delta X_{t_{i}}|^{2} h + (1 + f_{z}\lambda_{5}^{-1} + \lambda_{6})\int_{t_{i}}^{t}E|\delta Z_{s}|^{2}\,\mathrm{d}s \!+ C(\lambda_{5},\lambda_{6})h^{2}, \end{aligned}

with $$A_{8} \mathrel {\mathop :}= K\lambda _{5}^{-1} + \lambda _{5} + \lambda _{6}, A_{9} \mathrel {\mathop :}= f_{x}\lambda _{5}^{-1} + 2\lambda _{6}$$. Choosing $$\epsilon _{3} = \left (1 + f_{z}\lambda _{5}^{-1} + \lambda _{6}\right)^{-1}\lambda _{6}$$ and using

$$\int_{t_{i}}^{t_{i+1}} E|\delta Z_{t}|^{2} \,\mathrm{d}t \le (1 + \epsilon_{3})E|\delta \tilde{Z}_{t_{i}}|^{2} h + \left(1 + \epsilon_{3}^{-1}\right)E_{z}^{i},$$

where $$\delta \tilde {Z}_{t_{i}} = \tilde {Z}_{t_{i}} - Z_{t_{i}}^{\pi }$$ and $$E_{z}^{i} = \int _{t_{i}}^{t_{i+1}}E|Z_{t} - \tilde {Z}_{t_{i}}|^{2}\,\mathrm {d}t$$, we furthermore obtain

$$E|\delta Y_{t_{i+1}}|^{2} \le e^{A_{8}h}E|\delta Y_{t_{i}}|^{2} + A_{9}E|\delta X_{t_{i}}|^{2} h + A_{10} E|\delta \tilde{Z}_{t_{i}}|^{2}h + C(\lambda_{5},\lambda_{6})\left(h^{2}+E_{z}^{i}\right),$$
(5.22)

with $$A_{10} \mathrel {\mathop :}= 1 + f_{z}\lambda _{5}^{-1} + 2\lambda _{6}$$.

Define

\begin{aligned} M_{i} = \max\left\{E|\delta X_{i}|^{2}, E|\delta Y_{i}|^{2}\right\}, \quad 0 \le i \le N. \end{aligned}

Combining inequalities (5.21) and (5.22) together yields

\begin{aligned} &M_{i+1} \\ \le~& \left(e^{\max\{A_{6}, A_{8}\}h}+ \max\{A_{7}, A_{9}\}h\right)M_{i} + A_{10}E|\delta \tilde{Z}_{t_{i}}|^{2}h + C(\lambda_{5},\lambda_{6})\left(h^{2}+E_{z}^{i}\right) \\ \le~& e^{(\max\{A_{6}, A_{8}\} + \max\{A_{7}, A_{9}\})h}M_{i} + A_{10}E|\delta \tilde{Z}_{t_{i}}|^{2}h + C(\lambda_{5},\lambda_{6})\left(h^{2}+E_{z}^{i}\right). \end{aligned}

Letting $$A_{11} \mathrel {\mathop :}= \max \{A_{6},A_{8}\} + \max \{A_{7},A_{9}\}$$, we have

$$M_{i+1} \le e^{A_{11}h}M_{i} + A_{10}E|\delta \tilde{Z}_{t_{i}}|^{2}h + C(\lambda_{5},\lambda_{6})\left(h^{2} + E_{z}^{i}\right).$$
(5.23)

We start from $$M_{0} = E|Y_{0} - Y_{0}^{\pi }|^{2}$$ and apply inequality (5.23) repeatedly to obtain

$$M_{N} \le A_{10}e^{A_{11}T}\sum_{i=0}^{N-1}E|\delta \tilde{Z}_{t_{i}}|^{2}h + C(\lambda_{5},\lambda_{6})\left[h + E|Y_{0} - Y_{0}^{\pi}|^{2}\right],$$
(5.24)

in which for the last term we use the fact $$\sum _{i=0}^{N-1}E_{z}^{i} \le Ch$$ from inequality (3.2). Note that

\begin{aligned} A_{10} &= 1 + f_{z}\lambda_{5}^{-1} + 2\lambda_{6}, \\ A_{11} &\le 2K + 2K\lambda_{5}^{-1} + \lambda_{5} + 3\lambda_{6}. \end{aligned}

Given any λ4>0, we can choose λ6 small enough such that

\begin{aligned} \left(1 + f_{z}\lambda_{5}^{-1} + 2\lambda_{6}\right)e^{A_{11}T} &\le (1+\lambda_{4})\left(1+f_{z}\lambda_{5}^{-1}\right)e^{\left(2K + 2K\lambda_{5}^{-1} + \lambda_{5}\right)T}. \end{aligned}

This condition and inequality (5.24) together give us

\begin{aligned} M_{N} \le~& (1+\lambda_{4})\left(1+f_{z}\lambda_{5}^{-1}\right)e^{\left(2K + 2K\lambda_{5}^{-1} + \lambda_{5}\right)T}\sum_{i=0}^{N-1}E|\delta \tilde{Z}_{t_{i}}|^{2}h \\ &+ C(\lambda_{4},\lambda_{5})\left[h+E|Y_{0} - Y_{0}^{\pi}|^{2}\right]. \end{aligned}
(5.25)

Finally, by decomposing the objective function, we have

\begin{aligned} &~E|g\left(X_{T}^{\pi}\right) - Y_{T}^{\pi}|^{2} \\ = &~ E|g\left(X_{T}^{\pi}\right) - g(X_{T}) + Y_{T} - Y_{T}^{\pi}|^{2} \\ \le &~ \left(1+(\sqrt{g_{x}})^{-1}\right) E|g\left(X_{T}^{\pi}\right) - g(X_{T})|^{2} + (1+\sqrt{g_{x}}) E|\delta Y_{N}|^{2} \\ \le &~ (g_{x} + \sqrt{g_{x}})E|\delta X_{N}|^{2} + (1 + \sqrt{g_{x}})E|\delta Y_{N}|^{2} \\ \le &~ (1 + \sqrt{g_{x}})^{2}M_{N}. \end{aligned}
(5.26)

We complete our proof by combining inequalities (5.25), (5.26) and choosing $$\lambda _{5} = \text {argmin}_{x \in \mathbb {R}^{+}} H(x)$$. □

Proof of Lemma 4 We use the same notations as in the proof of Lemma 1. As derived in (4.12), for any λ7>fz≥0, we have

$$E|\delta Y_{i+1}|^{2} \ge [1 - (2k_{f}+\lambda_{7})h]E|\delta Y_{i}|^{2} + \left(1 - f_{z}\lambda_{7}^{-1}\right)E|\delta Z_{i}|^{2}h - f_{x}\lambda_{7}^{-1}E|\delta X_{i}|^{2}h.$$
(5.27)

Multiplying both sides of (5.27) by $$e^{A_{5}ih}(e^{-A_{5}T}\vee 1)/\left (1 - f_{z}\lambda _{7}^{-1}\right)$$ gives us

\begin{aligned} &~\frac{\lambda_{7}\left(e^{-A_{5}T}\vee 1\right)}{\lambda_{7} - f_{z}}\left\{e^{A_{5}ih} E|\delta Y_{i+1}|^{2} - e^{A_{5}(i-1)h}E|\delta Y_{i}|^{2} + e^{A_{5}ih} \frac{f_{x}}{\lambda_{7}}E|\delta X_{i}|^{2}h\right\} \\ \ge &~e^{A_{5}ih}\left(e^{-A_{5}T}\vee 1\right)E|\delta Z_{i}|^{2}h \\ \ge &E|\delta Z_{i}|^{2}h. \end{aligned}
(5.28)

Summing (5.28) up from i=0 to N−1, we obtain

\begin{aligned} \sum_{i=0}^{N-1}E|\delta Z_{i}|^{2}h \le \frac{\lambda_{7}\left(e^{-A_{5}T}\vee 1\right)}{\lambda_{7} - f_{z}} \left\{e^{A_{5}T - A_{5}h}E|\delta Y_{N}|^{2} + \frac{f_{x}}{\lambda_{7}}\sum_{i=0}^{N-1}e^{A_{5}ih}E|\delta X_{i}|^{2}h \right\}. \end{aligned}
(5.29)

Note that E|δYN|2gxE|δXN|2 by Assumption 1. Plugging it into (5.29), we arrive at the desired result. □

Proof of Lemma 5 We prove by induction backwardly. Let $$Z_{t_{N}}^{\pi,'} = 0$$ for convenience. It is straightforward to see that the statement holds for i=N. Assume the statement holds for i=k+1. For i=k, we know $$Y_{t_{k+1}}^{\pi,'} = U_{k+1}\left (X_{t_{k+1}}^{\pi }, Y_{t_{k+1}}^{\pi }\right)$$. Recalling the definition of $$\left \{X_{t_{i}}^{\pi }\right \}_{0\le i\le N}, \left \{Y_{t_{i}}^{\pi }\right \}_{0\le i\le N}$$ in (2.3), we can rewrite $$Y_{t_{k+1}}^{\pi,'} = \overline {U}_{{k}}\left (X_{t_{k}}^{\pi }, Y_{t_{k}}^{\pi },\Delta W_{k}\right)$$, with $$\overline {U}_{k}: \mathbb {R}^{m} \times \mathbb {R} \times \mathbb {R}^{d} \rightarrow \mathbb {R}$$ being a deterministic function. Note $$Z_{t_{k}}^{\pi,'} = h^{-1}E\left [\overline {U}_{k}\left (X_{t_{k}}^{\pi }, Y_{t_{k}}^{\pi },\Delta W_{k}\right)\Delta W_{k}|\mathcal {F}_{t_{k}}\right ]$$. Since ΔWk is independent of $$\mathcal {F}_{t_{k}}$$, there exists a deterministic function $$V_{k}^{\pi }: \mathbb {R}^{m} \times \mathbb {R} \rightarrow \mathbb {R}^{d}$$ such that $$Z_{t_{k}}^{\pi, '}=V_{k}^{\pi }\left (X_{t_{k}}^{\pi }, Y_{t_{k}}^{\pi }\right)$$.

Next we consider $$Y_{t_{k}}^{\pi,'}$$. Let $$H_{k} = L^{2}(\Omega, \sigma (X_{t_{k}}^{\pi }, Y_{t_{k}}^{\pi }), \mathbb {P})$$, where $$\sigma \left (X_{t_{k}}^{\pi }, Y_{t_{k}}^{\pi }\right)$$ denotes the σ-algebra generated by $$X_{t_{k}}^{\pi }, Y_{t_{k}}^{\pi }$$. We know Hk is a Banach space and another equivalent representation is

$$H_{k}=\left\{Y = \phi\left(X_{t_{k}}^{\pi}, Y_{t_{k}}^{\pi}\right)~|~\phi \text{ is measurable and}\ E|Y|^{2} < \infty \right\}.$$

Consider the following map defined on Hk:

$$\Phi_{k}(Y) = E\left[{Y}_{t_{k+1}}^{\pi,^{\prime}} + f\left(t_{k},{X}_{t_{k}}^{\pi},Y,{Z}_{t_{k}}^{\pi,^{\prime}}\right)h|\mathcal{F}_{t_{k}}\right].$$

By Assumption 3, Φk(Y) is square-integrable. Furthermore, following the same argument for $$Z_{t_{k}}^{\pi,'}, \Phi _{k}(Y)$$ can also be represented as a deterministic function of $$X_{t_{k}}^{\pi }, Y_{t_{k}}^{\pi }$$. Hence, Φk(Y)Hk. Note that Assumption 1 implies E|Φk(Y1)−Φk(Y2)|2Kh2E|Y1Y2|2. Therefore Φk is a contraction map on Hk when $$h < 1/\sqrt {K}$$. By the Banach fixed-point theorem, there exists a unique fixed-point $$Y^{*}=\phi _{k}^{*}\left (X_{t_{k}}^{\pi }, Y_{t_{k}}^{\pi }\right) \in H_{k}$$ satisfying Y=Φk(Y). We choose $$U_{k}^{\pi }=\phi _{k}^{*}$$ to validate the statement for $$Y_{t_{k}}^{\pi,'}$$.

When b and σ are independent of y, all of the arguments above can be made similarly with $$U_{i}^{\pi }, V_{i}^{\pi }$$ also being independent of Y. □

## Numerical examples

### General setting

In this section, we illustrate the proposed numerical scheme by solving two high-dimensional coupled FBSDEs adapted from the literature. The common setting for these two numerical examples is as follows. We assume d=m=100, that is, $$X_{t}, Z_{t}, W_{t} \in \mathbb {R}^{100}$$. Assume ξ is deterministic and we are interested in the approximation error of Y0, which is also a deterministic scalar.

We use N−1 fully-connected feedforward neural networks to parameterize $$\phi _{i}^{\pi }, i=0,1,\dots,N-1$$. Each of the neural networks has 2 hidden layers with dimension d+10. The input has dimension $$d+1~(X_{i}\in \mathbb {R}^{d}, Y_{i} \in \mathbb {R})$$ and the output has dimension d. In practice, one can of course choose Xi only as the input. We additionally test this input for the two examples and find no difference in terms of the relative error of Y0 (up to second decimal place). We use the rectifier function (ReLU) as the activation function and adopt batch normalization (Ioffe and Szegedy 2015) right after each matrix multiplication and before activation. We employ the Adam optimizer (Kingma and Ba 2015) to optimize the parameters with batch-size being 64. The loss function is computed based on 256 validation sample paths. We initialize all the parameters using a uniform or normal distribution and run each experiment 5 times to report the average result.

### Example 1

The first problem is adapted from (Milstein and Tretyakov 2006; Huijskens et al. 2016), in which the original spatial dimension of the problem is 1. We consider the following coupled FBSDEs

\left\{\begin{aligned} &X_{j,t} = x_{j,0} + \int_{0}^{t}\frac{X_{j,s}\left(1+X_{j,s}^{2}\right)}{(2+X_{j,s})^{3}}\, \mathrm{d}s \\ & \qquad+ \int_{0}^{t} \frac{1+X_{j,s}^{2}}{2+X_{j,s}^{2}}\sqrt{\frac{1+2Y_{s}^{2}}{1+Y_{s}^{2}+\exp{\left(-\frac{2|X_{s}|^{2}}{d(s+5)}\right)}}}\, \mathrm{d}W_{j,s}, \quad j=1,\dots,d,\\ &Y_{t} = \exp{\left(-\frac{|X_{T}|^{2}}{d(T+5)}\right)} \\ & \qquad~ + \int_{t}^{T}a(s,X_{s},Y_{s})+\sum_{j=1}^{d}b(s,X_{j,s},Y_{s})Z_{j,s}\, \mathrm{d}s - \int_{t}^{T}(Z_{s})^{\mathrm{T}}\, \mathrm{d}W_{s}, \end{aligned}\right.
(6.1)

where Xj,t,Zj,t,Wj,t denote the j-th components of Xt,Yt,Wt, and the coefficient functions are given as

{}\begin{aligned} a(t,x,u) \,=\, &~ \frac{1}{d(t+5)}\exp{\left(-\frac{|x|^{2}}{d(t+5)}\right)} \\ &~ \times \sum_{j=1}^{d} \left\{\frac{4x_{j}^{2}\left(1+x_{j}^{2}\right)}{\left(2+x_{j}^{2}\right)^{3}} \,+\, \frac{\left(1+x_{j}^{2}\right)^{2}}{\left(2+x_{j}^{2}\right)^{2}} \!- \frac{2x_{j}^{2}\left(1+x_{j}^{2}\right)^{2}}{d(t+5)\left(2+x_{j}^{2}\right)^{2}} - \frac{x_{j}^{2}}{t+5}\right\}, \\ b(t,x_{j},u) = &~ \frac{x_{j}}{\left(2+x_{j}^{2}\right)^{2}}\sqrt{\frac{1+u^{2}+\exp{\left(-\frac{|x|^{2}}{d(t+5)}\right)}}{1+2u^{2}}}. \end{aligned}

It can be verified by Itô’s formula that the Y part of the solution of (6.1) is given by

\begin{aligned} Y_{t} = \exp{\left(-\frac{|X_{t}|^{2}}{d(t+5)}\right)}. \end{aligned}

Let $$\xi =(1,1,\dots,1)$$ (100-dimensional), T=5,N=160. The initial guess of Y0 is generated from a uniform distribution on the interval [2,4] while the true value of Y0≈0.81873. We train 25000 steps with an exponential decay learning rate that decays every 100 steps, with the starting learning rate being 1e-2 and ending learning rate being 1e-5. Figure 1 illustrates the mean of the loss function and relative approximation error of Y0 against the number of iteration steps. All runs converged and the average final relative error of Y0 is 0.39%.

### Example 2

The second problem is adapted from (Bender and Zhang 2008), in which the spatial dimension is originally tested up to 10. The coupled FBSDEs are given by

\left\{\begin{aligned} &X_{j,t} = x_{j,0} + \int_{0}^{t} \sigma Y_{s}\, \mathrm{d}W_{j,s}, \quad j=1,\dots,d,\\ &Y_{t} = D\sum_{j=1}^{d} \sin(X_{j,T}) \\ &\qquad~ + \int_{t}^{T} -rY_{s} + \frac12 e^{-3r(T-s)\sigma^{2}} \left(D\sum_{j=1}^{d} \sin(X_{j,s})\right)^{3}\, \mathrm{d}s - \int_{t}^{T}(Z_{s})^{\mathrm{T}}\, \mathrm{d}W_{s}, \end{aligned}\right.
(6.2)

where σ>0,r,D are constants. One can easily check by Itô’s formula that the Y part of the solution of (6.2) is

$$Y_{t} = e^{-r(T-t)}D\sum_{j=1}^{d} \sin(X_{j,t}).$$

Let $$\xi =(\pi /2,\pi /2,\dots,\pi /2)$$ (100-dimensional), T=1,r=0.1,σ=0.3,D=0.1. The initial guess of Y0 is generated from a uniform distribution on the interval [0,1] while the true value of Y0≈9.04837. We train 5000 steps with an exponential decay learning rate that decays every 100 steps, with the starting learning rate being 1e-2 and the ending learning rate being 1e-3. When h=0.005 (N=200), the relative approximation error of Y0 is 0.09%. Furthermore, we test the influence of the time partition by choosing different values of N. In all cases, the training converged, and we plot in Fig. 2 the mean of relative error of Y0 against the number of time steps N. It is clearly shown that the error decreases as N increases (h decreases).

## Availability of data and materials

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

## Abbreviations

PDE:

Partial differential equation

SDE:

Stochastic differential equation

BSDE:

Backward stochastic differential equation

FBSDE:

Forward-backward stochastic differential equation

## References

1. Antonelli, F.: Backward-forward stochastic differential equations. Ann. Appl. Probab. 3, 777–793 (1993).

2. Arora, R., Basu, A., Mianjy, P., Mukherjee, A.: Understanding deep neural networks with rectified linear units (2018). In: Proceedings of the International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=B1J_rgWRW.

3. Barron, A. R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inf. Theory. 39(3), 930–945 (1993).

4. Beck, C., E, W., Jentzen, A.: Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations (2017). arXiv preprint arXiv:170905963.

5. Bellman, R. E.: Dynamic Programming. Princeton University Press, USA (1957).

6. Bender, C., Steiner, J.: Least-squares Monte Carlo for backward SDEs. In: Carmona, R., Del Moral, P., Hu, P., Oudjane, N. (eds.), pp. 257–289. Numerical Methods in Finance. Springer Proceedings in Mathematics, vol 12. Springer, Berlin (2012).

7. Bender, C., Zhang, J.: Time discretization and Markovian iteration for coupled FBSDEs. Ann. Appl. Probab. 18(1), 143–177 (2008).

8. Berner, J., Grohs, P., Jentzen, A.: Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations (2018). arXiv preprint arXiv:180903062.

9. Bölcskei, H., Grohs, P., Kutyniok, G., Petersen, P.: Optimal approximation with sparsely connected deep neural networks (2017). arXiv preprint arXiv:170501714.

10. Bouchard, B., Ekeland, I., Touzi, N.: On the Malliavin approach to Monte Carlo approximation of conditional expectations. Finance Stoch. 8(1), 45–71 (2004).

11. Bouchard, B., Touzi, N.: Discrete-time approximation and Monte-Carlo simulation of backward stochastic differential equations. Stoch. Process. Appl. 111(2), 175–206 (2004).

12. Cohen, N., Sharir, O., Shashua, A.: On the expressive power of deep learning: A tensor analysis. In: Feldman, V., Rakhlin, A., Shamir, O. (eds.), pp. 698–728. 29th Annual Conference on Learning Theory, vol. 49. PMLR, Columbia University, New York (2016). pp. 698–728. http://proceedings.mlr.press/v49/cohen16.html.

13. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathematics of Control. Signal. Syst. 2(4), 303–314 (1989).

14. E, W., Han, J., Jentzen, A.: Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commun. Math. Stat. 5(4), 349–380 (2017).

15. E, W., Hutzenthaler, M., Jentzen, A., Kruse, T.: On multilevel Picard numerical approximations for high-dimensional nonlinear parabolic partial differential equations and high-dimensional nonlinear backward stochastic differential equations. J. Sci. Comput. 79(3), 1534–1571 (2019).

16. Eldan, R., Shamir, O.: The power of depth for feedforward neural networks. In: Feldman, V., Rakhlin, A., Shamir, O. (eds.), pp. 907–940. 29th Annual Conference on Learning Theory, vol. 49. PMLR, Columbia University, New York (2016). pp. 907–940. http://proceedings.mlr.press/v49/eldan16.html.

17. Funahashi, K. I.: On the approximate realization of continuous mappings by neural networks. Neural Netw. 2(3), 183–192 (1989).

18. Grohs, P., Hornung, F., Jentzen, A., von Wurstemberger, P.: A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations (2018). arXiv preprint arXiv:180902362.

19. Han, J., Hu, R.: Deep fictitious play for finding Markovian Nash equilibrium in multi-agent games (2019). arXiv preprint arXiv:191201809.

20. Han, J., Jentzen, A., E, W.: Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci. 115(34), 8505–8510 (2018).

21. Han, J., Lu, J., Zhou, M.: Solving high-dimensional eigenvalue problems using deep neural networks: A diffusion Monte Carlo like approach (2020). arXiv preprint arXiv:200202600.

22. Henry-Labordere, P.: Counterparty risk valuation: A marked branching diffusion approach (2012). Available at SSRN 1995503. https://arxiv.org/abs/1203.2369.

23. Henry-Labordere, P., Oudjane, N., Tan, X., Touzi, N., Warin, X., et al.: Branching diffusion representation of semilinear PDEs and Monte Carlo approximation, pp. 184–210. In: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, Institut Henri Poincaré, vol. 55. Institut Henri Poincaré, Paris (2019). pp. 184–210. https://projecteuclid.org/euclid.aihp/1547802399.

24. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989).

25. Huijskens, T., Ruijter, M., Oosterlee, C.: Efficient numerical Fourier methods for coupled forward–backward SDEs. J. Comput. Appl. Math. 296, 593–612 (2016).

26. Hutzenthaler, M., Jentzen, A., Kruse, T., et al.: Multilevel Picard iterations for solving smooth semilinear parabolic heat equations (2016). arXiv preprint arXiv:160703295.

27. Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T. A.: A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations (2020). arXiv preprint arXiv:190110854.

28. Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T. A., von Wurstemberger, P.: Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential equations (2018). arXiv preprint arXiv:180701212.

29. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, Vol. 37. JMLR.org, Lille (2015). pp. 448–456.

30. Jentzen, A., Salimova, D., Welti, T.: A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients (2018). arXiv preprint arXiv:180907321.

31. Kingma, D., Ba, J.: Adam: a method for stochastic optimization (2015). In: Proceedings of the International Conference on Learning Representations (ICLR).

32. Liang, S., Srikant, R.: Why deep neural networks for function approximation? (2017). In: Proceedings of the International Conference on Learning Representations (ICLR).

33. Ma, J., Protter, P., Yong, J.: Solving forward-backward stochastic differential equations explicitly–a four step scheme. Probab. Theory Relat. Fields. 98(3), 339–359 (1994).

34. Ma, J., Yong, J.: Forward-Backward Stochastic Differential Equations and their Applications. Springer, Berlin Heidelberg (2007).

35. Mhaskar, H. N., Poggio, T.: Deep vs. shallow networks: An approximation theory perspective. Anal. Appl. 14(06), 829–848 (2016).

36. Milstein, G., Tretyakov, M.: Numerical algorithms for forward-backward stochastic differential equations. SIAM J. Sci. Comput. 28(2), 561–582 (2006).

37. Pardoux, E., Peng, S.: Backward stochastic differential equations and quasilinear parabolic partial differential equations. Springer, Berlin (1992).

38. Pardoux, E., Tang, S.: Forward-backward stochastic differential equations and quasilinear parabolic PDEs. Prob. Theory Relat. Fields. 114(2), 123–150 (1999).

39. Zhang, J.: A numerical scheme for BSDEs. Ann. Appl. Prob. 14(1), 459–488 (2004).

## Acknowledgments

We thank Professor Weinan E for his valuable comments and suggestions in the preparation of this work.

## Funding

The authors received no specific funding for this work.

## Author information

Authors

### Contributions

Both authors read and approved the final manuscript.

### Corresponding author

Correspondence to Jiequn Han.

## Ethics declarations

### Competing interests

The authors declare that they have no competing interests.

## Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions 