Convergence of the Deep BSDE Method for Coupled FBSDEs

The recently proposed numerical algorithm, deep BSDE method, has shown remarkable performance in solving high-dimensional forward-backward stochastic differential equations (FBSDEs) and parabolic partial differential equations (PDEs). This article lays a theoretical foundation for the deep BSDE method in the general case of coupled FBSDEs. In particular, a posteriori error estimation of the solution is provided and it is proved that the error converges to zero given the universal approximation capability of neural networks. Numerical results are presented to demonstrate the accuracy of the analyzed algorithm in solving high-dimensional coupled FBSDEs.


Introduction
Forward-backward stochastic differential equations (FBSDEs) and partial differential equations (PDEs) of parabolic type have found numerous applications in stochastic control, finance, physics, etc., as a ubiquitous modeling tool. In most situations encountered in practice the equations cannot be solved analytically but require certain numerical algorithms to provide approximate solutions. On the one hand, the dominant choices of numerical algorithms for PDEs are mesh-based methods, such as finite differences, finite elements, etc. On the other hand, FBSDEs can be tackled directly through probabilistic means, with appropriate methods for the approximation of conditional expectation. Since these two kinds of equations are intimately connected through the nonlinear Feynman-Kac formula [1], the algorithms designed for one kind of equation can often be used to solve another one.
However, the aforementioned numerical algorithms become more and more difficult, if not impossible, when the dimension increases. They are doomed to run into the so-called "curse of dimensionality" [2] when the dimension is high, namely, the computational complexity grows exponentially as the dimension grows. The classical mesh-based algorithms for PDEs require a mesh of size O(N d ). The simulation of FBSDEs faces a similar difficulty in the general nonlinear cases, due to the need to compute conditional expectation in high dimension. The conventional methods, including the least squares regression [3], Malliavin approach [4], and kernel regression [5], are all of exponential complexity. There are a limited number of cases where practical high-dimensional algorithms are available. For example, in the linear case, Feynman-Kac formula and Monte Carlo simulation together provide an efficient approach to solving PDEs and associated BSDEs numerically.
In addition, methods based on the branching diffusion process [6,7] and multilevel Picard iteration [8,9,10] overcome the curse of dimensionality in their considered settings. We refer [9] for the detailed discussion on the complexity of the algorithms mentioned above. Overall there is no numerical algorithm in literature so far proved to overcome the curse of dimensionality for general quasilinear parabolic PDEs and the corresponding FBSDEs.
A recently developed algorithm, called the deep BSDE method [11,12], has shown astonishing power in solving general high-dimensional FBSDEs and parabolic PDEs [13,14,15]. In contrast to conventional methods, the deep BSDE method employs neural networks to approximate unknown gradients and reformulates the original equation-solving problem into a stochastic optimization problem. Thanks to the universal approximation capability and parsimonious parameterization of neural networks, in practice the objective function can be effectively optimized in high-dimensional cases, and the function values of interests are obtained quite accurately.
The deep BSDE method was initially proposed for decoupled FBSDEs. In this paper, we extend the method to deal with coupled FBSDEs and a broader class of quasilinear parabolic PDEs. Furthermore, we present an error analysis of the proposed scheme, including decoupled FBSDEs as a special case. Our theoretical result consists of two theorems. Theorem 1 provides a posteriori error estimation of the deep BSDE method. As long as the objective function is optimized to be close to zero under fine time discretization, the approximate solution is close to the true solution. In other words, in practice, the accuracy of the numerical solution is effectively indicated by the value of the objective function. Theorem 2 shows that such a situation is attainable, by relating the infimum of the objective function to the expression ability of neural networks. As an implication of the universal approximation property (in the L 2 sense), there exist neural networks with suitable parameters such that the obtained numerical solution is approximately accurate. To the best of our knowledge, this is the first theoretical result of the deep BSDE method for solving FBSDEs and parabolic PDEs. Although our numerical algorithm is based on neural networks, the theoretical result provided here is equally applicable to the algorithms based on other forms of function approximations.
The article is organized as follows. In section 2, we precisely state our numerical scheme for coupled FBSDEs and quasilinear parabolic PDEs and give the main theoretical results of the proposed numerical scheme. In section 3, the basic assumptions and some useful results from the literature are given for later use. The proofs of the two main theorems are provided in section 4 and section 5, respectively. Some numerical experiments with the proposed scheme are presented in section 6.

A Numerical Scheme for Coupled FBSDEs and Main Results
Let T ∈ (0, +∞) be the terminal time, (Ω, F, {F t } 0≤t≤T , P) be a filtered probability space equipped with a d-dimensional standard Brownian motion {W t } 0≤t≤T starting from 0. ξ is a square-integrable random variable independent of {W t } 0≤t≤T . We use the same notation (Ω, F, {F t } 0≤t≤T , P) to denote the filtered probability space generated by {W t + ξ} 0≤t≤T . The notation |x| denotes the Euclidean norm of a vector x and A = trace(A T A) denotes the Frobenius norm of a matrix A.

Consider the following coupled FBSDEs
in which X t takes values in R m , Y t takes values in R, and Z t takes values in R d . Here we assume Y t to be one-dimensional to simplify the presentation. The result can be extended without any difficulty to the case where Y t is multi-dimensional. We say (X t , Y t , Z t ) is a solution of the above FBSDEs, if all its components are F t -adapted and square-integrable, together satisfying equations (2.1)(2.2). Solving coupled FBSDEs numerically is more difficult than solving decoupled FBSDEs. Except the Picard iteration method developed in [16], most methods exploit the relation to quasilinear parabolic PDEs via the four-time-step-scheme in [17]. This type of methods suffers from high dimensionality due to spatial discretization of PDEs. In contrast, our strategy, starting from simulating the coupled FBSDEs directly, is a new purely probabilistic scheme. To state the numerical algorithm precisely, we consider a partition of the Inspired by the nonlinear Feynman-Kac formula that will be introduced below, we view Y 0 as a function of X 0 and view Z t as a function of X t and Y t . Equipped with this viewpoint, our goal becomes finding appropriate can serve as good surrogates of Y 0 and Z t i , respectively. To this end, we consider the classical Euler scheme Without loss of clarity, here we use the notation X π 0 as X π t 0 , X π T as X π t N , etc. Following the spirit of the deep BSDE method, we employ a stochastic optimizer to solve the following stochastic optimization problem where N ′ 0 and N i (0 ≤ i ≤ N − 1) are parametric function spaces generated by neural networks. To see intuitively where the objective function (2.4) comes from, we consider the following variational problem: where Y 0 is F 0 -measurable and square-integrable, and Z t is a F t -adapted square-integrable process. The solution of the FBSDEs (2.1)(2.2) is a minimizer of the above problem since the loss function attains zero when it is evaluated at the solution. In addition, the wellposedness of the FBSDEs (under some regularity conditions) ensures the existence and uniqueness of the minimizer. Therefore, we expect (2.4), as a discretized counterpart of (2.5), defines a benign optimization problem and the associated near-optimal solution provides us a good approximate solution of the original FBSDEs. The reason we do not represent Z t i as a function of X t i only is that the process {X π t i } 0≤i≤N is not Markovian, while the process {X π t i , Y π t i } 0≤i≤N is Markovian, which facilitates our analysis considerably. If b and σ are both independent of Y , then the FBSDEs (2.1)(2.2) are decoupled, we can take φ π i as a function of X π t i only, as the numerical scheme introduced in [11,12]. Our two main theorems regarding the deep BSDE method are the following, mainly on the justification and property of the objective function (2.4) in the general coupled case, regardless of the specific choice of parametric function spaces. An important assumption for the two theorems is the so-called weak coupling or monotonicity condition, which will be explained in detail in section 3. The precise statement of the theorems can be found in Theorem 1 ′ (section 4) and Theorem 2 ′ (section 5), respectively.
Theorem 2. Under some assumptions, there exists a constant C, independent of h, d and m, such that for sufficiently small h, . Briefly speaking, Theorem 1 states that the simulation error (left side of equation (2.6)) can be bounded through the value of the objective function (2.4). To the best of our knowledge, this is the first result for the error estimation of the coupled FBSDEs, concerning both time discretization error and terminal distance. Theorem 2 states that the optimal value of the objective function can be small if the approximation capability of the parametric function spaces (N ′ 0 and N i above) is high. Neural networks are a promising candidate for such a requirement, especially in high-dimensional problems. There are numerous results, dating back to the 90s (see, e.g., [18,19,20,21,22,23,24,25,26,27]), in regard to the universal approximation and complexity of neural networks. There are also some recent analysis [28,29,30,31] on approximating the solutions of certain parabolic partial differential equations with neural networks. However, the problem is still far from resolved. Theorem 2 implies that if the involved conditional expectations can be approximated by neural networks whose numbers of parameters growing at most polynomially both in the dimension and the reciprocal of the required accuracy, then the solutions of the considered FBSDEs can be represented in practice without the curse of dimensionality. Under what conditions this assumption is true is beyond the scope of this work and remains for further investigation.
The above-mentioned scheme in (2.3)(2.4) is for solving FBSDEs. The so-called nonlinear Feynman-Kac formula, connecting FBSDEs with the quasilinear parabolic PDEs, provides an approach to numerically solve quasilinear parabolic PDEs (2.7) below through the same scheme. We recall a concrete version of the nonlinear Feynman-Kac formula in Theorem 3 below and refer interested readers to e.g., [32] for more details. According to this formula, the term E|Y 0 − Y π 0 | 2 can be interpreted as E|u(0, ξ) − µ π 0 (ξ)| 2 . Therefore, we can choose the random variable ξ with a delta distribution, a uniform distribution in a bounded region, or any other distribution we are interested in. After solving the optimization problem, we obtain µ π 0 (ξ) as an approximation of u(0, ξ). See [11,12] for more details.
Theorem 3. Assume 1. m = d and b(t, x, y), σ(t, x, y), f (t, x, y, z) are smooth functions with bounded firstorder derivatives with respect to x, y, z.
2. There exist a positive continuous function ν and a constant µ, satisfying that 3. There exists a constant α ∈ (0, 1) such that g is bounded in the Hölder space Then the following quasilinear PDE has a unique classical solution u(t, x) that is bounded with bounded u t , ∇ x u, and ∇ 2 The associated FBSDEs , and X t is the solution of the following SDE Remark. The statement regarding FBSDEs (2.1)(2.2) in Theorem 3 is developed through a PDE-based argument, which requires m = d, uniform ellipticity of σ, and high-order smoothness of b, σ, f , and g. An analogous result through probabilistic argument is given below in Theorem 4 (point 4). In that case, we only need the Lipschitz condition for all of the involved functions, in addition to some weak coupling or monotonicity conditions demonstrated in Assumption 3. Note that the Lipschitz condition alone does not guarantee the existence of a solution to the coupled FBSDEs, even in the situation when b, f, σ are linear (see [16,32] for a concrete counterexample).
Remark. Theorem 3 also implies that the assumption that the drift function b only depends on x, y is general. If b depends on z as well, one can move the associated term in (2.7) into the nonlinearity f and apply the nonlinear Feynman-Kac formula back to obtain an equivalent system of coupled FBSDEs, in which the new drift function is independent of z.

Preliminaries
In this section, we introduce our assumptions and two useful results in [16]. We use the (ii) b, σ, f, g are uniformly Lipschitz continuous with respect to (x,y,z). In particular, there are non-negative constants K, b y , σ x , σ y , f x , f z , and g x such that , and σ(t, 0, 0) are bounded. In particular, there are constants b 0 , σ 0 , f 0 , and g 0 such that We note here b y et al. are all constants, not partial derivatives. For convenience, we use L to denote the set of all the constants mentioned above and assume K is the upper bound of L . Assumption 2. b, σ, f are uniformly Hölder-1 2 continuous with respect to t. We assume the same constant K to be the upper bound of the square of the Hölder constants as well.
Assumption 3. One of the following five cases holds: 1. Small time duration, that is, T is small.
2. Weak coupling of Y into the forward SDE (2.1), that is, b y and σ y are small. In particular, if b y = σ y = 0, then the forward equation does not depend on the backward one and, thus, equations (2.1)(2.2) are decoupled.
3. Weak coupling of X into the backward SDE (2.2), that is, f x and g x are small.
In particular, if f x = g x = 0, then the backward equation does not depend on the forward one and, thus, equations (2.1)(2.2) are also decoupled. In fact, in this case, Z = 0 and (2.2) reduces to an ODE.
4. f is strongly decreasing in y, that is, k f is very negative.
5. b is strongly decreasing in x, that is, k b is very negative.
The assumptions stated above are usually called weak coupling and monotonicity conditions in literature [16,33,34]. To make it more precise, we define Then, a specific quantitative form of the above five conditions can be summarized as: In other words, if any of the five conditions of the weak coupling and monotonicity conditions holds to certain extent, the two inequalities in (3.1) hold. Below, we refer to (3.1) as Assumption 3 and the five general qualitative conditions described above as the weak coupling and monotonicity conditions. The above three assumptions are basic assumptions in [16], which we need in order to use the results from [16], as stated in Theorems 4 and 5 below. Theorem 4 gives the connections between coupled FBSDEs and quasilinear parabolic PDEs under weaker conditions. Theorem 5 provides the convergence of the implicit scheme for coupled FBSDEs. Our work primarily uses the same set of assumptions except that we assume some further quantitative restrictions related to the weak coupling and monotonicity conditions, which will be transparent through the extra constants we define in proofs. Our aim is to provide explicit conditions on which our results hold and more clearly present the relationship between these constants and the error estimates. As will be seen in the proof, roughly speaking, the weaker the coupling (resp., the stronger the monotonicity, the smaller the time horizon) is, the easier the condition is satisfied, and the smaller the constant C related with error estimates are. Theorem 4. Under Assumptions 1, 2, and 3, there exists a function u: R × R m → R that satisfies the following statements.
3. u is a viscosity solution of the PDE (2.7).

The FBSDEs
Furthermore, the solution of the FBSDEs satisfies the path regularity with some constant C depending on L and T Remark. Several conditions can guarantee Z t admits a càdlàg version, such as m = d and σσ T ≥ δI with some δ > 0, see e.g., [35].
Theorem 5. Under Assumptions 1, 2, and 3, for sufficiently small h, the following discrete-time equation Finally, to make sure the system in (2.3) is well-defined, we restrict our parametric function spaces N ′ 0 and N i as in Assumption 4 below. Note that neural networks with common activation functions, including ReLU and sigmoid function, satisfy this assumption. Under Assumption 1 and 4, one can easily prove by induction that 3) are all measurable and square-integrable random variables.
Recalling the definition of the system of equations (2.3), we have X π Right multiplying (∆W i ) T on both sides of (4.2) and taking the expectation E[·|F t i ] again, we obtain The above observation motivates us to consider the following system of equations where the X component is defined forwardly and the Y, Z components are defined backwardly. However, since we do not specify the terminal condition of Y π T , the system of equations (4.3) has infinitely many solutions. The following lemma gives an estimate of the difference between two such solutions.
For any λ 1 > 0, λ 2 ≥ f z , and sufficiently small h, denote (4.4) To prove Lemma 1, we need the following lemma to handle the Z component.

Lemma 2.
Let 0 ≤ s 1 < s 2 , given Q ∈ L 2 (Ω, F s 2 , P), by the martingale representation theorem, there exists an F t -adapted process {H s } s 1 ≤s≤s 2 such that Noting that Q s 1 = 0, we have Proof of Lemma 1. Let By the martingale representation theorem, there exists an F t -adapted square-integrable process {δZ t } t i ≤t≤t i+1 such that or, equivalently, From equations (4.5) and (4.8), noting that From equation (4.9), by Assumptions 1, 2 and the root-mean square and geometric mean inequality (RMS-GM inequality), for any λ 1 > 0, we have By induction we can obtain that, for 0 ≤ n ≤ N , Similarly, from equation (4.10), for any λ 2 > 0, we have To deal with the integral term in (4.11), we apply Lemma 2 to (4.6)(4.8) and get which implies, by the Cauchy inequality, where (·) k denotes the k-th component of the vector. Plugging it into (4.11) gives us Then for any λ 2 ≥ f z and sufficiently small h satisfying (2k f + λ 2 )h < 1, we have By induction we obtain that, for 0 ≤ n ≤ N , Now we are ready to prove Theorem 1, whose precise statement is given below. Note that its conditions are satisfied if any of the five cases in the weak coupling and monotonicity conditions holds.

(4.13)
Then there exists a constant C > 0, depending on E|ξ| 2 , L , T , λ 1 , and λ 2 , such that for sufficiently small h, 4. Suppose all constants except k f and λ 2 > 0 are fixed. Let A 1 ′ := A 1 + A 3 = 2k b + 2k f + σ x + λ 1 + λ 2 and rewrite A 0 as It is straightforward to check that there exists a negative constant C 1 such that when Combining these two estimates gives A 0 < 1.

5.
Noting that k b and k f play the same role in A 1 ′ , we use the same argument as above to show that when k b is sufficiently negative, there exists λ 2 ≥ f x such that A 0 < 1.
Proof of Theorem 1 ′ . From the proof of this theorem and throughout the remainder of the paper, we use C to generally denote a constant that only depends on E|ξ| 2 , L , and T , whose value may change from line to line when there is no need to distinguish. We also use C(·) to generally denote a constant depending on E|ξ| 2 , L , T and the constants represented by ·.
We use the same notations as Lemma 1. Let X π,1 ). It can be easily checked that both ({X π,j t i } 0≤i≤N , {Y π,j t i } 0≤i≤N , {Z π,j t i } 0≤i≤N −1 ), j = 1, 2 satisfy the system of equations (4.3). Our proof strategy is to use Lemma 1 to bound the difference between two solutions through the objective function E|g(X π T ) − Y π T | 2 . This allows us to apply Theorem 5 to derive the desired estimates.
To begin with, note that for any λ 3 > 0, the RMS-GM inequality yields

Lemma 1 tells us
Therefore by definition of P and S, we have

Consider the function
When A(h) < 1, we have and note that When A 0 < 1, comparing lim h→0 A(h) and A 0 , we know that, for any ǫ > 0, there exists λ 3 > 0 and sufficiently small h such that (4.17) By fixing ǫ = 1 and choosing suitable λ 3 , we obtain our error estimates of E|δX n | 2 and E|δY n | 2 as To estimate E|δZ n | 2 , we consider estimate (4.12), in which λ 2 can take any value no smaller than f z . If f z = 0, we choose λ 2 = 2f z and obtain Summing from 0 to N − 1 gives us (4.20) The case f z = 0 can be dealt with similarly by choosing λ 2 = 1 and the same type of estimate can be derived. Finally, combining estimates (4.18)(4.19)(4.20) with Theorem 5, we prove the statement in Theorem 1 ′ .

An Upper Bound for the Minimized Objective Function
We prove Theorem 2 in this section. We first state three useful lemmas. Theorem 2 ′ , as a detailed statement of Theorem 2, and Theorem 6, as an variation of Theorem 2 ′ under stronger conditions, are then provided, followed by their proofs. The proofs of three lemmas are given at the end of the section. The main process we analyze is (2.3). Lemma 3 gives an estimate of the final distance E|g(X π T ) − Y π T | 2 provided by (2.3) in terms of the deviation between the approximated variables Y π 0 , Z π t i and the true solutions. Lemma 3. Suppose Assumptions 1, 2, and 3 hold true. Let Given λ 4 > 0, there exists a constant C > 0 depending on E|ξ| 2 , L , T , and λ 4 , such that for sufficiently small h, Lemma 3 is close to Theorem 2, except thatZ t i is not a function of X π t i and Y π t i defined in (2.3). To bridge this gap, we need the following two lemmas. First, similar to the proof of Theorem 1 ′ , an estimate of the distance between the process defined in (2.3) and the process defined in (3.3) is also needed here. Lemma 4 is a general result to serve this purpose, providing an estimate of the difference between two backward processes driven by different forward processes.
t i , then for any λ 7 > f z , and sufficiently small h, we have Lemma 5 shows that, similar to the nonlinear Feynman-Kac formula, the discrete stochastic process defined in (2.3) can also be linked to some deterministic functions.
for 0 ≤ i ≤ N − 1. If b and σ are independent of y, then there exist deterministic functions Now we are ready to prove Theorem 2, with a precise statement given below. Like Theorem 1 ′ , the conditions below are satisfied if any of the five cases of the weak coupling and monotonicity conditions holds to certain extent.
Remark. If any of the weak coupling and monotonicity conditions introduced in Assumption 3 holds to a sufficient extent, there must exist λ 1 , λ 2 , λ 3 , λ 7 satisfying the conditions in Theorem 2 ′ . The arguments are very similar to those provided in Remark 4. Hence, we omit the details here for the sake of brevity.
Proof of Theorem 2 ′ . Using Lemma 3 with λ 4 > 0, we obtain Splitting the term δZ t i =Z t i − Z π t i and applying the generalized mean inequality, we have (recall Z for any V i . Therefore we have the estimate Similar to the derivation of estimate (4.16) (using a given λ 3 > 0 without final specification) in the proof of Theorem 1 ′ , when A 0 ′ < 1, we have in which P = max 0≤n≤N e −A 1 nh E|δX i | 2 . Plugging (5.10) into (5.9), and then into (5.8), we get and for sufficiently small h. Here B(h) is defined as  (1−B 0 ). Rearranging the term E|g(X π T ) − Y π T | 2 in inequality (5.11) yields our final estimate. We shall briefly discuss how the universal approximation theorem can be applied based on Theorem 2 ′ . For instance, Theorem 2.1 in [22] states that every continuous and piecewise linear function with m-dimensional input can be represented by a deep neural network with rectified linear units and at most ⌈1 + log 2 (m + 1)⌉ depth. Now we view Y 0 as a target function with input ξ and E[Z t i |X π Theorem 6. Suppose Assumptions 1, 2, 3, 4 and the assumptions in Theorem 3 hold true. Let u be the solution of corresponding quasilinear PDEs (2.7) and L be the squared Lipschitz constant of σ T (t, x, u(t, x))∇ x u(t, x) with respect to x. With the same notations of Theorem 2 ′ , when A 0 ′ < 1 and there exists a constant C > 0 depending on E|ξ| 2 , T , L , L, λ 1 , λ 2 , and λ 3 , such that for sufficiently small h, Using Lemma 3 again with λ 4 > 0 gives us Given the continuity of σ T (t, x, u(t, x))∇ x u(t, x), we know Z t admits a continuous version. Hence the termZ t i in δZ t i =Z t i − Z π t i can be replaced with Z t i , i.e., Similar to the arguments in inequalities (5.6)(5.7), we have where the last equality uses the convergence result (3.4). Plugging it into (5.13), we have for sufficiently small h. We employ the estimate (5.10) again to rewrite inequality (5.14) as Arguing in the same way as that in the proof of Theorem 2 ′ , whenB(h) is strictly bounded above by 1 for sufficiently small h, we can choose λ 4 small enough and rearrange the terms in inequality (5.15) to obtain the result in inequality (5.12).
Remark. The Lipschitz constant used in Theorem 6 may be further estimated a priori. Denote the Lipschitz constant of function f with respect to x as L x (f ), and the bound of function f as M (f ). Loosely speaking, we have Here L x (u) = M (∇ x u(t, x)) can be estimated from the first point of Theorem 4 and L(∇ x u(t, x)) = M (∇ xx u) can be estimated through the Schauder estimate (see, e.g., [32, Chapter 4, Lemma 2.1]). Note that the resulting estimate may depend on the dimension d.
Similarly, with the same type of estimates in (5.16)(5.20), for any λ 5 , λ 6 > 0, we have Arguing in the same way of (5.21), by Grönwall inequality, for sufficiently small h, we have with A 10 := 1 + f z λ −1 5 + 2λ 6 . Define Combining inequalities (5.21)(5.22) together yields Letting A 11 := max{A 6 , A 8 } + max{A 7 , A 9 }, we have We start from M 0 = E|Y 0 − Y π 0 | 2 and apply inequality (5.23) repeatedly to obtain 24) in which for the last term we use the fact N −1 i=0 E i z ≤ Ch from inequality (3.2). Note that Given any λ 4 > 0, we can choose λ 6 small enough such that This condition and inequality (5.24) together give us Finally, by decomposing the objective function, we have We complete our proof by combining inequalities (5.25)(5.26) and choosing λ 5 = argmin x∈R + H(x).
Proof of Lemma 4. We use the same notations as in the proof of Lemma 1. As derived in (4.12), for any λ 7 > f z ≥ 0, we have Multiplying both sides of (5.27) by e A 5 ih (e −A 5 T ∨ 1)/(1 − f z λ −1 7 ) gives us Note that E|δY N | 2 ≤ g x E|δX N | 2 by Assumption 1. Plugging it into (5.29), we arrive at the desired result.
Proof of Lemma 5. We prove by induction backwardly. Let Z π, ′ t N = 0 for convenience. It is straightforward to see that the statement holds for i = N . Assume the statement holds for i = k + 1. For i = k, we know Y π, ′ 6 Numerical Examples

General Setting
In this section, we illustrate the proposed numerical scheme by solving two high-dimensional coupled FBSDEs adapted from the literature. The common setting for these two numerical examples is as follows. We assume d = m = 100, that is, X t , Z t , W t ∈ R 100 . Assume ξ is deterministic and we are interested in the approximation error of Y 0 , which is also a deterministic scalar.
We use N − 1 fully-connected feedforward neural networks to parameterize φ π i , i = 0, 1, . . . , N − 1. Each of the neural networks has 2 hidden layers with dimension d + 10. The input has dimension d + 1 (X i ∈ R d , Y i ∈ R) and the output has dimension d. In practice, one can of course choose X i only as the input. We additionally test this input for the two examples and find no difference in terms of the relative error of Y 0 (up to second decimal place). We use the rectifier function (ReLU) as the activation function and adopt batch normalization [36] right after each matrix multiplication and before activation. We employ the Adam optimizer [37] to optimize the parameters with batch-size being 64. The loss function is computed based on 256 validation sample paths. We initialize all the parameters using a uniform or normal distribution and run each experiment 5 times to report the average result.

Example 1
The first problem is adapted from [38,39], in which the original spatial dimension of the problem is 1. We consider the following coupled FBSDEs                                    X j,t = x j,0 + t 0 X j,s (1 + X 2 j,s ) (2 + X j,s ) 3 ds where X j,t , Z j,t , W j,t denote the j-th components of X t , Y t , W t , and the coefficient functions are given as It can be verified by Itô's formula that the Y part of the solution of (6.1) is given by .
Let ξ = (1, 1, . . . , 1) (100-dimensional), T = 5, N = 160. The initial guess of Y 0 is generated from a uniform distribution on the interval [2,4] while the true value of Y 0 ≈ 0.81873. We train 25000 steps with an exponential decay learning rate that decays every 100 steps, with the starting learning rate being 1e-2 and ending learning rate being 1e-5. Figure 1 illustrates the mean of the loss function and relative approximation error of Y 0 against the number of iteration steps. All runs converged and the average final relative error of Y 0 is 0.39%.

Example 2
The second problem is adapted from [16], in which the spatial dimension is originally tested up to 10 where σ > 0, r, D are constants. One can easily check by Itô's formula that the Y part of the solution of (6.2) is sin(X j,t ).
Let ξ = (π/2, π/2, . . . , π/2) (100-dimensional), T = 1, r = 0.1, σ = 0.3, D = 0.1. The initial guess of Y 0 is generated from a uniform distribution on the interval [0, 1] while the true value of Y 0 ≈ 9.04837. We train 5000 steps with an exponential decay learning rate that decays every 100 steps, with the starting learning rate being 1e-2 and the ending learning rate being 1e-3. When h = 0.005 (N = 200), the relative approximation error of Y 0 is 0.09%. Furthermore, we test the influence of the time partition by choosing different values of N . In all cases, the training converged, and we plot in Figure 2 the mean of relative error of Y 0 against the number of time steps N . It is clearly shown that the error decreases as N increases (h decreases).