Skip to main content

Uncertainty and filtering of hidden Markov models in discrete time


We consider the problem of filtering an unseen Markov chain from noisy observations, in the presence of uncertainty regarding the parameters of the processes involved. Using the theory of nonlinear expectations, we describe the uncertainty in terms of a penalty function, which can be propagated forward in time in the place of the filter. We also investigate a simple control problem in this context.


BSDE: Backward stochastic difference equationi.i.d.: Independent and identically distributedMAP extimator: Maximum a posteriori estimatorDR-expectation: Either “data-driven robust estimator” or “divergence robust estimator” (the acronym is deliberately ambiguous)StaticUP: Static generators, uncertain prior frameworkDynamicUP: Dynamic generators, uncertain prior frameworkStaticDR: Static generators, DR-expectation frameworkDynamicDR: Dynamic generators, DR-expectation framework


Filtering is a common problem in many applications. The essential concept is that there is an unseen Markov process, which influences the state of some observed process, and our task is to approximate the state of the unseen process using a form of Bayes’ theorem. Many results have been obtained in this direction, most famously the Kalman filter (Kalman 1960; Kalman and Bucy 1961), which assumes the underlying processes considered are Gaussian, and gives explicit formulae accordingly. Similarly, under the assumption that the underlying process is a finite-state Markov chain, a general formula to calculate the filter can be obtained (the Wonham filter Wonham (1965)). These results are well known, in both discrete and continuous time (see Bain and Crisan (2009) or Cohen and Elliott (2015) Chapter 21 for further general discussion).

In this paper, we consider a simple setting in discrete time, where the underlying process is a finite-state Markov chain. Our concern is to study uncertainty in the dynamics of the underlying processes, in particular, its effect on the behaviour of the corresponding filter. That is, we assume that the observer has only imperfect knowledge of the dynamics of the underlying process and of their relationship with the observation process, and wishes to incorporate this uncertainty in their estimates of the unseen state. We are particularly interested in allowing the level of uncertainty in the filtered state to be endogenous to the filtering problem, arising from the uncertainty in parameter estimates and process dynamics.

We model this uncertainty in a general manner, using the theory of nonlinear expectations, and concern ourselves with a description of uncertainty for which explicit calculations can be carried out, and which can be motivated by considering statistical estimation of parameters. We then apply this to building a dynamically consistent expectation for random variables based on future states, and to a general control problem, with learning, under uncertainty.

Basic filtering

Consider two stochastic processes, X={Xt}t≥0 and Y={Yt}t≥0. Let Ω be the space of paths of (X,Y) and \(\mathbb {P}\) be a probability measure on Ω. We denote by \(\{\mathcal {F}_{t}\}_{t\ge 0}\) the (completed) filtration generated by X and Y and denote by \(\mathcal {Y}=\{\mathcal {Y}_{t}\}_{t\ge 0}\) the (completed) filtration generated by Y. The key problem of filtering is to determine estimates of ϕ(Xt) given \(\mathcal {Y}_{t}\), that is, \(\mathbb {E}_{\mathbb {P}}[\phi (X_{t})|\mathcal {Y}_{t}]\), where ϕ is an arbitrary Borel function.

Suppose that X is a Markov chain with (possibly time-dependent) transition matrix \(A_{t}^{\top }\) under \(\mathbb {P}\) (the transpose here saves notational complexity later). Without loss of generality, we assume that X takes values in the standard basis vectors \(\{e_{i}\}_{i=1}^{N}\) of \(\mathbb {R}^{N}\) (where N is the number of states of X), and so we write

$$X_{t} = A_{t} X_{t-1} + M_{t},$$

where \(\mathbb {E}_{\mathbb {P}}[M_{t+1}|\mathcal {F}_{t}] = 0\), so \(\mathbb {E}_{\mathbb {P}}[X_{t}|\mathcal {F}_{t-1}] = A_{t} X_{t-1}\).

We suppose the process Y is multivariate real-valuedFootnote 1. The law of Y depends on X, in particular, the \(\mathbb {P}\)-distribution of Yt given {Xs}st{Ys}s<t (that is, given all past observations of X and Y and the current state of X) is

$$Y_{t} \sim c(y;t, X_{t})d\mu(y)$$

for μ a reference measure on \(\left (\mathbb {R}^{d}, \mathcal {B}\left (\mathbb {R}^{d}\right)\right)\), where is used to indicate the density of the distribution of a random variable.

For simplicity, we assume that Y0≡0, so no information is revealed about X0 at time 0. It is convenient to write Ct(y)=C(y;t) for the diagonal matrix with entries c(y;t,ei), so that

$$C_{t}(y) X_{t} = c(y;t, X_{t})X_{t}.$$

Note that these assumptions, in particular the values of A and C, depend on the choice of probability measure \(\mathbb {P}\). Conversely, as our space Ω is the space of paths of (X,Y), the measure \(\mathbb {P}\) is determined by A and C. We call A and C the generators of our probability measure.

As we have assumed Xt takes values in the standard basis in \(\mathbb {R}^{N}\), the expectation \(\mathbb {E}_{\mathbb {P}}[X_{t}|\mathcal {Y}_{t}]\) determines the entire conditional distribution of Xt given \(\mathcal {Y}_{t}\). In this discrete time context, the filtering problem can be solved in a fairly simple manner: Suppose we have already calculated \(p_{t-1}:=\mathbb {E}_{\mathbb {P}}[X_{t-1}|\mathcal {Y}_{t-1}]\). Then, by linearity and the dynamics of X, using the fact

$$\mathbb{E}_{\mathbb{P}}\left[M_{t}\left|\mathcal{Y}_{t-1}\right.\right] = \mathbb{E}_{\mathbb{P}}\left[\left.\mathbb{E}_{\mathbb{P}}\left[M_{t}\left|\mathcal{F}_{t-1}\right.\right]\right|\mathcal{Y}_{t-1}\right]=0,$$

we can calculate

$$\mathbb{E}_{\mathbb{P}}\left[X_{t}\left|\mathcal{Y}_{t-1}\right.\right] = \mathbb{E}_{\mathbb{P}}\left[\left.A_{t} X_{t-1}+ M_{t}\right|\mathcal{Y}_{t-1}\right] = A_{t} p_{t-1}.$$

Bayes’ theorem then states that, with probability one, with denoting equality up to proportionality,

$$\mathbb{P}\left(\left.X_{t}=e_{i}\right|\mathcal{Y}_{t}\right) = \mathbb{P}\left(X_{t}=e_{i}\left|\{Y_{s}\}_{s< t}, Y_{t}\right.\right) \propto c(Y_{t};t, e_{i}) \mathbb{P}(X_{t}=e_{i}|\mathcal{Y}_{t-1}),$$

which can be written in a simple matrix form, as given in the following theorem which summarizes the classical filter.

Theorem 1

For X a hidden Markov chain with transition matrix \(A^{\top }_{t}\), and Y an observation process with conditional density (given Xt) given by

$$Y_{t}|X_{t} \sim c(y;t,X_{t})d\mu(y) = \mathbf{1}^{\top} C_{t}(y)X_{t} d\mu(y),$$

the conditional distribution \(\mathbb {E}[X_{t}|\mathcal {Y}_{t}] = p_{t}\) satisfies the recursion

$$ p_{t} = \mathbb{G}(p_{t-1}, A_{t}, C_{t}, Y_{t}) := \frac{C_{t}(Y_{t}) A_{t} p_{t-1}}{\mathbf{1}^{\top} C_{t}(Y_{t}) A_{t} p_{t-1}}, $$

where 1 denotes a vector with all components 1.

We call pt the “filter state” at time t. Note that, if we assume the density c is positive, At is irreducible and pt−1 has all entries positive, then pt will also have all entries positive.

In practice, the key problem with implementing these methods is the requirement that we know the underlying transition matrix A and the density C. These are generally not known perfectly, but need to be estimated prior to the implementation of the filter. Uncertainty in the choice of these parameters will lead to uncertainty in the estimates of the filtered state, and the aim of this paper is to derive useful representations of that uncertainty.

As variation in the choice of A and C corresponds to a different choice of measure \(\mathbb {P}\), we see that using an uncertain collection of generators corresponds naturally to uncertainty regarding \(\mathbb {P}\). This type of uncertainty, where the probability measure is not known, is commonly referred to as “Knightian” uncertainty (with reference to Knight (1921), related ideas are also discussed by Keynes (1921) and Wald (1945)).

Effectively, we wish to consider the propagation of uncertainty in Bayesian updating (as the filter is simply a special case of this). Huber and Ronchetti 2009 p. 331 briefly touch on this, however (based on earlier work by Kong) argue that this propagation is computationally infeasible. However, their approach was based on Choquet integrals, rather than nonlinear (convex) expectations in the style of Peng (2010) and others. In the coming sections, we see how the structure of nonlinear expectations allows us to derive comparatively simple rules for updating.

Remark 1

While we will present our theory in the context where X is a finite state Markov chain, our approach does not depend in any significant way on this assumption. In particular, it would be equally valid, mutatis mutandis, if we supposed that X followed the dynamics of the Kalman filter, and our uncertainty was on the coefficients of the filter. We specialize to the Markov chain case purely for the sake of concreteness.

The aim of this paper is to provide, with a minimum of technical complexity, the basic structures which underlie this approach to filtering with a nonlinear expectation. It proceeds as follows: In Section 2 we give some key facts about the measures which naturally appear in a filtering context. In Section 3, we introduce the theory of nonlinear expectations, and a means of connecting these with statistical estimation from Cohen (2017). Section 4 unites these expectations with filtering, giving recursive equations which replace the filtering equation in Theorem 1, it also outlines some concrete simplifications of this general structure, depending on whether the underlying parameters can vary through time and how new information is to be incorporated. In Section 5 we consider dynamic properties of this nonlinear expectation when looking at future events, and some connections with the theory of (discrete time) BSDEs. Finally, Section 6 considers a generic control problem in this context.

Conditionally Markov measures

In order to incorporate learning in our nonlinear expectations and filtering, it is useful to extend slightly from the family of measures previously described. In particular, we wish to allow the dynamics to depend on past observations, while preserving enough Markov structure to enable filtering. The following classes of probability measures will be of interest.

Definition 1

We write \(\mathcal {M}_{1}\) for the space of probability measures equivalent to a reference measure \(\mathbb {P}\).

Let \(\mathcal {M}_{M}\subset \mathcal {M}_{1}\) denote the probability measures under which

  • X is a Markov chain, that is, for all t, Xt+1 is independent of \(\mathcal {F}_{t}\) given Xt;

  • {Ys}st+1 is independent of \(\mathcal {F}_{t}\) given Xt+1;

  • both X and Y are time homogeneous, that is, the conditional distributions of Xt+1|Xt and Yt|Xt do not depend on t.

Let \(\mathcal {M}_{M|\mathcal {Y}}\subset \mathcal {M}_{1}\) denote the probability measures under which

  • X is a conditional Markov chain, that is, for all t, Xt+1 is independent of \(\mathcal {F}_{t}\) given Xt and {Ys}st; and

  • {Ys}st+1 is independent of \(\mathcal {F}_{t}\) given {Xt+1}{Ys}st.

We note that, if we consider a measure in \(\mathcal {M}_{M|\mathcal {Y}}\), there is a natural notion of the generators A and C. In particular, \(\mathcal {M}_{M}\) corresponds to those measures under which the generators A and C are constant, while \(\mathcal {M}_{M|\mathcal {Y}}\) corresponds to those measures under which the generators A and C are functions of time and {Ys}st (i.e. \(\{\mathcal {Y}_{t}\}_{t \ge 0}\)-predictable processes).

For each t, these generators determine the measure on \(\mathcal {F}_{t}\) given \(\mathcal {F}_{t-1}\), and (together with the distribution of X0) this determines the measure at all times. It is straightforward to verify that our filtering equations hold for all measures in \(\mathcal {M}_{M|\mathcal {Y}}\), with the appropriate modification of the generators.

Definition 2

For a measure \(\mathbb {Q}\in \mathcal {M}_{M|\mathcal {Y}}\), we write \(\left (A^{\mathbb {Q}},C^{\mathbb {Q}}(\cdot)\right)\) for the generator of (X,Y) under \(\mathbb {Q}\), recalling that \(C^{\mathbb {Q}}_{t}(y) = \text {diag}\left (\left \{c^{\mathbb {Q}}_{t}(y; e_{i})\right \}_{i=1}^{N}\right)\), and that \(A^{\mathbb {Q}}_{t}\) and \(C^{\mathbb {Q}}_{t}\) are now allowed to depend on {Ys}s<t. For notational convenience, we shall typically not write the dependence on {Ys}s<t explicitly.

Similarly, for a \(\{\mathcal {Y}_{t}\}_{t\ge 0}\)-predictable process (At,Ct(·))t≥0 taking values in the product of the space of transition matrices and the space of diagonal matrix-valued functions, where each diagonal element is a probability density on \(\mathbb {R}^{d}\), and p0 a probability vector in \(\mathbb {R}^{N}\), we write \(\mathbb {Q}(A,C, p_{0})\) for the measure with generator (At,Ct(·))t≥0 and initial distribution \(\mathbb {E}_{\mathbb {Q}}[X_{0}]=p_{0}\).

In what follows, we will be variously wishing to restrict a measure \(\mathbb {Q}\) to a σ-algebra, and to condition a measure on a σ-algebra. To prevent notational confusion, we shall write \(\mathbb {Q}\|_{\mathcal {F}}\) for the restriction of \(\mathbb {Q}\) to \(\mathcal {F}\), and \(\mathbb {Q}|_{\mathcal {F}}\) for \(\mathbb {Q}\) conditioned on \(\mathcal {F}\).

In our setting, our fundamental problem is that we do not know what measure is “true”, and so work instead under a family of measures. In this context, measure changes can be described as follows.

Proposition 1

Let \(\bar {\mathbb {P}}\) be a reference probability measure under which X is a sequence of i.i.d. uniform random variables from the basis vectors \(\{e_{1},...e_{N}\}\subset \mathbb {R}^{N}\) and {Yt}t≥0 is independent of X, with i.i.d. distribution Ytdμ. The measure \(\mathbb {Q}(A,C, p_{0})\in \mathcal {M}_{M|\mathcal {Y}}\) has Radon–Nikodym derivative (or likelihood)

$$\frac{d\mathbb{Q}(A,C, p_{0})\|_{\mathcal{F}_{T}}}{d\bar{\mathbb{P}}\|_{\mathcal{F}_{T}}} = N\left(X_{0}^{\top} p_{0}\right) \prod_{t=1}^{T} \left(\left(X_{t}^{\top} A_{t-1} X_{t-1}\right)\left(\mathbf{1}^{\top} C(Y_{t}) X_{t}\right)\right). $$

Remark 2

The requirement that \(\bar {\mathbb {P}}\) is a probability measure is unnecessary (i.e., in Bayesian parlance, the reference distribution may be improper). For example, we can use Lebesgue measure on \(\mathbb {R}\) as the marginal reference measure μ for Yt without difficulty, in which case ct(y) is the usual (Lebesgue) density of the distribution of Yt.


A simple verification of this result is possible by factoring the proposed Radon–Nikodym density as the product of three terms:

$$\frac{d\mathbb{Q}(A,C, p_{0})\|_{\mathcal{F}_{T}}}{d\bar{\mathbb{P}}\|_{\mathcal{F}_{T}}} = \frac{X_{0}^{\top} p_{0}}{1/N} \cdot \prod_{t=1}^{T} \left(X_{t}^{\top} A_{t-1} X_{t-1}\right)\cdot \prod_{t=1}^{T} \left(\mathbf{1}^{\top} C(Y_{t}) X_{t}\right).$$

The first term, \((X_{0}^{\top } p_{0})/(1/N)\), changes the distribution of X0 from uniform to p0, as is seen from the calculation

$$E_{\bar{\mathbb{P}}}\left[X_{0}\left(X_{0}^{\top} p_{0}\right)/(1/N)\right] = N E_{\bar{\mathbb{P}}}\left[X_{0} X_{0}^{\top}\right] p_{0} = N N^{-1} I_{N} p_{0} = p_{0}.$$

This is clearly a probability density with respect to \(\bar {\mathbb {P}}\).

The second term, \(\prod _{t=1}^{T} \left (X_{t}^{\top } A_{t-1} X_{t-1}\right)\), changes the conditional distribution of Xt|{Xs}s<t (for each t), from a uniform distribution to the probability vector At−1Xt−1; as can be demonstrated by a calculation similar to that for X0 in the first term. As the columns of At−1 sum to one, it is easy to verify that this product has expectation 1 (conditional on X0) and is nonnegative, that is, it is a probability density which does not modify the distribution of X0.

The third term changes the conditional distribution of Yt|{Xt,Xs,Ys}st from μ to C(y)Xtdμ(y)=c(y;Xt)dμ(y). This is most easily seen by calculating

$$E_{\mathbb{Q}}[ g(Y_{t})|\{X_{t}, X_{s}, Y_{s}\}_{s\neq t}] = \int g(y) \mathbf{1}^{\top} C(y) X_{t} d\mu(y) = \int g(y)c(y; X_{t})d\mu(y) $$

for a general bounded Borel function g. As c is defined to be a density, it is again easy to verify that the product \(\prod \left (\mathbf {1}^{\top } C(Y_{t}) X_{t}\right)\) is a probability density with respect to \(\bar {\mathbb {P}}\), and that it does not modify the distribution of X.

As we are on a canonical space, the measure \(\mathbb {Q}\) is determined by the laws of X and Y, and so we have the result. □

The above proposition gives a Radon–Nikodym derivative adapted to the full filtration \(\{\mathcal {F}_{t}\}_{t\ge 0}\). In practice, it is also useful to consider the corresponding Radon–Nikodym derivative adapted to the observation filtration \(\{\mathcal {Y}_{t}\}_{t\ge 0}\). As this filtration is generated by the process Y, it is enough to multiply together the conditional distributions of \(Y_{t}|\mathcal {Y}_{t-1}\), leading to the following convenient representation. Recall that

$$\sum_{i} p_{i} c_{t}(y; e_{i}) = \mathbf{1}^{\top} C_{t}(y) p.$$

Proposition 2

For \(\mathbb {Q}(A,C, p_{0})\in \mathcal {M}_{M|\mathcal {Y}}\), the Radon–Nikodym derivative restricted to \(\mathcal {Y}_{t}\) is given by

$$L^{\text{obs}}(\mathbb{Q}(A,C, p_{0})|\mathbf{y}):=\frac{d\mathbb{Q}(A,C, {p_{0}})\|_{\mathcal{Y}_{T}}}{d{\bar{\mathbb{P}}}\|_{\mathcal{Y}_{T}}} = \prod_{t=1}^{T} \mathbf{1}^{\top} C_{t}(Y_{t}) A_{t} p_{t-1}^{(A,C), p_{0}}, $$

where \(p_{t-1}^{(A,C), p_{0}}\) is the solution to the filtering problem in the measure \(\mathbb {Q}(A,C,p_{0})\), as determined by (1) (and so includes further dependence on {Ys}s<t).


The distribution of \(Y_{t}|\mathcal {Y}_{t-1}\) is determined by (for Borel sets B)

$$\begin{aligned} E_{\mathbb{Q}}[1_{Y_{t}\in B}|\mathcal{Y}_{t-1}] &= E_{\mathbb{Q}}\left[E_{\mathbb{Q}}[1_{Y_{t}\in B}|\mathcal{Y}_{t-1}\vee \sigma(X_{t})] \Big|\mathcal{Y}_{t-1}\right] \\ &= E_{\mathbb{Q}}\left[\int_{B} \mathbf{1}^{\top} C_{t}(y)X_{t} d\mu(y)\Big|\mathcal{Y}_{t-1}\right]\\ &= \int_{B} \mathbf{1}^{\top} C_{t}(y)E_{\mathbb{Q}}[X_{t}|\mathcal{Y}_{t-1}] d\mu(y) \\&= \int_{B} \mathbf{1}^{\top} C_{t}(y)A_{t} p_{t-1}^{(A,C),p_{0}} d\mu(y). \end{aligned} $$

As \(Y_{t}|\mathcal {Y}_{t-1}\) has distribution μ under \(\bar {\mathbb {P}}\), it follows that \(\mathbf {1}^{\top } C_{t}(Y_{t})A_{t} p_{t-1}^{(A,C),p_{0}}\) is the Radon–Nikodym density of the conditional law of \(Y_{t}|\mathcal {Y}_{t-1}\) under \(\mathbb {Q}\) with respect to its law under \(\bar {\mathbb {P}}\). As {Ys}sT generates \(\mathcal {Y}_{t}\), the result follows by induction. □

In order to apply classical statistical methods, we take a (generic) parameterization of this family of measures. This will also allow us to encode which parts of the generators we believe are static (and so can be learnt from observations), and which are dynamic (and so will violate stationarity).

Assumption 1

For fixed m>0, we assume we are given a (Borel measurable) function \(\Phi :\mathbb {N}\times \mathbb {R}^{m}\times \mathbb {R}^{m} \to \mathbb {A}\) such that the generators satisfy

$$\Phi(t, \mathfrak{S}, \mathfrak{D}_{t})=(A_{t},C_{t}),$$

where \(\mathfrak {S}\) is a static parameter, and \(\mathfrak {D}_{t}\) is a parameter which may vary at each point in time. We write \(\mathcal {Q}\) for the family of measures in \(\mathcal {M}_{M|\mathcal {Y}}\) induced by this parameterization, and typically omit to write the argument t of Φ.

With a slight abuse of notation, if \((A_{t},C_{t})=\Phi (t, \mathfrak {S}, \mathfrak {D}_{t})\), we write \(\mathbb {Q}(\mathfrak {S}, \mathfrak {D}, p_{0})\)as an alias for the measure \(\mathbb {Q}(A, C, p_{0})\), and \(\mathbb {G}(t, \mathfrak {S}, \mathfrak {D}_{t}, p)\)as an alias for the function \(\mathbb {G}(t, A, C, p)\) defined in (1).

Nonlinear expectations

In this section, we introduce the concepts of nonlinear expectations and convex risk measures, and discuss their connection with penalty functions on the space of measures. These objects provide a technical foundation with which to model the presence of uncertainty in a random setting. This theory is explored in some detail in Föllmer and Schied (2002b). Other key works which have used or contributed to this theory, in no particular order, are Hansen and Sargent (2008) (see also Hansen and Sargent (2005, 2007) for work related to what we present here); Huber and Roncetti (2009); Peng (2010); El Karoui et al. (1997); Delbaen et al. (2010); Duffie and Epstein (1992); Rockafellar et al. (2006); Riedel (2004) and Epstein and Schneider (2003). We base our terminology on that used in Föllmer and Schied (2002b) and Delbaen et al. (2010).

We here present, without proof, the key details of this theory as needed for our analysis.

Definition 3

For a σ-algebra \(\mathcal {G}\) on Ω, let \(L^{\infty }(\mathcal {G})\) denote the space of essentially bounded \(\mathcal {G}\)-measurable random variables. A nonlinear expectation on \(L^{\infty }(\mathcal {G})\) is a mapping

$$\mathcal{E}:L^{\infty}(\mathcal{G}) \to \mathbb{R}$$

satisfying the assumptions

  • Strict Monotonicity: for any \(\xi _{1}, \xi _{2}\in L^{\infty }(\mathcal {G})\), if ξ1ξ2 a.s., then \(\mathcal {E}(\xi _{1}) \geq \mathcal {E}(\xi _{2})\) and, if in addition \(\mathcal {E}(\xi _{1})=\mathcal {E}(\xi _{2})\), then ξ1=ξ2 a.s.;

  • Constant triviality: for any constant k, \(\mathcal {E}(k)=k\);

  • Translation equivariance: for any \(k\in \mathbb {R}\), \(\xi \in L^{\infty }(\mathcal {G})\), \(\mathcal {E}(\xi +k)= \mathcal {E}(\xi)+k\).

A “convex” expectation also satisfies

  • Convexity: for any λ[0,1], \(\xi _{1}, \xi _{2}\in L^{\infty }(\mathcal {G})\),

    $$\mathcal{E}(\lambda \xi_{1}+ (1-\lambda) \xi_{2}) \leq \lambda \mathcal{E}(\xi_{1})+ (1-\lambda) \mathcal{E}(\xi_{2}).$$

If \(\mathcal {E}\) is a convex expectation, then the operator defined by \(\rho (\xi) = \mathcal {E}(-\xi)\) is called a convex risk measure. A particularly nice class of convex expectations is those which satisfy

  • Lower semicontinuity: For a sequence \(\{\xi _{n} \}_{n\in \mathbb {N}}\subset L^{\infty }(\mathcal {G})\) with ξnξ pointwise (and \(\xi \in L^{\infty }(\mathcal {G})\)), we have \(\mathcal {E}(\xi _{n}) \uparrow \mathcal {E}(\xi)\).

The following theorem (which was expressed in the language of risk measures) is due to Föllmer and Schied (2002a) and Frittelli and Rosazza Gianin (2002).

Theorem 2

Suppose \(\mathcal {E}\) is a lower semicontinuous convex expectation. Then there exists a “penalty” function \(\mathcal {R}: \mathcal {M}_{1}\to [0,\infty ]\) such that

$$\mathcal{E}(\xi) = \sup_{\mathbb{Q}\in \mathcal{M}_{1}} \left\{\mathbb{E}_{\mathbb{Q}}[\xi] -\mathcal{R}(\mathbb{Q})\right\}.$$

Provided \(\mathcal {R}(\mathbb {Q})<\infty \) for some \(\mathbb {Q}\) equivalent to \(\mathbb {P}\), we can restrict our attention to measures in \(\mathcal {M}_{1}\) equivalent to \(\mathbb {P}\) without loss of generality.

Remark 3

This result gives some intuition as to how a convex expectation can model “Knightian” uncertainty. One considers all the possible probability measures on the space, and then selects the maximal expectation among all measures, penalizing each measure depending on its plausibility, as measured by \(\mathcal {R}(\cdot)\). As convexity of \(\mathcal {E}\) is a natural requirement of an “uncertainty averse” assessment of outcomes, Theorem 2 shows that this is the only way to construct an “expectation” \(\mathcal {E}\) which penalizes uncertainty, while preserving monotonicity, translation equivariance, and constant triviality (and lower semicontinuity).

In order to relate our penalty function with the temporal structure of filtering, we focus our attention on measures in our parametric family \(\mathcal {Q}\), defined in Assumption 1. We also allow the penalty \(\mathcal {R}\) to depend on time. In the analysis of this paper, the following definition allows us to obtain a (forward) recursive structure in our nonlinear expectations, as one might expect in a filtering context.

Definition 4

We say a family of penalty functions \(\{\mathcal {R}_{t}\}_{t\ge 0}\), is additive if it can be written in the form

$$\mathcal{R}_{t}(\mathbb{Q}) = \left\{\begin{array}{cc} \left(\frac{1}{k}\alpha_{t}(\mathbb{Q}, \{Y_{s}\}_{s\le t})\right)^{k'}& \text{if }\mathbb{Q}\in \mathcal{Q}\\+\infty & \text{otherwise,} \end{array}\right. $$

where, if \(\mathbb {Q} = \mathbb {Q}(\mathfrak {S}, \mathfrak {D}, p_{0})\) and \(p^{\mathbb {Q}}_{t}\) is the solution of the filtering Eq. 1 under \(\mathbb {Q}\), the function αt is of the form

$$\alpha_{t}\left(\mathbb{Q}, \{Y_{s}\}_{s\le t}\right) = \kappa_{\text{prior}}(p_{0}, \mathfrak{S}) + \sum_{s\le t} \gamma_{s}\left(\mathfrak{S}, \mathfrak{D}_{s}, \{Y_{n}\}_{n\le s}, p_{s-1}^{\mathbb{Q}}\right)+m_{t}.$$

Here k and k are positive constants, κprior and {γt}t≥0 are known real functions bounded below, and mt is a \(\mathcal {Y}_{t}\)-measurable scalar random variable which ensures the normalization condition \(\inf _{\mathbb {Q}} \alpha \left (\mathbb {Q}, \{Y_{s}\}_{s\le t}\right)= 0\) holds for almost all observation sequences {Ys}s≥0.


From the discussion above, it is apparent that we can focus our attention on calculating the penalty function \(\mathcal {R}\), rather than working with the nonlinear expectation directly. This penalty function is meant to encode how “unreasonable” a probability measure \(\mathbb {Q}\) is as a model for our outcomes.

In Cohen (2017), we have considered a framework which links the choice of the penalty function to statistical estimation of a model. The key idea of Cohen (2017) is to use the negative log-likelihood function for this purpose, where the likelihood is taken against an arbitrary reference measure, and evaluated using the observed data. This directly uses the statistical information from observations to quantify our uncertainty.

In this paper, we make a slight extension of this idea, to explicitly incorporate prior beliefs. In particular, we replace the log-likelihood with the log-posterior density, which in turn gives an additional term in the penalty.

Definition 5

For \(\mathbb {Q}\in \mathcal {Q}\), the observed likelihood \(L^{\text {obs}}(\mathbb {Q}|\mathbf {y})\) is given in Proposition 2. Inspired by a “Bayesian” approach, we augment this by the addition of a prior distribution over \(\mathcal {Q}\). Suppose a (possibly improper) prior is given, with density in terms of the parameters \((\mathfrak {S}, \{\mathfrak {D}_{t}\}_{t\ge 0})\)

$$\exp\left(-\kappa_{\text{prior}}(\mathfrak{S}, p_{0}) - \sum_{t} \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{t})\right).$$

The posterior relative density is given by the product

$$L(\mathbb{Q}|\mathbf{y}) = L^{\text{obs}}(\mathbb{Q}|\mathbf{y})\exp\left(-\kappa_{\text{prior}}(\mathfrak{S}) - \sum_{t} \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{t})\right).$$

The “\(\mathcal {Q}|\mathbf {y}_{t}\)-divergence” is defined to be the normalized negative log-posterior relative density

$$ \alpha_{\mathbf{y}}(\mathbb{Q}):= -\log\left(L(\mathbb{Q}|\mathbf{y})\right) + \sup_{\tilde{\mathbb{Q}}\in \mathcal{Q}}\left\{\log\left(L\left(\tilde{\mathbb{Q}}|\mathbf{y}\right)\right)\right\}. $$

Remark 4

The right-hand side of (2) is well defined whether or not a maximum a posteriori (MAP) estimator exists. Given a MAP estimate \(\hat {\mathbb {Q}}\in \mathcal {Q}\), we would have the simpler representation

$$\alpha_{\mathbf{y}}(\mathbb{Q}):= -\log\left(\frac{L(\mathbb{Q}|\mathbf{y})}{L(\hat{\mathbb{Q}}|\mathbf{y})}\right).$$

Definition 6

For fixed observations yt=(Y1,Y2,...,Yt), for uncertainty aversion parameters k>0 and k[1,], we define the convex expectation

$$ \mathcal{E}_{\mathbf{y}_{t}}^{k,k'}(\xi):= \sup_{\mathbb{Q}\in \mathcal{Q}}\left\{\mathbb{E}_{\mathbb{Q}}[\xi|\mathbf{y}_{t}] -\left(\frac{1}{k}\alpha_{\mathbf{y}_{t}}(\mathbb{Q})\right)^{k'}\right\}, $$

where we adopt the convention x=0 for x[0,1] and + otherwiseFootnote 2.

We call \(\mathcal {E}_{\mathbf {y}_{t}}^{k,k'}\) the “DR-expectation” (with parameter k,k). We may omit to write k,k for notational simplicity.

With deliberate ambiguity, the acronym “DR” can either stand for “divergence robust” or “data-driven robust”.

Remark 5

By construction, \(\mathcal {Q}\) is parameterized by \(\mathfrak {S}\) and \(\{\mathfrak {D}_{t}\}_{t\ge 0}\), which lie in \(\mathbb {R}^{m}\) for some m. The divergence and conditional expectations given yt are continuous with respect to this parameterization and can be constructed to be Borel measurable with respect to yt. Consequently, measure theoretic concerns which arise from taking the supremum will not cause difficulty, in particular, the DR-expectation defined in (3) is guaranteed to be a Borel measurable function of yt for every ξ. (This follows from Filippov’s implicit function theorem, see, for example, Cohen and Elliott (2015) Appendix 10.)

Remark 6

The choice of parameters k and k determines much of the behaviour of the nonlinear expectation. The role of k is simple, as it acts to scale the uncertainty aversion—a higher value of k results in smaller penalties, and hence the DR-expectation will lie further above the MAP expectation. The parameter k determines the “curvature” of the uncertainty aversion. Taking k= results in the DR-expectation being positively homogeneous, that is, it is a coherent expectation in the sense of Artzner et al. (1999). In Cohen (2017), the asymptotic behaviour of the DR-expectation is studied, under the assumption of iid observations. For k=1, the DR-expectation of corresponds (for large samples) to the expected value under the maximum likelihood model plus k/2 times the sampling variance of the expectation, while for k=, the DR-expectation corresponds to the expected value under the maximum likelihood model plus \(\sqrt {2k}\) times the sampling standard error of the expectation. In this paper, we will not be considering such an asymptotic result, so the values of k and k will not play a significant role. Their presence nevertheless gives a more general class of penalty functions in Definition 4, and they are kept for notational consistency with other papers considering the DR-expectation.

Remark 7

In principle, we could now apply the DR-expectation framework to a filtering context as follows: Take a collection of models \(\mathcal {Q}\). For a random variable ξ, and for each measure \(\mathbb {Q}\in \mathcal {Q}\), compute \(E_{\mathbb {Q}}[\xi |\mathbf {y}_{t}]\) and \(\alpha _{\mathbf {y}_{t}}(\mathbb {Q})\). Taking a supremum as in (3), we obtain the DR-expectation. However, this is generally not computationally tractable in this form.

Lemma 1

Let \(\{\mathcal {F}_{t}\}_{t\ge 0}\) be a filtration such that Y is adapted. For \(\mathcal {F}_{t}\)-measurable random variables, the choice of horizon Tt in the definition of the penalty function α is irrelevant. In particular, for \(\mathcal {F}_{t}\)-measurable ξ, if \(\mathcal {Q}\|_{\mathcal {F}_{t}}=\{\mathbb {Q}\|_{\mathcal {F}_{t}}: \mathbb {Q}\in \mathcal {Q}\}\), we know

$$\mathcal{E}_{\mathbf{y}_{t}}(\xi) = \sup_{\mathbb{Q}\in \mathcal{Q}\|_{\mathcal{F}_{t}}}\left\{\mathbb{E}_{\mathbb{Q}}[\xi|\mathbf{y}_{t}] -\left(\frac{1}{k}\alpha_{\mathbf{y}_{t}}(\mathbb{Q}\|_{\mathcal{Y}_{t}})\right)^{k'}\right\}, $$

where \(\alpha _{\mathbf {y}_{t}}(\mathbb {Q}\|_{\mathcal {Y}_{t}})\) is defined as above, in terms of the restricted measure \(\mathbb {Q}\|_{\mathcal {Y}_{t}}\).


By construction, \(\alpha _{\mathbf {y}_{t}}\phantom {\dot {i}\!}\) is obtained from the posterior relative density, which is determined by the restriction of \(\mathbb {Q}\) to \(\mathcal {Y}_{t}\subseteq \mathcal {F}_{t}\), while the expectation depends only on the restriction of \(\mathbb {Q}\) to \(\mathcal {F}_{t}\). As these are the only terms needed to compute the DR-expectation, the result follows. □

Theorem 3

The penalty in the DR-expectation is additive, in the sense of Definition 4.


From Proposition 2, we have the likelihood

$$L^{\text{obs}}(\mathbb{Q}(A, C, p_{0})|\mathbf{y}_{t}) = \prod_{s=1}^{t}\mathbf{1}^{\top} C_{s}(Y_{s}) A_{s} p_{s-1}^{(A,C), p_{0}},$$

where \(p_{s}^{(A,C), p_{0}}\) is the solution to the filtering Eq. 1. By Lemma 1, the penalty in the DR-expectation is given by

$$\begin{aligned} &\alpha_{\mathbf{y}_{t}}(\mathbb{Q}\|_{\mathcal{Y}_{t}})\\ &= - \log\left(L^{\text{obs}}(\mathbb{Q}(A, C, p_{0})|\mathbf{y}_{t}) \cdot e^{-\kappa_{\text{prior}}(\mathfrak{S}, p_{0}) - \sum_{t} \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{t})}\right)-m_{t}\\ &= \kappa_{\text{prior}}(\mathfrak{S}, p_{0}) + \sum_{s\le t}\left[ \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{s}) -\log\left(\mathbf{1}^{\top} C_{s}(Y_{s}) A_{s} p_{s-1}^{(A,C), p_{0}}\right)\right]-m_{t}, \end{aligned} $$

where mt is chosen to ensure \(\inf _{\mathbb {Q}}(\alpha _{\mathbf {y}_{t}}(\mathbb {Q}) \equiv 0\). As \((A_{t}, C_{t}(\cdot)) = \Phi (\mathfrak {S}, \mathfrak {D}_{t})\), we obtain the desired form by setting

$$\gamma_{t}(\mathfrak{S}, \mathfrak{D}_{t}, \{Y_{s}\}_{s\le t}, p) = \gamma_{\text{prior}}(\mathfrak{S},\mathfrak{D}_{t}) -\log\left(\mathbf{1}^{\top} C_{t}(Y_{t}) A_{t} p\right).$$

Remark 8

The purpose of the nonlinear expectation is to give an “upper” estimate of a random variable, accounting for uncertainty in the underlying probabilities. This is closely related to robust estimation in the sense of Wald (1945). In particular, one can consider the robust estimator given by

$$\mathrm{arg\,inf}_{\hat\xi\in\mathbb{R}^{N}} \mathcal{E}_{\mathbf{y}_{t}}\left(\left\|\xi -\hat \xi\right\|^{2}\right),$$

which gives a “minimax” estimate of ξ, given the observations yt and a quadratic loss function. The advantage of the nonlinear expectation approach is that it allows one to construct such an estimate for every random variable/loss function, giving a cost-specific quantification of uncertainty in each case.

We can also see a connection with the theory of H filtering (see, for example, Grimble and El Sayed (1990) or more recently Zhang et al. (2009) and references therein, or the more general H-control theory in Başar and Bernhard (1991)). In this setting, we look for estimates which perform best in the worst-situation, where “worst” is usually defined in terms of a perturbation to the input signal or coefficients. In our setting, we focus not on the estimation problem directly, but on the “dual” problem of building an upper expectation, i.e. calculating the “worst” expectation in terms of a class of perturbations to the coefficients (our setting is general enough that perturbation to the signal can also be included through shifting the coefficients).

Remark 9

There are also connections between our approach and what is called “risk-sensitive filtering”, see, for example, James et al. (1994); Dey and Moore (1995); or the review of Boel et al. (2002) and references therein (from an engineering perspective); or Hansen and Sargent (2007); Hansen and Sargent (2008) (from an economic perspective). In their setting, one uses the nonlinear expectation defined by

$$\mathcal{E}(\xi|\mathcal{Y}_{t}) = -k \log \mathbb{E}_{\mathbb{P}}\big[\exp(- \xi/k)\big|\mathcal{Y}_{t}\big],$$

for some choice of robustness parameter 1/k>0. This leads to significant simplification, as dynamic consistency and recursivity is guaranteed in every filtration (see Graf (1980) and Kupper and Schachermayer (2009), and further discussion in Section 5). The corresponding penalty function is given by the conditional relative entropy,

$$\mathcal{R}_{t}(\mathbb{Q}) = k \mathbb{E}_{\mathbb{Q}}[\log(d\mathbb{Q}/d\mathbb{P})|\mathcal{Y}_{t}],$$

which is additive (Definition 4) and the one-step penalty can be calculated accordingly. In this case, the optimization defining the nonlinear expectation could also be taken over \(\mathcal {M}_{1}\), so this approach has a claim to be including “nonparametric” uncertainty, as all measures are considered, rather than purely Markov measures or measures in a parametric family (however, the optimization can be taken over conditionally Markov measures, and one will obtain an identical result!).

The difficulty with this approach is that it does not allow for easy incorporation of knowledge of the error of estimation of the generators (A,C) in the level of robustness—the only parameter available to choose is k, which multiplies the relative entropy. A small choice of k corresponds to a small penalty, hence a very robust expectation, but this robustness is not directly linked to the estimation of the generators (A,C). Therefore, the impact of statistical estimation error remains obscure, as k is chosen largely exogenously of this error. For this reason, our approach, which directly allows for the penalty to be based on the statistical estimation of the generators, has advantages over this simpler method.

Recursive penalties

The DR-expectation provides us with an approach to including statistical estimation in our valuations. However, the calculations suggested by Remark 7 are generally intractable in their stated form. In this section, we shall see how the assumption that the penalty is additive (Definition 4) can be used to simplify our calculations.

Our arguments will be based on dynamic programming techniques. For the sake of precision and brevity, we here state a (forward in time) abstract “dynamic programming principle” which we can call on in later arguments.

Theorem 4

Let U be a topological space which is equal to the countable union of compact metrizable subsets of itself. For some m>0, let \(g_{t}:\mathbb {R}^{m}\times U\to \mathbb {R}^{m}\) be a sequence of Borel measurable functions. For any sequence u=(u1,...,uT) in U, let the sequence \(Z_{t}^{\mathbf {u},z_{0}}\) be defined by the recursion

$$Z_{t}^{\mathbf{u},z_{0}} = g_{t}\left(Z_{t-1}^{\mathbf{u},z_{0}}, u_{t}\right) \qquad \text{and}\qquad Z_{0}^{\mathbf{u},z_{0}} = z_{0}.$$

For each uU, we write g−1(·,u) for the (set-valued) inverse of g(·,u).

Suppose we have a sequence of Borel measurable maps \(\mathcal {A}_{t}:\mathbb {R}\times U\times \mathbb {R}^{m}\to \mathbb {R}\) such that \(v\mapsto \mathcal {A}(v,u,z)\) is nondecreasing and continuous (uniformly in u,z) and \(\mathcal {A}(v, u,z) \to \infty \) as v. For each u and z0, we define the sequence of values at each time t by

$$V_{t}(\mathbf{u}, z_{0}) = \mathcal{A}_{t}\left[V_{t-1}(\mathbf{u},z_{0}), u_{t}, Z^{\mathbf{u},z_{0}}_{t-1}\right]\qquad \text{and}\qquad V_{0}(\mathbf{u},z_{0}) = v_{0}(z_{0}).$$

Then, the minimal value,

$$V^{*}_{t}(z) := \inf_{\left\{\mathbf{u}, z_{0}: Z^{\mathbf{u},z_{0}}_{t}=z\right\}} V_{t}(\mathbf{u},z_{0}),$$

satisfies the recursion

$$V^{*}_{t}(z) = \inf_{u\in U} \inf_{y\in g^{-1}_{t}(z,u)}\left\{\mathcal{A}_{t}\left[V^{*}_{t-1}(y), u, y\right]\right\} \qquad \text{and}\qquad V^{*}_{0}(z_{0})=v_{0}(z_{0}), $$

(with the convention that the infimum of the empty set is +).


We proceed by induction. Clearly, the result holds at t=0, as does the (at t=0 empty) statement

$$V_{t}(z) = +\infty \quad \text{for all }z\not\in\bigcup_{\mathbf{u}, z_{0}}\left\{ Z^{\mathbf{u},z_{0}}_{t}\right\}. $$

Suppose then that (4) and (5) hold at t=n−1. For every ε>0, there exists (u,z0) such that

$$\begin{aligned} \mathcal{A}_{n}\left[V_{n-1}(\mathbf{u},z_{0}), u_{n}, Z^{\mathbf{u}}_{n-1}\right]= V_{n}(\mathbf{u}, z_{0}) &= \mathcal{A}_{n}\left[V_{n-1}(\mathbf{u},z_{0}), u_{n}, Z^{\mathbf{u}}_{n-1}\right]\\ &\leq \mathcal{A}_{n}\left[V_{n-1}^{*}\left(Z^{\mathbf{u}}_{n-1}\right)+\epsilon, u_{n}, Z^{\mathbf{u}}_{n-1}\right]. \end{aligned} $$

Taking the infimum over \(\left \{\mathbf {u}, z_{0}:Z_{t}^{\mathbf {u}} = z\right \}\) (which can be done measurably with respect to z, given Filippov’s implicit function theorem, see, for example, Cohen and Elliott (2015) Appendix 10, and sending ε→0 gives

$$\begin{aligned} \inf_{\left\{\mathbf{u}, z_{0}:Z_{n}^{\mathbf{u}} = z\right\}}V_{n}(\mathbf{u}, z_{0}) & = \inf_{\left\{\mathbf{u}, z_{0}:Z_{n}^{\mathbf{u}} = z\right\}} \mathcal{A}_{n}\left[V_{n-1}^{*}\left(Z^{\mathbf{u}}_{n-1}\right), u_{n}, Z^{\mathbf{u}}_{n-1}\right]. \end{aligned} $$

From the definition of g, we know that

$$\left\{\mathbf{u},z_{0}: Z^{\mathbf{u},z_{0}}_{n} = z\right\}=\left\{\mathbf{u},z_{0}: Z^{\mathbf{u}, z_{0}}_{n-1}\in g^{-1}_{n}(z, u_{n})\right\}$$

from which we derive

$$\begin{aligned} V_{n}^{*}(z) & = \inf_{\left\{\mathbf{u},z_{0}: Z^{\mathbf{u}, z_{0}}_{n-1}\in g^{-1}_{n}(z, u_{n})\right\}} \mathcal{A}_{n}\left[V_{n-1}^{*}\left(Z^{\mathbf{u},z_{0}}_{n-1}\right), u_{n}, Z^{\mathbf{u},z_{0}}_{n-1}\right]. \end{aligned} $$

The right side of this equation depends on u,z0 only through the values of Zn−1u,z0 and un. In particular, considering the set of attainable y, that is, for \(y\in \bigcup _{\mathbf {u}, z_{0}}\left \{ Z^{\mathbf {u},z_{0}}_{n}\right \}\), we change variables to write

$$\begin{aligned} V_{n}^{*}(z) &=\inf_{u_{n}\in U}\inf_{\substack{y\in g^{-1}_{n}(z, u_{n})\\y\in \bigcup_{\mathbf{u}, z_{0}}\left\{ Z^{\mathbf{u},z_{0}}\right\}}} \mathcal{A}_{n}\left[V_{n-1}^{*}(y), u_{n}, y\right]. \end{aligned} $$

As the infimum on the empty set is +, we also obtain (5) at time n, and simplify to give (4) for t=n, completing the inductive proof. □

Corollary 1

Suppose, instead of Z being defined by a forward recursion, for some Borel measurable function \(\tilde g\) we had the backward recursion

$$\tilde g_{t}\left(Z_{t}^{\mathbf{u},z_{0}}, u_{t}\right) = Z_{t-1}^{\mathbf{u},z_{0}}.$$

The result of Theorem 4 still holds (with effectively the same proof), where we write \(\tilde g_{t}\) instead of \(g_{t}^{-1}\), so the second infimum in (4) is unnecessary (as \(\tilde g(z,u)\) is single-valued).

For practical purposes, it is critical that we refine our approach to provide a recursive construction of our nonlinear expectation. In classical filtering, one obtains a recursion for expectations \(\mathbb {E}[\phi (X_{t})|\mathcal {Y}_{t}]\), for Borel functions ϕ; one does not typically consider the expectations of general random variables. Similarly, we will consider the expectations of random variables ϕ(Xt).

Proposition 3

For each t, there exists a \(\mathcal {Y}_{t}\otimes \mathcal {B}(\mathbb {R})\)-measurable function κt such that, for every Borel function ϕ,

$$ \begin{aligned} \mathcal{E}_{\mathbf{y}_{t}}(\phi(X_{t})) &:= \sup_{\mathbb{Q}\in\mathcal{Q}} \left\{\mathbb{E}_{\mathbb{Q}}[\phi(X_{t})|\mathbf{y}_{t}]-\mathcal{R}_{t}(\mathbb{Q})\right\}\\ &= \sup_{q\in S_{N}^{+}}\left\{\sum_{i} q_{i} \phi(e_{i}) - \left(\frac{1}{k}\kappa_{t}(\omega, q)\right)^{k'}\right\}, \end{aligned} $$

where \(S_{N}^{+}\) denotes the probability simplex in \(\mathbb {R}^{N}\), that is, \(S_{N}^{+}=\{x\in \mathbb {R}^{N}: \sum _{i} x_{i} = 1,\quad x_{i}\geq 0 \quad \forall i\}\).


Fix the observations yt. Taking ΩX to be the possible states of Xt, that is, the basis vectors in \(\mathbb {R}^{N}\) with the discrete topology and corresponding Borel σ-algebra, the space of measures on ΩX is represented by the probability simplex \(S_{N}^{+}\). We consider the map

$$\phi\mapsto\mathcal{E}'(\phi(\omega_{X})):= \mathcal{E}_{\mathbf{y}_{t}}(\phi(X_{t}))$$

as a nonlinear expectation with underlying space ΩX. By Theorem 2, it follows that there exists a penalty function \(\kappa _{\mathbf {y}_{t}}(q)\phantom {\dot {i}\!}\) such that

$$\mathcal{E}'(\phi(\omega_{X})) = \sup_{q\in S_{N}^{+}}\left\{\sum_{i} q_{i} \phi(e_{i}) - \left(\frac{1}{k}\kappa_{\mathbf{y}_{t}}(q)\right)^{k'}\right\}.$$

Taking a regular version of this penalty function (which by convex duality exists as \(\mathcal {E}\) is measurable in yt), we can write \(\kappa _{t}(\omega, q) = \kappa _{\mathbf {y}_{t}}(q)\phantom {\dot {i}\!}\) as desired. □

Our aim is to find a recursion for κt, for various choices of \(\mathcal {R}\). Our constructions will depend on the following object.

Definition 7

Recall from Theorem 1 and Assumption 1 that, given a generator \((A_{t},C_{t}(\cdot))=\Phi (\mathfrak {S}, \mathfrak {D}_{t}) \in \mathbb {A}\) at time t, our filter dynamics are described by the recursion (up to proportionality)

$$p_{t} = \mathbb{G}(p_{t-1}, \mathfrak{S}, \mathfrak{D}_{t}, Y_{t}) \propto C(Y_{t})A p_{t-1}.$$

We correspondingly define the (set-valued) inverse

$$\mathbb{G}^{-1}(p; \mathfrak{S}, \mathfrak{D}_{t}, Y_{t}) = \left\{p'\in S_{N}^{+} : \mathbb{G}\left(p', \mathfrak{S}, \mathfrak{D}_{t}, Y_{t}\right) = p\right\}.$$

For notational simplicity, we will omit the argument Yt when this does not lead to confusion.

The set \(\mathbb {G}^{-1}(p; \mathfrak {S}, \mathfrak {D}_{t},Y_{t})\) represents the filter states at time t−1 which evolve to p at time t, assuming the generator of our process (at time t) is given by \(\Phi (\mathfrak {S}, \mathfrak {D}_{t})\) and we observe Yt. This set may be empty, if no such filter states exist. As the matrix A is generally not invertible (even accounting for the restriction to \(S_{N}^{+}\)), the set \(\mathbb {G}^{-1}(p; \mathfrak {S}, \mathfrak {D}_{t}, Y_{t})\) is not generally a singleton.

Filtering with uncertainty

We now show that if we assume our penalty is additive, then the function κ appearing in (6) can be obtained in a recursive manner.

Theorem 5

Suppose \(\mathcal {R}_{t}\) is additive, in the sense of Definition 4. Then, a function κ satisfying (6) is given by

$$\kappa_{t}(p) = \inf_{\mathfrak{S}}K_{t}(p, \mathfrak{S}),$$

where Kt satisfies the recursion

$$K_{t}(p,\mathfrak{S}) = \inf_{\mathfrak{D}_{t}} \left\{\inf_{p'\in \mathbb{G}^{-1}(p,\mathfrak{S}, \mathfrak{D}_{t}, Y_{t})} \left\{K_{t-1}\left(p', \mathfrak{S}\right) + \gamma_{t}\left(\mathfrak{S}, \mathfrak{D}_{t}, \{Y_{s}\}_{s\ge 0}, p'\right)\right\}\right\}-m'_{t}, $$

with initial value \(K_{0}(p_{0}, \mathfrak {S}) = \kappa _{\text {prior}}(p_{0}, \mathfrak {S})\), where m is chosen to ensure we have the normalization \(\inf _{p, \mathfrak {S}}K_{t}(p, \mathfrak {S}) \equiv 0\).


As we know that \(\mathcal {R}_{t}\) is additive, we have the representation \(\mathcal {R}_{t}(\mathbb {Q}) = \left (k^{-1} \alpha _{t}(\mathbb {Q}, \{Y_{s}\}_{s\le t})\right)^{k'}\), where

$$\alpha_{t}(\mathbb{Q}, \{Y_{s}\}_{s\le t}) = \kappa_{\text{prior}}(p_{0}, \mathfrak{S}) + \sum_{s\le t} \gamma_{s}\left(\mathfrak{S}, \mathfrak{D}_{s}, \{Y_{n}\}_{n\le s}, p_{s-1}^{\mathbb{Q}}\right)+m_{t}.$$

As \(\mathbb {E}_{\mathbb {Q}}[\phi (X_{t})|\mathbf {y}_{t}]\) depends only on the conditional law of \(X_{t}|\mathcal {Y}_{t}\) under \(\mathbb {Q}\), it is easy to see that (6) is satisfied when

$$ \left(\frac{1}{k}\kappa_{t}(p)\right)^{k'} = \inf_{\{\mathbb{Q}:\mathbb{E}_{\mathbb{Q}}[X_{t}|\mathcal{Y}_{t}]= p\}}\mathcal{R}_{t}(\mathbb{Q}) = \inf_{\{\mathbb{Q}:\mathbb{E}_{\mathbb{Q}}[X_{t}|\mathcal{Y}_{t}]= p\}}\left(\frac{1}{k}\alpha_{t}(\mathbb{Q})\right)^{1/k'}, $$

We wish to write the minimization in (7) as a recursive control problem, to which we can apply Theorem 4. Given p0, \(\mathfrak {S}\), and \(\{\mathfrak {D}_{s}\}_{s\le t}\), the law of \(X_{t}|\mathcal {Y}_{t}\) is given by the solution to the filtering Eq. 1. Write \(Z_{t} = (p_{t}, \mathfrak {S})\), and \(u_{t} = \mathfrak {D}_{t}\), so that Z is a state process defined by \(Z_{0} = z_{0} = (p_{0}, \mathfrak {S})\) and the recursion (controlled by \(\mathbf {u} =\{u_{s}\}_{s\le t} = \{\mathfrak {D}_{s}\}_{s\le t}\))

$$Z_{t} = \hat{\mathbb{G}}(Z_{t-1}, \mathfrak{D}_{t}) := \left(\mathbb{G}\left(p_{t-1}, \mathfrak{S}, \mathfrak{D}_{t}, Y_{t}\right),\, \mathfrak{S}\right).$$

Omitting the constant mt from the definition of αt, we define

$$V_{t}(z_{0}, \mathbf{u}) = \kappa_{\text{prior}}(p_{0}, \mathfrak{S}) + \sum_{s\le t} \gamma_{s}\left(\mathfrak{S}, \mathfrak{D}_{s}, \{Y_{n}\}_{n\le s}, p_{s}\right).$$

Taking \(\mathcal {A}_{t}\) to be the operator

$$\mathcal{A}_{t}\left[V_{t-1}, u_{t}, Z_{t}^{\mathbf{u}, z_{0}}\right]= V_{t-1} + \gamma_{t}(\mathfrak{S}, u_{t}, \{Y_{s}\}_{s\le t}, p_{t-1}),$$

we see that V satisfies the structure assumed in Theorem 4. Therefore, its minimal value satisfies

$$\begin{aligned} V^{*}_{t}(z) &= \inf_{\left\{z_{0}, \mathbf{u}:Z_{t}^{z_{0}, \mathbf{u}} = z\right\}} V_{t}(z_{0}, \mathbf{u}) \\ &= \inf_{\mathfrak{D}_{t}} \left\{\inf_{(p', \mathfrak{S})= z'\in \hat{\mathbb{G}}^{-1}(z)} \left\{V^{*}_{t-1}(z') + \gamma_{t}\left(\mathfrak{S}, u_{t}, \{Y_{s}\}_{s\ge 0}, p'\right)\right\}\right\} \end{aligned} $$

with initial value \(V_{0}(z) = \kappa _{\text {prior}}(p_{0}, \mathfrak {S})\). We renomalize this by setting \(m^{\prime }_{t} = \inf _{z} V^{*}_{t}(z)\) and \(K_{t}(p, \mathfrak {S}) := V^{*}_{t}(z) -m'_{t}\), and so obtain the stated dynamics for K. By construction, we know

$$K_{t}(p, \mathfrak{S}) = \inf_{\{\mathfrak{D}_{s}\}_{s\le t}}\left\{\alpha_{t}\left(\mathbb{Q}, \{Y_{s}\}_{s\leq t}\right): \mathbb{E}_{\mathbb{Q}}[X_{t}|\mathcal{Y}_{t}]= p, \mathbb{Q} = \mathbb{Q}\left(p_{0}, \mathfrak{S}, \{\mathfrak{D}_{s}\}_{s\le t}\right)\right\}.$$

It follows that (7), and hence (6), are satisfied by taking \(\kappa _{t}(p) = \inf _{\mathfrak {S}} K_{t}(p, \mathfrak {S})\), as desired. □


In this section, we will seek to outline a few key settings where this theory can be applied.

Static generators, uncertain prior (StaticUP)

We first consider the case where uncertainty is given over the prior inputs to the filter. In particular, this “prior uncertainty” is not updated given new observations, and \(\mathcal {R}\) will not change through time.

Framework 1

(StaticUP) In a StaticUP setting, the inputs to the filtering problem are the initial filter state p0 and the generator (A,C(·)), which we parameterize solely using the static parameter \(\mathfrak {S}\). in particular, we exclude dependence on the “dynamic” parameters \(\{\mathfrak {D}_{t}\}_{t\ge 0}\). To represent our uncertain prior, we take a penalty

$$\mathcal{R}(\mathbb{Q}) = \left(\frac{1}{k} \alpha(\mathbb{Q})\right)^{k'} = \left(\frac{1}{k}\kappa_{\text{prior}}(p_{0}, \mathfrak{S})\right)^{k'}$$

for some prescribed penalty κprior.

We now apply Theorem 5 (omitting dependence on \(\mathfrak {D}_{t}\), as we are in a purely static setting) to see that a dynamic version κt of the penalty function, satisfying (6), can be computed as

$$\kappa_{t}(p) = \inf_{\mathfrak{S}}K_{t}(p, \mathfrak{S}),$$

where Kt satisfies the recursion

$$K_{t}(p,\mathfrak{S}) = \inf_{p'\in \mathbb{G}^{-1}(p,\mathfrak{S}, Y_{t})} \left\{K_{t-1}(p', \mathfrak{S})\right\}-m'_{t}. $$

Assuming \(\inf _{(p_{0}, \mathfrak {S})}\kappa _{\text {prior}}(p_{0}, \mathfrak {S})=0\), we further compute \(m^{\prime }_{t} \equiv 0\). This completely characterizes the penalty function, and hence the nonlinear expectation.

Remark 10

Inspired by the DR-expectation, a possible choice of penalty function κprior would be the negative log-density of a prior distribution for the inputs \((p_{0}, \mathfrak {S})\), shifted to have minimal value zero. Alternatively, taking an empirical Bayesian perspective, κprior could be the log-likelihood from a prior calibration process. In this case, we are incorporating our prior statistical uncertainty regarding the parameters in the filtering problem.

Remark 11

We emphasize that there is no learning of the generator being done in this framework—the penalty applied at time t=0 is simply propagated forward; our observations do not affect our opinion of the plausible generators. In particular, if we assume no knowledge of the initial state (i.e., a zero penalty), then we will have no knowledge of the state at time t (unless the observations cause the filter to degenerate).

Example 1

For a concrete example of the StaticUP framework, we take the class of models in \(\mathcal {M}_{M}\) where A and C are perfectly known and A=I, so Xt=X0 is constant (but X0 is unknown). We take N=2, so X takes only one of two values. For the observation distribution C, we assume that

$$Y_{t}|(X_{t}=e_{1}) \sim \text{Bernoulli}(a), \qquad Y_{t}|(X_{t}=e_{2}) \sim \text{Bernoulli}(b),$$

where a,b(0,1) are fixed constants. Effectively, in this example we are using filtering to determine which of two possible parameters is the correct mean for our observation sequence. It is worth emphasising that the filter process p corresponds to the posterior probabilities, in a Bayesian setting, of the events that our Bernoulli process has parameter a or b.

It is useful to note that, from classical Bayesian statistical calculationsFootnote 3, for a given p0, one can see that the corresponding value of pt is determined from the log-odds ratio,

$$\log\left(\frac{p_{t}^{1}}{p_{t}^{2}}\right) = \log\left(\frac{p_{0}^{1}}{p_{0}^{2}}\right)+ {t\bar{Y}_{t}}\log\left(\frac{a}{b}\right)+t(1-\bar Y_{t})\log\left(\frac{1-a}{1-b}\right),$$

where \(\bar Y_{t} = t^{-1} \sum _{s\leq t} Y_{s}\) is the average number of successes observed at time t.

To write down the StaticUP penalty function, let the (known) dynamics be described by \(\mathfrak {S}^{*}\). Consequently, we can write \(K(p, \mathfrak {S})=\infty \) for all \(\mathfrak {S}\neq \mathfrak {S}^{*}\). We initialize with a known penalty \(\kappa _{\text {prior}}(p, \mathfrak {S}^{*})=\kappa _{0}(p)\) for all \(p\in S_{N}^{+}\). As \(\mathfrak {S}^{*}\) is known, there is no distinction between K and κ, that is,

$$\kappa_{t}(p) =\inf_{\mathfrak{S}} K_{t}(p, \mathfrak{S}) = K_{t}(p, \mathfrak{S}^{*}) = \inf_{\left\{p_{0}: \mathbb{E}_{\mathbb{Q}(\mathfrak{S}^{*}, p_{0})}[X_{t}|\mathcal{Y}_{t}] = p\right\}}\left\{\kappa_{0}(p_{0})\right\}.$$

In this example, we can express our penalty in terms of the log-odds, for the sake of notational simplicity given the closed-form solution to the filtering problem, and hence can explicitly calculate the (unique) initial distribution p0 which would evolve to a given p at time t. In particular, the time-t penalty is given by a shift of the initial penalty:

$$\kappa_{t}\bigg(\log\left(\frac{p_{t}^{1}}{p_{t}^{2}}\right)\bigg) = \kappa_{0}\bigg(\log\left(\frac{p_{t}^{1}}{p_{t}^{2}}\right)-{t\bar Y_{t}}\log\left(\frac{a}{b}\right)-t(1-\bar Y_{t})\log\left(\frac{1-a}{1-b}\right) \bigg).$$

Remark 12

This example demonstrates the following behaviour:

  • If the initial penalty is zero, then the penalty at time t is also zero—there is no learning of which state we are in.

  • When parameterized by the log-odds ratio, there is no variation in the curvature of the penalty (and so no change in our “uncertainty”), we simply shift the penalty around, corresponding to our changing posterior probabilities.

  • The update of κ is done purely using the tools of Bayesian statistics, rather than having any direct incorporation of our uncertainty.

Remark 13

We point out that this is, effectively, the model of uncertainty proposed by Walley (1991) (see, in particular, Walley (1991) section 5.3, although there he takes a model where the unknown parameter is Beta distributed). See also Fagin and Halpern (1990).

Dynamic generators, uncertain prior (DynamicUP)

If we model the generator (A,C) as fixed and unknown (i.e., it depends only on \(\mathfrak {S}\)), calculation of \(K_{t}(p, \mathfrak {S})\) suffers from a curse of dimensionality—the dimension of \(\mathfrak {S}\) determines the size of the domain of Kt. On the other hand, if we suppose the generator at time t depends only on the dynamic parameters \(\mathfrak {D}_{t}\), we can use dynamic programming to obtain a lower-dimensional problem.

Framework 2

(DynamicUP) In the DynamicUP setting, for an initial penalty on the initial hidden state, κprior(p0), and a penalty on the time-t generator, \(\gamma _{t}(\mathfrak {D}_{t})\), our total penalty is given by \(\mathcal {R}(\mathbb {Q}) = \left (\frac {1}{k} \alpha (\mathbb {Q})\right)^{k'}\), where we now have

$$\alpha(\mathbb{Q}) = \kappa_{\text{prior}}(p_{0}) + \sum_{s=1}^{\infty} \gamma_{s}(\mathfrak{D}_{s}).$$

In this case, as we ignore the static parameter \(\mathfrak {S}\), we simplify Theorem 5 through the identity \(\kappa _{t}(p) = K_{t}(p, \mathfrak {S})\). This yields the recursion

$$\kappa_{t}(p) = \inf_{\mathfrak{D}_{t}} \left\{\inf_{p'\in \mathbb{G}^{-1}(p,\mathfrak{D}_{t}, Y_{t})} \left\{\kappa_{t-1}(p') + \gamma_{t}(\mathfrak{D}_{t})\right\}\right\}-m'_{t} $$

and again, if we assume \(\inf _{\mathbb {Q}}\alpha (\mathbb {Q}) \equiv 0\), we then conclude \(m^{\prime }_{t} \equiv 0\).

This formulation of the uncertain filter allows us to use dynamic programming to solve our problem forward in time. In the setting of Example 1, as the generator is perfectly known, there is no distinction between the dynamic and static cases.

A continuous-time version of this setting (for a Kalman–Bucy filter) is considered in detail in Allan and Cohen (2019a).

Static generators, DR-expectation (StaticDR)

In the above examples, we have regarded the prior as uncertain and used this to penalize over models. We did not use the data to modify our penalty function \(\mathcal {R}\). The DR-expectation gives us an alternative approach in which the data guides our model choice more directly. In what follows, we apply the DR-expectation in our filtering context and observe that it gives a slightly different recursion for the penalty function. Again, we can consider models where our generator is constant (i.e., depends only on \(\mathfrak {S}\)) or changes dynamically (i.e., depends only on \(\mathfrak {D}_{t}\)).

Framework 3

(StaticDR) As in the StaticUP framework, we assume that the generator (A,C) is determined by the static parameter \(\mathfrak {S}\). For each \(\mathfrak {S}\), with \(\mathbb {Q}= \mathbb {Q}(A,C, p_{0})\) and \((A,C)=\Phi (\mathfrak {S})\), we have a penalty \(\mathcal {R}(\mathbb {Q}) = \left (\frac {1}{k} \alpha (\mathbb {Q})\right)^{k'}\) given by the log-posterior density

$$\alpha(\mathbb{Q}\|_{\mathcal{Y}_{t}}) = \kappa_{\text{prior}}(p_{0}, \mathfrak{S}) - L^{\text{obs}}\left(\mathbb{Q}(A,C, p_{0})\big|\mathbf{y}_{t}\right)+m_{t}$$

which is additive, as shown in Theorem 3. Applying Theorem 5, we see that the penalty can be written \(\kappa _{t}(p) = \inf _{\mathfrak {S}} K_{t}(p, \mathfrak {S})\), where \(K_{0}(p, \mathfrak {S}) = \kappa _{\text {prior}}(p, \mathfrak {S})\) and K satisfies the recursion

$$K_{t}(p,\mathfrak{S}) = \inf_{\mathfrak{D}_{t}} \left\{\inf_{p'\in \mathbb{G}^{-1}(p,\mathfrak{S}, Y_{t})} \left\{K_{t-1}(p', \mathfrak{S}) + \log c^{\mathfrak{S}}\left(Y_{s};A^{\mathfrak{S}} p'\right)\right\}\right\}-m'_{t}. $$

Unlike in the uncertain prior cases, we cannot typically claim that \(m^{\prime }_{t} \equiv 0\), instead it is a random process dependent on our observations.

Remark 14

Comparing Framework 1 (StaticUP) with Framework 3 (StaticDR), we see that the key distinction is the presence of the log-likelihood term. This term implies that observations of Y will affect our quantification of uncertainty, rather than purely updating each model.

Example 2

In the setting of Example 1, recall that X is constant, so we know (A,C(·)). One can calculate the StaticDR penalty either directly, or through solving the stated recursion using the dynamics of p. As in the StaticUP case, the result is most simply expressed by first calculating p0 from pt through

$$\log\left(\frac{p_{0}^{1}}{p_{0}^{2}}\right) = \log\left(\frac{p_{t}^{1}}{p_{t}^{2}}\right)-{t\bar Y_{t}}\log\left(\frac{a}{b}\right)-t(1-\bar Y)_{t}\log\left(\frac{1-a}{1-b}\right)$$

and then

$$\kappa_{t}(p_{t}) = \kappa_{0}(p_{0})- \log\left(p_{0}^{1} a^{t\bar Y_{t}}(1-a)^{t\left(1-\bar Y_{t}\right)} + p_{0}^{2} b^{t\bar Y_{t}}(1-b)^{t\left(1-\bar Y_{t}\right)}\right)-m_{t}, $$

where mt is chosen to ensure infpκt(p)=0. From this, we see that the likelihood modifies our uncertainty directly, rather than us simply propagating each model via Bayes’ rule. A consequence of this is that if we start with extreme uncertainty (κ0≡0), then our observations teach us what models are reasonable, thereby reducing our uncertainty (i.e., we will find κt(p)>0 for p(0,1) when t>0).

Remark 15

It is interesting to ask what the long-term behaviour of these uncertain filters will be. In Cohen (2017), the long run behaviour of the DR-expectation based on i.i.d. observations is derived and, in principle, a similar analysis is possible here. Using the asymptotic analysis of maximum likelihood estimation for hidden Markov models in Leroux (1992) or Douc et al. (2011), we know that the MLE will converge with probability one to the true parameter, under appropriate regularity conditions. Here, the presence of the prior influences this slightly, however, this impact vanishes as t. With further regularity assumptions, one can also show that the log-likelihood function, divided by the number of observations, almost surely converges to the relative entropy between a proposed model and the true model (see, for example, Leroux (1992) section 5). If one also knew that the relative entropy is smooth and convex, the analysis of Cohen (2017) Theorems 4 and 5 is possible, showing that the DR-expectation corresponds to adding a term related to the sampling variance of the hidden stateFootnote 4. In particular, as the number of observations increases, the DR-expectation will converge to the expected value under the filter with the true parameters.

Dynamic generators, DR-expectation (DynamicDR)

As in the uncertain prior case, it is often impractical to calculate a recursion for \(K(p, \mathfrak {S})\) given the high dimension of \(\mathfrak {S}\). We therefore consider the case when (A,C) depends only on the dynamic parameters \(\mathfrak {D}_{t}\).

Framework 4

(DynamicDR) As before, for each \(\{\mathfrak {D}_{s}\}_{s\ge 0}\), with \(\mathbb {Q}= \mathbb {Q}(A,C, p_{0})\) and \((A_{t},C_{t}) = \Phi (\mathfrak {D}_{t})\), we have a penalty \(\mathcal {R}(\mathbb {Q}) = \left (\frac {1}{k} \alpha (\mathbb {Q})\right)^{k'}\) given by the log-posterior density

$$\alpha(\mathbb{Q}\|_{\mathcal{Y}_{t}}) = \kappa_{\text{prior}}(p_{0}) - \log L^{\text{obs}}\left(\mathbb{Q}(A,C, p_{0})\big|\mathbf{y}_{t}\right) +m_{t}.$$

From Theorem 3, we know that the log-posterior density is additive. Applying Theorem 5, and the identity \(\kappa _{t}(p) = K_{t}(p, \mathfrak {S})\), we conclude that the penalty κt(p) in (6) can be computed from the recursion

$$\begin{aligned} {\kappa_{t}}(p) &={\inf_{\mathfrak{D}_{t}}}\left\{{\inf_{p^{\prime}\in \mathbb{G}^{-1}(p,\mathfrak{D}_{t}, Y_{t})}}\left\{{\vphantom{\left({c^{\mathfrak{D}_{t}}} (Y_{t}; {A^{\mathfrak{D}}_{t}} {p_{t-1}})\right)}}{{\kappa}_{t-1}}(p^{\prime}) + \gamma_{\text{prior}}(t,\mathfrak{D}_{t}; \{Y_{s}\}_{s< t})\right.\right. \\ &\quad\qquad-\left.\left.\log\left({c^{\mathfrak{D}_{t}}} (Y_{t}; {A^{\mathfrak{D}}_{t}} {p_{t-1}})\right)\right\}{\vphantom{\inf_{p^{\prime}\in \mathbb{G}^{-1}(p,\mathfrak{D}_{t}, Y_{t})}}}\right\}-m_{t}', \end{aligned} $$

with initial value κ0(p)=π(p), where mt′ is chosen to ensure \(\inf _{p\in S_{N}^{+}}\kappa _{t}(p)=0\) for all t.

Remark 16

We expect that there will be less difference between the dynamic uncertain prior and dynamic DR-expectation settings than between the static uncertain prior and static DR-expectation settings. This is because there is only limited learning possible in the dynamic DR-expectation, as \(\mathfrak {D}_{t}\) may vary independently at every time, so the DR-expectation has only one value with which to infer the value of each \(\mathfrak {D}_{t}\). This increases the relative importance of the prior term γprior, which describes our understanding of typical values of the generator. In practice, the key distinction between the dynamic DR-expectation and uncertain prior models appears to be when the initial penalty is near zero—in this case, the DR-expectation regularizes the initial state quickly, while the uncertain prior model may remain near zero indefinitely.

Example 3

In the setting of Example 2, as the dynamics are perfectly known, there is again no difference between the dynamic and static generator DR-expectation cases.

A continuous-time version of this setting (for a Kalman–Bucy filter) is considered in Allan and Cohen (2020).

Expectations of the future

The nonlinear expectations considered above do not consider how the future will evolve. In particular, we have focussed our attention on calculating \(\mathcal {E}_{\mathbf {y}_{t}}(\phi (X_{t}))\), that is, on the expectation of functions of the current hidden state. In other words, we can consider our nonlinear expectation as a mapping

$$\mathcal{E}_{\mathbf{y}_{t}}: L^{\infty}(\sigma(X_{t})\otimes \mathcal{Y}_{t}) \to L^{\infty}(\mathcal{Y}_{t}).$$

If we wish to calculate expectations of future states, then we may wish to consider doing so in a filtration-consistent manner. This is of particular importance when considering optimal control problems.

Definition 8

For a fixed horizon T>0, suppose that for each t<T we have a mapping \(\mathcal {E}(\cdot |\mathcal {Y}_{t}):L^{\infty }(\mathcal {Y}_{T}) \to L^{\infty }(\mathcal {Y}_{t})\). We say that \(\mathcal {E}\) is a \(\mathcal {Y}\)-consistent convex expectation if \(\mathcal {E}(\cdot |\mathcal {Y}_{t})\) satisifes the following assumptions, analogous to those above,

  • Strict Monotonicity: for any \(\xi _{1}, \xi _{2}\in L^{\infty }(\mathcal {Y}_{T})\), if ξ1ξ2 a.s., then \(\mathcal {E}(\xi _{1}|\mathcal {Y}_{t}) \geq \mathcal {E}(\xi _{2}|\mathcal {Y}_{t})\) a.s., and if, in addition, \(\mathcal {E}(\xi _{1}|\mathcal {Y}_{t})=\mathcal {E}(\xi _{2}|\mathcal {Y}_{t})\) then ξ1=ξ2 a.s.;

  • Constant triviality: for \(b\in L^{\infty }(\mathcal {Y}_{t})\), \(\mathcal {E}(b|\mathcal {Y}_{t})=b\);

  • Translation equivariance: for any \(b\in L^{\infty }(\mathcal {Y}_{t})\), \(\xi \in L^{\infty }(\mathcal {Y}_{T})\), \(\mathcal {E}(\xi +b|\mathcal {Y}_{t})= \mathcal {E}(\xi |\mathcal {Y}_{t})+b\);

  • Convexity: for any λ[0,1], \(\xi _{1}, \xi _{2}\in L^{\infty }(\mathcal {Y}_{T})\),

    $$\mathcal{E}(\lambda \xi_{1}+ (1-\lambda) \xi_{2}|\mathcal{Y}_{t}) \leq \lambda \mathcal{E}(\xi_{1}|\mathcal{Y}_{t})+ (1-\lambda) \mathcal{E}(\xi_{2}|\mathcal{Y}_{t});$$
  • Lower semicontinuity: for a sequence \(\{\xi _{n} \}_{n\in \mathbb {N}}\subset L^{\infty }(\mathcal {Y}_{T})\) with \(\xi _{n} \uparrow \xi \in L^{\infty }(\mathcal {Y}_{T})\) pointwise, \(\mathcal {E}(\xi _{n}|\mathcal {Y}_{t}) \uparrow \mathcal {E}(\xi |\mathcal {Y}_{t})\) pointwise for every t<T;

and the additional asssumptions

  • \(\{\mathcal {Y}_{t}\}_{t\ge 0}\)-consistency: for any s<t<T, any \(\xi \in L^{\infty }(\mathcal {Y}_{T})\),

    $$\mathcal{E}(\xi|\mathcal{Y}_{s}) = \mathcal{E}(\mathcal{E}(\xi|\mathcal{Y}_{t})|\mathcal{Y}_{s});$$
  • Relevance: for any t<T, any \(A\in \mathcal {Y}_{t}\), \(\mathcal {E}(I_{A}\xi |\mathcal {Y}_{t}) = I_{A} \mathcal {E}(\xi |\mathcal {Y}_{t})\).

The assumption of \(\mathcal {Y}\)-consistency is sometimes simply called recursivity, time consistency, or dynamic consistency (and is closely related to the validity of the dynamic programming principle), however, it is important to note that this depends on the choice of filtration. In our context, consistency with the observation filtration \(\mathcal {Y}\) is natural, as this describes the information available for us to make decisions.

Remark 17

Definition 8 is equivalent to considering a lower semicontinuous convex expectation, as in Definition 3, and assuming that for any \(\xi \in L^{\infty }(\mathcal {Y}_{T})\) and any t<T, there exists a random variable \(\xi _{t}\in L^{\infty }(\mathcal {Y}_{t})\) such that \(\mathcal {E}(I_{A} \xi) = \mathcal {E}(I_{A} \xi _{t})\) for all \(A\in \mathcal {Y}_{t}\). In this case, one can define \(\mathcal {E}(\xi |\mathcal {Y}_{t}) = \xi _{t}\) and verify that the definition given is satisfied (see Föllmer and Schied (2002b); Cohen and Elliott (2010)).

Much work has been done on the construction of dynamic nonlinear expectations (see, for example, Epstein and Schneider (2003); Duffie and Epstein (1992); El Karoui et al. (1997); Cohen and Elliott (2010); and references therein). In particular, there have been close relations drawn between these operators and the theory of BSDEs (for a setting covering the discrete-time examples we consider here, see Cohen and Elliott (2010); Cohen and Elliott (2011)).

Remark 18

The importance of \(\mathcal {Y}\)-consistency is twofold: First, it guarantees that, when using a nonlinear expectation to construct the value function for a control problem, an optimal policy will be consistent in the sense that (assuming an optimal policy exists) a policy which is optimal at time zero will remain optimal in the future. Second, \(\{\mathcal {Y}_{t}\}_{t\ge 0}\)-consistency allows the nonlinear expectation to be calculated recursively, working backwards from a terminal time. This leads to a considerable simplification numerically, as it avoids a curse of dimensionality in intertemporal control problems.

Remark 19

One issue in our setting is that our lack of knowledge does not simply line up with the arrow of time—we are unaware of events which occurred in the past, as well as those which are in the future. This leads to delicacies in questions of dynamic consistency. Conventionally, this has often been considered in a setting of “partially observed control”, and these issues are resolved by taking the filter state pt to play the role of a state variable, and solving the corresponding “fully observed control problem” with pt as underlying. In our context, we do not know the value of pt, instead we have the (even higher dimensional) penalty function Kt as a state variable.

In the following sections, we will outline how our earlier approach can be extended to provide a dynamically consistent expectation, and how enforcing dynamic consistency will modify our perception of risk.

Asynchronous expectations

We will focus our attention on constructing a dynamically consistent nonlinear expectation for random variables in \(L^{\infty }(\sigma (X_{T})\otimes \mathcal {Y}_{T})\), given observations up to times t<T. Throughout this section, we will use the following construction:

Definition 9

Suppose we have a nonlinear expectation

$$\mathcal{E}_{\mathbf{y}_{T}}: L^{\infty}(\sigma(X_{T})\otimes \mathcal{Y}_{T}) \to L^{\infty}(\mathcal{Y}_{T})$$

constructed for our nonlinear filtering problem, as above, and are given a a \(\mathcal {Y}\)-consistent family of maps

$$\overleftarrow{\mathcal{E}}(\cdot|\mathcal{Y}_{t}): L^{\infty}(\mathcal{Y}_{T}) \to L^{\infty}(\mathcal{Y}_{t}).$$

We then extend \(\overleftarrow{\mathcal {E}}\) to variables in \(L^{\infty }(\sigma (X_{T})\otimes \mathcal {Y}_{T})\) by the composition

$$\overleftarrow{\mathcal{E}}(\cdot|\mathcal{Y}_{t}) := \overleftarrow{\mathcal{E}}(\mathcal{E}_{\mathbf{y}_{T}}(\cdot)|\mathcal{Y}_{t}).$$

Given this definition, our key aim is to construct the \(\mathcal {Y}\)-consistent family \(\overleftarrow{\mathcal {E}}(\cdot |\mathcal {Y}_{t})\), in a way which “agrees” with our uncertainty in the underlying filter. As we are in discrete time, we can construct a \(\mathcal {Y}\)-consistent family through recursion, if we have its definition over each single step. The definition of the DR-expectation can be applied to generate these one-step expectations in a natural way.

Definition 10

For \(\mathcal {R}\) an additive penalty function (Definition 4), we define the one-step expectation, for \(\xi \in L^{\infty }(\mathcal {Y}_{t+1})\), by

$$\overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) = \mathrm{ess\,sup}_{\mathbb{Q}\in \mathcal{Q}} \left\{\mathbb{E}_{\mathbb{Q}}\left[\xi- \mathcal{R}_{t+1}(\mathbb{Q})|\mathcal{Y}_{t}\right] \right\}, $$

where the essential supremum is taken among the bounded \(\mathcal {Y}_{t}\)-measurable random variables. Using this, we define a \(\mathcal {Y}\)-consistent expectation \(L^{\infty }(\sigma (X_{T})\otimes \mathcal {Y}_{T}) \to L^{\infty }(\mathcal {Y}_{t})\)by recursion,

$$\overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) = \overleftarrow{\mathcal{E}}(\overleftarrow{\mathcal{E}}\left(\xi|\mathcal{Y}_{t+1}\right)|\mathcal{Y}_{t}) = \overleftarrow{\mathcal{E}}(\cdots \overleftarrow{\mathcal{E}}(\mathcal{E}_{\mathbf{y}_{T}}(\xi)|\mathcal{Y}_{T-1})\cdots |\mathcal{Y}_{t}).$$

Remark 20

It is necessary to use the penalty \(\mathcal {R}_{t+1}\) in this definition, as our penalty should include the behaviour of the generator Ct+1(·), which determines the distribution of Yt+1|Xt+1.

Recall that, as \(\mathcal {Y}\) is generated by Y, the Doob–Dynkin lemma states that any \(\mathcal {Y}_{t+1}\)-measurable function ξ is simply a function of {Ys}st+1, so we can write

$$ \xi(\omega) = \hat\xi(Y_{t+1}, \{Y_{s}\}_{s\leq t}). $$

For any conditionally Markov measure \(\mathbb {Q}\), if \(\mathbb {Q}\) has generator (At,Ct(·))t≥0, it follows that

$$\mathbb{E}_{\mathbb{Q}}[\xi|\mathcal{Y}_{t}]= \int_{\mathbb{R}^{d}} \hat{\xi}(y, \{Y_{s}\}_{s\leq t}) \left(\mathbf{1}^{\top} C_{{t+1}}(y)A_{t} p_{t}\right) d\mu(y).$$

In particular, we apply this to our penalty function to define the function \(\hat {\mathcal {R}}\) such that

$$ \hat{\mathcal{R}}_{t+1}\!(Y_{t+1}, p, \mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\le t}) \,=\, \left(\!\frac{1}{k}\!\left(K_{t}(p_{t}, \mathfrak{S}) \,+\, \gamma_{t+1}\left(\mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\le t+1}, p_{t}\right)\!\right)\!\right)^{k'}. $$

Applying this to our definition of \(\overleftarrow{\mathcal {E}}\), we obtain the following representation.

Lemma 2

The one-step expectation \(\overleftarrow{\mathcal {E}}\) can be written

$$\begin{aligned} \overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) &= \underset{\mathbb{Q}\in \mathcal{Q}}{\mathrm{ess\,sup}} \left\{\mathbb{E}_{\mathbb{Q}}[\xi- \mathcal{R}_{t+1}(\mathbb{Q})|\mathcal{Y}_{t}] \right\}\\ &= \underset{\mathfrak{S}, \mathfrak{D}_{t+1}, p}{\mathrm{ess\,sup}} \left\{\int_{\mathbb{R}^{d}}\left(\hat{\xi}\left(y, \{Y_{s}\}_{s\leq t}\right) - \hat{\mathcal{R}}_{t+1}\left(y,p, \mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\leq t}\right)\right)\right. \\ &\qquad\qquad\qquad \left.\left(\mathbf{1}^{\top} C_{{t+1}}(y)A_{t+1} p\right) d\mu(y) \right\}, \end{aligned} $$

where K is the dynamic penalty constructed in Theorem 5 and \((A_{t+1}, C_{t+1}(\cdot))\equiv \Phi (\mathfrak {S}, \mathfrak {D}_{t+1})\).



$$\overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) = \underset{\mathbb{Q}\in \mathcal{Q}}{\mathrm{ess\,sup}} \left\{\mathbb{E}_{\mathbb{Q}}[\xi|\mathcal{Y}_{t}]- \mathbb{E}_{\mathbb{Q}}[\mathcal{R}_{t+1}(\mathbb{Q})|\mathcal{Y}_{t}] \right\}.$$

We know that

$$\mathbb{E}_{\mathbb{Q}}[\xi|\mathcal{Y}_{t}]= \int_{\mathbb{R}^{d}} \hat{\xi}(y, \{Y_{s}\}_{s\leq t}) \left(\mathbf{1}^{\top} C_{{t+1}}(y)A_{t+1} p_{t}\right) d\mu(y),$$

which depends on \(\mathbb {Q}\) only through At+1, Ct+1 and pt, or equivalently, through the parameters \(\mathfrak {S}\), \(\mathfrak {D}_{t+1}\) and pt. In particular, as \(\mathcal {R}\) is additive, we can substitute in its structure and simplify using the definition of K in Theorem 5 to obtain

$$\begin{aligned} \overleftarrow{\mathcal{E}}(\xi|\mathcal{Y}_{t}) &=\underset{\mathbb{Q}\in \mathcal{Q}}{\mathrm{ess\,sup}}\left\{{\vphantom{\frac{1}{2_{\frac{1}{2}}}}}\mathbb{E}_{\mathbb{Q}}[\xi|\mathcal{Y}_{t}]\right.\\ &\qquad\quad - \left.\mathbb{E}_{\mathbb{Q}}\left[\left(\frac{1}{k}\left(K(p_{t}, \mathfrak{S}) \,+\, \gamma_{t+1}(\mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\le t+1}, p_{t})\right)\right)^{k'}\Big|\mathcal{Y}_{t}\right] \right\}. \end{aligned} $$

Using the definition of \(\hat {\mathcal {R}}\), we change these conditional expectations to integrals, and obtain the desired representation. □

Remark 21

There is a surprising form of double-counting of the penalty here. To see this, let’s assume ϕ does not depend on Y. If we consider \(\xi _{t+1} = \mathcal {E}_{\mathbf {y}_{t+1}}(\phi (X_{t+1}))\), then we have included a penalty for the proposed model at t+1, that is,

$$\xi_{t+1} = \mathcal{E}_{\mathbf{y}_{t+1}}(\phi(X_{t+1})) = \sup_{p\in S_{N}^{+}}\left\{\sum_{i} p^{i}\phi(e_{i})-\left(\frac{1}{k}(K_{t+1}(p)\right)^{k'}\right\},$$

where Kt+1(p) is the penalty associated with the filter state at time t+1, which includes the penalty γt+1 on the parameters \(\mathfrak {S}\) and \(\mathfrak {D}_{t+1}\).

When we calculate \(\overleftarrow{\mathcal {E}}(\xi _{t+1}|\mathcal {Y}_{t})\), we do so by using the penalty \(K(p_{t}, \mathfrak {S}) + \gamma _{t+1}(\mathfrak {S}, \mathfrak {D}_{t+1}, \{Y_{s}\}_{s\le t+1}, p_{t})\), which again includes the term γt+1 which penalizes unreasonable values of the parameters \(\mathfrak {S}\) and \(\mathfrak {D}_{t+1}\). This “double counting” of the penalty corresponds to us including both our “uncertainty at time t+1” (in \(\mathcal {E}_{\mathbf {y}_{t+1}}\)), and also our “uncertainty at t about our uncertainty at t+1” (in \(\overleftarrow{\mathcal {E}}(\cdot |\mathcal {Y}_{t})\)).

Remark 22

One should be careful in this setting, as the recursively-defined nonlinear expectation will be optimized for a different value of \(\mathfrak {S}\) at every time. As \(\mathfrak {S}\) is considered to be a static penalty, this is an internal inconsistency in the modelling of our uncertainty—we always estimate assuming that \(\mathfrak {S}\) has never changed, but evaluate the future by considering our possible future opinions of the value of \(\mathfrak {S}\).

Review of BSDE theory

While it is useful to give a recursive definition of our nonlinear expectation, a better understanding of its dynamics is of practical importance. In what follows, for the dynamic generator case, we consider the corresponding BSDE theory, assuming that Yt can take only finitely many values, as in Cohen and Elliott (2010). We now present the key results of Cohen and Elliott (2010), in a simplified setting.

In what follows, we suppose that Y takes d values, which we associate with the standard basis vectors in \(\mathbb {R}^{d}\). For simplicity, we write 1 for the vector in \(\mathbb {R}^{d}\) with all components 1.

Definition 11

Write \(\bar {\mathbb {P}}\) for a probability measure such that {Yt}t≥0 is an i.i.d. sequence, uniformly distributed over the d states, and M for the \(\bar {\mathbb {P}}\)-martingale difference process Ytd−11. As in Cohen and Elliott (2010), M has the property that any \(\mathcal {Y}\)-adapted \(\bar {\mathbb {P}}\)-martingale L can be represented by \(L_{t}= L_{0} + \sum _{0\le s< t}Z_{s} M_{s+1}\) for some Z (and Z is unique up to addition of a multiple of 1).

Remark 23

The construction of Z in fact also shows that, if L is written \(L_{t}= \tilde L(Y_{1},...,Y_{t-1}, Y_{t})\), then \(e_{i}^{\top } Z_{t} = L(Y_{1},..., Y_{t-1}, e_{i})\) for every i (up to addition of a multiple of 1).

We can then define a BSDE (Backward Stochastic Difference Equation) with solution (ξ,Z):

$$ \xi_{t}(\omega) - \sum_{t\leq u< T} f(\omega, u, \xi_{u}(\omega), Z_{u}(\omega)) + \sum_{t\leq u< T} Z_{u}(\omega) M_{u+1}(\omega) = \xi_{T}(\omega), $$

where T is a finite deterministic terminal time, f a \(\mathcal {Y}\)-adapted map \(F:\Omega \times \{0,...,T\} \times \mathbb {R} \times \mathbb {R}^{d}\rightarrow \mathbb {R}\), and ξT a given \(\mathbb {R}\)-valued \(\mathcal {Y}_{t}\)-measurable terminal condition. For simplicity, we henceforth omit the ω argument of ξ, Z, and M.

The general existence and uniqueness result for BSDEs in this context is as follows:

Theorem 6

Suppose f is such that the following two assumptions hold:

  1. (i)

    For any ξ, if Z1=Z2+k1 for some k, then \(f\left (\omega,t, \xi _{t}, Z^{1}_{t}\right) = f\left (\omega, t, \xi _{t}, Z^{2}_{t}\right)\), \(\bar {\mathbb {P}}\)-a.s. for all t.

  2. (ii)

    For any \(z\in \mathbb {R}^{d}\), for all t, for \(\bar {\mathbb {P}}\)-almost all ω, the map

    $$\xi\mapsto \xi-f(\omega, t, \xi, z)$$

    is a bijection \(\mathbb {R}\rightarrow \mathbb {R}\).

Then, for any terminal condition ξT essentially bounded, \(\mathcal {Y}_{t}\)-measurable, and with values in \(\mathbb {R}\), the BSDE (10) has a \(\mathcal {Y}\)-adapted solution (ξ,Z). Moreover, this solution is unique up to indistinguishability for ξ and indistinguishability up to addition of multiples of 1 for Z.

In this setting, we also have a comparison theorem:

Theorem 7

Consider two discrete-time BSDEs as in (10), corresponding to coefficients f1,f2, and terminal values \(\xi ^{1}_{T}, \xi ^{2}_{T}\). Suppose the conditions of Theorem 6 are satisfied for both equations, let (ξ1,Z1) and (ξ2,Z2) be the associated solutions. Suppose the following conditions hold:

  1. (i)

    \(\xi ^{1}_{T}\geq \xi ^{2}_{T} \bar {\mathbb {P}}\)-a.s.

  2. (ii)

    \(\bar {\mathbb {P}}\)-a.s., for all times t and every \(\xi \in \mathbb {R}\) and \(z\in \mathbb {R}^{d}\),

    $${\kern-5cm}f^{1}(\omega, t, \xi, z) \geq f^{2}(\omega, t, \xi, z).$$
  3. (iii)

    \(\bar {\mathbb {P}}\)-a.s., for all t, f1 satisfies

    $$f^{1}\left(\omega, t, \xi_{t}^{2}, Z_{t}^{1}\right) - f^{1}\left(\omega, t, \xi_{t}^{2}, Z_{t}^{2}\right)\geq\min_{j\in \mathbb{J}_{t}}\left\{\left(Z^{1}_{t}-Z^{2}_{t}\right)\left(e_{j}-d^{-1}\boldsymbol{1}\right)\right\},$$

    where \(\mathbb {J}_{t} :=\{i:\bar {\mathbb {P}}(X_{t+1}=e_{i} | \mathcal {F}_{t})>0\}\).

  4. (iv)

    \(\bar {\mathbb {P}}\)-a.s., for all t and all \(z\in \mathbb {R}^{d}\), ξξf1(ω,t,ξ,z) is an increasing function.

It is then true that ξ1ξ2\(\bar {\mathbb {P}}\)-a.s. A driver f1 satisfying (iii) and (iv) will be called “balanced”.

Finally, we also know that all dynamically consistent nonlinear expectations can be represented through BSDEs:

Theorem 8

The following two statements are equivalent.

  1. (i)

    \(\overleftarrow{\mathcal {E}}(\cdot |\mathcal {Y}_{t})\) is a \(\mathcal {Y}_{t}\)-consistent, dynamically translation invariant, nonlinear expectation.

  2. (ii)

    There exists a driver f which is balanced, independent of ξ, and satisfies the normalisation condition f(ω,t,ξt,0)=0, such that, for all ξT, the value of \(\xi _{t} \!\,=\, \overleftarrow{\mathcal {E}}(\xi _{T}|\mathcal {Y}_{t})\) is the solution to a BSDE with terminal condition ξT and driver f.

Furthermore, these two statements are related by the equation

$$f(\omega, t, \xi, z) =\overleftarrow{\mathcal{E}}({zM}_{t+1}|\mathcal{Y}_{t}).$$

BSDEs for future expectations

By applying the above general theory, we can easily see that our nonlinear expectation has a representation as the solution to a particular BSDE.

Theorem 9


$$\xi_{t}:=\overleftarrow{\mathcal{E}}\left(\phi\left.\left(X_{T}, \{Y_{t}\}_{t\le T}\right)\right|\mathcal{Y}_{t}\right).$$

The dynamically consistent expectation satisfies the BSDE

$$\xi_{t+1} = \xi_{t}-f(Z_{t}; \kappa_{t}) + Z_{t} M_{t+1}$$

with driver

$$\begin{aligned} & f\left(Z_{t}; \hat{\mathcal{R}}_{t+1}\right)\\ &:= \underset{p, \mathfrak{S}, \mathfrak{D}_{t+1}}{\mathrm{ess\,sup}}\sum_{i}\left\{ \left(Z^{i}-\hat{\mathcal{R}}_{t+1}(e_{i}, p, \mathfrak{S}, \mathfrak{D}_{t+1}, \{Y_{s}\}_{s\le t})\right) \left(\mathbf{1}^{\top} C_{t+1}(e_{i})A_{t+1} p\right)\right.\\ &\qquad\qquad\qquad - \left.d^{-1}Z^{i}\right\}, \end{aligned} $$

where \((A_{t+1}, C_{t+1}(\cdot)) \equiv \Phi (\mathfrak {S}, \mathfrak {D}_{t+1})\).


As ξt+1 is \(\mathcal {Y}_{t+1}\)-measurable, by the Doob–Dynkin lemma there exists a Borel measurable function \(\hat {\xi }_{t+1}\) such that \(\xi _{t+1}=\hat {\xi }_{t+1}(Y_{t+1})\) (omitting to write {Ys}st as an argument). We write Zt for the vector containing each of the values of this function. From the definition of M, as in the proof of the martingale representation theorem in Cohen and Elliott (2010), it follows that

$$\xi_{t+1} - \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}] = Z_{t} M_{t+1}\qquad \text{and}\qquad \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}] = \sum_{i} d^{-1} Z^{i}.$$

We then calculate, using Lemma 2 (simplified to our finite-state setting and omitting {Ys}st as an argument),

$$\begin{aligned} & \xi_{t} - \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]\\ &= \overleftarrow{\mathcal{E}}(\xi_{t+1}|\mathcal{Y}_{t}) - \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]\\ &= \underset{p, \mathfrak{S}, \mathfrak{D}_{t+1}}{\mathrm{ess\,sup}}\left\{\!\sum_{i} \left(Z^{i}-\hat{\mathcal{R}}_{t+1}(e_{i}, p,\mathfrak{S}, \mathfrak{D}_{t+1})\right)\!\! \left(\mathbf{1}^{\top} C_{t+1}(e_{i})A_{t+1} p\right) \,-\, \mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]\right\}\\ &= f(Z_{t}; \kappa_{t}). \end{aligned} $$

The answer follows by rearrangement. □

A control problem with uncertain filtering

In this final section, we consider the solution of a simple control problem under uncertainty, using the formal structures previously developed. In some ways, this approach is similar to those considered by Bielecki et al. (2017), where the DR-expectation is replaced by an approximate confidence interval. (Taking k= in a StaticDR model would give a very similar problem to the one they consider.) A key complexity in doing this is that our uncertainty does not agree with the arrow of time—at time t, we do not know the future values of {Ys,Xs}s>t (as is typical for stochastic control), but we also do not know the values of {Xs}st, even though these have an indirect impact on our costs.

Suppose a controller selects a control u from a set U, which we assume is a countable union of compact metrizable sets. Controls are required to be \(\mathcal {Y}\)-predictable (i.e., ut is \(\mathcal {Y}_{t-1}\)-measurable), and we write \(\mathcal {U}\) for the space of such controls, and u=(u1,...,uT) for the vector of controls at every time.

A control has an impact on the generator of X,Y, through modifying the penalty function γ, which describes the “reasonable” models for the transition matrix A and the distribution of observations C. In particular, for a given u the term γt in the additive structure of \(\mathcal {R}_{t}\) is permitted to depend on ut. We assume γt( ;ut) is continuous in ut for every value of its other arguments. This is a variant on a standard weak formulation of the control problem—our agent no longer selects the generators (A,C(·)) directly, but instead modifies the penalty determining which values of (A,C(·)) are ‘reasonable models’.

In order to separate the effects of past controls (which determine the agent’s understanding of the present hidden state), and future controls (which modify the future dynamics), we write \(\overleftarrow{\mathcal {E}}_{\mathbf {u},K}(\cdot |\mathcal {Y}_{t})\) for the \(\mathcal {Y}\)-consistent expectation generated by the single-step expectations (omitting {Ys}st as an argument)

$$\begin{aligned} &\overleftarrow{\mathcal{E}}_{u_{t+1},K_{t}}(\xi|\mathcal{Y}_{t}) \\ &= \underset{p, \mathfrak{S}, \mathfrak{D}_{t+1}}{\mathrm{ess\,sup}} \int_{\mathbb{R}^{d}}\left(\hat{\xi}(y) - \hat{\mathcal{R}}_{t+1}(y,p, \mathfrak{S}, \mathfrak{D}_{t+1};u_{t+1}, K_{t})\right)\\ &\qquad\qquad\qquad\qquad \cdot \left(\mathbf{1}^{\top} C_{{t+1}}(y)A_{t+1} p\right) d\mu(y), \end{aligned} $$


$$ \hat{\mathcal{R}}_{t+1}\!\left(Y_{t+1},p, \mathfrak{S}, \mathfrak{D}_{t+1};u_{t+1}, K_{t}\right) \,=\, \left(\!\frac{1}{k}\left(\!K_{t}(p_{t}, \mathfrak{S}) \,+\, \gamma_{t+1}\!\left(Y_{t+1}\!, \mathfrak{S}, \mathfrak{D}_{t+1\!}, p_{t}; u_{t+1}\right)\!\right)\!\!\right)^{k'}. $$

We then observe that \(\overleftarrow{\mathcal {E}}_{\mathbf {u}, K}(\cdot |\mathcal {Y}_{t})\) is formally independent of (u1,...,ut−1). Nevertheless, the effective value of Kt depends on (u1,...,ut−1), as these now appear in the γ terms appearing in Theorem 5.

The controller wishes to minimize an expected cost

$$\overleftarrow{\mathcal{E}}_{\mathbf{u}, K_{0}}\left(\mathfrak{C}(X_{T}, \{Y_{s}\}_{s\leq T}) + \sum_{t< T} \mathfrak{L}_{t}(\{Y_{s}\}_{s\le t}, u_{t+1})\right),$$

where K0=κprior is the uncertainty before the control problem begins. Here \(\mathfrak {C}\) is a terminal cost, which may depend on the hidden state XT, and \(\mathfrak {L}\) is a running cost, which depends on the control ut+1 used at time t. We assume \(\mathfrak {C}\) and \(\mathfrak {L}\) are continuous in u (almost surely). We think of the cost \(\mathfrak {L}_{t}\) as being paid at time t, depending on the choice of control ut+1 (which will affect the generator at time t+1). For notational simplicity, we omit Y as an argument when unnecessary.

Remark 24

We do not allow \(\mathfrak {L}_{t}\) to depend on Xt, as this may lead to paradoxes (as the agent could learn information about the hidden state by observing their running costs).

For a given control process u, we define the remaining cost

$$J(\omega, t, K, \mathbf{u}) = \overleftarrow{\mathcal{E}}_{\mathbf{u},K}\left(\mathfrak{C}(X_{T}) + \sum_{t\le s < T} \mathfrak{L}_{s}(u_{s+1})\Big|\mathcal{Y}_{t}\right)$$

and hence the value function

$$V(\omega, t, K) = \inf_{u\in \mathcal{U}}\overleftarrow{\mathcal{E}}_{\mathbf{u}, K}\left(\mathfrak{C}(X_{T}) + \sum_{t\le s < T} \mathfrak{L}_{s}(u_{s+1})\Big|\mathcal{Y}_{t}\right).$$

Remark 25

We define our expected cost using the \(\mathcal {Y}\)-consistent expectation \(\overleftarrow{\mathcal {E}}_{\mathbf {u}, K}\), rather than the (inconsistent) DR-expectation \(\mathcal {E}_{\mathbf {y}_{t}}\), as this leads to time-consistency in the choice of controls.

Remark 26

We see that the calculation of the value function is a “minimax” problem, in that V minimizes the cost, which we evaluate using a maximum over a set of models. However, given the potential for learning, the requirement for time consistency, and the uncertainties involved, it is not clear that one can write V explicitly in terms of a single minimization and maximization of a given function.

Remark 27

As the filter-state penalty K is a general function depending on the control, and Y only takes finitely many states, it is not generally possible to express the effect (on K) of a control through a change of measure relative to some reference dynamics. In particular, we face the problem that controls us for times s<T have an impact on the terminal cost \(V_{T}= \overleftarrow{\mathcal {E}}_{\mathbf {u}, K_{T}}(\mathfrak {C}(X_{T})|\mathcal {Y}_{T})\), through their impact on the uncertainty KT. Unlike in a traditional control problem, VT is not independent of u given \(\mathcal {Y}_{t}\); this is problem of the ‘arrow of time’ mentioned at the start of this section. For this reason, even though we model the impact of a control through its effect on the generator, we cannot give a fully “weak” formulation of our control problem, and are restricted to a “Markovian” setting with K as a state variable.

Theorem 10

The value function satisfies a dynamic programming principle, in particular, if an optimal control u exists, then for every tT,

$$\begin{aligned} V(\omega, t-1, K_{t-1}) &= \overleftarrow{\mathcal{E}}_{\mathbf{u}, K}\left(V_{t}\left(\omega, t-1, K^{\left(u^{*}, K_{t-1}\right)}_{t}\right)\big|\mathcal{Y}_{t-1}\right) +\mathfrak{L}_{t-1}\left(u_{t}^{*}\right), \end{aligned} $$

where \(K^{(u^{*}, K_{t-1})}_{t}\) is the one-step solution of the recursion of Theorem 5 using the control u.

A similar result also holds if we only assume an ε-optimal control exists for every ε>0.


This effectively falls into the setting of our abstract dynamic programming principle (in particular, Corollary 1), with a time reversal. For any control u, using the recursivity of \(\overleftarrow{\mathcal {E}}\) and writing κt=κt(u,κt−1) for simplicity, we have

$$\begin{aligned} &J(\omega,t-1,K_{t-1},\mathbf{u}) \\ &= \overleftarrow{\mathcal{E}}_{\mathbf{u},K_{t-1}}\left(\mathfrak{C}(X_{T}) + \sum_{t-1\le s \le T} \mathfrak{L}_{s}(u_{s+1})\Big|\mathcal{Y}_{t-1}\right)\\ &=\overleftarrow{\mathcal{E}}_{\mathbf{u},K_{t-1}}\left(J(\omega, t, \kappa_{t}, \mathbf{u})\big|\mathcal{Y}_{t-1}\right)+ \mathfrak{L}_{t-1}(u_{t}). \end{aligned} $$

In reversed time τ=Tt, we have the state variable zτ=KTt, which is defined using a backward recursion (in terms of τ) by Theorem 5. The operator \(\mathcal {A}_{\tau }\) is then given by

$$\begin{aligned} &\mathcal{A}_{\tau}[J(\omega, T-\tau-1, K_{T-\tau-1}, \mathbf{u}), u, K_{T-\tau-1}]\\ &:= \overleftarrow{\mathcal{E}}_{\mathbf{u},K_{T-\tau-1}}\left(J(\omega, T-\tau, K_{T-\tau}, u)\big|\mathcal{Y}_{t-1}\right)+ \mathfrak{L}_{T-\tau-1}(u_{T-\tau}), \end{aligned} $$

which is monotone and continuous in its first argument. The result then follows from Corollary 1. □

Remark 28

The appeal to Corollary 1 is slightly complicated by the fact that V is a random variable, rather than simply a scalar value. The reader can verify that this does not affect the proof of Theorem 4 significantly, as we are in a finitely generated probability space (so measurable selection arguments remain straightforward), and the operator \(\mathcal {A}\) forces the solution to have the desired measurability through time.

By combining this dynamic programming property with the definition of the one-step expectation, we can write down a difference equation which V must solve.

Theorem 11

The value function of the control problem satisfies the recursion

$$\begin{aligned} &V(\omega, t, K_{t}) \\ &=\inf_{u \in U}\mathrm{ess\,sup}_{p, \mathfrak{S}, \mathfrak{D}_{t+1}} \left\{\int_{\mathbb{R}^{d}}\left(V\left(\omega, t, K_{t}^{u, K_{t-1}}\right)+\mathfrak{L}_{t+1}(y,u) \right.\right.\\ &\qquad\qquad - \left.\left.\hat{\mathcal{R}}_{t+1}\left(y,p, \mathfrak{S}, \mathfrak{D}_{t+1};u_{t+1}, K_{t}\right)\right)\left(\mathbf{1}^{\top} C_{{t+1}}(y)A_{t+1} p\right) d\mu(y) \right\} \end{aligned} $$

and terminal value

$$V(\omega,_{T}, K_{T}) := \mathcal{E}_{\mathbf{y}_{T}}\left(\mathfrak{C}\left(X_{T}, \{Y_{s}\}_{s\le T}\right)\right).$$

Here, \(K_{t}^{u, K_{t-1}}\) is the value of Kt starting at Kt−1, with control u at time t, and evolving following Theorem 5 and \(\hat {\mathcal {R}}\) is as in (11).

Corollary 2

A control is optimal if and only if it achieves the infimum in the formula for V above.

Remark 29

If we assume that the terminal cost depends only on XT (and not on Y), and the running cost does not depend on Y, then one can observe a Markov property to the control problem, that is, Vs is conditionally independent of Y given Ks. The corresponding optimal controls can then also be taken only to depend on Ks.

Availability of data and materials

Not applicable.


  1. 1.

    This assumption can easily be relaxed, to allow Y to take values in an appropriate Polish or Blackwell space. We restrict to the real setting purely for simplicity.

  2. 2.

    The convention 1=0 simply ensures that x is lower semicontinuous, and also that xx is the convex conjugate of xx1.

  3. 3.

    One can derive the stated formula using the filtering equations, for the vector \(p_{t}=\left (p_{t}^{1}, p_{t}^{2}\right)^{\top }\). However, the closed-form solution given here is more easily obtained using alternative methods for Bayesian hypothesis testing (which is effectively what this problem encodes).

  4. 4.

    One difficulty is that the analysis of Cohen (2017) considers the divergence to the MLE model, rather than the true model. This allows a slightly finer control over the asymptotic behaviour, which would need to be replicated in the filtering setting.


  1. Allan, A. L., Cohen, S. N.: Parameter uncertainty in the Kalman–Bucy filter. SIAM J. Control Optim. 57(3), 1646–1671 (2019a).

  2. Allan, A. L., Cohen, S. N.: Pathwise Stochastic Control with Applications to Robust Filtering. Ann. Appl. Prob. (2020). arXiv::1902.05434.

  3. Artzner, P., Delbaen, F., Eber, J. -M., Heath, D.: Coherent measures of risk. Math. Finan. 9(3), 203–228 (1999).

    MathSciNet  MATH  Google Scholar 

  4. Başar, T., Bernhard, P.: H-Optimal Control and Related Minimax Design Problems, A Dynamic Game Approach. Birkhäuser, Basel (1991).

    Google Scholar 

  5. Bain, A., Crisan, D.: Fundamentals of Stochastic Filtering. Springer, Berlin–Heidelberg–New York (2009).

    Google Scholar 

  6. Bielecki, T. R., Chen, T., Cialenco, I.: Recursive construction of confidence regions. Electron. J. Stat. 11(2), 4674–4700 (2017).

    MathSciNet  MATH  Google Scholar 

  7. Boel, R. K., James, M. R., Petersen, I. R.: Robustness and risk-sensitive filtering. IEEE Trans. Autom. Control. 47(3), 451–461 (2002).

    MathSciNet  MATH  Google Scholar 

  8. Cohen, S. N., Elliott, R. J.: A general theory of finite state backward stochastic difference equations. Stoch. Process. Appl. 120(4), 442–466 (2010).

    MathSciNet  MATH  Google Scholar 

  9. Cohen, S. N., Elliott, R. J.: Backward stochastic difference equations and nearly-time-consistent nonlinear expectations. SIAM J. Control Optim. 49(1), 125–139 (2011).

    MathSciNet  MATH  Google Scholar 

  10. Cohen, S. N., Elliott, R. J.: Stochastic Calculus and Applications. 2nd ed. Birkhäuser, New York (2015).

    Google Scholar 

  11. Cohen, S. N.: Data-driven nonlinear expectations for statistical uncertainty in decisions. Electron. J. Stat. 11(1), 1858–1889 (2017).

    MathSciNet  MATH  Google Scholar 

  12. Delbaen, F., Peng, S., Rosazza Gianin, E.: Representation of the penalty term of dynamic concave utilities. Finan. Stochast. 14(3), 449–472 (2010).

    MathSciNet  MATH  Google Scholar 

  13. Dey, S., Moore, J. B.: Risk-sensitive filtering and smoothing for hidden Markov models. Syst. Control Lett. 25, 361–366 (1995).

    MathSciNet  MATH  Google Scholar 

  14. Douc, R., Moulines, E., Olsson, J., van Handel, R.: Consistency of the maximum likelihood estimator for general hidden Markov models. Ann. Stat. 39(1), 474–513 (2011).

    MathSciNet  MATH  Google Scholar 

  15. Duffie, D., Epstein, L. G.: Asset pricing with stochastic differential utility. Rev. Finan. Stud. 5(3), 411–436 (1992).

    Google Scholar 

  16. El Karoui, N., Peng, S., Quenez, M. C.: Backward stochastic differential equations in finance. Math. Finan. 7(1), 1–71 (1997).

    MathSciNet  MATH  Google Scholar 

  17. Epstein, L. G., Schneider, M.: Recursive multiple-priors. J. Econ. Theory. 113, 1–31 (2003).

    MathSciNet  MATH  Google Scholar 

  18. Fagin, R., Halpern, J.: A new approach to updating beliefs. In: Proceedings of the Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-90), pp. 317–325. AUAI Press, Corvallis (1990).

    Google Scholar 

  19. Föllmer, H., Schied, A.: Convex measures of risk and trading constraints. Finan. Stochast. 6, 429–447 (2002a).

  20. Föllmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time. Studies in Mathematics 27. de Gruyter, Berlin-New York (2002b).

  21. Frittelli, M., Rosazza Gianin, E.: Putting order in risk measures. J. Bank. Financ. 26(7), 1473–1486 (2002).

    Google Scholar 

  22. Graf, S.: A Radon–Nikodym theorem for capacities. J. für die reine und angewandte Mathematik. 320, 192–214 (1980).

  23. Grimble, M. J., El Sayed, A.: Solution of the H optimal linear filtering problem for discrete-time systems. Trans. Acoust. Speech Sig. Process. IEEE. 38(7) (1990).

  24. Hansen, L. P., Sargent, T. J.: Robust estimation and control under commitment. J. Econ. Theory. 124, 258–301 (2005).

    MathSciNet  MATH  Google Scholar 

  25. Hansen, L. P., Sargent, T. J.: Recursive robust estimation and control without commitment. J. Econ. Theory. 136(1), 1–27 (2007).

    MathSciNet  MATH  Google Scholar 

  26. Hansen, L. P., Sargent, T. J.: Robustness. Princeton University Press, Princeton (2008).

    Google Scholar 

  27. Huber, P. J., Roncetti, E. M.: Robust Statistics, 2nd edn.Wiley, Hoboken (2009).

    Google Scholar 

  28. James, M. R., Baras, J. S., Elliott, R. J.: Risk-sensitive control and dynamic games for partially observed discrete-time nonlinear systems. Trans. Autom. Control IEEE. 39(4), 780–792 (1994).

    MathSciNet  MATH  Google Scholar 

  29. Kalman, R. E.: A new approach to linear filtering and prediction problems. J. Basic Eng. ASME. 82, 33–45 (1960).

    MathSciNet  Google Scholar 

  30. Kalman, R. E., Bucy, R. S.: New results in linear filtering and prediction theory. J. Basic Eng. ASME. 83, 95–108 (1961).

    MathSciNet  Google Scholar 

  31. Keynes, J. M.: A Treatise on Probability. Macmillan and Co., New York (1921). Reprint BN Publishing, 2008.

    Google Scholar 

  32. Knight, F. H.: Risk, Uncertainty and Profit. Houghton Mifflin, Boston (1921). reprint Dover 2006.

    Google Scholar 

  33. Kupper, M., Schachermayer, W.: Representation results for law invariant time consistent functions. Math. Financ. Econ. 2(3), 189–210 (2009).

    MathSciNet  MATH  Google Scholar 

  34. Leroux, B. G.: Maximum-likelihood estimation for hidden Markov models. Stoch. Process. Appl. 40, 127–143 (1992).

    MathSciNet  MATH  Google Scholar 

  35. Peng, S.: Nonlinear Expectations and Stochastic Calculus under Uncertainty. arxiv::1002.4546v1 (2010).

  36. Riedel, F.: Dynamic coherent risk measures. Stochast. Process. Appl. 112(2), 185–200 (2004).

    MathSciNet  MATH  Google Scholar 

  37. Rockafellar, R. T., Uryasev, S., Zabarankin, M.: Generalized deviations in risk analysis. Finan. Stochast. 10, 51–74 (2006).

    MathSciNet  MATH  Google Scholar 

  38. Wald, A.: Statistical decision functions which minimize the maximum risk. Ann. Math. 46(2), 265–280 (1945).

    MathSciNet  MATH  Google Scholar 

  39. Walley, P.: Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, London (1991).

    Google Scholar 

  40. Wonham, W. N.: Some applications of stochastic differential equations to optimal nonlinear filtering. SIAM J. Control. 2, 347–369 (1965).

    MathSciNet  MATH  Google Scholar 

  41. Zhang, J., Xia, Y., Shi, P.: Parameter-dependent robust H filtering for uncertain discrete-time systems. Automatica. 45, 560–565 (2009).

    MathSciNet  MATH  Google Scholar 

Download references


Thanks to Ramon van Handel, Michael Monoyios, Sergey Nadtochiy, Andrew Allan, Gonçalo Simões, and Robert Elliott for useful conversations, and to two anonymous referees for careful comments.


Research supported by the Oxford–Man Institute for Quantitative Finance, the Oxford–Nie Financial Data Laboratory and The Alan Turing Institute under the Engineering and Physical Sciences Research Council grant EP/N510129/1.

Author information




The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Samuel N. Cohen.

Ethics declarations

Competing interests

The author declares that he has no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cohen, S.N. Uncertainty and filtering of hidden Markov models in discrete time. Probab Uncertain Quant Risk 5, 4 (2020).

Download citation


  • Filtering
  • Optimal control
  • Robustness
  • Nonlinear expectation

AMS Subject Classification

  • 62M20
  • 60G35
  • 93E11