Uncertainty and filtering of hidden Markov models in discrete time

We consider the problem of filtering an unseen Markov chain from noisy observations, in the presence of uncertainty regarding the parameters of the processes involved. Using the theory of nonlinear expectations, we describe the uncertainty in terms of a penalty function, which can be propagated forward in time in the place of the filter.


Introduction
Filtering is a common problem in many applications. The essential concept is that there is an unseen Markov process, which influences the state of some observed process, and our task is to approximate the state of the unseen process using a form of Bayes' theorem. Many results have been obtained in this direction, most famously the Kalman filter (Kalman 1960;Kalman and Bucy 1961), which assumes the underlying processes considered are Gaussian, and gives explicit formulae accordingly. Similarly, under the assumption that the underlying process is a finite-state Markov chain, a general formula to calculate the filter can be obtained (the Wonham filter Wonham (1965)). These results are well known, in both discrete and continuous time (see Bain and Crisan (2009) or Cohen and Elliott (2015) Chapter 21 for further general discussion).
In this paper, we consider a simple setting in discrete time, where the underlying process is a finite-state Markov chain. Our concern is to study uncertainty in the dynamics of the underlying processes, in particular, its effect on the behaviour of the corresponding filter. That is, we assume that the observer has only imperfect knowledge of the dynamics of the underlying process and of their relationship with the observation process, and wishes to incorporate this uncertainty in their estimates of the unseen state. We are particularly interested in allowing the level of uncertainty in the filtered state to be endogenous to the filtering problem, arising from the uncertainty in parameter estimates and process dynamics.
We model this uncertainty in a general manner, using the theory of nonlinear expectations, and concern ourselves with a description of uncertainty for which explicit calculations can be carried out, and which can be motivated by considering statistical estimation of parameters. We then apply this to building a dynamically consistent expectation for random variables based on future states, and to a general control problem, with learning, under uncertainty.

Basic filtering
Consider two stochastic processes, X = {X t } t≥0 and Y = {Y t } t≥0 . Let be the space of paths of (X, Y ) and P be a probability measure on . We denote by {F t } t≥0 the (completed) filtration generated by X and Y and denote by Y = {Y t } t≥0 the (completed) filtration generated by Y. The key problem of filtering is to determine estimates of φ(X t ) given Y t , that is, E P [φ(X t )|Y t ], where φ is an arbitrary Borel function.
Suppose that X is a Markov chain with (possibly time-dependent) transition matrix A t under P (the transpose here saves notational complexity later). Without loss of generality, we assume that X takes values in the standard basis vectors {e i } N i=1 of R N (where N is the number of states of X), and so we write where E P [M t+1 |F t ] = 0, so E P [X t |F t−1 ] = A t X t−1 .
We suppose the process Y is multivariate real-valued 1 . The law of Y depends on X, in particular, the P-distribution of Y t given {X s } s≤t ∪ {Y s } s<t (that is, given all past observations of X and Y and the current state of X) is Y t ∼ c(y; t, X t )dμ (y) for μ a reference measure on R d , B R d , where ∼ is used to indicate the density of the distribution of a random variable.
For simplicity, we assume that Y 0 ≡ 0, so no information is revealed about X 0 at time 0. It is convenient to write C t (y) = C(y; t) for the diagonal matrix with entries c(y; t, e i ), so that C t (y)X t = c(y; t, X t )X t .
Note that these assumptions, in particular the values of A and C, depend on the choice of probability measure P. Conversely, as our space is the space of paths of (X, Y ), the measure P is determined by A and C. We call A and C the generators of our probability measure. As we have assumed X t takes values in the standard basis in R N , the expectation E P [X t |Y t ] determines the entire conditional distribution of X t given Y t . In this discrete time context, the filtering problem can be solved in a fairly simple manner: Suppose we have already calculated p t−1 := E P [X t−1 |Y t−1 ]. Then, by linearity and the dynamics of X, using the fact Bayes' theorem then states that, with probability one, with ∝ denoting equality up to proportionality, P ( X t = e i | Y t ) = P (X t = e i |{Y s } s<t , Y t ) ∝ c(Y t ; t, e i )P(X t = e i |Y t−1 ), which can be written in a simple matrix form, as given in the following theorem which summarizes the classical filter.
Theorem 1 For X a hidden Markov chain with transition matrix A t , and Y an observation process with conditional density (given X t ) given by Y t |X t ∼ c(y; t, X t )dμ(y) = 1 C t (y)X t dμ(y), the conditional distribution E[X t |Y t ] = p t satisfies the recursion where 1 denotes a vector with all components 1.
We call p t the "filter state" at time t. Note that, if we assume the density c is positive, A t is irreducible and p t−1 has all entries positive, then p t will also have all entries positive.
In practice, the key problem with implementing these methods is the requirement that we know the underlying transition matrix A and the density C. These are generally not known perfectly, but need to be estimated prior to the implementation of the filter. Uncertainty in the choice of these parameters will lead to uncertainty in the estimates of the filtered state, and the aim of this paper is to derive useful representations of that uncertainty.
As variation in the choice of A and C corresponds to a different choice of measure P, we see that using an uncertain collection of generators corresponds naturally to uncertainty regarding P. This type of uncertainty, where the probability measure is not known, is commonly referred to as "Knightian" uncertainty (with reference to Knight (1921), related ideas are also discussed by Keynes (1921) and Wald (1945)).
Effectively, we wish to consider the propagation of uncertainty in Bayesian updating (as the filter is simply a special case of this). Huber and Ronchetti 2009 p. 331 briefly touch on this, however (based on earlier work by Kong) argue that this propagation is computationally infeasible. However, their approach was based on Choquet integrals, rather than nonlinear (convex) expectations in the style of Peng (2010) and others. In the coming sections, we see how the structure of nonlinear expectations allows us to derive comparatively simple rules for updating.
Remark 1 While we will present our theory in the context where X is a finite state Markov chain, our approach does not depend in any significant way on this assumption. In particular, it would be equally valid, mutatis mutandis, if we supposed that X followed the dynamics of the Kalman filter, and our uncertainty was on the coefficients of the filter. We specialize to the Markov chain case purely for the sake of concreteness.
The aim of this paper is to provide, with a minimum of technical complexity, the basic structures which underlie this approach to filtering with a nonlinear expectation. It proceeds as follows: In Section 2 we give some key facts about the measures which naturally appear in a filtering context. In Section 3, we introduce the theory of nonlinear expectations, and a means of connecting these with statistical estimation from Cohen (2017). Section 4 unites these expectations with filtering, giving recursive equations which replace the filtering equation in Theorem 1, it also outlines some concrete simplifications of this general structure, depending on whether the underlying parameters can vary through time and how new information is to be incorporated. In Section 5 we consider dynamic properties of this nonlinear expectation when looking at future events, and some connections with the theory of (discrete time) BSDEs. Finally, Section 6 considers a generic control problem in this context.

Conditionally Markov measures
In order to incorporate learning in our nonlinear expectations and filtering, it is useful to extend slightly from the family of measures previously described. In particular, we wish to allow the dynamics to depend on past observations, while preserving enough Markov structure to enable filtering. The following classes of probability measures will be of interest.

Definition 1
We write M 1 for the space of probability measures equivalent to a reference measure P.
Let M M ⊂ M 1 denote the probability measures under which • X is a Markov chain, that is, for all t, X t+1 is independent of F t given X t ; • both X and Y are time homogeneous, that is, the conditional distributions of X t+1 |X t and Y t |X t do not depend on t.
We note that, if we consider a measure in M M|Y , there is a natural notion of the generators A and C. In particular, M M corresponds to those measures under which the generators A and C are constant, while M M|Y corresponds to those measures under which the generators A and C are functions of time and {Y s } s≤t (i.e. {Y t } t≥0predictable processes).
For each t, these generators determine the measure on F t given F t−1 , and (together with the distribution of X 0 ) this determines the measure at all times. It is straightforward to verify that our filtering equations hold for all measures in M M|Y , with the appropriate modification of the generators.

Definition 2 For a measure
, and that A Q t and C Q t are now allowed to depend on {Y s } s<t . For notational convenience, we shall typically not write the dependence on {Y s } s<t explicitly.
Similarly, for a {Y t } t≥0 -predictable process (A t , C t (·)) t≥0 taking values in the product of the space of transition matrices and the space of diagonal matrix-valued functions, where each diagonal element is a probability density on R d , and p 0 a probability vector in R N , we write Q(A, C, p 0 ) for the measure with generator In what follows, we will be variously wishing to restrict a measure Q to a σalgebra, and to condition a measure on a σ -algebra. To prevent notational confusion, we shall write Q F for the restriction of Q to F, and Q| F for Q conditioned on F.
In our setting, our fundamental problem is that we do not know what measure is "true", and so work instead under a family of measures. In this context, measure changes can be described as follows.

Remark 2
The requirement thatP is a probability measure is unnecessary (i.e., in Bayesian parlance, the reference distribution may be improper). For example, we can use Lebesgue measure on R as the marginal reference measure μ for Y t without difficulty, in which case c t (y) is the usual (Lebesgue) density of the distribution of Y t .
Proof A simple verification of this result is possible by factoring the proposed Radon-Nikodym density as the product of three terms: The first term, (X 0 p 0 )/(1/N), changes the distribution of X 0 from uniform to p 0 , as is seen from the calculation This is clearly a probability density with respect toP. The second term, T t=1 X t A t−1 X t−1 , changes the conditional distribution of X t |{X s } s<t (for each t), from a uniform distribution to the probability vector A t−1 X t−1 ; as can be demonstrated by a calculation similar to that for X 0 in the first term. As the columns of A t−1 sum to one, it is easy to verify that this product has expectation 1 (conditional on X 0 ) and is nonnegative, that is, it is a probability density which does not modify the distribution of X 0 .
The third term changes the conditional distribution of Y t |{X t , X s , Y s } s =t from μ to C(y)X t dμ(y) = c(y; X t )dμ(y). This is most easily seen by calculating for a general bounded Borel function g. As c is defined to be a density, it is again easy to verify that the product 1 C(Y t )X t is a probability density with respect toP, and that it does not modify the distribution of X.
As we are on a canonical space, the measure Q is determined by the laws of X and Y, and so we have the result.
The above proposition gives a Radon-Nikodym derivative adapted to the full filtration {F t } t≥0 . In practice, it is also useful to consider the corresponding Radon-Nikodym derivative adapted to the observation filtration {Y t } t≥0 . As this filtration is generated by the process Y, it is enough to multiply together the conditional distributions of Y t |Y t−1 , leading to the following convenient representation. Recall that i p i c t (y; e i ) = 1 C t (y)p.
is the solution to the filtering problem in the measure Q(A, C, p 0 ), as determined by (1) (and so includes further dependence on {Y s } s<t ).
As Y t |Y t−1 has distribution μ underP, it follows that 1 C t (Y t )A t p (A,C),p 0 t−1 is the Radon-Nikodym density of the conditional law of Y t |Y t−1 under Q with respect to its law underP. As {Y s } s≤T generates Y T , the result follows by induction.
In order to apply classical statistical methods, we take a (generic) parameterization of this family of measures. This will also allow us to encode which parts of the generators we believe are static (and so can be learnt from observations), and which are dynamic (and so will violate stationarity).
Assumption 1 For fixed m > 0, we assume we are given a (Borel measurable) where S is a static parameter, and D t is a parameter which may vary at each point in time. We write Q for the family of measures in M M|Y induced by this parameterization, and typically omit to write the argument t of .
With a slight abuse of notation, if (A t , C t ) = (t, S, D t ), we write Q(S, D, p 0 ) as an alias for the measure Q(A, C, p 0 ), and G(t, S, D t , p) as an alias for the function G(t, A, C, p) defined in (1).

Nonlinear expectations
In this section, we introduce the concepts of nonlinear expectations and convex risk measures, and discuss their connection with penalty functions on the space of measures. These objects provide a technical foundation with which to model the presence of uncertainty in a random setting. This theory is explored in some detail in Föllmer and Schied (2002b). Other key works which have used or contributed to this theory, in no particular order, are Hansen and Sargent (2008) (see also Sargent (2005, 2007) for work related to what we present here); Huber and Roncetti (2009), Peng (2010), El Karoui et al. (1997, Delbaen et al. (2010), Duffie and Epstein (1992), Rockafellar et al. (2006), and Riedel (2004) and Epstein and Schneider (2003). We base our terminology on that used in Föllmer and Schied (2002b) and Delbaen et al. (2010).
We here present, without proof, the key details of this theory as needed for our analysis.
A "convex" expectation also satisfies • Convexity: for any λ ∈ [0, 1], ξ 1 , ξ 2 ∈ L ∞ (G), If E is a convex expectation, then the operator defined by ρ(ξ ) = E(−ξ) is called a convex risk measure. A particularly nice class of convex expectations is those which satisfy The following theorem (which was expressed in the language of risk measures) is due to Föllmer and Schied (2002a) and Frittelli and Rosazza Gianin (2002).

Theorem 2 Suppose E is a lower semicontinuous convex expectation. Then there exists a "penalty" function
Provided R(Q) < ∞ for some Q equivalent to P, we can restrict our attention to measures in M 1 equivalent to P without loss of generality.
Remark 3 This result gives some intuition as to how a convex expectation can model "Knightian" uncertainty. One considers all the possible probability measures on the space, and then selects the maximal expectation among all measures, penalizing each measure depending on its plausibility, as measured by R(·). As convexity of E is a natural requirement of an "uncertainty averse" assessment of outcomes, Theorem 2 shows that this is the only way to construct an "expectation" E which penalizes uncertainty, while preserving monotonicity, translation equivariance, and constant triviality (and lower semicontinuity).
In order to relate our penalty function with the temporal structure of filtering, we focus our attention on measures in our parametric family Q, defined in Assumption 1. We also allow the penalty R to depend on time. In the analysis of this paper, the following definition allows us to obtain a (forward) recursive structure in our nonlinear expectations, as one might expect in a filtering context.

Definition 4 We say a family of penalty functions {R t } t≥0 , is additive if it can be written in the form
where, if Q = Q(S, D, p 0 ) and p Q t is the solution of the filtering Eq. 1 under Q, the function α t is of the form Here k and k are positive constants, κ prior and {γ t } t≥0 are known real functions bounded below, and m t is a Y t -measurable scalar random variable which ensures the normalization condition inf Q α Q, {Y s } s≤t = 0 holds for almost all observation sequences {Y s } s≥0 .

DR-expectations
From the discussion above, it is apparent that we can focus our attention on calculating the penalty function R, rather than working with the nonlinear expectation directly. This penalty function is meant to encode how "unreasonable" a probability measure Q is as a model for our outcomes. In Cohen (2017), we have considered a framework which links the choice of the penalty function to statistical estimation of a model. The key idea of Cohen (2017) is to use the negative log-likelihood function for this purpose, where the likelihood is taken against an arbitrary reference measure, and evaluated using the observed data. This directly uses the statistical information from observations to quantify our uncertainty.
In this paper, we make a slight extension of this idea, to explicitly incorporate prior beliefs. In particular, we replace the log-likelihood with the log-posterior density, which in turn gives an additional term in the penalty.
Definition 5 For Q ∈ Q, the observed likelihood L obs (Q|y) is given in Proposition 2. Inspired by a "Bayesian" approach, we augment this by the addition of a prior distribution over Q. Suppose a (possibly improper) prior is given, with density in terms of the parameters (S, The posterior relative density is given by the product The "Q|y t -divergence" is defined to be the normalized negative log-posterior relative density (2)

Remark 4
The right-hand side of (2) is well defined whether or not a maximum a posteriori (MAP) estimator exists. Given a MAP estimateQ ∈ Q, we would have the simpler representation .
We call E k,k y t the "DR-expectation" (with parameter k, k ). We may omit to write k, k for notational simplicity.
With deliberate ambiguity, the acronym "DR" can either stand for "divergence robust" or "data-driven robust".
Remark 5 By construction, Q is parameterized by S and {D t } t≥0 , which lie in R m for some m. The divergence and conditional expectations given y t are continuous with respect to this parameterization and can be constructed to be Borel measurable with respect to y t . Consequently, measure theoretic concerns which arise from taking the supremum will not cause difficulty, in particular, the DR-expectation defined in (3) is guaranteed to be a Borel measurable function of y t for every ξ . (This follows from Filippov's implicit function theorem, see, for example, Cohen and Elliott (2015) Appendix 10.)

Remark 6
The choice of parameters k and k determines much of the behaviour of the nonlinear expectation. The role of k is simple, as it acts to scale the uncertainty aversion-a higher value of k results in smaller penalties, and hence the DR-expectation will lie further above the MAP expectation. The parameter k determines the "curvature" of the uncertainty aversion. Taking k = ∞ results in the DR-expectation being positively homogeneous, that is, it is a coherent expectation in the sense of Artzner et al. (1999). In Cohen (2017), the asymptotic behaviour of the DR-expectation is studied, under the assumption of iid observations. For k = 1, the DR-expectation of corresponds (for large samples) to the expected value under the maximum likelihood model plus k/2 times the sampling variance of the expectation, while for k = ∞, the DR-expectation corresponds to the expected value under the maximum likelihood model plus √ 2k times the sampling standard error of the expectation. In this paper, we will not be considering such an asymptotic result, so the values of k and k will not play a significant role. Their presence nevertheless gives a more general class of penalty functions in Definition 4, and they are kept for notational consistency with other papers considering the DR-expectation.

Remark 7
In principle, we could now apply the DR-expectation framework to a filtering context as follows: Take a collection of models Q. For a random variable ξ , and for each measure Q ∈ Q, compute E Q [ξ |y t ] and α y t (Q). Taking a supremum as in (3), we obtain the DR-expectation. However, this is generally not computationally tractable in this form.
Proof By construction, α y t is obtained from the posterior relative density, which is determined by the restriction of Q to Y t ⊆ F t , while the expectation depends only on the restriction of Q to F t . As these are the only terms needed to compute the DR-expectation, the result follows.

Theorem 3 The penalty in the DR-expectation is additive, in the sense of Definition 4.
Proof From Proposition 2, we have the likelihood is the solution to the filtering Eq. 1. By Lemma 1, the penalty in the DR-expectation is given by where m t is chosen to ensure inf Q (α y t (Q) ≡ 0. As (A t , C t (·)) = (S, D t ), we obtain the desired form by setting Remark 8 The purpose of the nonlinear expectation is to give an "upper" estimate of a random variable, accounting for uncertainty in the underlying probabilities. This is closely related to robust estimation in the sense of Wald (1945). In particular, one can consider the robust estimator given by which gives a "minimax" estimate of ξ , given the observations y t and a quadratic loss function. The advantage of the nonlinear expectation approach is that it allows one to construct such an estimate for every random variable/loss function, giving a cost-specific quantification of uncertainty in each case. We can also see a connection with the theory of H ∞ filtering (see, for example, Grimble and El Sayed (1990) or more recently Zhang et al. (2009) and references therein, or the more general H ∞ -control theory in Başar and Bernhard (1991)). In this setting, we look for estimates which perform best in the worst-situation, where "worst" is usually defined in terms of a perturbation to the input signal or coefficients. In our setting, we focus not on the estimation problem directly, but on the "dual" problem of building an upper expectation, i.e. calculating the "worst" expectation in terms of a class of perturbations to the coefficients (our setting is general enough that perturbation to the signal can also be included through shifting the coefficients).
Remark 9 There are also connections between our approach and what is called "risk-sensitive filtering", see, for example, James et al. (1994) and Dey and Moore (1995); or the review of Boel et al. (2002) and references therein (from an engineering perspective); or Hansen and Sargent (2007) and Hansen and Sargent (2008) (from an economic perspective). In their setting, one uses the nonlinear expectation defined by for some choice of robustness parameter 1/k > 0. This leads to significant simplification, as dynamic consistency and recursivity is guaranteed in every filtration (see Graf (1980) and Kupper and Schachermayer (2009), and further discussion in Section 5). The corresponding penalty function is given by the conditional relative entropy, which is additive (Definition 4) and the one-step penalty can be calculated accordingly. In this case, the optimization defining the nonlinear expectation could also be taken over M 1 , so this approach has a claim to be including "nonparametric" uncertainty, as all measures are considered, rather than purely Markov measures or measures in a parametric family (however, the optimization can be taken over conditionally Markov measures, and one will obtain an identical result!). The difficulty with this approach is that it does not allow for easy incorporation of knowledge of the error of estimation of the generators (A, C) in the level of robustness-the only parameter available to choose is k, which multiplies the relative entropy. A small choice of k corresponds to a small penalty, hence a very robust expectation, but this robustness is not directly linked to the estimation of the generators (A, C). Therefore, the impact of statistical estimation error remains obscure, as k is chosen largely exogenously of this error. For this reason, our approach, which directly allows for the penalty to be based on the statistical estimation of the generators, has advantages over this simpler method.

Recursive penalties
The DR-expectation provides us with an approach to including statistical estimation in our valuations. However, the calculations suggested by Remark 7 are generally intractable in their stated form. In this section, we shall see how the assumption that the penalty is additive (Definition 4) can be used to simplify our calculations.
Our arguments will be based on dynamic programming techniques. For the sake of precision and brevity, we here state a (forward in time) abstract "dynamic programming principle" which we can call on in later arguments.  For each u ∈ U , we write g −1 (·, u) for the (set-valued) inverse of g(·, u). nondecreasing and continuous (uniformly in u, z) and A(v, u, z) → ∞ as v → ∞. For each u and z 0 , we define the sequence of values at each time t by

Suppose we have a sequence of Borel measurable maps
Then, the minimal value, (with the convention that the infimum of the empty set is +∞).
Proof We proceed by induction. Clearly, the result holds at t = 0, as does the (at t = 0 empty) statement Suppose then that (4) and (5) hold at t = n − 1. For every > 0, there exists (u, z 0 ) such that Taking the infimum over u, z 0 : Z u t = z (which can be done measurably with respect to z, given Filippov's implicit function theorem, see, for example, Cohen and Elliott (2015) Appendix 10, and sending → 0 gives From the definition of g, we know that The right side of this equation depends on u, z 0 only through the values of Z u,z 0 n−1 and u n . In particular, considering the set of attainable y, that is, for y ∈ u,z 0 Z u,z 0 n , we change variables to write As the infimum on the empty set is +∞, we also obtain (5) at time n, and simplify to give (4) for t = n, completing the inductive proof.

Corollary 1 Suppose, instead of Z being defined by a forward recursion, for some Borel measurable functiong we had the backward recursioñ
The result of Theorem 4 still holds (with effectively the same proof), where we writẽ g t instead of g −1 t , so the second infimum in (4) is unnecessary (asg(z, u) is singlevalued).
For practical purposes, it is critical that we refine our approach to provide a recursive construction of our nonlinear expectation. In classical filtering, one obtains a recursion for expectations E[φ(X t )|Y t ], for Borel functions φ; one does not typically consider the expectations of general random variables. Similarly, we will consider the expectations of random variables φ(X t ).

Proposition 3 For each t, there exists a Y t ⊗ B(R)-measurable function κ t such that, for every Borel function φ,
where S + N denotes the probability simplex in R N , that is, Proof Fix the observations y t . Taking X to be the possible states of X t , that is, the basis vectors in R N with the discrete topology and corresponding Borel σalgebra, the space of measures on X is represented by the probability simplex S + N . We consider the map as a nonlinear expectation with underlying space X . By Theorem 2, it follows that there exists a penalty function κ y t (q) such that Taking a regular version of this penalty function (which by convex duality exists as E is measurable in y t ), we can write κ t (ω, q) = κ y t (q) as desired.
Our aim is to find a recursion for κ t , for various choices of R. Our constructions will depend on the following object.
Definition 7 Recall from Theorem 1 and Assumption 1 that, given a generator (A t , C t (·)) = (S, D t ) ∈ A at time t, our filter dynamics are described by the recursion (up to proportionality) We correspondingly define the (set-valued) inverse For notational simplicity, we will omit the argument Y t when this does not lead to confusion.
The set G −1 (p; S, D t , Y t ) represents the filter states at time t − 1 which evolve to p at time t, assuming the generator of our process (at time t) is given by (S, D t ) and we observe Y t . This set may be empty, if no such filter states exist. As the matrix A is generally not invertible (even accounting for the restriction to S + N ), the set G −1 (p; S, D t , Y t ) is not generally a singleton.

Filtering with uncertainty
We now show that if we assume our penalty is additive, then the function κ appearing in (6) can be obtained in a recursive manner.

Theorem 5 Suppose R t is additive, in the sense of Definition 4. Then, a function κ satisfying (6) is given by
where K t satisfies the recursion with initial value K 0 (p 0 , S) = κ prior (p 0 , S), where m is chosen to ensure we have the normalization inf p,S K t (p, S) ≡ 0.
Proof As we know that R t is additive, we have the As E Q [φ(X t )|y t ] depends only on the conditional law of X t |Y t under Q, it is easy to see that (6) is satisfied when We wish to write the minimization in (7) as a recursive control problem, to which we can apply Theorem 4. Given p 0 , S, and {D s } s≤t , the law of X t |Y t is given by the solution to the filtering Eq. 1. Write Z t = (p t , S), and u t = D t , so that Z is a state process defined by Z 0 = z 0 = (p 0 , S) and the recursion (controlled by Omitting the constant m t from the definition of α t , we define V t (z 0 , u) = κ prior (p 0 , S) + s≤t γ s S, D s , {Y n } n≤s , p s .
Taking A t to be the operator we see that V satisfies the structure assumed in Theorem 4. Therefore, its minimal value satisfies with initial value V 0 (z) = κ prior (p 0 , S). We renomalize this by setting m t = inf z V * t (z) and K t (p, S) := V * t (z) − m t , and so obtain the stated dynamics for K. By construction, we know It follows that (7), and hence (6), are satisfied by taking κ t (p) = inf S K t (p, S), as desired.

Examples
In this section, we will seek to outline a few key settings where this theory can be applied.

Static generators, uncertain prior (StaticUP)
We first consider the case where uncertainty is given over the prior inputs to the filter. In particular, this "prior uncertainty" is not updated given new observations, and R will not change through time.
Framework 1 (StaticUP) In a StaticUP setting, the inputs to the filtering problem are the initial filter state p 0 and the generator (A, C(·)), which we parameterize solely using the static parameter S. in particular, we exclude dependence on the "dynamic" parameters {D t } t≥0 . To represent our uncertain prior, we take a penalty for some prescribed penalty κ prior . We now apply Theorem 5 (omitting dependence on D t , as we are in a purely static setting) to see that a dynamic version κ t of the penalty function, satisfying (6), can be computed as where K t satisfies the recursion Assuming inf (p 0 ,S) κ prior (p 0 , S) = 0, we further compute m t ≡ 0. This completely characterizes the penalty function, and hence the nonlinear expectation.
Remark 10 Inspired by the DR-expectation, a possible choice of penalty function κ prior would be the negative log-density of a prior distribution for the inputs (p 0 , S), shifted to have minimal value zero. Alternatively, taking an empirical Bayesian perspective, κ prior could be the log-likelihood from a prior calibration process. In this case, we are incorporating our prior statistical uncertainty regarding the parameters in the filtering problem.

Remark 11 We emphasize that there is no learning of the generator being done in this framework-the penalty applied at time t = 0 is simply propagated forward; our observations do not affect our opinion of the plausible generators. In particular, if we assume no knowledge of the initial state (i.e., a zero penalty), then we will have no knowledge of the state at time t (unless the observations cause the filter to degenerate).
Example 1 For a concrete example of the StaticUP framework, we take the class of models in M M where A and C are perfectly known and A = I , so X t = X 0 is constant (but X 0 is unknown). We take N = 2, so X takes only one of two values. For the observation distribution C, we assume that where a, b ∈ (0, 1) are fixed constants. Effectively, in this example we are using filtering to determine which of two possible parameters is the correct mean for our observation sequence. It is worth emphasising that the filter process p corresponds to the posterior probabilities, in a Bayesian setting, of the events that our Bernoulli process has parameter a or b. It is useful to note that, from classical Bayesian statistical calculations 3 , for a given p 0 , one can see that the corresponding value of p t is determined from the log-odds ratio,

s is the average number of successes observed at time t.
To write down the StaticUP penalty function, let the (known) dynamics be described by S * . Consequently, we can write K(p, S) = ∞ for all S = S * . We initialize with a known penalty κ prior (p, S * ) = κ 0 (p) for all p ∈ S + N . As S * is known, there is no distinction between K and κ, that is, In this example, we can express our penalty in terms of the log-odds, for the sake of notational simplicity given the closed-form solution to the filtering problem, and hence can explicitly calculate the (unique) initial distribution p 0 which would evolve to a given p at time t. In particular, the time-t penalty is given by a shift of the initial penalty: Remark 12 This example demonstrates the following behaviour: • If the initial penalty is zero, then the penalty at time t is also zero-there is no learning of which state we are in. • When parameterized by the log-odds ratio, there is no variation in the curvature of the penalty (and so no change in our "uncertainty"), we simply shift the penalty around, corresponding to our changing posterior probabilities. • The update of κ is done purely using the tools of Bayesian statistics, rather than having any direct incorporation of our uncertainty.

Remark 13
We point out that this is, effectively, the model of uncertainty proposed by Walley (1991) (see, in particular, Walley (1991) Fagin and Halpern (1990).

Dynamic generators, uncertain prior (DynamicUP)
If we model the generator (A, C) as fixed and unknown (i.e., it depends only on S), calculation of K t (p, S) suffers from a curse of dimensionality-the dimension of S determines the size of the domain of K t . On the other hand, if we suppose the generator at time t depends only on the dynamic parameters D t , we can use dynamic programming to obtain a lower-dimensional problem. In this case, as we ignore the static parameter S, we simplify Theorem 5 through the identity κ t (p) = K t (p, S). This yields the recursion and again, if we assume inf Q α(Q) ≡ 0, we then conclude m t ≡ 0.
This formulation of the uncertain filter allows us to use dynamic programming to solve our problem forward in time. In the setting of Example 1, as the generator is perfectly known, there is no distinction between the dynamic and static cases.
A continuous-time version of this setting (for a Kalman-Bucy filter) is considered in detail in Allan and Cohen (2019a).

Static generators, DR-expectation (StaticDR)
In the above examples, we have regarded the prior as uncertain and used this to penalize over models. We did not use the data to modify our penalty function R. The DR-expectation gives us an alternative approach in which the data guides our model choice more directly. In what follows, we apply the DR-expectation in our filtering context and observe that it gives a slightly different recursion for the penalty function. Again, we can consider models where our generator is constant (i.e., depends only on S) or changes dynamically (i.e., depends only on D t ). given by the log-posterior density α(Q Y t ) = κ prior (p 0 , S) − L obs Q(A, C, p 0 ) y t + m t which is additive, as shown in Theorem 3. Applying Theorem 5, we see that the penalty can be written κ t (p) = inf S K t (p, S), where K 0 (p, S) = κ prior (p, S) and K satisfies the recursion Unlike in the uncertain prior cases, we cannot typically claim that m t ≡ 0, instead it is a random process dependent on our observations.

Remark 14
Comparing Framework 1 (StaticUP) with Framework 3 (StaticDR), we see that the key distinction is the presence of the log-likelihood term. This term implies that observations of Y will affect our quantification of uncertainty, rather than purely updating each model.

Example 2
In the setting of Example 1, recall that X is constant, so we know (A, C(·)). One can calculate the StaticDR penalty either directly, or through solving the stated recursion using the dynamics of p. As in the StaticUP case, the result is most simply expressed by first calculating p 0 from p t through where m t is chosen to ensure inf p κ t (p) = 0. From this, we see that the likelihood modifies our uncertainty directly, rather than us simply propagating each model via Bayes' rule. A consequence of this is that if we start with extreme uncertainty (κ 0 ≡ 0), then our observations teach us what models are reasonable, thereby reducing our uncertainty (i.e., we will find κ t (p) > 0 for p ∈ (0, 1) when t > 0).

Remark 15
It is interesting to ask what the long-term behaviour of these uncertain filters will be. In Cohen (2017), the long run behaviour of the DR-expectation based on i.i.d. observations is derived and, in principle, a similar analysis is possible here. Using the asymptotic analysis of maximum likelihood estimation for hidden Markov models in Leroux (1992) or Douc et al. (2011), we know that the MLE will converge with probability one to the true parameter, under appropriate regularity conditions. Here, the presence of the prior influences this slightly, however, this impact vanishes as t → ∞. With further regularity assumptions, one can also show that the loglikelihood function, divided by the number of observations, almost surely converges to the relative entropy between a proposed model and the true model (see, for example, Leroux (1992) section 5). If one also knew that the relative entropy is smooth and convex, the analysis of Cohen (2017) Theorems 4 and 5 is possible, showing that the DR-expectation corresponds to adding a term related to the sampling variance of the hidden state 4 . In particular, as the number of observations increases, the DR-expectation will converge to the expected value under the filter with the true parameters.

Dynamic generators, DR-expectation (DynamicDR)
As in the uncertain prior case, it is often impractical to calculate a recursion for K(p, S) given the high dimension of S. We therefore consider the case when (A, C) depends only on the dynamic parameters D t .

Framework 4 (DynamicDR) As before, for each
given by the logposterior density From Theorem 3, we know that the log-posterior density is additive. Applying Theorem 5, and the identity κ t (p) = K t (p, S), we conclude that the penalty κ t (p) in (6) can be computed from the recursion with initial value κ 0 (p) = π(p), where m t is chosen to ensure inf p∈S + N κ t (p) = 0 for all t.

Remark 16
We expect that there will be less difference between the dynamic uncertain prior and dynamic DR-expectation settings than between the static uncertain prior and static DR-expectation settings. This is because there is only limited learning possible in the dynamic DR-expectation, as D t may vary independently at every time, so the DR-expectation has only one value with which to infer the value of each D t . This increases the relative importance of the prior term γ prior , which describes our understanding of typical values of the generator. In practice, the key distinction between the dynamic DR-expectation and uncertain prior models appears to be when the initial penalty is near zero-in this case, the DR-expectation regularizes the initial state quickly, while the uncertain prior model may remain near zero indefinitely.
Example 3 In the setting of Example 2, as the dynamics are perfectly known, there is again no difference between the dynamic and static generator DR-expectation cases.
A continuous-time version of this setting (for a Kalman-Bucy filter) is considered in Allan and Cohen (2020).

Expectations of the future
The nonlinear expectations considered above do not consider how the future will evolve. In particular, we have focussed our attention on calculating E y t (φ(X t )), that is, on the expectation of functions of the current hidden state. In other words, we can consider our nonlinear expectation as a mapping If we wish to calculate expectations of future states, then we may wish to consider doing so in a filtration-consistent manner. This is of particular importance when considering optimal control problems.
Definition 8 For a fixed horizon T > 0, suppose that for each t < T we have a mapping E(·|Y t ) : L ∞ (Y T ) → L ∞ (Y t ). We say that E is a Y-consistent convex expectation if E(·|Y t ) satisifes the following assumptions, analogous to those above, • Strict Monotonicity: for any ξ 1 , ξ 2 ∈ L ∞ (Y T ), if ξ 1 ≥ ξ 2 a.s., then E(ξ 1 |Y t ) ≥ E(ξ 2 |Y t ) a.s., and if, in addition, E(ξ 1 |Y t ) = E(ξ 2 |Y t ) then ξ 1 = ξ 2 a.s.; and the additional asssumptions The assumption of Y-consistency is sometimes simply called recursivity, time consistency, or dynamic consistency (and is closely related to the validity of the dynamic programming principle), however, it is important to note that this depends on the choice of filtration. In our context, consistency with the observation filtration Y is natural, as this describes the information available for us to make decisions.

Remark 17
Definition 8 is equivalent to considering a lower semicontinuous convex expectation, as in Definition 3, and assuming that for any ξ ∈ L ∞ (Y T ) and any t < T , there exists a random variable ξ t ∈ L ∞ (Y t ) such that E(I A ξ) = E(I A ξ t ) for all A ∈ Y t . In this case, one can define E(ξ |Y t ) = ξ t and verify that the definition given is satisfied (see Föllmer and Schied (2002b) and Cohen and Elliott (2010)).
Much work has been done on the construction of dynamic nonlinear expectations (see, for example, Epstein and Schneider (2003), Duffie andEpstein (1992), El Karoui et al. (1997), and Cohen and Elliott (2010); and references therein). In particular, there have been close relations drawn between these operators and the theory of BSDEs (for a setting covering the discrete-time examples we consider here, see Cohen and Elliott (2010) and Cohen and Elliott (2011)).

Remark 18
The importance of Y-consistency is twofold: First, it guarantees that, when using a nonlinear expectation to construct the value function for a control problem, an optimal policy will be consistent in the sense that (assuming an optimal policy exists) a policy which is optimal at time zero will remain optimal in the future. Second, {Y t } t≥0 -consistency allows the nonlinear expectation to be calculated recursively, working backwards from a terminal time. This leads to a considerable simplification numerically, as it avoids a curse of dimensionality in intertemporal control problems.

Remark 19
One issue in our setting is that our lack of knowledge does not simply line up with the arrow of time-we are unaware of events which occurred in the past, as well as those which are in the future. This leads to delicacies in questions of dynamic consistency. Conventionally, this has often been considered in a setting of "partially observed control", and these issues are resolved by taking the filter state p t to play the role of a state variable, and solving the corresponding "fully observed control problem" with p t as underlying. In our context, we do not know the value of p t , instead we have the (even higher dimensional) penalty function K t as a state variable.
In the following sections, we will outline how our earlier approach can be extended to provide a dynamically consistent expectation, and how enforcing dynamic consistency will modify our perception of risk.

Asynchronous expectations
We will focus our attention on constructing a dynamically consistent nonlinear expectation for random variables in L ∞ (σ (X T ) ⊗ Y T ), given observations up to times t < T . Throughout this section, we will use the following construction:

Definition 9 Suppose we have a nonlinear expectation
constructed for our nonlinear filtering problem, as above, and are given a a Y-consistent family of maps We then extend E to variables in L ∞ (σ (X T ) ⊗ Y T ) by the composition Given this definition, our key aim is to construct the Y-consistent family E(·|Y t ), in a way which "agrees" with our uncertainty in the underlying filter. As we are in discrete time, we can construct a Y-consistent family through recursion, if we have its definition over each single step. The definition of the DR-expectation can be applied to generate these one-step expectations in a natural way.
Definition 10 For R an additive penalty function (Definition 4), we define the one-step expectation, for ξ ∈ L ∞ (Y t+1 ), by where the essential supremum is taken among the bounded Y t -measurable random variables. Using this, we define a Y-consistent expectation L ∞ (σ ( Remark 20 It is necessary to use the penalty R t+1 in this definition, as our penalty should include the behaviour of the generator C t+1 (·), which determines the distribution of Y t+1 |X t+1 .
Recall that, as Y is generated by Y, the Doob-Dynkin lemma states that any Y t+1measurable function ξ is simply a function of {Y s } s≤t+1 , so we can write For any conditionally Markov measure Q, if Q has generator (A t , C t (·)) t≥0 , it follows that In particular, we apply this to our penalty function to define the functionR such that (9) Applying this to our definition of E, we obtain the following representation.

Lemma 2 The one-step expectation E can be written
where K is the dynamic penalty constructed in Theorem 5 and (A t+1 , C t+1 (·)) ≡ (S, D t+1 ).
Proof Write We know that which depends on Q only through A t+1 , C t+1 and p t , or equivalently, through the parameters S, D t+1 and p t . In particular, as R is additive, we can substitute in its structure and simplify using the definition of K in Theorem 5 to obtain Using the definition ofR, we change these conditional expectations to integrals, and obtain the desired representation.

Remark 21
There is a surprising form of double-counting of the penalty here. To see this, let's assume φ does not depend on Y. If we consider ξ t+1 = E y t+1 (φ(X t+1 )), then we have included a penalty for the proposed model at t + 1, that is, where K t+1 (p) is the penalty associated with the filter state at time t + 1, which includes the penalty γ t+1 on the parameters S and D t+1 . When we calculate E(ξ t+1 |Y t ), we do so by using the penalty K(p t , S) + γ t+1 (S, D t+1 , {Y s } s≤t+1 , p t ), which again includes the term γ t+1 which penalizes unreasonable values of the parameters S and D t+1 . This "double counting" of the penalty corresponds to us including both our "uncertainty at time t + 1" (in E y t+1 ), and also our "uncertainty at t about our uncertainty at t + 1" (in E(·|Y t )).

Remark 22
One should be careful in this setting, as the recursively-defined nonlinear expectation will be optimized for a different value of S at every time. As S is considered to be a static penalty, this is an internal inconsistency in the modelling of our uncertainty-we always estimate assuming that S has never changed, but evaluate the future by considering our possible future opinions of the value of S.

Review of BSDE theory
While it is useful to give a recursive definition of our nonlinear expectation, a better understanding of its dynamics is of practical importance. In what follows, for the dynamic generator case, we consider the corresponding BSDE theory, assuming that Y t can take only finitely many values, as in Cohen and Elliott (2010). We now present the key results of Cohen and Elliott (2010), in a simplified setting.
In what follows, we suppose that Y takes d values, which we associate with the standard basis vectors in R d . For simplicity, we write 1 for the vector in R d with all components 1.

Definition 11
WriteP for a probability measure such that {Y t } t≥0 is an i.i.d. sequence, uniformly distributed over the d states, and M for theP-martingale difference process Y t − d −1 1. As in Cohen and Elliott (2010), M has the property that any Y-adaptedP-martingale L can be represented by L t = L 0 + 0≤s<t Z s M s+1 for some Z (and Z is unique up to addition of a multiple of 1).

Remark 23
The construction of Z in fact also shows that, if L is written L t = L(Y 1 , ..., Y t−1 , Y t ), then e i Z t = L(Y 1 , ..., Y t−1 , e i ) for every i (up to addition of a multiple of 1).

Theorem 8
The following two statements are equivalent.
There exists a driver f which is balanced, independent of ξ , and satisfies the normalisation condition f (ω, t, ξ t , 0) = 0, such that, for all ξ T , the value of ξ t = E(ξ T |Y t ) is the solution to a BSDE with terminal condition ξ T and driver f.
Furthermore, these two statements are related by the equation

BSDEs for future expectations
By applying the above general theory, we can easily see that our nonlinear expectation has a representation as the solution to a particular BSDE.

Theorem 9
Write The dynamically consistent expectation satisfies the BSDE with driver Proof As ξ t+1 is Y t+1 -measurable, by the Doob-Dynkin lemma there exists a Borel measurable functionξ t+1 such that ξ t+1 =ξ t+1 (Y t+1 ) (omitting to write {Y s } s≤t as an argument). We write Z t for the vector containing each of the values of this function. From the definition of M, as in the proof of the martingale representation theorem in Cohen and Elliott (2010), it follows that We then calculate, using Lemma 2 (simplified to our finite-state setting and omitting {Y s } s≤t as an argument), The answer follows by rearrangement.

A control problem with uncertain filtering
In this final section, we consider the solution of a simple control problem under uncertainty, using the formal structures previously developed. In some ways, this approach is similar to those considered by Bielecki et al. (2017), where the DRexpectation is replaced by an approximate confidence interval. (Taking k = ∞ in a StaticDR model would give a very similar problem to the one they consider.) A key complexity in doing this is that our uncertainty does not agree with the arrow of time-at time t, we do not know the future values of {Y s , X s } s>t (as is typical for stochastic control), but we also do not know the values of {X s } s≤t , even though these have an indirect impact on our costs. Suppose a controller selects a control u from a set U, which we assume is a countable union of compact metrizable sets. Controls are required to be Y-predictable (i.e., u t is Y t−1 -measurable), and we write U for the space of such controls, and u = (u 1 , ..., u T ) for the vector of controls at every time.
A control has an impact on the generator of X, Y , through modifying the penalty function γ , which describes the "reasonable" models for the transition matrix A and the distribution of observations C. In particular, for a given u the term γ t in the additive structure of R t is permitted to depend on u t . We assume γ t (· · · ; u t ) is continuous in u t for every value of its other arguments. This is a variant on a standard weak formulation of the control problem-our agent no longer selects the generators (A, C(·)) directly, but instead modifies the penalty determining which values of (A, C(·)) are 'reasonable models'.
(11) We then observe that E u,K (·|Y t ) is formally independent of (u 1 , ..., u t−1 ). Nevertheless, the effective value of K t depends on (u 1 , ..., u t−1 ), as these now appear in the γ terms appearing in Theorem 5.
The controller wishes to minimize an expected cost where K 0 = κ prior is the uncertainty before the control problem begins. Here C is a terminal cost, which may depend on the hidden state X T , and L is a running cost, which depends on the control u t+1 used at time t. We assume C and L are continuous in u (almost surely). We think of the cost L t as being paid at time t, depending on the choice of control u t+1 (which will affect the generator at time t + 1). For notational simplicity, we omit Y as an argument when unnecessary.

Remark 24
We do not allow L t to depend on X t , as this may lead to paradoxes (as the agent could learn information about the hidden state by observing their running costs).

Remark 25
We define our expected cost using the Y-consistent expectation E u,K , rather than the (inconsistent) DR-expectation E y t , as this leads to time-consistency in the choice of controls.

Remark 26
We see that the calculation of the value function is a "minimax" problem, in that V minimizes the cost, which we evaluate using a maximum over a set of models. However, given the potential for learning, the requirement for time consistency, and the uncertainties involved, it is not clear that one can write V explicitly in terms of a single minimization and maximization of a given function.

Remark 27
As the filter-state penalty K is a general function depending on the control, and Y only takes finitely many states, it is not generally possible to express the effect (on K) of a control through a change of measure relative to some reference dynamics. In particular, we face the problem that controls u s for times s < T have an impact on the terminal cost V T = E u,K T (C(X T )|Y T ), through their impact on the uncertainty K T . Unlike in a traditional control problem, V T is not independent of u given Y T ; this is problem of the 'arrow of time' mentioned at the start of this section. For this reason, even though we model the impact of a control through its effect on the generator, we cannot give a fully "weak" formulation of our control problem, and are restricted to a "Markovian" setting with K as a state variable.
Theorem 10 The value function satisfies a dynamic programming principle, in particular, if an optimal control u * exists, then for every t ≤ T , where K (u * ,K t−1 ) t is the one-step solution of the recursion of Theorem 5 using the control u * .
A similar result also holds if we only assume an -optimal control exists for every > 0.
Proof This effectively falls into the setting of our abstract dynamic programming principle (in particular, Corollary 1), with a time reversal. For any control u, using the recursivity of E and writing κ t = κ (u,κ t−1 ) t for simplicity, we have J (ω, t − 1, K t−1 , u) In reversed time τ = T − t, we have the state variable z τ = K T −t , which is defined using a backward recursion (in terms of τ ) by Theorem 5. The operator A τ is then given by which is monotone and continuous in its first argument. The result then follows from Corollary 1.

Remark 28
The appeal to Corollary 1 is slightly complicated by the fact that V is a random variable, rather than simply a scalar value. The reader can verify that this does not affect the proof of Theorem 4 significantly, as we are in a finitely generated probability space (so measurable selection arguments remain straightforward), and the operator A forces the solution to have the desired measurability through time.
By combining this dynamic programming property with the definition of the onestep expectation, we can write down a difference equation which V must solve.
Here, K u,K t−1 t is the value of K t starting at K t−1 , with control u at time t, and evolving following Theorem 5 andR is as in (11).

Corollary 2 A control is optimal if and only if it achieves the infimum in the formula for V above.
Remark 29 If we assume that the terminal cost depends only on X T (and not on Y), and the running cost does not depend on Y, then one can observe a Markov property to the control problem, that is, V s is conditionally independent of Y given K s . The corresponding optimal controls can then also be taken only to depend on K s .