## Abstract

Optimal control theory and machine learning techniques are combined to formulate and solve in closed form an optimal control formulation of online learning from supervised examples with regularization of the updates. The connections with the classical linear quadratic gaussian (LQG) optimal control problem, of which the proposed learning paradigm is a nontrivial variation as it involves random matrices, are investigated. The obtained optimal solutions are compared with the Kalman filter estimate of the parameter vector to be learned. It is shown that the proposed algorithm is less sensitive to outliers with respect to the Kalman estimate (thanks to the presence of the regularization term), thus providing smoother estimates with respect to time. The basic formulation of the proposed online learning framework refers to a discrete-time setting with a finite learning horizon and a linear model. Various extensions are investigated, including the infinite learning horizon and, via the so-called kernel trick, the case of nonlinear models.

## 1 Introduction

In recent years, the combination of techniques from the fields of optimization/control and machine learning has led to a successful interaction between the two disciplines. As an example, we mention reinforcement learning (Bertsekas, 1996b; Sutton & Barto, 1998). The cross-fertilization between these two fields shows itself in both directions.

### 1.1 Application of Machine Learning Techniques to Optimization/Control

Sparsity-inducing regularization techniques from machine learning have often been exploited to find suboptimal solutions to an initially unregularized optimization problem, having at the same time a sufficiently small number of nonzero arguments. For instance, the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) was applied in Gnecco, Morisi, and Bemporad (2015) to consensus problems and in Gallieri and Maciejowski (2012) to model predictive control (MPC).

Applications of machine learning techniques to control can be found, in Suykens, Vandewalle, & De Moor (2001) and in the series of papers (Gnecco & Sanguineti, 2010; Gaggero, Gnecco, & Sanguineti, 2013, 2014), where least squares support vector machines (LS-SVMs) and one-hidden-layer perceptron neural networks, respectively, were applied to find suboptimal solutions to optimal control problems. Mesbahi and Egerstedt (2010) applied spectral graph theory methods, already exploited successfully in machine learning problems (Belkin & Niyogi, 2006), to the control of multiagent dynamical systems.

### 1.2 Application of Optimization/Control Techniques to Machine Lear-ning

This is the direction we followed in this letter: we develop an approach that exploits for machine learning techniques from optimization and control (particularly, optimal control). We focus on the problem of estimating online an unknown parameter vector modeling a noisy linear relationship between input examples and their output labels. The emphasis is on linear regression, but an extension to nonlinear regression is also detailed. Moreover, the model can be applied to linear (respectively, nonlinear) binary classification by interpreting a suitable nonlinear transformation (e.g., a sigmoidal mapping) of the output as the a posteriori probability (given the input example) of belonging to one of the two classes.

Specifically, we propose and solve in closed form an optimal control formulation of online learning with supervised examples and a regularization of the updates. In the online framework, the examples become available one by one as time passes and the training of the learning machine is performed continuously. Online learning problems have been investigated, in, for example, Zinkevich (2003), Ralaivola and d’Alché Buc (2005), Smale and Tao (2006), Ying and Pontil (2008), Liu, Park, Wang, & Príncipe (2009), Shalev-Shwartz (2012), but without using an approach based on optimal control theory. As suggested by the preliminary results that we obtained in Gnecco, Bemporad, Gori, Morisi, and Sanguineti (2015), such an approach can provide a strong theoretical foundation to the choice of a specific online learning algorithm by selecting the updates of the parameter estimate as the outputs of a sequence of control laws that solve a suitable optimal control problem modeling online learning itself.^{1} A distinguishing feature of our study is that we derive online learning algorithms as closed-form optimal solutions to suitable online learning problems. In contrast, typically, work in the literature proposes a certain algorithm and then investigates its properties, but does not analyze the optimality of such an algorithm with respect to a suitable online learning problem. A remarkable exception is Bertsekas (1996a), but it refers to a deterministic optimization problem and, different from our approach, does not contain any regularization of the updates of the parameter estimate.

In a nutshell, our contributions are the following:

- •
We make the machine learning community aware of a point of view that until now might have been overlooked.

- •
By exploiting such a viewpoint, we develop a novel machine learning paradigm, for which we provide closed-form solutions.

- •
We make connections between our results and other machine learning algorithms.

### 1.3 The Adopted Learning Model

The learning model that we adopt can be considered a nontrivial variation (due to the presence of suitable random matrices) of the linear quadratic (LQ) and linear quadratic gaussian (LQG) optimal control problems, which we briefly summarize. The LQ problem (Bertsekas, 1995) consists of the minimization—with respect to a set of control laws, one for each decision stage—of a convex quadratic cost related to the control of a linear dynamical system, which is decomposed into the summation of several convex quadratic per stage costs, associated with a priori given cost matrices. At each stage, a control law is applied. It is a function of an information vector, which collects all the information available to the controller up to that stage. More precisely, the information vector is formed by the sequence of controls applied to the dynamical system up to the previous stage and by the sequence of measures of the state of the dynamical system itself, acquired up to the current stage. A peculiarity of the LQ problem is that such measures are linearly related to the state, again through suitable a priori given measurement matrices. The measures may be corrupted by additive noise, with given covariance matrices. When all the noise vectors are gaussian, one obtains the LQG problem, for which closed-form optimal control laws in feedback form are known.^{2} They are computed by solving recursively suitable Riccati equations and applying the Kalman filter (Sutton, 1992) to estimate the current state of the dynamical system.

The main difference between the LQ/LQG problems and the proposed formulation of online learning with supervised examples and regularization is the following. In our approach, both the cost and measurement matrices are random, being associated with the input examples, which become available as time goes on. It is worth mentioning that the randomness of some matrices in the context of the LQ optimal control problem was considered also in Bertsekas (1995), but in a way not directly applicable to the online learning problem investigated in this letter (see remark ^{16} in section 3.2 for further details). First, we consider a linear relationship between the input examples and their labels, possibly corrupted by additive noise, and collect into the state vector both the current estimate of the parameter vector modeling the input-output relationship and the parameter vector itself, which is unknown. Then we relax the linearity assumption and address a more general nonlinear context. The goal consists in finding an optimal online estimate of the parameter vector, on the basis of the information associated with the incoming examples, modeled in the simplest case as independent and identically distributed random vectors.^{3}

Each decision stage corresponds to the presentation of one example to the learning machine, whereas one part of the convex per stage cost penalizes quadratically the difference between the observed output and the one predicted by the learning machine, by using the current estimate of the parameter vector. Causality in the updates (i.e., their independence on future examples, which is important for a truly “online” framework) is preserved by constraining the updates to depend only on the history of the observations and updates up to the decision time, likewise in the LQ/LQG problems.

At each stage, the error on future examples is taken into account through the conditional expectation of the summation of the associated per stage costs, conditioned on the current information vector. In the stationary case, the link between the examples used for training and the future examples is only in their common generation model. In order to reduce the influence of outliers on the online estimate of the parameter vector, its smoothness with respect to time is also enforced through the presence of a suitable regularization term in the functional to be optimized, weighted by a positive regularization parameter.^{4} This represents the second part of the convex and quadratic per stage cost. The optimal solution is obtained by applying dynamic programming (DP) and requires the solution of suitable Riccati equations. The difference already noted between the classical LQ/LQG problems and the proposed online learning framework (i.e., the random nature of some matrices) determines two different forms for such equations—for the backward and forward phases of DP, respectively. When the optimization horizon is infinite, it is necessary to take into account the random nature of the matrices to perform a convergence analysis of the online estimate of the unknown parameter vector.

Table 1 provides the correspondence between the notations used for optimal control and the ones used for the proposed online learning framework.

Update | Control |

Updating function | Control function |

Problem OLL (Online Learning) | Optimal control problem |

Learning horizon | Optimization horizon |

Learning functional | Cost functional |

Average learning functional | Average cost functional |

Update | Control |

Updating function | Control function |

Problem OLL (Online Learning) | Optimal control problem |

Learning horizon | Optimization horizon |

Learning functional | Cost functional |

Average learning functional | Average cost functional |

### 1.4 Relationships with Other Machine Learning Techniques

The approaches to online learning most closely related to this work are Kalman filtering (Sutton, 1992; see also Bertsekas, 1995, appendix E) and its kernel version (Ralaivola & d’Alché Buc, 2005; Liu et al., 2009), in which, however, no penalization is made directly on the control (updating) variables. Indeed, one of our contributions consists in developing a theoretical framework in which such a penalization is taken into account and in providing in most cases closed-form results. Interestingly, the obtained solutions can be interpreted as smoothed versions (with respect to time) of the solution obtained applying the Kalman filter only. Most important, we show, both theoretically and numerically, that our solutions are less sensitive to outliers than the Kalman filter estimates. This is very useful (e.g., if one wants to give more importance to a whole set of most recently presented examples than to the current example) to obtain estimates that change more smoothly with respect to time (smoothness of an estimate is a desirable property, e.g., in applications to online system identification and control, in which one has also to control the system just identified). The issue of a smaller sensitivity to outliers is especially important in case of deviations from the gaussianity assumptions (which cannot be a priori excluded in practical applications), or of a mismatch between the true covariance matrices of the noise and the ones used to update the Kalman filter estimates. Other motivations for the presence in our formulation of a penalization of the updating variables are numerical benefits due to the regularization term (a typical way of dealing with ill-conditioning through regularization approaches) and better behavior when considering the extension of the online learning problem to the case of a time-varying parameter vector. All of these issues are discussed later in the letter. It is also worth mentioning that various forms of regularization of the Kalman filter (and, in an offline context, of the Kalman smoother) were also proposed in the recent literature (e.g., Jay, Duvaut, Darolles, & Gouriéroux, 2011; Wang, Li, & Rizos, 2012; Li, Gui, Gu, Han, & Du, 2014; Liu & Chang, 2016; Khan, Bouaynaya, & Fathallah-Shaykh, 2014), where, however, an optimal control approach was not used.

For each decision stage, the updating formula that provides the solution to the proposed learning paradigm is similar to the ones of other online estimates obtained through various machine learning techniques such as stochastic gradient descent. However, there is a substantial difference: we derive it as the optimal solution of an optimization problem modeling online learning, and this allows us to prove various interesting properties. We believe that this approach could also be fruitfully applied to other machine learning techniques used in online learning.

A number of extensions of the basic problem formulation is also described with some detail at the end of the letter, showing the generality of our theoretical framework, highlighting the large potentialities of the application of optimal control theory to machine learning, and providing hints to stimulate further interdisciplinary research on this topic. A connection with neuroscience is discussed too.

### 1.5 Organization of the Letter

Section 2 is a nontechnical overview of the main results derived in the letter, written to allow readers who are not familiar with optimal control, but work in the field of machine learning, to appreciate the nature of our approach and its contributions. At the same time, it provides a summary of the main results of the paper. Section 3 introduces and solves the proposed model of online learning as an LQ optimal control problem with random matrices and finite optimization horizon and provides closed-form expressions for the optimal solutions in the LQG case. Section 4 extends the analysis to the infinite-horizon framework. Section 5 investigates convergence properties of the algorithm, whereas section 6 compares the proposed online approach with average regret minimization and the Kalman filter estimates, both theoretically and numerically. Section 7 extends the analysis to nonlinear models (kernel methods). Other extensions are described in section 8. Section 9 is a conclusive discussion. To improve the readability, most technical proofs are contained in the appendix.

## 2 Overview of the Main Results

In the following, we summarize the main results with links to the parts of the letter in which they are presented:

• *We derive closed-form optimal solutions* for the proposed optimal control formulation of online learning and for some of its extensions. They are expressed in terms of two Riccati equations (see section 3), associated, respectively, with the backward phase of DP (to determine the gain matrix of the optimal controller) and the determination of the gain matrix in the Kalman filter (in the case of gaussian random vectors). Different from the LQG problem, the two Riccati equations have different natures: one involves expectations of random matrices (so it may be called an average Riccati equation), whereas the other involves realizations of random matrices (so it may be called a stochastic Riccati equation). As a consequence, a specific study, detailed in the letter, is needed to study properties of their solutions, which confirms that the proposed problem is not a trivial application of the LQG problem to an online learning framework.

• *We analyze both theoretically and numerically the role of the regularization parameter* (see section 3.3).

• *In the infinite-horizon case, we investigate the existence of a stationary (and linear) optimal updating function* (see section 4), and *stability issues, that is, the convergence to 0 of the mean square error between the parameter vector and its online estimate* when the number of examples goes to infinity (see section 5). In this context, another nontrivial difference with respect to the classical LQG problem arises. When computing certain expectations conditioned on the current information vector, one has to take into account that the information vector at a generic stage has additional components deriving from the realizations of the output measurement matrices up to the stage itself (as these are random matrices, associated with the input examples). As a consequence, the Kalman gain matrix, which is shown in the letter to be embedded in the optimal solution, is not only stage dependent but also a random matrix (although it becomes deterministic when conditioned on the input examples already presented to the learning machine). This motivates the investigation of issues such as its convergence in probability and the convergence of its expectation when the number of examples goes to infinity.

• *We discuss the connection of the proposed online learning framework with average regret minimization*. We prove that the sequence of our estimates minimizes the average regret functional (see section 6.1).

• *We investigate the connections among our solution, the Kalman filter estimate, and stochastic gradient descent* (see remark ^{9} and sections 6.2–6.4). We prove that our solution can be interpreted as a smoothed Kalman filter estimate, with time-varying gain matrix, and we show that it outperforms the latter in terms of its larger smoothness (section 6.3; see also section 8*e* and its smaller sensitivity to outliers, section 6.4).

• *We discuss cases in which the proposed solution can be computed efficiently* (see, e.g., the comments presented after proposition ^{15}, remark ^{24}, and the numerical results reported in section 6.3).

• *We address the case of nonlinear input-output relationships, modeled using the kernel trick of kernel machines* (see section 7). As is well known, the kernel trick is based on a preliminary (in general nonlinear) mapping of the input space to a larger-dimensional feature space, to which the original linear model is applied in a second step. The kernel trick, which consists of computing certain inner products in the auxiliary feature space through a suitable function called “kernel,” can be applied in our context since we show that the optimal solution can be expressed in terms of inner products in the feature space that can be computed through a kernel.

• *We describe various other extensions* (see section 8), such as the case of a time-varying parameter vector to be learned, the introduction of a discount factor, the inclusion of additional regularization terms, a continuous-time extension framework, and a possible extension of the problem formulation through techniques from robust estimation and control.

Table 2 collects some acronyms of frequent use in the letter.

Problem OLL | Online learning problem over finite horizon N |

and regularization parameter | |

Problem OLL | Online learning problem over infinite horizon |

and with regularization parameter | |

OLL estimate | Estimate obtained solving problem OLL or OLL |

LQ | Linear quadratic |

LQG | Linear quadratic gaussian |

LQR | Linear quadratic regulator |

ARE | Average Riccati equation |

SRE | Stochastic Riccati equation |

KF | Kalman filter |

MSE | Mean square error |

Problem OLL | Online learning problem over finite horizon N |

and regularization parameter | |

Problem OLL | Online learning problem over infinite horizon |

and with regularization parameter | |

OLL estimate | Estimate obtained solving problem OLL or OLL |

LQ | Linear quadratic |

LQG | Linear quadratic gaussian |

LQR | Linear quadratic regulator |

ARE | Average Riccati equation |

SRE | Stochastic Riccati equation |

KF | Kalman filter |

MSE | Mean square error |

## 3 The Basic Case: Discrete Time, Finite Horizon, and Linear Model

For simplicity, we consider first a discrete time setting with a finite learning horizon and a linear model. Then we address the extensions to an infinite learning horizon and nonlinear models.

### 3.1 Problem Formulation

*y*is generated from the input

_{k}*x*according to the following linear model: where is a measurement noise and is a random column vector, unknown to the learning machine, and to be estimated by the learning machine itself by using the sequence of examples as they become available.

_{k}The random variables *w*, , are mutually independent^{5} and (only for simplicity of notation and without any loss of generality) have mean 0. The random variables have the same finite variance , and each *x _{k}* has finite covariance matrix .

It is important to observe that model 3.1 is time invariant in the sense that the same *w* is used to generate every *y _{k}*, starting from every

*x*and every .

_{k}^{6}So once a realization of the random vector

*w*has been generated, this can be interpreted as a fixed vector, to be estimated by the learning machine using the online supervised examples.

Hence, the update *u _{k}* depends only on the sequence of examples observed up to the current stage

*k*and on the sequence of previous updates

*u*(or, equivalently, since , on the sequence of previous updates of the estimate of

_{j}*w*).

In our model, the updating functions *u _{k}* are chosen in order to minimize a learning functional over a finite learning horizon, defined as follows:

We state the following online learning problem (in this letter, the symbol “” is used to denote optimality):

(online learning over a finite horizon). Given the finite learning horizon *N*, the examples generated at each time instant according to the model defined by assumptions ^{1} and ^{2}, and the learning machine defined by assumption ^{3}, find the finite sequence of optimal updating functions with the structure defined by assumption ^{5}, that minimizes the learning functional 3.5.

Problem OLL can be considered a parameter identification problem or an optimal estimation problem, as the final goal consists of estimating optimally the parameter vector *w* relating the input examples to their outputs, given the current subsequence of examples and the adopted optimality criterion. It can also be considered an optimal control problem, interpreting the updating function *u _{k}* as a control function for the dynamical system 3.3. Although this last interpretation may seem less natural than the first two, it is motivated by the fact that problem OLL and its variations presented later in the letter can be investigated using optimal control techniques, as is done in the following.

For every , we shall call *online estimate* (*OLL estimate*, for short).

The term in the learning functional, equation 3.5, penalizes a large deviation of the learning machine estimate of the label *y _{k}* from its best estimate (in a mean square sense) , which would have been obtained if

*w*were known, whereas the term penalizes a large square of the norm of the update

*u*of the estimate of

_{k}*w*, and is a regularization term, which measures the trade-off between the two terms.

*w*given , that is, when it is the Kalman filter estimate of

_{k}*w*at time (see, e.g., Bertsekas, 1995).

_{k}^{7}It is worth mentioning that since the parameter vector to be learned is constant and the data generation model is described by equation 3.1, the specific Kalman estimation problem is equivalent to recursive least squares (see Sayed, 2003 for a proof of this equivalence).

In sections 6.3 and 6.4, we discuss some relationships of the proposed learning framework with the classical Kalman filter (Sutton, 1992). As it will be shown by proposition ^{28} and by the numerical results in Figure 1, the presence of the regularization term in the learning functional 3.5 can make the resulting sequence of optimal estimates of *w* smoother with respect to the time index, and less sensitive to outliers, than the sequence of estimates obtained by using the classical Kalman filter, under gaussian assumptions on the random variables *w* and .

The constraint that each update *u _{k}* (hence, also each optimal updating function ) depends only on the sequence of examples observed up to the current stage

*k*and on the sequence of previous updates

*u*implies that no future examples are taken into account to update the current estimate of

_{j}*w*. Hence, the proposed solution is actually a model of online learning. Instead, batch learning corresponds to the case where one assumes that all the sequence of examples is known to the learning machine, starting from time .

*y*(but knowing

_{k}*x*) and the label

_{k}*y*generated by model 3.1 at the time

_{k}*k*(note that, different from the term , both are observable at time

*k*). However, by taking expectations and recalling that has mean 0 and is mutually independent from

*x*,

_{k}*w*, and , one obtains Hence, since the last term in equation 3.6 is a constant, the learning functionals 3.5 and 3.6 have the same sequence of optimal updating functions. It is worth noting that in both formulas 3.5 and 3.6, in order to generate the estimates , one uses only the probability distribution of

_{k}*w*conditioned on the already available observations.

_{k}*x*are known at time

_{k}*k*, one can replace the measures

*y*by hence obtaining the measurement equation where , and has the same variance as . In this case, the “history” of the learning machine, measures, and past updates up to time

_{k}*k*is collected in the new information vector , defined as for and There is a one-to-one correspondence between the information vectors

*I*and . So, the optimization of the learning functional 3.5 assuming that the dynamical system evolves according to equation 3.3, the sequence of measures is provided by equation 3.4, and the updating functions

_{k}*u*have the form , is equivalent to the optimization of the following learning functional: assuming that the error vector evolves according to equation 3.7, the sequence of measures is provided by equation 3.8, and the update

_{k}*u*is now a function of the information vector . Such a problem is a nontrivial variation of the classical LQ problem (Bertsekas, 1995). Whereas in the latter, the matrices

_{k}*C*and

_{k}*Q*are deterministic, in the proposed formulation of online learning, they are random, since they depend on the input examples

_{k}*x*. Another difference is that for , the information vector includes the realizations of the inputs

_{k}*x*, hence also of the matrices

_{j}*C*and

_{j}*Q*.

_{j}### 3.2 Solution of the Finite Horizon Online Learning Problem

^{12}), the cost-to-go functions can be determined recursively by solving the Bellman Equations, for .

The regularity conditions mentioned above are satisfied, for example, in the case (studied in this letter), where the random vectors *w* and are gaussian. Indeed, in such a context, the optimal updating functions that will be provided by equation 3.13 are linear with respect to the information vector (Bertsekas, 1995; Bertsekas & Shrieve, 1978). In the following, the mild regularity conditions mentioned above will be implicity assumed throughout the letter.

Equations 3.11 and 3.12 are similar to those for the cost-to-go functions in the LQ problem (see, e.g., Bertsekas, 1995), with the difference that in the present context, the matrices *Q _{k}* and

*C*are random. Moreover, both matrices become known to the learning machine at time

_{k}*k*, as they can be derived from the information vector . In the following, we sometimes use the superscript “” not only for the optimal updating functions but also to denote vectors (e.g, and

*e*) evaluated when the sequence of optimal updating functions, equation 3.13, is applied.

_{k}^{1}–

^{5}be satisfied. Then the updating functions that solve problem OLL are given, for , by where and the matrices and are symmetric positive-semidefinite. The recursions above are initialized by and

Equation 3.17 can be called an average Riccati equation (*ARE*) since it contains the expectation term . In practice, it can be solved offline, as can the classical deterministic Riccati equation of the linear quadratic regulator (LQR) subproblem (Bertsekas, 1995), simply by replacing *Q _{k}* (which is deterministic in the LQ problem) by . It is worth remarking that solving the ARE, equation 3.17, does not require the knowledge of future input examples and that all the matrices

*L*in equation 3.14 have spectral radius

_{k}^{8}strictly smaller than 1. Finally, the matrices

*F*are reported in formula 3.16 because they are used to express ) (see formula (A.18) in the appendix). They are also used in the infinite-horizon version of problem OLL (problem OLL), to reduce one part of the proof of proposition

_{k}^{23}in section 4 to the finite-horizon case.

Due to equation 3.13, in order to generate the optimal update at time *k*, one has to compute . Let us now make the following additional assumption:

(gaussian random variables). The random variables *w* and are gaussian.

^{14}is satisfied, the optimal estimate of the proposed framework tracks the (usually time-varying) Kalman filter estimate. Indeed, inspection of its proof shows that is the Kalman filter (KF estimate) of the error vector at the time

*k*, based on the information vector , thus getting a Kalman filter recursion scheme.

*k*, based on the information vector

*I*(or, equivalently, on the corresponding information vector ). Moreover, for let be the (conditional) covariance matrix

_{k}^{9}of

*e*, conditioned on the information vector , and the (unconditional) covariance matrix

_{k}^{10}of

*e*

_{0}, which is equal to the (unconditional) covariance matrix of

*w*.

Equation 3.25 has the form of the Riccati equation of the well-known Kalman filter. Due to the stochastic nature of *C _{k}*, it can be called a stochastic Riccati equation (

*SRE*). From a computational point of view, solving equation 3.25 is easy even in a high-dimensional setting (i.e., when the dimension

*d*of the input space is large). Indeed, (which needs to be inverted in equation 3.25) is a scalar. Similarly, in formula 3.24, one has to invert the scalar . In other applications of the Kalman filter, one has to invert matrices. Moreover, differently from the average Riccati equation, the stochastic Riccati equation has to be solved online.

It is worth mentioning that also Bertsekas (1995) investigates an LQ optimal control problem with random matrices. In that case, however, there is perfect information on the state, and the randomness is limited to the system dynamics. For that problem, a suitable average Riccati equation is obtained, but no stochastic Riccati equation. Hence, that formulation, though inspiring for our work presented in this letter, cannot be applied directly to our online learning framework.

*L*(solution of the LQR subproblem) and the determination of the Kalman gain matrices

_{k}*H*(solution of the Kalman filter estimation subproblem). One might wonder why, in problem OLL one gets, instead of the classical Riccati equation, two different kinds of equations for the two subproblems, the ARE, equation 3.17, and the SRE, equation 3.25, in spite of the well-known duality between the LQR subproblem and the Kalman filter estimation problem (Söderström, 2002). The reason is that when moving from the LQR subproblem to the Kalman filter estimation subproblem, the roles of the matrices in the primal problem (i.e., the LQR subproblem) are played, respectively, by the following matrices of the dual problem (i.e., the Kalman filter estimation problem): where is the covariance matrix of the system noise (a kind of noise that is not present in model 3.7), hence it is an all-0s matrix. Now, in our specific context, in the primal problem, the matrix

_{k}*Q*is random, whereas in the dual problem, the matrix is deterministic. Similarly, in the primal problem, the matrix

_{k}*B*is deterministic, whereas in the dual problem, the matrix is random. This lack of symmetry is the reason that the two Riccati equations 3.17 and 3.25, have different forms.

_{k}The next proposition states some properties of the solution to the SRE. For two symmetric square matrices *S*_{1} and *S*_{2} of the same dimension, means that is symmetric and positive-semidefinite. When it is evident from the context, we use the symbol 0 to denote a matrix whose elements are all equal to 0.

(properties of the solution to the SRE). Let assumptions ^{1}–^{14} be satisfied. Then

(i.e., the sequence is “nonnegative” and monotonic “nonincreasing” in a generalized sense, according to ) for all the realizations of the random matrices and .

- If all the input examples
*x*have a common probability distribution with bounded support and the same positive-definite covariance matrix in such a way that for all_{k}*k*, then with a priori probability 1, one has When equation 3.34 holds, then

An intuitive explanation of the second bound in equation 3.30 is the following. When the time index moves from *k* to , the new information acquired at the time cannot deteriorate, on the average, the quality of the KF estimate, which is in accordance with its optimality properties (Bertsekas, 1995). Equations 3.29, 3.30, and 3.36 will be used, together with equations 3.34 and 3.35, in the convergence analysis of the proposed method for (see section 4 and the appendix).

An important assumption that is needed in the proof of proposition 3.iv is that the common covariance matrix of the input examples is positive-definite. When this is not the case, this means that with probability 1, all the input examples belong to a finite-dimensional subspace of ; hence, with probability 1, it is not possible to extract from the input-output pairs any information about the component of *w* that it is orthogonal to that subspace unless such a component is correlated with the projection of *w* on . However, one still has the convergence of both the KF estimate and the OLL estimate of *w* to the projection of *w* on , as can be shown by setting the problem directly on . Morover, the possible absence of information in the data about the component of *w* that it is orthogonal to has no negative consequences on the estimation process, in the sense that in order to compute for a possibly unseen input *x*, one needs, with probability 1, to know only the component of *w* that belongs to the subspace .

### 3.3 Role of the Regularization Parameter

#### 3.3.1 The Case

*u*in the learning functional, equation 3.9, becomes negligible, and one obtains from equation 3.14, and from equation 3.13. Hence, one gets (from equations 3.7 and 3.37), Equivalently, in terms of the unknown vector

_{k}*w*and its optimal estimates , at the times

*k*and , respectively, one obtains hence which is just the KF estimate of

*w*at the time

*k*, based on the information vector

*I*.

_{k}#### 3.3.2 The Case

#### 3.3.3 Intermediate Values of

In this case the estimate enjoyes convergence properties similar to the ones of the KF estimate , as illustrated numerically in Figure 1. Moreover, is a smoothed version of the estimate . The sequence of estimates is smoother and less sensitive to outliers than the sequence of estimates , as a large change in the estimate when moving from to is penalized by the presence of the term in the cost functional 3.9. This can be seen also by formula 3.22, as equation 3.14 implies that all the eigenvalues of the symmetric matrix *L _{k}* are inside the unit circle. A deeper investigation of these three issues (convergence, smoothness, and sensitivity to outliers) is made in sections 5, 6.3, and 6.4, respectively.

## 4 LQG Learning Over an Infinite Horizon

To address the infinite-horizon case, we remove the final-stage cost (or, equivalently, we assume with probability 1, hence also with probability 1) and let (the precise formulation is provided later in this section).

^{20}, the analysis has some similarities with the one for the optimal solution to the LQG problem performed, for example, in Bertsekas (1995). We denote by a symmetric and positive-definite square root of . As one can check directly from the definitions of reachability and observability

^{11}(Antsaklis & Michel, 2007), we observe that the pair is reachable,

^{12}whereas the pair is observable. Hence, one can apply proposition 4.1 of Bertsekas (1995), from which it follows that the ARE, equation 3.17 admits a stationary solution , associated with the two stationary matrices and (see 3.16). Moreover, by reversing the time indices in equation 3.17, (i.e., setting and ), the solution of the ARE (which is equivalent to equation 3.17), converges to for any initialization of the positive-semidefinite matrix

*P*

_{0}, still by Bertsekas (1995, proposition 4.1).

The (average) learning functional over infinite horizon is defined as follows:

(online learning over infinite horizon). Given the examples generated at each time instant according to the model defined by assumptions ^{1} and ^{2}, and the learning machine defined by assumption ^{3}, find the infinite sequence of *optimal updating functions* with the structure defined by assumption ^{5} that minimizes the average learning functional 4.4.

As with problem OLL, for every we shall call *online estimate* (*OLL estimate*, for short). In the following, we consider directly the LQG case, with identical distributions for the input examples.

^{1}–

^{20}be satisfied. Then, updating functions that solve problem OLL are given, for by where

*L*is defined in equation 4.2.

^{13}

*U*is a basis of orthogonal unit-norm eigenvectors of (hence, ) and is a diagonal matrix collecting the corresponding positive eigenvalues. We recall that satisfies that is, Now, has the same eigenvectors as . Hence, and also have the same eigenvectors, so is expressed as where is a suitable diagonal matrix, with positive eigenvalues. Due to equation 4.6, the diagonal elements and are related by Hence, by the positiveness of , one gets Similarly, the stationary matrix

*L*can be expressed as where the elements of the diagonal matrix are A particularly simple case occurs when the matrix is diagonal; hence, one can choose the matrices

*U*and as the identity matrix

*I*, so and

*L*are also diagonal, by formulas 4.7 and 4.8. Moreover, if is proportional to the identity matrix

*I*, and

*L*are also proportional to

*I*. Finally, similar remarks also hold in the finite-horizon case for the matrices and

*L*in case the matrices commute (this happens, for example, when all the matrices are equal to ).

_{k}## 5 Convergence Properties of the Online Learning Estimates in Terms of Mean Square Errors

*k*. Different from the LQG case detailed in Bertsekas (1995), the expectation of is needed in the following, as depends on the sequence of random matrices . Under the assumptions of proposition 3.iv, this converges to the 0 matrix as

*k*tends to (see formula 3.34).

(convergence of the MSE of the KF estimate). Let assumptions ^{1}–^{20} be satisfied. Then the following hold:

Note that the upper bound in proposition 5.ii provides for the convergence to 0 of the mean square error of the KF estimate of *w* at the time *k*, a rate of order . As for proposition 5.iv, an intuitive explanation is the following. As the parameter *w* to be learned does not change in time, after a sufficiently large number of “good” examples, the learning machine has practically learned *w*, and future examples are practically not needed (of course, this holds in the case, considered so far, in which the parameter *w* does not change with time; see section 8 for a relaxation of this assumption).

denote the mean square error of the OLL estimate at time *k*. The next proposition provides a recursion to compute and bound from above and states its convergence to 0. We refer to remark ^{41} in the appendix for a possible way to derive estimates of the associated rate of convergence.

the (unconditional) covariance matrix of .

(convergence of the MSE of the OLL estimate). Let assumptions ^{1}–^{20} be satisfied. Then the following hold:

- Under the assumptions made in section 4, one has

## 6 Comparisons with Other Machine Learning Techniques

### 6.1 Connections with Average Regret Minimization

^{23}(with ), the sequence of KF estimates minimizes the functional , hence also the average regret functional, equation 6.4. Moreover, the minimum average regret is 0 because, by proposition 5.ii, one has then one gets also

Nevertheless, the next proposition shows that for any , also the sequence of OLL estimates minimizes the average regret functional 6.4.

For any , under the assumptions of proposition ^{23}, the sequence of OLL estimates minimizes the average regret functional, equation 6.4.

### 6.2 Connections with KF and Stochastic Gradient Descent

*w*associated with the data generation model has the following recursive form (see, e.g., the statement of proposition

^{15}): where

*L*is a suitable square matrix and is the Kalman filter estimate of

_{k}*w*at time

*k*, based on the vector

*I*that collects all the information available to the learning machine up to time

_{k}*k*. Hence, it follows from equation 6.7 that our estimates are obtained from the Kalman filter estimates through an additional smoothing step. The form of equation 6.7 is similar to the ones of other online estimates obtained through various machine learning techniques, such as stochastic gradient descent (Spall, 2003). However, there is a substantial difference: we derive equation 6.7 as the optimal solution of a suitable optimal control/estimation problem, showing various interesting consequences of that in the letter, made possible by our use of optimal control/estimation techniques. We believe that this approach could also be fruitfully applied to other machine learning techniques used in online learning. Moreover, we offer a principled way to construct the matrix

*L*in equation 6.7 as the solution of a suitable Riccati equation.

_{k}### 6.3 Outperformance with Respect to KF in Terms of Smoothness

Finally, Figures 3 and 4 show that both approaches are also suitable for parameter vectors of much larger dimension. Indeed, it refers to the case of and online examples. Figure 3 reports the *l*_{2}-norm of the error vector associated with the KF estimate and with the OLL estimate, respectively, at the generic stage *k*, for a single realization of the sequence of the input-output examples. The running time of such a simulation (whose code was written in Matlab R2013, as for all of the other simulations) was about 29 seconds, on a notebook with a 1.40 GHz CPU and 4 GB of RAM. Figure 4 reports a similar comparison for the *l*_{2}-norm of the update vector, for the same realization. It is worth noting that for a better comparison of the two figures, *l*_{2}-norms and not square *l*_{2}-norms have been considered therein (in Figure 2, the square *l*_{2}-norm of the update vector was considered, for its relationship with proposition ^{28}). To compare the two approaches, various choices for the regularization parameter have been considered in Figures 3 and 4 (see the captions for details). In more detail, Figure 3 shows that for this case of a time-invariant parameter vector, the KF estimate and the OLL estimates have a similar error (in a root mean square sense, slightly smaller for the KF estimate, which is, however, much less smooth with respect to the time index *k*). Moreover, the OLL estimates have much smaller norms of the update vectors (in Figure 4, they are practically negligible compared to the ones associated with the KF estimate, for all the values considered for the regularization parameter). In item e.3 of section 8, we show that for the case of a slowly time-varying parameter vector, the OLL estimate can achieve even a much smaller error than the KF estimate, under a suitable periodic reinitialization of the matrices (see Figure 8).

### 6.4 Outperformance with Respect to KF in Terms of Sensitivity to Outliers

Here we further compare numerically the KF and OLL estimates, now in terms of their different sensitivity to outliers. To this end, we periodically alter the output data perturbation model, choosing the disturbance to be equal to a positive constant *z*_{1} when *k* is a multiple of some positive integer *Z*; otherwise, it is equal to a negative constant *z*_{2} when *k* is not a multiple of *Z*. The two constants *z*_{1} and *z*_{2} are chosen in such a way that the empirical mean (over any time window of duration *Z*) of the s is 0 (i.e., the condition is imposed), and the empirical variance (over the same time-window) is (i.e., the condition is imposed). Hence, and are obtained. Moreover, for a fair comparison with the KF estimate, this modified assumption on the output data-perturbation model is not included in the optimization problem producing the OLL estimates.^{14} In other words, that knowledge is not provided to the learning machine.

The numerical results reported in Figure 5 show clearly the much smaller sensitivity to outliers of the OLL estimates with respect to the KF ones (details about the parameter choices are reported in the figure caption). This is ultimately due to the larger smoothness of the OLL estimates with respect to the time index.

An additional theoretical motivation for the smaller sensitivity to outliers of our OLL estimates is obtained by an inspection of formula 3.22 in proposition ^{15}. Limiting for simplicity of the analysis to the first OLL updates of the parameter vector, it follows by that formula and by that and , where is the first KF update, which is influenced only by the first example presented to the learning machine. Since (see formula 3.14), it is evident that the OLL estimate is less influenced by the presence of a possible outlier. Moreover, such an influence decreases by increasing the regularization parameter , since, by formula 3.14, the larger is, the smaller is. Figure 6 confirms this advantage of the OLL estimates, showing that the *l*_{2}-norms of the vectors of updates used to generate the OLL estimates are typically much smaller than the *l*_{2}-norms of the vectors of updates used to generate the KF estimates. An additional significant advantage of the OLL estimates in the presence of time-varying parameter vectors is detailed in item e of section 8.

## 7 Nonlinear Models of Data Generation and Application of Kernel Methods

*x*preliminarily to another Euclidean space, then applying the model in the new input space. More precisely, one introduces a (possibly nonlinear) mapping , where

_{k}*E*is a Euclidean space of dimension

*d*, possibly larger than

_{E}*d*(or even infinite). Then the equation 3.1 becomes where the parameter vector

*w*now belongs to

*E*. In this case, one can still apply all the techniques described in the letter taking

*E*as the new input space. Of course, in doing this, the dimensions of some matrices would in general change (typically, they would increase); for instance, in case of a finite

*d*, the matrices ,

_{E}*L*, and would become matrices, whereas the Kalman gain matrix

*H*would become a matrix. In case of an infinite-dimensional Hilbert space

_{k}*E*, they would be replaced by suitable infinite-dimensional linear operators.

Interestingly, as we show below, when doing such an extension, one can apply the so-called kernel trick of kernel machines (Cristianini & Shawe-Taylor, 2000). More precisely, we show some circumstances under which, for every (possibly unseen) input *x*, one can express both and in terms of inner products of the form , where *x _{j}* is an input example already seen by the learning machine. Hence, if one is able to express in a simple way (e.g., through a symmetric kernel function such that ), one can compute and even without knowing explicitly the expression of the mapping .

*x*. Then, given any two input vectors , the inner product is expressed as which is the so-called homogeneous polynomial kernel (Cristianini & Shawe-Taylor, 2000) of order 2, whose evaluation involves only computations to be performed in the original input space .

We first consider how to compute using kernels, then how to compute , too. We make the following assumption.

The results presented in the next proposition for the kernel version of the KF estimate are essentially the same as the ones obtained in (Liu et al., 2009, theorems 2 and 3), which shows also how to express the linear combinations inside such equations, through an application of the matrix inversion lemma (see, e.g., Sayed, 2003). However, their extension to the kernel version of the OLL estimate, provided in proposition ^{34}, is novel. In order to improve their readability, in the next formulas—equations 7.3–7.5 and 7.9--7.12—only the functional form of the right-hand side is provided.

*E*, the convergence analysis is exactly the same as the one in proposition

^{25}, and a similar (even though more technical) analysis is expected to hold for the infinite-dimensional case. Finally, in case the (matrix associated with the) covariance operator is only positive-semidefinite but not positive-definite, one could still follow remark

^{19}to prove the convergence of the estimate on the subspace on which the input data lie with probability 1. Such a subspace could be estimated by, for example, an application of kernel principal component analysis (KPCA) (Schölkopf, Smola, & Müller, 1998). Moreover, one could even redefine the problem taking that subspace as the new input space, making the operator be positive-definite when restricted on it.

After dealing with the kernel version of the KF estimate of *w*, we now investigate the kernel version of its OLL estimate. We make the following assumption:

Assumption 8.i refers to a particularly simple model for , which is relaxed in assumption 8.ii, which refers to the case in which is modeled by an empirical estimate obtained using the unsupervised examples.

More generally, in assumption 8.ii could be previously seen input data, preferably not used by the learning machine in combination with labels, to possibly avoid overtraining. So their number could grow as the learning machine acquires examples. Of course, after adding new empirical data in the estimate 7.8, one could also accordingly update the matrix *L _{k}* (or, in the infinite-horizon case, the stationary matrix

*L*), in item d.1 of section 8.

We conclude this section by mentioning that in the nonlinear case, differently from techniques such as the extended KF (Grewal & Andrews, 2001), the kernel version of the OLL estimate has the advantage of solving an optimal control (or optimal estimation) problem. Other approaches to online learning with kernels are described, for example, in Diethe and Girolami (2013).

## 8 Extensions

In this section, we illustrate some other extensions of the proposed OLL estimation scheme investigated in the letter.

### 8.1 Extension a: Nonzero Mean of *x*_{k}

_{k}

In this case, no significant change in the analysis is required. The only difference is that and are now correlation matrices instead of covariance matrices.

### 8.2 Extension b: Nonzero Mean of *w*

^{25}and

^{26}still hold if , and the KF estimate and the OLL estimate are initialized, respectively, by and (notice that two different initialization indices have been used for the two estimates, where the subscript has been used to denote the a priori KF estimate—the one obtained before the presentation of the first example, whereas the subscript 0 has been used for initializing of the OLL estimate).

^{15}Indeed, in such a case one obtains similar expressions as in the appendix for the matrices in equation A.48 and for the matrices , , , in equations 5.4, 5.5, A.65 and A.66, respectively, and the same equation, A.73, which is used there to obtain the convergence result, equation 5.6, through an analysis of the convergence of when

*k*tends to .

The case is important in practice, and, among other ones, it models the situation in which, after some number of measures, the time index *k* is shifted to the left (i.e., *k* is replaced by , or equivalently, one reformulates problem OLL using instead of 0 as the initial index in the summation of its objective, equation 3.5) and the knowledge derived by the previous estimates (i.e., the one up to the time ) is used to generate the term (this is actually a posteriori knowledge, since it summarizes the knowledge deriving from the previous estimates, but it becomes the new a priori knowledge for the new problem with modified starting index in the summation).

*k*tends to , due to equation A.64 in the appendix with

*L*replaced by

_{j}*L*, since the matrix

*I*+

*L*has spectral radius smaller than 1.

### 8.3 Extension c: Introduction of a Bias in the Model

*b*is an additional parameter to be learned using the sequence of examples available online. This case can be reduced to equation 3.1 by replacing the input vector

*x*by , and the parameter vector

_{k}*w*to be learned by . As the last component of the new input vector has nonzero mean, one is also reduced to case a) above. Moreover, the assumption of positive-definiteness of the correlation matrix of is satisfied automatically if it is satisfied by the covariance matrix of

*x*.

_{k}### 8.4 Extension d: More Complex Models for the Measurement Errors

The measurement errors could have nonzero means, nonidentical distributions, and/or be not mutually independent. The first two cases can be dealt with in a straightforward way. Indeed, in the first case, one has only to subtract the expectation of from the measure *y _{k}* before presenting it as an input to the KF,

^{16}while in the second case, one has to insert an additional index

*k*to , using terms of the form in the Kalman filter recursion scheme 3.25 and in the Kalman gain matrix 3.24. Finally, in the correlated case, one could model the measurement noise as the output of an auxiliary uncontrolled linear dynamical system, which receives mutually independent noises as inputs. In this case, when the horizon tends to , the convergence of the solution of the ARE to a stationary solution could be more difficult to prove (or such a convergence could even not hold at all), since the reachability condition needed for the application of Bertsekas (1995, proposition 4.1) would be violated.

### 8.5 Extension e: Time-Varying Models for Some Parameters

When solving the “shifted version” of problem OLL that uses as the initial index (see remark ^{37}), one could exploit, for some of its parameters, time-varying models (which could be also estimated online), including the cases of:

Slowly time-varying covariance matrices of the input examples

*x*_{k}Slowly time-varying variances of the measurement noises

A slowly time-varying parameter vector in the data-generation model, equation 3.1.

About issue e.1, we notice that the covariance matrices may be estimated online from the already-observed input data^{17}*x _{k}* (), when such covariance matrices do not depend on the time index (as assumed in the basic infinite-horizon version of the problem), but also if one makes the assumption instead that they are slowly time varying (in such a case, one could give more weight in such estimates to the last input data using, for example, a forgetting factor). Now let us suppose that one is solving the infinite-horizon version of problem OLL under the assumptions of section 4, replacing the “common”

^{18}covariance matrix of the future input data by its initial estimate, here denoted as . However, let us also suppose that the estimate at time of derived from the previous input data (denoted in the following as ) is significantly different from (due, e.g., to slow changes with respect to time of the covariance matrices , or simply due to a possibly bad initial estimate ). Then, using the updated estimate as the model for the “common” covariance matrix of the future input data, one could also update the stationary matrix

*L*of the proposed optimal controller, replacing by in the average Riccati equation, equation 3.17, and looking for its (new) “stationary” solution , hence deriving the (new) “stationary” matrix

*L*from equation 4.2. Such new solutions will be “stationary” as long as the future estimates of the “common” covariance matrix of the future input data will not change significantly with respect to .

*x*, or even several groups of such pairs, each characterized by similar values of

_{k}*x*. In this way, one could generate, inside the

_{k}*i*th such group, the auxiliary data matrix , whose element is defined as and has a very small dependence on

*w*, since and Then one could use such matrices to estimate the variance of the measurement noise, giving more importance or weight to the last among such measures. The obtained estimate would be then used as the common variance in the shifted version of problem OLL that uses as the initial index, presented in section 4.

Finally, about issue e.3, we will show that a minor modification of the proposed OLL model can learn even a time-varying parameter vector *w*. We first focus on the case in which, at some time , the parameter vector *w* changes, then remains fixed. In this case, the mean square error of the estimate of the new parameter vector still converges to 0 when *k* tends to (see formula 5.6, together with item b in this section). However, the trace of may be extremely small (see equation 5.1, combined with Markov’s inequality), making the trace of be extremely small, for every . Since such matrices are used to construct the Kalman gain matrices *H _{k}*, the convergence to 0 of the error with respect to the new parameter vector may be extremely slow, for both the KF and the OLL estimates. This issue could be solved in both cases by a reinitialization at time of the covariance matrix to the a priori covariance matrix . More generally, a periodic reinitialization could be used to track a periodically (or continuosly) changing parameter vector

*w*. In this case, the OLL estimate would have the advantage, with respect to the KF estimate, to change more slowly in time, making it more suitable to a slowly time-varying parameter vector. The next figures provide more insight into this last issue.

Figures 7 and 8 refer to a parameter vector that changes periodically, the change in its components being small and random. In Figure 7, there is no reinitialization to of any of the matrices , and the convergence to the new parameter vector of both the KF estimate and the OLL estimate is slow. Figure 8 instead refers to the case in which—the change in the parameter vector being the same—there is a periodic reinitialization of the matrices to (this second period has been chosen as different from the first one according to which the parameter vector changes, just to avoid giving the learning machine the advantage of knowing when the parameter vector changes, in order to model the more realistic situation in which this knowledge is not available to the machine). In this case, both estimates are able to track the time-varying parameter vector *w* in a better way, but the OLL estimate is smoother than the KF estimate due to the presence of the regularization parameter . So in this context, the OLL estimate is preferable to the KF estimate if the parameter vector changes slowly with time. Figures 9 and 10 show the reason that this happens: when there is no reinitialization to of the matrices , the Frobenius norm^{19} of the Kalman gain matrix *H _{k}* is expected to be small for

*k*large (see formulas 3.24 and 5.1), which is confirmed by Figure 9. Hence, although the norm of the error tends to increase when the parameter vector changes, the KF estimate of

*w*at time

*k*is not affected so much by this change (see formula 3.23), hence, also the OLL estimate does not change so much (see formula 3.22). Instead, a reinitialization to tends to make the Frobenius norm of the Kalman gain matrix

*H*much larger (which is confirmed by Figure 10), and this amplifies the effect of the larger norm of (due to the change in the parameter vector

_{k}*w*) on the KF estimate of

*w*at time

*k*. For the OLL estimate, the change in the estimate is expected to be smaller due to the smoothing effect of formula 3.22. So, like the KF estimate, the OLL estimate is also able to track the change in the parameter vector but with smoother behavior.

Figures 11 and 12 demonstrate also that a periodic reinitialization to of the matrices does not negatively affect so much the tracking of a time-invariant parameter vector *w* in the case of the OLL estimate, whereas the KF estimate is more negatively affected. Again, this is due to the smoothing effect of formula 3.22.

Finally, various approaches to deal with learning time-varying parameters online were presented in McGough (2003); Liu et al. (2009); Chang (2014), in some cases in the context of state estimation of dynamical systems in the presence of outliers (Alessandri & Awawdeh, 2016; De Palma & Indiveri, 2016). As a possible extension, one could combine those approaches with the regularization of the updates included in our model, whose beneficial effects have also been demonstrated in this time-varying case.

### 8.6 Extension f: Insertion of a Discount Factor in the Problem

Then the resulting problem is just a variation, with random matrices, of the discounted LQ/LQG problem (see Davis & Vinter, 1985, sec. 6.3), for the version with deterministic matrices).

^{26}to the discounted case.

Regarding the modification 8.5, one can also observe that, like the basic version of problem OLL, the past updates of the estimate of the parameter vector *w* are not modified when a new example arrives and that, in any case, such updates have no influence on the cost-to-go functions at later stages. Moreover, the discount factor should be not confused with a forgetting factor^{20} (which gives less importance to the past errors, to determine the current estimate).

### 8.7 Extension g: Introduction of Additional Regularizations on the Estimates of *w*

*L*would have in this case the size . Instead, the SRE would be exactly the same as equation 3.25. A suitable choice for the sequence of the regularization parameters

_{k}