Kalman Filter, a Least Squares Problem [work in process]

———-work in process————

Prerequisite

  • Familiar with Kalman Filter is nessesary.
  • FamiliarMatrix computation.

What’s in this article?

This article shows Kalman Filter is a Least Squares problem.

Using the least squares interpretation, we can answer some questions easily. e.g. How to tune the covariance of a Kalman filter? How to learn the prediction function?

1. Kalman Filter Refresher

Assuming, the current state is given by \(x_i\), It’s current covariance is given by \(P_i\).

The state runs as a linear system. e.g. the next state is given by \(x_{i+1} = F x_i\). I will Ignore process noise for now.

We have a observation function of current system, \(z_i = H x_i\) , with observation covariance \(S_i\)

Using probability as a tool, the Kalman filter consists of 2 parts. (I ignored the prediction noise for now. I will become clear later.)

Prediction

Next state prediction: \(\hat{x}_{i} = F x_{i-1}\).
Next Covariance prediction: \(\hat{P}_{i} = F x_{i-1} F^T\)

Update

Compute Kalman gain: \(K_i = \hat{P}_i H^T( H \hat{P}_i H^T + S_i)^{-1}\).
Update state: \(x_i = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\).
Update state covariance: \(P_i = (I – K_i H) \hat{P}_i\)

2. Fixed-Lag Smoothing Refresher and new interpretation

In the fixed-lag smoothing article, assuming the cost is quadratic, I showed that the cost function for the fixed-lag smoothing is,

\(\text{total_cost}({x_0, x_1, … x_n}) = \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})\)

The fixed-lag smoothing solves the problem iteratively by breaking the \(\text{total_cost}\) into small optimization problem.

\(\text{total_cost}({x_1, … x_n})
\\ = \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})
\\ \Leftrightarrow \text{min}_{x_0} (\text{cost_term}_0 (x_0, x_{1}) + \text{cost_term}_1 (x_1, x_{2}) ) + \sum\limits_{i=2}^n \text{cost_term}_i (x_i, x_{i+1})
\\ \Leftrightarrow \text{min}_{x_0, x_1} (\text{cost_term}_0 (x_0, x_{1}) + \text{cost_term}_1 (x_1, x_{2}) + \text{cost_term}_2 (x_2, x_{3}) ) + \sum\limits_{i=3}^n \text{cost_term}_i (x_i, x_{i+1})
\\ \Leftrightarrow min_{x_0, x_1, … x_{k-1}}\sum\limits_{i=0}^{k-1} \text{cost_term}_i (x_i, x_{i+1}) + \sum\limits_{i=k}^n \text{cost_term}_i (x_i, x_{i+1})\)

(The \(\Leftrightarrow\) mean equivalence in the respect of argmin. Please check this post for detail.)

We can define the first part of the above equation as the cost-to-go.

\(\text{cost_to_go}_k(x_k) := min_{x_0, x_1, … x_{k-1}}\sum\limits_{i=0}^{k-1} \text{cost_term}_i (x_i, x_{i+1})\)

If all costs are quadratic, follow the derivation we did in this article, we can show the cost-to-go is a quadratic form.

The total cost can be written as,

\(\text{total_cost}({x_k, … x_n}) \Leftrightarrow \text{cost_to_go}_k(x_k) + \sum\limits_{i={k+1}}^{n-1} \text{cost_term}_i (x_i, x_{i+1}) \; \; \; (2.1)\)

Since cost terms are quadratic, we can compute cost-to-go iteratively,

\(\text{cost_to_go}_i(x_i) = \text{min}_{x_{i-1}} ( \text{cost_to_go}_{i-1}(x_{i-1}) + \text{cost_term}_{i-1} (x_{i-1}, x_{i}) ) \; \; \; (2.2)\)

With the initial value for the cost-to-go \(\text{cost_to_go}_{0}(x_{0}) =0\)

Because the total-cost (2,1) can be expressed by cost-to-go, we can use (2.2) to compute the total-cost iteratively.

For example, the final cost can be expressed as,

\(\text{total_cost}({x_k, … x_n}) \Leftrightarrow \text{cost_to_go}_{n}(x_{n})\)

The \(\text{cost_to_go}_{n}(x_{n}) \) can be computed as,

\(\text{cost_to_go}_{n}(x_{n}) = \text{min}_{x_{n-1}}( \text{cost_to_go}_{n-1}(x_{n-1}) + \text{cost_term}_{n-1} (x_{n-1}, x_{n}) ) \; \; \; (2.3)\)

This mean to compute \(x_n^*\), we just to need keep track of the \(\text{cost_to_go}_{n-1}(x_{n-1})\), and solve (2.3).

That’s what fixed-lag smoothing does.

It’s also what Kalman Filter does.

3. The big picture of Kalman Filter

The above derivation is exactly the same as for Kalman Filter

The cost function for the full Kalman Filter is,

\(\text{full_kalman_filter}({x_0, x_1, … x_n}) = \text{cost_prior}(x_0) \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})\)

For Kalman Filter the \(cost_term_i (x_i, x_{i+1})\) is,

\( cost_term_i (x_i, x_{i+1}) = ||x_{i+1} – F x_i||^2_Q + ||H x_{i+1} – h||^2_R\)

\(||x_{i+1} – F x_i||^2_Q\) is the processing cost. It soft-constrains the predicted state \(\hat{x_{i+1}} = F x_i\) to be close to the next \(x_{i+1}\).

\(||H x_{i+1} – h||^2_R\) is the measurement cost. It penalties the predicted measurement \(\hat{H x_{i+1}\) to be close to the true measurement \(h\).

To estimate the optimal current state, we need to solve (2.3). In the context of Kalman Filter, it is,

\(\text{cost_to_go_kalman_filter}_{n}(x_{n})
\\ = \text{min}_{x_{n-1}}( \text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) + \text{cost_term}_{n-1} (x_{n-1}, x_{n}) ) \\ = \text{min}_{x_{n-1}}( \text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) + ||x_{i+1} – F x_i||^2_Q + ||H x_{i+1} – h||^2_R ) \)

Since, \(\text{cost_to_go_kalman_filter}(n-1)\) is a quadratic. Let \(\text{cost_to_go_kalman_filter}(n-1) = ||x_{n-1} – x_{n-1}^*||_M\). The above equation becomes,

\(\text{cost_to_go_kalman_filter}_{n}(x_{n}) = \text{min}_{x_{n-1}}( ||x_{n-1} – x_{n-1}^*||_M + ||x_{i+1} – F x_i||^2_Q + ||H x_{i+1} – h||^2_R ) \; \; \; (3.1)\)

The optimal solution of \(\text{cost_to_go_kalman_filter}_{n}(x_{n})\) is the best estimation of the current state.

\(x_n^* = \text{argmin}\text{cost_to_go_kalman_filter}_{n}(x_{n})\).

This is equivalent to the predict/update step for the Kalman Filter.

Let me prove it in the next section.

Kalman Filter as a least squares problem

The classical Kalman filter is given by the predict and the update steps.


Prediction

Next state prediction: \(\hat{x}_{i} = F x_{i-1}\).
Next Covariance prediction: \(\hat{P}_{i} = F x_{i-1} F^T\)

Update

Compute Kalman gain: \(K_i = \hat{P}_i H^T( H \hat{P}_i H^T + S_i)^{-1}\).
Update state: \(x_i = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\).
Update state covariance: \(P_i = (I – K_i H) \hat{P}_i\)


In this section, I will prove the predict step and the update step is equivalent to a least squares problem.

Definition

Given the previous state \(x_{i-1}\) and previous covariance \(P_{i-1}\), the cost function for Kalman filter is,

\( \text{kalman-filter-cost}(x_i) = \text{observation-cost}(x_i) + \text{prior-cost}(x_i) \),

where,

\(\text{observation-cost}(x_i) = ||z_i – H x_i ||^2_{{S_i}^{-1}} \),

\( \text{prior-cost}(x_i) = ||F^{-1}x_i – x_{i-1} ||^2_{{P_{i-1}}^{-1}} \),

\(\text{kalman-filter-cost}(x_i) = ||z_i – H x_i ||^2_{{S_i}^{-1}} + ||F^{-1}x_i – x_{i-1} ||^2_{{P_{i-1}}^{-1}} \)

In English, we want,

  1. State observation \(H x_i\) close to true observation \(z_i\).
  2. “predicted previous state” \(F^{-1}x_i \) close to previous state \(x_{i-1}\) weighted by inverse of previous state covariance.

Let’s show the cost function is equivalent to the Kalman Filter.

1. Prove the Prediction

(This may be confusing at first, but It will be clear soon)

Assuming Matrices are invertable. (non-invertable matrix usually means you are doing something weird)

Goal: We want to show,

\(||F^{-1}x_i – x_{i-1} ||^2_{(P_{i-1})^-1} \stackrel{?}{=} || x_i – F x_{i-1}||^2_{(F P_{i-1} F^T)^{-1}} \; \; \; (1)\)

In English, the prior cost is equivalent to a soft constraint which want the predicted current state \(F x_{i-1}\) close to the current state \(x_i\) weighted by \({(F P_{i-1} F^T)^{-1}}\).

Please compare it with the prediction step of classical Kalman Filter.

Prediction

Next state prediction: \(\hat{x}_{i} = F x_{i-1}\).
Next Covariance prediction: \(\hat{P}_{i} = F x_{i-1} F^T\)


Let’s start the prove!

Let \(W_{i-1} = {P_{i-1}}^{-1}\) , Start with left-hand-side of (1),

\(||F^{-1}x_i – x_{i-1} ||^2_{W_{i-1}} \)

\(= (F^{-1}x_i – x_{i-1})^T W_{i-1} (F^{-1}x_i – x_{i-1}) \)

Times \( (F^{-1} F) \) with everything inside the quadratic.

\(= ( (F^{-1} F) (F^{-1}x_i – x_{i-1}))^T W_{i-1} ( (F^{-1} F) (F^{-1}x_i – x_{i-1})) \\ = ( (F^{-1}) (x_i – F x_{i-1}))^T W_{i-1} ( (F^{-1}) (x_i – F x_{i-1})) \\ = (x_i – F x_{i-1}))^T ( (F^{-1})^T W_{i-1} (F^{-1}) (x_i – F x_{i-1}) \; \; \; (2)\)

Let’s prove a identity first,

\((F^{-1})^T W_{i-1} (F^{-1}) \stackrel{?}{=} (F P_{i-1} F^T)^{-1} \; \; \; (3) \\ (F^{-1})^T W_{i-1} (F^{-1}) (F P_{i-1} F^T) \stackrel{?}{=} \text{Identity} \)

Since \(W_{i-1} = {P_{i-1}}^{-1}\), everything in the LHS canceled out. (3) is proved.

Plug (3) into (2)

\((2) = (x_i – F x_{i-1}))^T (F P_{i-1} F^T)^{-1} (x_i – F x_{i-1}) = \text{rhs of} (1)\)

We proved (1).

What does equation (1) mean?

\(\text{prior-cost}(x_i) = || x_i – F x_{i-1}||^2_{(F P_{i-1} F^T)^{-1}} \)

In English, it means the current state \(x_i\) should be close to the prediction of previous state \(F x_{i-1}\) weighted by the inverse of \(F P_{i-1} F^T \).

It is the prediction step of the Kalman filter in the context of least squares.

2. Prove the Observation, the solution of the least squares problem

Use equation (1), the cost function for Kalman Filter is,

\(\text{kalman-filter-cost}(x_i) = ||z_i – H x_i ||^2_{(S_i)^{-1}} + || x_i – F x_{i-1}||^2_{(F P_{i-1} F^T)^{-1}} \)

We want to prove:

  1. The \(x_i^* = \text{argmin} (\text{kalman-filter-cost}(x_i)) \) is the Kalman Filter state update for state \(x_i\). i.e. \(x_i^* = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\)
  2. The \(\text{Hessian} (\text{kalman-filter-cost})( x_i^* ) \) (The hessian of the cost function evaluated at \(x_i^*\)) is the Kalman filter covariance update step. i.e. \(\text{The-Inverse-Hessian} (\text{kalman-filter-cost})( x_i^* ) = (I – K_i H) \hat{P}_i\)

Let’s start the prove!

The prove them, we just need to solve the least squares problem.

Start with the optimality condition of the quadratic cost,

\(\frac{ \partial{\text{kalman-filter-cost}}} {\partial{x_i}} =\boldsymbol{0}\)

Let \(A = {S_i}^{-1}\), and \(B = (F P_{i-1} F^T)^{-1} \),

The optimality condition for the quadratic cost is,

\(\frac{ \partial{\text{kalman-filter-cost}}} {\partial{x_i}} = \boldsymbol{0}\)

\(\frac{ \partial{ ||z_i – H x_i ||^2_{A} }} {\partial{x_i}} + \frac{ \partial{|| x_i – F x_{i-1}||^2_{B}}} {\partial{x_i}} = \boldsymbol{0}\)

\( -(z_i – H x_i)^T A^T H + (x_i – F x_{i-1})^T B^T = \boldsymbol{0}\)

Let \(\hat{x_i} = F x_{i-1}\) , and take the transpose.

We got the system of equation for the optimality condition,

\(-H^T A z_i + H^T A H x_i + B x_i – B \hat{x_i} = \boldsymbol{0} \)

\( (H^T A H + B) x_i = B \hat{x_i} + H^T A z_i \)

The optimal variable is,

\(x^* = \text{argmin}(\text{cost-kalman-filter}(x_i)) = (H^T A H + B)^{-1} (B \hat{x_i} + H^T A z_i) \; \; \; (6) \)

The Covariance (the inverse hessian) of \(x_i^*\) is,

\(P_i = (H^T A H + B)^{-1} \; \; \; (5) \)

Now, let’s show (5) and (6) are equivalent to the Kalman Filter update step.

Prove the covariance (5) equals to Kalman filter covariance update

The section is inspired by : https://users.ece.cmu.edu/~byronyu/papers/derive_ks.pdf
Where Byron Yu derive the optimality condition by matching Gaussians.
I remember Simo Särkkä has a super good poof. But I can’t find it…

We want to show,

\(P_i = (H^T A H + B)^{-1}\) is equvelent to Kalman filter covariance update \(P_i = (I – K_i H) \hat{P}_i\)

Start with,

\((H^T A H + B)^{-1}\)

By the Woodbury matrix identity

( https://en.wikipedia.org/wiki/Woodbury_matrix_identity ),

\(= B^{-1} – B^{-1}H^T(A^{-1} + HB^{-1} H^T)^{-1} H B^{-1}\)

\(= (I – B^{-1}H^T(A^{-1} + HB^{-1}H^T)^{-1} H) B^{-1}\)

Note: \(A = {S_i}^{-1}\), \(B = (F P_{i-1} F^T)^{-1} = \hat{P}_i^{-1} \).

\(= (I – \hat{P}_i H^T({S_i} + H \hat{P}_i H^T)^{-1} H) \hat{P}_i\)

Let, \(K = \hat{P}_i H^T({S_i} + H \hat{P}_i H^T)^{-1} \)

\(= (1 – K H) \hat{P}_i\)

Where \(K\) is exact the Kalman gain. This is exact the update step for Covariance.

We proved the inverse hessian of kalman cost function is the covariance.

Prove the covariance (5) equals to Kalman filter covariance update

Here is the full prove. I don’t want to ignore a single step since that’s where confusions arise.

I got an iPad Pro, so let me do it on my new device! The resolution gets decreased when I convert my note into images… But it is still readable.

We proved the the solution of kalman filter cost optimization is the kalman update \(\text{argmin} (\text{kalman-filter-cost}(x_i)) = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\)

Summary

The kalman filter

Predict
Next state prediction: \(\hat{x}_{i} = F x_{i-1}\).
Next Covariance prediction: \(\hat{P}_{i} = F x_{i-1} F^T\)
Update
Compute Kalman gain: \(K_i = \hat{P}_i H^T (H \hat{P}_i H^T + S_i)^{-1}\).
Update state: \(x_i = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\).
Update state covariance: \(P_i = (I – K_i H) \hat{P}_i\)

Is equalvelent to solve this optimization problem,

\(\text{kalman-filter-cost}(x_i) = ||z_i – H x_i ||^2_{{S_i}^{-1}} + ||F^{-1}x_i – x_{i-1} ||^2_{{P_{i-1}}^{-1}} \\ = ||z_i – H x_i ||^2_{(S_i)^{-1}} + || x_i – F x_{i-1}||^2_{(F P_{i-1} F^T)^{-1}}\)

The solution can be compute is closed-form,

\(x_i^* = (H^T A H + B)^{-1} (B \hat{x_i} – H^T A z_i) \)

\(\text{inv}(cov) = \text{hessian}(\text{kalman-filter-cost}(x_i^*)) = H^T A H + B\)

Where \(A = S^{-1}\), \(B = (F P_{i-1} F^T)^{-1} \)

Implementation

To show Kalman Optimization problem is equivalent to the classical Kalman update. I implemented both of them and did a comparison.

The Model

The implementation of the prediction and the observation model.

class PointModel:
    """
    x = f(x)

    y = h(x)

    """
    DT = 0.1
    NUM_STATES = 5
    NUM_OBSERVATION = 2
    STATE_NAME = ['x', 'y', 'v', 'theta', 'theta_dot']

    @staticmethod
    def unpack_state(state):
        x, y, v, theta, theta_dot = state
        return x, y, v, theta, theta_dot

    def f(self, state):
        """
            x_next = f(x)
        """
        x, y, v, theta, theta_dot = self.unpack_state(state)
        return np.array([
            x + self.DT * math.cos(theta) * v,
            y + self.DT * math.sin(theta) * v,
            v,
            theta + self.DT * theta_dot,
            theta_dot
        ])
    
    def f_jacobian(self, state):
        """
            df/dx
        """
        x, y, v, theta, theta_dot = self.unpack_state(state)
        return np.array([
            [1, 0, self.DT * math.cos(theta), - self.DT * math.sin(theta) * v, 0],
            [0, 1, self.DT * math.sin(theta), self.DT * math.cos(theta) * v, 0],
            [0, 0, 1, 0, 0],
            [0, 0, 0, 1, self.DT],
            [0, 0, 0, 0, 1.]
        ])

    def f_weight(self):
        return np.diag([1, 1, 10, 10, 10])

    def f_cov(self):
        return np.linalg.inv(self.f_weight())

    def h(self, state):
        """
          z = h(x)
        """
        x, y, v, theta, theta_dot = self.unpack_state(state)
        if NON_LINEAR_EXPERIMENT:
            return np.array([x * x, y * y])
        else:
            return np.array([x, y])

    def h_jacobian(self, state):
        """
            dh/dx
        """
        x, y, v, theta, theta_dot = self.unpack_state(state)
        if NON_LINEAR_EXPERIMENT:
            return np.array([
                [2 * x, 0, 0, 0, 0],
                [0, 2 * y, 0, 0, 0],
            ])
        else:
            return np.array([
                [1, 0, 0, 0, 0],
                [0, 1, 0, 0, 0],
            ])

    def h_weight(self):
        return np.array([
            [5., 0],
            [0, 5.]
        ])

    def h_cov(self):
        return np.linalg.inv(self.h_weight())

The model is a point kinematic model with position observations.

Note the NON_LINEAR_EXPERIMENT is False in this experiment.

The classical Kalman Filter

class ExtentedKalmanFilter:

    def __init__(self, prior_state, prior_cov):
        self.state = prior_state
        self.cov = prior_cov

    def predict(self):
        model = get_model()

        self.state = model.f(self.state)

        f_jacobian = model.f_jacobian(self.state)
        self.cov = f_jacobian @ self.cov @ f_jacobian.T + model.f_cov()

    def update(self, measurement):
        model = get_model()
        from numpy.linalg import inv

        innovation = measurement - model.h(self.state)
        h_jacobian = model.h_jacobian(self.state)
        Sk = h_jacobian @ self.cov @ h_jacobian.T + model.h_cov()
        K = self.cov @ h_jacobian.T @ inv(Sk)

        self.state = self.state + K @ innovation
        self.cov = (np.identity(model.NUM_STATES) - K @ h_jacobian) @ self.cov

    def filter(self, measurement):
        self.predict()
        self.update(measurement)

Nothing special. I just followed the Kalman update and observer equations. The implementation works for both nonlinear and linear cases.

The Kalman filter as an Optimization

class KalmanLeastSqaures:
    """
    cost(x) = ||F^-1 x - x_pre||^2_{FWF^T + Q}^{-1} + ||z - Hx||^2_R^{-1}
    """
    def __init__(self, prior_state, prior_cov, max_iter = 10):
        self.state = prior_state
        self.cov = prior_cov
        self.max_iter = max_iter

    def filter(self, measurement):
        prior_state = self.state
        prior_cov = self.cov

        self.least_sqr_optimization(prior_state, prior_cov, measurement)

    def compute_jacobian_and_residual(self, current_varialbes,
                                      prior_state, prior_cov, measurement):
        model = get_model()
        num_state = model.NUM_STATES
        num_variables = num_state
        num_measurements = model.NUM_OBSERVATION

        num_prior_equations = model.NUM_STATES
        num_observation_equations = model.NUM_OBSERVATION
        num_equations = num_prior_equations + num_observation_equations

        jacobian = np.zeros([num_equations, num_variables])
        weight = np.zeros([num_equations, num_equations])
        residual = np.zeros([num_equations, 1]).squeeze()

        n = 0
        f_jacobian_prev = model.f_jacobian(prior_state)
        # prior
        jacobian[n: num_prior_equations, :] = np.identity(num_state)
        residual[n:num_prior_equations] = current_varialbes - model.f(prior_state)
        weight[n:num_prior_equations, n:num_prior_equations] \
             = np.linalg.inv(f_jacobian_prev @ prior_cov @ f_jacobian_prev.T + model.f_cov())
        n += num_prior_equations

        # observation
        jacobian[n: n + num_observation_equations, :] = model.h_jacobian(current_varialbes)
        residual[n: n + num_observation_equations] = model.h(current_varialbes) - measurement
        weight[n: n + num_observation_equations, n: n + num_observation_equations] = model.h_weight()
        n += num_observation_equations

        assert n  == num_equations

        return jacobian, residual, weight
    
    def construct_normal_equation(self, jacobian, residual, weight):
        J_transpose_dot_weight = np.dot(jacobian.T, weight)
        norm_equation_left = np.dot(J_transpose_dot_weight, jacobian)
        norm_equation_right = - np.dot(J_transpose_dot_weight, residual)
        return norm_equation_left, norm_equation_right


    def least_sqr_optimization(self, prior_state, prior_cov, measurement):
        model = get_model()
        variables = model.f(prior_state)

        norm_equation_left = prior_cov.copy()
        for iter in range(self.max_iter):
            jacobian, residual, weight \
                = self.compute_jacobian_and_residual(variables, prior_state, prior_cov, measurement)

            norm_equation_left, norm_equation_right \
                = self.construct_normal_equation(jacobian, residual, weight)

            dx = np.linalg.solve(norm_equation_left, norm_equation_right)

            LAMBDA = 1.

            variables += LAMBDA * dx


        self.state = variables
        self.cov = np.linalg.inv(norm_equation_left)

The implementation is longer but the implementation is clear and straightforward. We simply solved a least square problem in the Gaussian-Newton framework.

The Comparison

Result of the classical Kalman Filter
Result of the Kalman Filter Optimization problem

The classical Kalman Filter is equivalent to the Kalman filter optimization problem (with 1 iteration).

Why 1 iteration? Yes, I will prove it in the next post and show Extent Kalman Filter is sub-optimal.

The Benefits of Optimization

Viewing the kalman filter as an optimization problem simplified many problems.

Nonlinear function

EKF is sub-optimal since it only does the newton update for a single iteration.

Constraints

There is no simple theory to handle constraints in the classical Kalman Filter framework. In the least squares convex optimization framework, I can use whatever convex optimization method to handle constraints.

Covariance tuning

Covariance tuning is simply tuning the weight for the quadratic cost function. I can argue that the numbers are not important. But we need to make sure the scales are consistent.

Covariance learning

Covariance learning become clear. All we need to do is fit a quadratic function to data (\(z_i, x_i\)). \(\text{argmin} cost(W) = \sum ||z_i – H x_i||_W\).

Prediction function learning

Same as the covariance learning. We first choose a whatever parameterization of the prediction function \(f(x, \theta)\). Then, we learning it from data. \(\theta^{*} = \text{argmin} cost(\theta) = \sum ||f(x_{i-1}) – x_i||_W \).

I learned Kalman Filter in the probabilistic framework. The classical Kalman Filter is beautiful but the probabilistic framework is rigid and hard to apply to real-world problems. (In fact, people like to write probabilistic in theory but Implementation is by optimization).


Optimization is a better framework to work with.

The prediction noise

I ignored prediction noise in the prove. Here is a supplement to prove the prediction step with noise and prior (or cost-to-go).

We showed \(||x_{n-1} – x_{n-1}^*||^2_{P^{-1}} + ||x_{i+1} – F x_i||^2_{Q^{-1}}\) is equivalent to \(||F x_{n-1}^* – x_{n} ||^2_{({FPF^T + Q})^{-1}}\)

Why I ignore the prediction noise?

Kalman Filter without prediction noise is equivalent to a famous motion planning algorithm. The Differential Dynamic Programming.

The big picture of Kalman Filter

This section is optional. The goal is to build to connection between Kalman Filter and a planning algorithm called Differential Dynamic Programming.

Using cost-to-go to solve quadratic optimization

In the fixed-lag smoothing article, I showed that for a quadratic cost function,

\(\text{total_cost}({x_0, x_1, … x_n}) = \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})\)

We can solve the optimization efficiently by tracking a cost-to-go.


A quick refresher for the problem. Read this article for a detailed explaination.

The quadratic cost function is equavelent to,

\(\text{total_cost}({x_1, … x_n}) = \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})
\\ \Leftrightarrow min_{x_0, x_1, … x_{k-1}}(\sum\limits_{i=0}^{k-1} \text{cost_term}_i (x_i, x_{i+1})) + \sum\limits_{i=k}^n \text{cost_term}_i (x_i, x_{i+1})\)

(The \(\Leftrightarrow\) mean equivalence in the respect of argmin. Please check this post for detail.)

We can define the first part of the above equation as the cost-to-go. The cost-to-go is a function of a single variable. It is basically a cost function with older variables eliminated.

\(\text{cost_to_go}_k(x_k) := min_{x_0, x_1, … x_{k-1}}\sum\limits_{i=0}^{k-1} ( \text{cost_term}_i (x_i, x_{i+1}))\)

The total cost can be written as,

\(\text{total_cost}({x_k, … x_n}) \Leftrightarrow \text{cost_to_go}_k(x_k) + \sum\limits_{i={k+1}}^{n-1} \text{cost_term}_i (x_i, x_{i+1})\)

When \(k = n\), For the last cost-to-go,

\(\text{total_cost}({x_k, … x_n}) \Leftrightarrow \text{cost_to_go}_{n}(x_{n}) \; \; \; (2.1)\)

The key observation was, we can compute cost-to-go recursively,

\(\text{cost_to_go}_i(x_i) = \text{min}_{x_{i-1}} ( \text{cost_to_go}_{i-1}(x_{i-1}) + \text{cost_term}_{i-1} (x_{i-1}, x_{i}) ) \; \; \; (2.2)\)

With the initial value for the cost-to-go \(\text{cost_to_go}_{0}(x_{0}) =0\)

Because the total-cost (2,1) can be expressed by a cost-to-go, we can use (2.2) to compute the total-cost recursively. In the end, we solve (2,2) to get the \(x_n^*\).

Comparing to solving the whole problem (2,1), solving (2,2) n times to much faster. In fact, we can solve the (block-diagonal) problem in \(O(n)\) time instead of \(O(n^3)\) time.

Kalmin filter did exactly this.

Kalman Filter as cost-to-go optimization

The above derivation is exactly the same as for Kalman Filter

The cost function for the full Kalman Filter is,

\(\text{full_kalman_filter}({x_0, x_1, … x_n}) = \text{cost_prior}(x_0) + \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})\)

For Kalman Filter the \(\text{cost_term}_i (x_i, x_{i+1})\) is,

\( \text{cost_term}_i (x_i, x_{i+1}) = ||x_{i+1} – F x_i||^2_Q + ||H x_{i+1} – h||^2_R\)

\(||x_{i+1} – F x_i||^2_Q\) is the processing cost. It soft-constrains the predicted state \(\hat{x_{i+1}} = F x_i\) to be close to the next \(x_{i+1}\).

\(||H x_{i+1} – h||^2_R\) is the measurement cost. It wants the predicted measurement \(H x_{i+1}\) to be close to the true measurement \(h\).

To estimate the optimal current state, we need to solve (2.2).

\(\text{cost_to_go_kalman_filter}_{n}(x_{n})
\\ = \text{min}_{x_{n-1}}( \text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) + \text{cost_term}_{n-1} (x_{n-1}, x_{n}) ) \\ = \text{min}_{x_{n-1}}( \text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) + ||x_{n+1} – F x_n||^2_Q + ||H x_{n+1} – h||^2_R ) \)

Since, \(\text{cost_to_go_kalman_filter}_{n-1}(x_{n-1})\) is a quadratic. Let \(\text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) = ||x_{n-1} – x_{n-1}^*||_M\). Where \(x_{n-1}^*\) is the center of the quadratic and \(M\) is the Hessian of the quadratic. It’s just Gaussian with a different name.

The above equation becomes,

\(\text{cost_to_go_kalman_filter}_{n}(x_{n}) = \text{min}_{x_{n-1}}( ||x_{n-1} – x_{n-1}^*||_M + ||x_{n+1} – F x_n||^2_Q + ||H x_{n+1} – h||^2_R ) \; \; \; (3.1)\)

The optimal solution is given by,

\(x_n^* = \text{argmin }\text{cost_to_go_kalman_filter}_{n}(x_{n})\).

This is equivalent to the prediction and update step for the Kalman Filter.

What’s Next?

In this post, we showed the Kalman Filter is an optimization problem. In the next posts I will sho,

  1. The Extended Kalman Filter is sub-optimal.
  2. The Differential Dynamic Programming.

3 thoughts on “Kalman Filter, a Least Squares Problem [work in process]

Leave a Reply