———-work in process————
Prerequisite
- Familiar with Kalman Filter is nessesary.
- FamiliarMatrix computation.
What’s in this article?
This article shows Kalman Filter is a Least Squares problem.
Using the least squares interpretation, we can answer some questions easily. e.g. How to tune the covariance of a Kalman filter? How to learn the prediction function?
1. Kalman Filter Refresher
Assuming, the current state is given by \(x_i\), It’s current covariance is given by \(P_i\).
The state runs as a linear system. e.g. the next state is given by \(x_{i+1} = F x_i\). I will Ignore process noise for now.
We have a observation function of current system, \(z_i = H x_i\) , with observation covariance \(S_i\)
Using probability as a tool, the Kalman filter consists of 2 parts. (I ignored the prediction noise for now. I will become clear later.)
Prediction
Next state prediction: \(\hat{x}_{i} = F x_{i-1}\).
Next Covariance prediction: \(\hat{P}_{i} = F x_{i-1} F^T\)
Update
Compute Kalman gain: \(K_i = \hat{P}_i H^T( H \hat{P}_i H^T + S_i)^{-1}\).
Update state: \(x_i = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\).
Update state covariance: \(P_i = (I – K_i H) \hat{P}_i\)
2. Fixed-Lag Smoothing Refresher and new interpretation
In the fixed-lag smoothing article, assuming the cost is quadratic, I showed that the cost function for the fixed-lag smoothing is,
\(\text{total_cost}({x_0, x_1, … x_n}) = \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})\)
The fixed-lag smoothing solves the problem iteratively by breaking the \(\text{total_cost}\) into small optimization problem.
\(\text{total_cost}({x_1, … x_n})
\\ = \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})
\\ \Leftrightarrow \text{min}_{x_0} (\text{cost_term}_0 (x_0, x_{1}) + \text{cost_term}_1 (x_1, x_{2}) ) + \sum\limits_{i=2}^n \text{cost_term}_i (x_i, x_{i+1})
\\ \Leftrightarrow \text{min}_{x_0, x_1} (\text{cost_term}_0 (x_0, x_{1}) + \text{cost_term}_1 (x_1, x_{2}) + \text{cost_term}_2 (x_2, x_{3}) ) + \sum\limits_{i=3}^n \text{cost_term}_i (x_i, x_{i+1})
\\ \Leftrightarrow min_{x_0, x_1, … x_{k-1}}\sum\limits_{i=0}^{k-1} \text{cost_term}_i (x_i, x_{i+1}) + \sum\limits_{i=k}^n \text{cost_term}_i (x_i, x_{i+1})\)
(The \(\Leftrightarrow\) mean equivalence in the respect of argmin. Please check this post for detail.)
We can define the first part of the above equation as the cost-to-go.
\(\text{cost_to_go}_k(x_k) := min_{x_0, x_1, … x_{k-1}}\sum\limits_{i=0}^{k-1} \text{cost_term}_i (x_i, x_{i+1})\)
If all costs are quadratic, follow the derivation we did in this article, we can show the cost-to-go is a quadratic form.
The total cost can be written as,
\(\text{total_cost}({x_k, … x_n}) \Leftrightarrow \text{cost_to_go}_k(x_k) + \sum\limits_{i={k+1}}^{n-1} \text{cost_term}_i (x_i, x_{i+1}) \; \; \; (2.1)\)
Since cost terms are quadratic, we can compute cost-to-go iteratively,
\(\text{cost_to_go}_i(x_i) = \text{min}_{x_{i-1}} ( \text{cost_to_go}_{i-1}(x_{i-1}) + \text{cost_term}_{i-1} (x_{i-1}, x_{i}) ) \; \; \; (2.2)\)
With the initial value for the cost-to-go \(\text{cost_to_go}_{0}(x_{0}) =0\)
Because the total-cost (2,1) can be expressed by cost-to-go, we can use (2.2) to compute the total-cost iteratively.
For example, the final cost can be expressed as,
\(\text{total_cost}({x_k, … x_n}) \Leftrightarrow \text{cost_to_go}_{n}(x_{n})\)
The \(\text{cost_to_go}_{n}(x_{n}) \) can be computed as,
\(\text{cost_to_go}_{n}(x_{n}) = \text{min}_{x_{n-1}}( \text{cost_to_go}_{n-1}(x_{n-1}) + \text{cost_term}_{n-1} (x_{n-1}, x_{n}) ) \; \; \; (2.3)\)
This mean to compute \(x_n^*\), we just to need keep track of the \(\text{cost_to_go}_{n-1}(x_{n-1})\), and solve (2.3).
That’s what fixed-lag smoothing does.
It’s also what Kalman Filter does.
3. The big picture of Kalman Filter
The above derivation is exactly the same as for Kalman Filter
The cost function for the full Kalman Filter is,
\(\text{full_kalman_filter}({x_0, x_1, … x_n}) = \text{cost_prior}(x_0) \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})\)
For Kalman Filter the \(cost_term_i (x_i, x_{i+1})\) is,
\( cost_term_i (x_i, x_{i+1}) = ||x_{i+1} – F x_i||^2_Q + ||H x_{i+1} – h||^2_R\)
\(||x_{i+1} – F x_i||^2_Q\) is the processing cost. It soft-constrains the predicted state \(\hat{x_{i+1}} = F x_i\) to be close to the next \(x_{i+1}\).
\(||H x_{i+1} – h||^2_R\) is the measurement cost. It penalties the predicted measurement \(\hat{H x_{i+1}\) to be close to the true measurement \(h\).
To estimate the optimal current state, we need to solve (2.3). In the context of Kalman Filter, it is,
\(\text{cost_to_go_kalman_filter}_{n}(x_{n})
\\ = \text{min}_{x_{n-1}}( \text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) + \text{cost_term}_{n-1} (x_{n-1}, x_{n}) ) \\ = \text{min}_{x_{n-1}}( \text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) + ||x_{i+1} – F x_i||^2_Q + ||H x_{i+1} – h||^2_R ) \)
Since, \(\text{cost_to_go_kalman_filter}(n-1)\) is a quadratic. Let \(\text{cost_to_go_kalman_filter}(n-1) = ||x_{n-1} – x_{n-1}^*||_M\). The above equation becomes,
\(\text{cost_to_go_kalman_filter}_{n}(x_{n}) = \text{min}_{x_{n-1}}( ||x_{n-1} – x_{n-1}^*||_M + ||x_{i+1} – F x_i||^2_Q + ||H x_{i+1} – h||^2_R ) \; \; \; (3.1)\)
The optimal solution of \(\text{cost_to_go_kalman_filter}_{n}(x_{n})\) is the best estimation of the current state.
\(x_n^* = \text{argmin}\text{cost_to_go_kalman_filter}_{n}(x_{n})\).
This is equivalent to the predict/update step for the Kalman Filter.
Let me prove it in the next section.
Kalman Filter as a least squares problem
The classical Kalman filter is given by the predict and the update steps.
Prediction
Next state prediction: \(\hat{x}_{i} = F x_{i-1}\).
Next Covariance prediction: \(\hat{P}_{i} = F x_{i-1} F^T\)
Update
Compute Kalman gain: \(K_i = \hat{P}_i H^T( H \hat{P}_i H^T + S_i)^{-1}\).
Update state: \(x_i = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\).
Update state covariance: \(P_i = (I – K_i H) \hat{P}_i\)
In this section, I will prove the predict step and the update step is equivalent to a least squares problem.
Definition
Given the previous state \(x_{i-1}\) and previous covariance \(P_{i-1}\), the cost function for Kalman filter is,
\( \text{kalman-filter-cost}(x_i) = \text{observation-cost}(x_i) + \text{prior-cost}(x_i) \),
where,
\(\text{observation-cost}(x_i) = ||z_i – H x_i ||^2_{{S_i}^{-1}} \),
\( \text{prior-cost}(x_i) = ||F^{-1}x_i – x_{i-1} ||^2_{{P_{i-1}}^{-1}} \),
\(\text{kalman-filter-cost}(x_i) = ||z_i – H x_i ||^2_{{S_i}^{-1}} + ||F^{-1}x_i – x_{i-1} ||^2_{{P_{i-1}}^{-1}} \)
In English, we want,
- State observation \(H x_i\) close to true observation \(z_i\).
- “predicted previous state” \(F^{-1}x_i \) close to previous state \(x_{i-1}\) weighted by inverse of previous state covariance.
Let’s show the cost function is equivalent to the Kalman Filter.
1. Prove the Prediction
(This may be confusing at first, but It will be clear soon)
Assuming Matrices are invertable. (non-invertable matrix usually means you are doing something weird)
Goal: We want to show,
\(||F^{-1}x_i – x_{i-1} ||^2_{(P_{i-1})^-1} \stackrel{?}{=} || x_i – F x_{i-1}||^2_{(F P_{i-1} F^T)^{-1}} \; \; \; (1)\)
In English, the prior cost is equivalent to a soft constraint which want the predicted current state \(F x_{i-1}\) close to the current state \(x_i\) weighted by \({(F P_{i-1} F^T)^{-1}}\).
Please compare it with the prediction step of classical Kalman Filter.
Prediction
Next state prediction: \(\hat{x}_{i} = F x_{i-1}\).
Next Covariance prediction: \(\hat{P}_{i} = F x_{i-1} F^T\)
Let’s start the prove!
Let \(W_{i-1} = {P_{i-1}}^{-1}\) , Start with left-hand-side of (1),
\(||F^{-1}x_i – x_{i-1} ||^2_{W_{i-1}} \)
\(= (F^{-1}x_i – x_{i-1})^T W_{i-1} (F^{-1}x_i – x_{i-1}) \)
Times \( (F^{-1} F) \) with everything inside the quadratic.
\(= ( (F^{-1} F) (F^{-1}x_i – x_{i-1}))^T W_{i-1} ( (F^{-1} F) (F^{-1}x_i – x_{i-1})) \\ = ( (F^{-1}) (x_i – F x_{i-1}))^T W_{i-1} ( (F^{-1}) (x_i – F x_{i-1})) \\ = (x_i – F x_{i-1}))^T ( (F^{-1})^T W_{i-1} (F^{-1}) (x_i – F x_{i-1}) \; \; \; (2)\)
Let’s prove a identity first,
\((F^{-1})^T W_{i-1} (F^{-1}) \stackrel{?}{=} (F P_{i-1} F^T)^{-1} \; \; \; (3) \\ (F^{-1})^T W_{i-1} (F^{-1}) (F P_{i-1} F^T) \stackrel{?}{=} \text{Identity} \)
Since \(W_{i-1} = {P_{i-1}}^{-1}\), everything in the LHS canceled out. (3) is proved.
Plug (3) into (2)
\((2) = (x_i – F x_{i-1}))^T (F P_{i-1} F^T)^{-1} (x_i – F x_{i-1}) = \text{rhs of} (1)\)
We proved (1).
What does equation (1) mean?
\(\text{prior-cost}(x_i) = || x_i – F x_{i-1}||^2_{(F P_{i-1} F^T)^{-1}} \)
In English, it means the current state \(x_i\) should be close to the prediction of previous state \(F x_{i-1}\) weighted by the inverse of \(F P_{i-1} F^T \).
It is the prediction step of the Kalman filter in the context of least squares.
2. Prove the Observation, the solution of the least squares problem
Use equation (1), the cost function for Kalman Filter is,
\(\text{kalman-filter-cost}(x_i) = ||z_i – H x_i ||^2_{(S_i)^{-1}} + || x_i – F x_{i-1}||^2_{(F P_{i-1} F^T)^{-1}} \)
We want to prove:
- The \(x_i^* = \text{argmin} (\text{kalman-filter-cost}(x_i)) \) is the Kalman Filter state update for state \(x_i\). i.e. \(x_i^* = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\)
- The \(\text{Hessian} (\text{kalman-filter-cost})( x_i^* ) \) (The hessian of the cost function evaluated at \(x_i^*\)) is the Kalman filter covariance update step. i.e. \(\text{The-Inverse-Hessian} (\text{kalman-filter-cost})( x_i^* ) = (I – K_i H) \hat{P}_i\)
Let’s start the prove!
The prove them, we just need to solve the least squares problem.
Start with the optimality condition of the quadratic cost,
\(\frac{ \partial{\text{kalman-filter-cost}}} {\partial{x_i}} =\boldsymbol{0}\)
Let \(A = {S_i}^{-1}\), and \(B = (F P_{i-1} F^T)^{-1} \),
The optimality condition for the quadratic cost is,
\(\frac{ \partial{\text{kalman-filter-cost}}} {\partial{x_i}} = \boldsymbol{0}\)
\(\frac{ \partial{ ||z_i – H x_i ||^2_{A} }} {\partial{x_i}} + \frac{ \partial{|| x_i – F x_{i-1}||^2_{B}}} {\partial{x_i}} = \boldsymbol{0}\)
\( -(z_i – H x_i)^T A^T H + (x_i – F x_{i-1})^T B^T = \boldsymbol{0}\)
Let \(\hat{x_i} = F x_{i-1}\) , and take the transpose.
We got the system of equation for the optimality condition,
\(-H^T A z_i + H^T A H x_i + B x_i – B \hat{x_i} = \boldsymbol{0} \)
\( (H^T A H + B) x_i = B \hat{x_i} + H^T A z_i \)
The optimal variable is,
\(x^* = \text{argmin}(\text{cost-kalman-filter}(x_i)) = (H^T A H + B)^{-1} (B \hat{x_i} + H^T A z_i) \; \; \; (6) \)
The Covariance (the inverse hessian) of \(x_i^*\) is,
\(P_i = (H^T A H + B)^{-1} \; \; \; (5) \)
Now, let’s show (5) and (6) are equivalent to the Kalman Filter update step.
Prove the covariance (5) equals to Kalman filter covariance update
Where Byron Yu derive the optimality condition by matching Gaussians.
We want to show,
\(P_i = (H^T A H + B)^{-1}\) is equvelent to Kalman filter covariance update \(P_i = (I – K_i H) \hat{P}_i\)
Start with,
\((H^T A H + B)^{-1}\)
By the Woodbury matrix identity
( https://en.wikipedia.org/wiki/Woodbury_matrix_identity ),
\(= B^{-1} – B^{-1}H^T(A^{-1} + HB^{-1} H^T)^{-1} H B^{-1}\)
\(= (I – B^{-1}H^T(A^{-1} + HB^{-1}H^T)^{-1} H) B^{-1}\)
Note: \(A = {S_i}^{-1}\), \(B = (F P_{i-1} F^T)^{-1} = \hat{P}_i^{-1} \).
\(= (I – \hat{P}_i H^T({S_i} + H \hat{P}_i H^T)^{-1} H) \hat{P}_i\)
Let, \(K = \hat{P}_i H^T({S_i} + H \hat{P}_i H^T)^{-1} \)
\(= (1 – K H) \hat{P}_i\)
Where \(K\) is exact the Kalman gain. This is exact the update step for Covariance.
We proved the inverse hessian of kalman cost function is the covariance.
Prove the covariance (5) equals to Kalman filter covariance update
Here is the full prove. I don’t want to ignore a single step since that’s where confusions arise.
I got an iPad Pro, so let me do it on my new device! The resolution gets decreased when I convert my note into images… But it is still readable.
We proved the the solution of kalman filter cost optimization is the kalman update \(\text{argmin} (\text{kalman-filter-cost}(x_i)) = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\)
Summary
The kalman filter
Predict
Next state prediction: \(\hat{x}_{i} = F x_{i-1}\).
Next Covariance prediction: \(\hat{P}_{i} = F x_{i-1} F^T\)
Update
Compute Kalman gain: \(K_i = \hat{P}_i H^T (H \hat{P}_i H^T + S_i)^{-1}\).
Update state: \(x_i = \hat{x}_i + K_i ( z_i – H \hat{x}_i )\).
Update state covariance: \(P_i = (I – K_i H) \hat{P}_i\)
Is equalvelent to solve this optimization problem,
\(\text{kalman-filter-cost}(x_i) = ||z_i – H x_i ||^2_{{S_i}^{-1}} + ||F^{-1}x_i – x_{i-1} ||^2_{{P_{i-1}}^{-1}} \\ = ||z_i – H x_i ||^2_{(S_i)^{-1}} + || x_i – F x_{i-1}||^2_{(F P_{i-1} F^T)^{-1}}\)
The solution can be compute is closed-form,
\(x_i^* = (H^T A H + B)^{-1} (B \hat{x_i} – H^T A z_i) \)
\(\text{inv}(cov) = \text{hessian}(\text{kalman-filter-cost}(x_i^*)) = H^T A H + B\)
Where \(A = S^{-1}\), \(B = (F P_{i-1} F^T)^{-1} \)
Implementation
To show Kalman Optimization problem is equivalent to the classical Kalman update. I implemented both of them and did a comparison.
The Model
The implementation of the prediction and the observation model.
class PointModel: """ x = f(x) y = h(x) """ DT = 0.1 NUM_STATES = 5 NUM_OBSERVATION = 2 STATE_NAME = ['x', 'y', 'v', 'theta', 'theta_dot'] @staticmethod def unpack_state(state): x, y, v, theta, theta_dot = state return x, y, v, theta, theta_dot def f(self, state): """ x_next = f(x) """ x, y, v, theta, theta_dot = self.unpack_state(state) return np.array([ x + self.DT * math.cos(theta) * v, y + self.DT * math.sin(theta) * v, v, theta + self.DT * theta_dot, theta_dot ]) def f_jacobian(self, state): """ df/dx """ x, y, v, theta, theta_dot = self.unpack_state(state) return np.array([ [1, 0, self.DT * math.cos(theta), - self.DT * math.sin(theta) * v, 0], [0, 1, self.DT * math.sin(theta), self.DT * math.cos(theta) * v, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, self.DT], [0, 0, 0, 0, 1.] ]) def f_weight(self): return np.diag([1, 1, 10, 10, 10]) def f_cov(self): return np.linalg.inv(self.f_weight()) def h(self, state): """ z = h(x) """ x, y, v, theta, theta_dot = self.unpack_state(state) if NON_LINEAR_EXPERIMENT: return np.array([x * x, y * y]) else: return np.array([x, y]) def h_jacobian(self, state): """ dh/dx """ x, y, v, theta, theta_dot = self.unpack_state(state) if NON_LINEAR_EXPERIMENT: return np.array([ [2 * x, 0, 0, 0, 0], [0, 2 * y, 0, 0, 0], ]) else: return np.array([ [1, 0, 0, 0, 0], [0, 1, 0, 0, 0], ]) def h_weight(self): return np.array([ [5., 0], [0, 5.] ]) def h_cov(self): return np.linalg.inv(self.h_weight())
The model is a point kinematic model with position observations.
Note the NON_LINEAR_EXPERIMENT is False in this experiment.
The classical Kalman Filter
class ExtentedKalmanFilter: def __init__(self, prior_state, prior_cov): self.state = prior_state self.cov = prior_cov def predict(self): model = get_model() self.state = model.f(self.state) f_jacobian = model.f_jacobian(self.state) self.cov = f_jacobian @ self.cov @ f_jacobian.T + model.f_cov() def update(self, measurement): model = get_model() from numpy.linalg import inv innovation = measurement - model.h(self.state) h_jacobian = model.h_jacobian(self.state) Sk = h_jacobian @ self.cov @ h_jacobian.T + model.h_cov() K = self.cov @ h_jacobian.T @ inv(Sk) self.state = self.state + K @ innovation self.cov = (np.identity(model.NUM_STATES) - K @ h_jacobian) @ self.cov def filter(self, measurement): self.predict() self.update(measurement)
Nothing special. I just followed the Kalman update and observer equations. The implementation works for both nonlinear and linear cases.
The Kalman filter as an Optimization
class KalmanLeastSqaures: """ cost(x) = ||F^-1 x - x_pre||^2_{FWF^T + Q}^{-1} + ||z - Hx||^2_R^{-1} """ def __init__(self, prior_state, prior_cov, max_iter = 10): self.state = prior_state self.cov = prior_cov self.max_iter = max_iter def filter(self, measurement): prior_state = self.state prior_cov = self.cov self.least_sqr_optimization(prior_state, prior_cov, measurement) def compute_jacobian_and_residual(self, current_varialbes, prior_state, prior_cov, measurement): model = get_model() num_state = model.NUM_STATES num_variables = num_state num_measurements = model.NUM_OBSERVATION num_prior_equations = model.NUM_STATES num_observation_equations = model.NUM_OBSERVATION num_equations = num_prior_equations + num_observation_equations jacobian = np.zeros([num_equations, num_variables]) weight = np.zeros([num_equations, num_equations]) residual = np.zeros([num_equations, 1]).squeeze() n = 0 f_jacobian_prev = model.f_jacobian(prior_state) # prior jacobian[n: num_prior_equations, :] = np.identity(num_state) residual[n:num_prior_equations] = current_varialbes - model.f(prior_state) weight[n:num_prior_equations, n:num_prior_equations] \ = np.linalg.inv(f_jacobian_prev @ prior_cov @ f_jacobian_prev.T + model.f_cov()) n += num_prior_equations # observation jacobian[n: n + num_observation_equations, :] = model.h_jacobian(current_varialbes) residual[n: n + num_observation_equations] = model.h(current_varialbes) - measurement weight[n: n + num_observation_equations, n: n + num_observation_equations] = model.h_weight() n += num_observation_equations assert n == num_equations return jacobian, residual, weight def construct_normal_equation(self, jacobian, residual, weight): J_transpose_dot_weight = np.dot(jacobian.T, weight) norm_equation_left = np.dot(J_transpose_dot_weight, jacobian) norm_equation_right = - np.dot(J_transpose_dot_weight, residual) return norm_equation_left, norm_equation_right def least_sqr_optimization(self, prior_state, prior_cov, measurement): model = get_model() variables = model.f(prior_state) norm_equation_left = prior_cov.copy() for iter in range(self.max_iter): jacobian, residual, weight \ = self.compute_jacobian_and_residual(variables, prior_state, prior_cov, measurement) norm_equation_left, norm_equation_right \ = self.construct_normal_equation(jacobian, residual, weight) dx = np.linalg.solve(norm_equation_left, norm_equation_right) LAMBDA = 1. variables += LAMBDA * dx self.state = variables self.cov = np.linalg.inv(norm_equation_left)
The implementation is longer but the implementation is clear and straightforward. We simply solved a least square problem in the Gaussian-Newton framework.
The Comparison
The classical Kalman Filter is equivalent to the Kalman filter optimization problem (with 1 iteration).
Why 1 iteration? Yes, I will prove it in the next post and show Extent Kalman Filter is sub-optimal.
The Benefits of Optimization
Viewing the kalman filter as an optimization problem simplified many problems.
Nonlinear function
EKF is sub-optimal since it only does the newton update for a single iteration.
Constraints
There is no simple theory to handle constraints in the classical Kalman Filter framework. In the least squares convex optimization framework, I can use whatever convex optimization method to handle constraints.
Covariance tuning
Covariance tuning is simply tuning the weight for the quadratic cost function. I can argue that the numbers are not important. But we need to make sure the scales are consistent.
Covariance learning
Covariance learning become clear. All we need to do is fit a quadratic function to data (\(z_i, x_i\)). \(\text{argmin} cost(W) = \sum ||z_i – H x_i||_W\).
Prediction function learning
Same as the covariance learning. We first choose a whatever parameterization of the prediction function \(f(x, \theta)\). Then, we learning it from data. \(\theta^{*} = \text{argmin} cost(\theta) = \sum ||f(x_{i-1}) – x_i||_W \).
I learned Kalman Filter in the probabilistic framework. The classical Kalman Filter is beautiful but the probabilistic framework is rigid and hard to apply to real-world problems. (In fact, people like to write probabilistic in theory but Implementation is by optimization).
Optimization is a better framework to work with.
The prediction noise
I ignored prediction noise in the prove. Here is a supplement to prove the prediction step with noise and prior (or cost-to-go).
We showed \(||x_{n-1} – x_{n-1}^*||^2_{P^{-1}} + ||x_{i+1} – F x_i||^2_{Q^{-1}}\) is equivalent to \(||F x_{n-1}^* – x_{n} ||^2_{({FPF^T + Q})^{-1}}\)
Why I ignore the prediction noise?
Kalman Filter without prediction noise is equivalent to a famous motion planning algorithm. The Differential Dynamic Programming.
The big picture of Kalman Filter
This section is optional. The goal is to build to connection between Kalman Filter and a planning algorithm called Differential Dynamic Programming.
Using cost-to-go to solve quadratic optimization
In the fixed-lag smoothing article, I showed that for a quadratic cost function,
\(\text{total_cost}({x_0, x_1, … x_n}) = \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})\)
We can solve the optimization efficiently by tracking a cost-to-go.
A quick refresher for the problem. Read this article for a detailed explaination.
The quadratic cost function is equavelent to,
\(\text{total_cost}({x_1, … x_n}) = \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})
\\ \Leftrightarrow min_{x_0, x_1, … x_{k-1}}(\sum\limits_{i=0}^{k-1} \text{cost_term}_i (x_i, x_{i+1})) + \sum\limits_{i=k}^n \text{cost_term}_i (x_i, x_{i+1})\)
(The \(\Leftrightarrow\) mean equivalence in the respect of argmin. Please check this post for detail.)
We can define the first part of the above equation as the cost-to-go. The cost-to-go is a function of a single variable. It is basically a cost function with older variables eliminated.
\(\text{cost_to_go}_k(x_k) := min_{x_0, x_1, … x_{k-1}}\sum\limits_{i=0}^{k-1} ( \text{cost_term}_i (x_i, x_{i+1}))\)
The total cost can be written as,
\(\text{total_cost}({x_k, … x_n}) \Leftrightarrow \text{cost_to_go}_k(x_k) + \sum\limits_{i={k+1}}^{n-1} \text{cost_term}_i (x_i, x_{i+1})\)
When \(k = n\), For the last cost-to-go,
\(\text{total_cost}({x_k, … x_n}) \Leftrightarrow \text{cost_to_go}_{n}(x_{n}) \; \; \; (2.1)\)
The key observation was, we can compute cost-to-go recursively,
\(\text{cost_to_go}_i(x_i) = \text{min}_{x_{i-1}} ( \text{cost_to_go}_{i-1}(x_{i-1}) + \text{cost_term}_{i-1} (x_{i-1}, x_{i}) ) \; \; \; (2.2)\)
With the initial value for the cost-to-go \(\text{cost_to_go}_{0}(x_{0}) =0\)
Because the total-cost (2,1) can be expressed by a cost-to-go, we can use (2.2) to compute the total-cost recursively. In the end, we solve (2,2) to get the \(x_n^*\).
Comparing to solving the whole problem (2,1), solving (2,2) n times to much faster. In fact, we can solve the (block-diagonal) problem in \(O(n)\) time instead of \(O(n^3)\) time.
Kalmin filter did exactly this.
Kalman Filter as cost-to-go optimization
The above derivation is exactly the same as for Kalman Filter
The cost function for the full Kalman Filter is,
\(\text{full_kalman_filter}({x_0, x_1, … x_n}) = \text{cost_prior}(x_0) + \sum\limits_{i=0}^n \text{cost_term}_i (x_i, x_{i+1})\)
For Kalman Filter the \(\text{cost_term}_i (x_i, x_{i+1})\) is,
\( \text{cost_term}_i (x_i, x_{i+1}) = ||x_{i+1} – F x_i||^2_Q + ||H x_{i+1} – h||^2_R\)
\(||x_{i+1} – F x_i||^2_Q\) is the processing cost. It soft-constrains the predicted state \(\hat{x_{i+1}} = F x_i\) to be close to the next \(x_{i+1}\).
\(||H x_{i+1} – h||^2_R\) is the measurement cost. It wants the predicted measurement \(H x_{i+1}\) to be close to the true measurement \(h\).
To estimate the optimal current state, we need to solve (2.2).
\(\text{cost_to_go_kalman_filter}_{n}(x_{n})
\\ = \text{min}_{x_{n-1}}( \text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) + \text{cost_term}_{n-1} (x_{n-1}, x_{n}) ) \\ = \text{min}_{x_{n-1}}( \text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) + ||x_{n+1} – F x_n||^2_Q + ||H x_{n+1} – h||^2_R ) \)
Since, \(\text{cost_to_go_kalman_filter}_{n-1}(x_{n-1})\) is a quadratic. Let \(\text{cost_to_go_kalman_filter}_{n-1}(x_{n-1}) = ||x_{n-1} – x_{n-1}^*||_M\). Where \(x_{n-1}^*\) is the center of the quadratic and \(M\) is the Hessian of the quadratic. It’s just Gaussian with a different name.
The above equation becomes,
\(\text{cost_to_go_kalman_filter}_{n}(x_{n}) = \text{min}_{x_{n-1}}( ||x_{n-1} – x_{n-1}^*||_M + ||x_{n+1} – F x_n||^2_Q + ||H x_{n+1} – h||^2_R ) \; \; \; (3.1)\)
The optimal solution is given by,
\(x_n^* = \text{argmin }\text{cost_to_go_kalman_filter}_{n}(x_{n})\).
This is equivalent to the prediction and update step for the Kalman Filter.
What’s Next?
In this post, we showed the Kalman Filter is an optimization problem. In the next posts I will sho,
- The Extended Kalman Filter is sub-optimal.
- The Differential Dynamic Programming.
3 thoughts on “Kalman Filter, a Least Squares Problem [work in process]”