Why writing is post?
Optimization and probability (maximum likelihood estimation) are dual problem of each other.
I’d like to do some math to show the duality and derive the covariance of optimization problems. And I want to apply the theory to implement some cool programs.
What are in this post?
- Show the relationship between the optimization and the probability.
- Derive the uncertainty for optimization problems.
- Implement the uncertainty for 2D GPS in Python.
- Implement the uncertainty for Lucas-Kanade tracker in C++.
The Relationship
By definition, linear least squares optimization problems are maximum likelihood estimation(MLE) with Gaussian distribution.
Prove
Definition of Gaussian,
\(p(x) = \text{scale} \exp(- (x – \mu)^T \Sigma^{-1} (x – \mu)) \)
Where \(\mu\) is the mean of the Gaussian, and the \(\Sigma\) is the covariance of the Gaussian.
Start with the definition of maximum likelihood estimation with Gaussian distribution,
\(\text{maximize} \; l(x) \\ = \text{maximize} \; p(x) \\ = \text{maximize} \; \text{scale} \exp(- (x – \mu)^T \Sigma^{-1} (x – \mu)) \)
Take the log, drop constants and take the negative,
\( \text{maximize} \; – log(l(x)) \\ = \text{minimize} \; log(l(x)) \\ = \text{minimize} \; (x – \mu)^T \Sigma_i^{-1} (x – \mu)\)
Let’s call \( log(l(x)) = (x – \mu)^T \Sigma_i^{-1} (x – \mu)\) the cost function.
\(= \text{minimize} \; cost(x) \)
Since the log function is monotonic,
\(\text{argmin} \; cost(x) = \text{argmax} \; l(x) \)
In sum, linear least squares problems are equivalent to maximum likelihood estimation with Gaussian distribution.
The probability dual problem of it is: fit a Gaussian to a arbitrary probability distribution and update by “reading” the means of the Gaussian.
Take-away
It was a sloppy prove. But it is enough for some interesting intuitions.
- The Hessian (the second-order derivative) of a cost function is the inverse covariance of its dual MLE problem.
The Covariance and the Hessian
Recall that for a nonlinear least squares problem, the Hessian of a cost function evaluated at current variable \(x\) is \(J(x)^TJ(x)\), which is the left-land-side of the normal equation.
Least Squares Refresher, https://wang-yimu.com/least-squares-refresher/
So the Covariance (the inverse Hessian) of a least squares problem is,
\(\text{covariance}(x) ={\text{hessian}(cost) (x) }^{-1} = {(J(x)^T J(x))}^{-1}\)
Answers (not perfect though) are in this tutorial:
Factor Graphs for Robot Perception,
Frank Dellaert, Michael Kaess
http://www.cs.cmu.edu/~kaess/pub/Dellaert17fnt.pdf
Theory needs support from facts.
Since we can plot the covariance (at least in 2d), let’s write some programs to show the uncertainty for 2 optimization problems.
Problem 1, The Covariance of GPS
For the Global Positioning System(GPS) system,
- We know
- where satellites are.
- the distance from satellites to a receiver.
- We want to know the location of the receiver.
We can formulate this problem as a optimization problem,
\(cost(x, y) = \\ \\ \sum\limits_{\text{satellites}} ||\text{dist}((x,y), \text{satellite-i}) – \text{measured-distance-to-satellite-i}||^2 \)
Where,
\(\text{dist}((x,y), \text{satellite}) = \sqrt{(x – \text{satellite}.x)^2 + (y – \text{satellite} .y)^2) } \)
In English, we want to find a receiver location \((x,y)\) to satisfy all the physical distance measurements from satellites.
I solved this problem by least squares. I ignored the Jacobian derivation since today is Christmas. Read the code for details.
From previous discussion, the covariance of this problem is the inverse Hessian of the cost function. In the case of least-squares, the covariance is the inverse of the left-hand-side of the normal equation.
It is a very non-convex function. E.g. To avoid a obstacle, you can go left or go right.
For more info:
CHOMP: Gradient Optimization Techniques for Efficient Motion Planning
Nathan Ratliff, Matt Zucker, J. Andrew Bagnell, Siddhartha Srinivasa
https://www.ri.cmu.edu/pub_files/2009/5/icra09-chomp.pdf
GPS with Covariance Implementation
The code: https://github.com/yimuw/yimu-blog/tree/master/least_squares/covariance1_gps
Case 1, Good covariance
Satellites are at:
[20, 20], [30, -30], [-40, 0]
The ground-truth receiver position is:
[-10, 0]
The result receiver position is:
[-1.00393270e+01 7.46063839e-04] with good covariance
Case 2, Bad covariance
Satellites are at:
[-2, 20], [2, -20].
The groud-truth receiver position is:
[-10, 0]
The result receiver position is
[-7.68252229 -0.58229322] with bad covariance
I would call this situation singularity. e.g. You can move the (x,y) along the circle in the cost field and get almost the same minimum cost.
Read this paper for Singularity/Degeneracy:
On Degeneracy of Optimization-based State Estimation,
Ji Zhang, Michael Kaess, and Sanjiv Singh Problems
https://www.ri.cmu.edu/pub_files/2016/5/ICRA_2016.pdf
Let’s go through some key functions.
Least Square Core
def least_squares(self): """ The nonlinear least squares iteration """ for iteration in range(self.max_iteration): jacobian = self.compute_jacobian(self.satelite_measurements) b = self.compute_b(self.satelite_measurements) W = self.compute_weights(self.satelite_measurements) delta = self.solve_normal_equation(jacobian, b, W) self.variables_xy += delta
Compute the jacobian, the current residual and solve the normal equation.
Compute the Jacobian
def compute_jacobian(self, satelite_measurements): """ Compute jacobian of residual function analytically """ num_residuals = len(satelite_measurements) jacobian = np.zeros([num_residuals, 2]) for i, s in enumerate(satelite_measurements): f = self.variables_xy - s.position jacobian[i, :] = f / np.linalg.norm(f) np.testing.assert_allclose( jacobian[i, :], self.gradient_checking_simple(satelite_measurements[i]), 1e-4) return jacobian
Try to derive the jacobian of the distance function yourself. I also did a jacobian checking. Read the function for details.
Plot the Covariance
def plot_hessian_as_covarinace_2d(ax, xy, hessian, satelite_measurements): """ plot 2d hessian as cov ... """ x, y = xy cov = np.linalg.inv(hessian) value, v = np.linalg.eig(cov) value = np.sqrt(value) for j in range(1, 4): SCALE = 3 ell = Ellipse(xy=(np.mean(x), np.mean(y)), width=value[0] * j * SCALE, height=value[1] * j * SCALE, angle=np.rad2deg(np.arctan2(v[0, 1], v[1, 0])), facecolor='none', edgecolor='red') ax.add_artist(ell) ...
I plotted ellipses defined by the eigenvalues and eigenvectors of the inverse hessian.
Play with the code for more intuition.
Problem 2, The Covariance of the Lucas-Kanade tracker.
Prerequisite: Lucas-Kanade tracker: https://wang-yimu.com/lucas-kanade-tracker/
From my previous post, we know the cost function for Lucus-Kanade is,
\(\text{minimize} \; \text{cost}(dx, dy) = \sum_{x,y}||A(x,y) – B(x+dx, y + dy) ||^2\)
Where \(A\), \(B\) are two image patches.
The Jacobian of the residual function (things inside the L2 norm) is,
\(J = \begin{bmatrix}\frac{\partial{r_0} }{\partial{dx} } (cur) & \frac{\partial{r_0} }{\partial{dy}}(cur)) \\ … & …\\ \frac{\partial{r_i} }{\partial{dx} } (cur) & \frac{\partial{r_i} }{\partial{dy}}(cur)) \\ … & …\\ \frac{\partial{r_n} }{\partial{dx} } (cur) & \frac{\partial{r_n} }{\partial{dy}}(cur))\end{bmatrix} \)
So the Hessian of the cost function is,
\(\text{Hessian} = J^T J\)
By definition, the Covariance of Lucas-Kanade is,
\(\text{Cov} = (J^TJ)^{-1}\)
We can plot the covariance of the Lucas-Kanade tracker.
Here we go.
LK with Covariance Implementation
Code: https://github.com/yimuw/yimu-blog/tree/master/least_squares/lucas_kanade
The code the same as normal Lucas-Kanade tracker. except we need to,
- Output the Hessian.
- Plot the Covariance.
Output the Hessian
cv::Point2f solve_normal_equation(const cv::Mat& jacobian, const cv::Mat& b, Hessian* const hessian_ptr) { cv::Mat_<float> delta; // J^T J delta = -J^T b cv::solve(jacobian, -b, delta, cv::DECOMP_CHOLESKY | cv::DECOMP_NORMAL); // std::optional if (hessian_ptr != nullptr) { *hessian_ptr = jacobian.t() * jacobian; } return { delta.at<float>(0, 0), delta.at<float>(1, 0) }; }
We pass the hessian output by a pointer.
Plot the Covariance
cv::Mat show_features_with_covariance(const std::string wname = "features", const size_t dt = 1) { cv::Mat im_debug = last_color_im_.clone(); assert(cur_features_locations_.size() == hessians_.size()); for (size_t i = 0; i < hessians_.size(); ++i) { auto& p = cur_features_locations_[i]; auto& hessian = hessians_[i]; cv::Mat eigenvalues; cv::Mat eigenvectors; cv::eigen(hessian.inv(), eigenvalues, eigenvectors); const float theta = std::atan2(eigenvectors.at<float>(0, 1), eigenvectors.at<float>(0, 0)) / M_PI * 180.; constexpr float SCALE = 10; const int l1 = static_cast<int>(SCALE * std::sqrt(eigenvalues.at<float>(0, 0))); const int l2 = static_cast<int>(SCALE * std::sqrt(eigenvalues.at<float>(1, 0))); constexpr double MAX_STD = 50; if (l1 > MAX_STD) { // Bad tracks cv::ellipse(im_debug, p, { static_cast<int>(MAX_STD), static_cast<int>(l2 * MAX_STD / l1) }, theta, 0, 360, { 100, 0, 0 }, 1); } else { // TODO: opencv bug if thinkness is not 1? cv::ellipse(im_debug, p, { l1, l2 }, theta, 0, 360, { 0, 255, 0 }, 1); } } cv::imshow(wname, im_debug); cv::waitKey(dt); return im_debug; }
We compute the covariance and plot the ellipses defined by the covariance.
We draw tracks with high uncertainty in black.
Discussion
I personally think the least-squares/optimization framework is much easier to work with.
Many authors modeled a problem by probability but implemented the solution by least squares.
Some people think Probabilistic Robotics is the bible of Robotics.
I prefer Introduction to Applied linear Algebra, Convex Optimization and Linear Matrix Inequalities in System and Control Theory.
Introduction to Applied linear Algebra
Stephen Boyd, Lieven Vandenberghe
http://vmls-book.stanford.edu/vmls.pdf
The book of Ecclesiastes,
Convex Optimization
Stephen Boyd, Lieven Vandenberghe
https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
The book of Job,
Linear Matrix Inequalities in System and Control Theory
Stephen Boyd, Laurent El Ghaoui, Eric Feron, and V. Balakrishnan
http://web.stanford.edu/~boyd/lmibook/lmibook.pdf
Paper Unlock
On Degeneracy of Optimization-based State Estimation Problems
Ji Zhang, Michael Kaess, and Sanjiv Singh
The Iterative-Closest-Point (ICP) algorithm is a method to align 2 point clouds.
In a tunnel, the ICP can have uncertainty along the tunnel direction. As a result the estimated movement along the tunnel direction can be a arbitrary scalar.
In this paper, the author find the degenerated direction from the Hessian matrix \(A^T A\). And they don’t update estimated movement along the degenerated direction.
Degeneracy-Aware Factors with Applications to Underwater SLAM
Akshay Hinduja1, Bing-Jui Ho, and Michael Kaess
This paper first did the same PCA style ICP as the previous paper.
The ICP links 2 poses in the SLAM problem. To make sure SLAM problem also knows there is a degenerate direction, the author updated the ICP cost (loop closure cost) in the SLAM problem.
The update for weight can be confusing. Let me do a example.
\(cost(x) = x^T \Lambda x\)
Let’s project vector x into a whatever space by a projection matrix \(V\). In math, it is simply \(y = Vx\).
\(cost(y) = (Vx)^T \Lambda Vx = x (V^T \Lambda V) x = ||x||^2_{V^T \Lambda V}\)
(I am not sure why the weight update was \(V_f \Lambda_{ik} V_f^T\). Probably just need some detail derivations)
Alternative
Personally, I don’t think doing \(V_f \Lambda_{ik} V_f^T\) is necessary. I would argue the Hessian matrix \(A^T \Lambda A\) (in my annotation \(J^T W J\)) already captured the degenerated direction.
In other word, forcing the weight for the degenerated direction to be 0 is not necessary because \(A^T \Lambda A\) already give the degenerated direction very small cost.
So, for this loop closure cost, we can do,
\(\text{loop-closure-cost}(x1, x2) = ||\Delta_{icp} \ominus (x1 \ominus x2) ||^2_{A^T \Lambda A}\)
Where \(A^T \Lambda A\) define the weight for the cost. \(A^T \Lambda A\) is a Positive-semi definite matrix. It can be understood as a uncertainty ellipsoid (please see the GPS example above).
Motion planning under uncertainty using iterative local optimization in belief space
Jur van den Berg, Sachin Patil and Ron Alterovitz
Motion planning is usually modeled as a least squares problem. The author used iLQG ( some people call it DDP) to solve the least squares efficiently.
The paper discussed how to solve a least squares problem by iLQG efficiently in details. I will skip this part.
The interesting part is the objective function for the motion planning least squares problem.
As we discussed in previous section, the Hessian \(J^T W J\) of a least squares problem encode the uncertainty. The inverse of the Hessian is the Covariance. The eigenvalues of the Covariance quantities the uncertainty. For example, if eigenvalues of the Covariance is large, then the uncertainty ellipsoid is large.
There is a math fact, \(\text{trace(M)} = \text{sum of eigenvalues of M}\) .
In this paper, the author wanted to minimize the uncertainty by minimizing the trace of covariance.
(I am not a expert on Planning though. I am only 90% sure about the above explanations)
7 thoughts on “The Covariance of Optimization Problems”