The Covariance of Optimization Problems

Why writing is post?

Optimization and probability (maximum likelihood estimation) are dual problem of each other.

I’d like to do some math to show the duality and derive the covariance of optimization problems. And I want to apply the theory to implement some cool programs.

What are in this post?

  • Show the relationship between the optimization and the probability.
  • Derive the uncertainty for optimization problems.
  • Implement the uncertainty for 2D GPS in Python.
  • Implement the uncertainty for Lucas-Kanade tracker in C++.

The Relationship

By definition, linear least squares optimization problems are maximum likelihood estimation(MLE) with Gaussian distribution.

Prove

Definition of Gaussian,

\(p(x) = \text{scale} \exp(- (x – \mu)^T \Sigma^{-1} (x – \mu)) \)

Where \(\mu\) is the mean of the Gaussian, and the \(\Sigma\) is the covariance of the Gaussian.


Start with the definition of maximum likelihood estimation with Gaussian distribution,

\(\text{maximize} \; l(x) \\ = \text{maximize} \; p(x) \\ = \text{maximize} \; \text{scale} \exp(- (x – \mu)^T \Sigma^{-1} (x – \mu)) \)

Take the log, drop constants and take the negative,

\( \text{maximize} \; – log(l(x)) \\ = \text{minimize} \; log(l(x)) \\ = \text{minimize} \; (x – \mu)^T \Sigma_i^{-1} (x – \mu)\)

Let’s call \( log(l(x)) = (x – \mu)^T \Sigma_i^{-1} (x – \mu)\) the cost function.

\(= \text{minimize} \; cost(x) \)

Since the log function is monotonic,

\(\text{argmin} \; cost(x) = \text{argmax} \; l(x) \)

In sum, linear least squares problems are equivalent to maximum likelihood estimation with Gaussian distribution.

It is the standard procedure for (at least robotics) papers with keyword probability in it.
One way to understand the least squares (or Newton) optimization is: fit a quadratic to a arbitrary cost function, and update by analytically minimize the quadratic.
The probability dual problem of it is: fit a Gaussian to a arbitrary probability distribution and update by “reading” the means of the Gaussian.

Take-away

It was a sloppy prove. But it is enough for some interesting intuitions.

  • The Hessian (the second-order derivative) of a cost function is the inverse covariance of its dual MLE problem.

The Covariance and the Hessian

Recall that for a nonlinear least squares problem, the Hessian of a cost function evaluated at current variable \(x\) is \(J(x)^TJ(x)\), which is the left-land-side of the normal equation.

Check this post for details:
Least Squares Refresher, https://wang-yimu.com/least-squares-refresher/

So the Covariance (the inverse Hessian) of a least squares problem is,

\(\text{covariance}(x) ={\text{hessian}(cost) (x) }^{-1} = {(J(x)^T J(x))}^{-1}\)

Question for you: If I only want to compute the covariance for a subset of variables, how to do it?
Answers (not perfect though) are in this tutorial:
Factor Graphs for Robot Perception,
Frank Dellaert, Michael Kaess
http://www.cs.cmu.edu/~kaess/pub/Dellaert17fnt.pdf

Theory needs support from facts.

Since we can plot the covariance (at least in 2d), let’s write some programs to show the uncertainty for 2 optimization problems.

Problem 1, The Covariance of GPS

For the Global Positioning System(GPS) system,

  • We know
    • where satellites are.
    • the distance from satellites to a receiver.
  • We want to know the location of the receiver.

We can formulate this problem as a optimization problem,

\(cost(x, y) = \\ \\ \sum\limits_{\text{satellites}} ||\text{dist}((x,y), \text{satellite-i}) – \text{measured-distance-to-satellite-i}||^2 \)

Where,

\(\text{dist}((x,y), \text{satellite}) = \sqrt{(x – \text{satellite}.x)^2 + (y – \text{satellite} .y)^2) } \)

In English, we want to find a receiver location \((x,y)\) to satisfy all the physical distance measurements from satellites.

I solved this problem by least squares. I ignored the Jacobian derivation since today is Christmas. Read the code for details.

From previous discussion, the covariance of this problem is the inverse Hessian of the cost function. In the case of least-squares, the covariance is the inverse of the left-hand-side of the normal equation.

Distance function is a very important function in robotics. For example, to avoid a obstacle, we need to compute the distance from the robot to the obstacle.
It is a very non-convex function. E.g. To avoid a obstacle, you can go left or go right.
For more info:
CHOMP: Gradient Optimization Techniques for Efficient Motion Planning
Nathan Ratliff, Matt Zucker, J. Andrew Bagnell, Siddhartha Srinivasa

https://www.ri.cmu.edu/pub_files/2009/5/icra09-chomp.pdf

GPS with Covariance Implementation

The code: https://github.com/yimuw/yimu-blog/tree/master/least_squares/covariance1_gps

Case 1, Good covariance

Satellites are at:

[20, 20], [30, -30], [-40, 0]

The ground-truth receiver position is:

[-10, 0]

The result receiver position is:

[-1.00393270e+01 7.46063839e-04] with good covariance

left: satellites and the result position with covariance.
right: the cost function field.

Case 2, Bad covariance

Satellites are at:

[-2, 20], [2, -20].

The groud-truth receiver position is:

[-10, 0]

The result receiver position is

[-7.68252229 -0.58229322] with bad covariance

left: satellites and the result position with covariance.
right: the cost function field.
There are only 2 satellites. But the linear system is deterministic!
I would call this situation singularity. e.g. You can move the (x,y) along the circle in the cost field and get almost the same minimum cost.
Read this paper for Singularity/Degeneracy:
On Degeneracy of Optimization-based State Estimation,
Ji Zhang, Michael Kaess, and Sanjiv Singh Problems
https://www.ri.cmu.edu/pub_files/2016/5/ICRA_2016.pdf

Let’s go through some key functions.

Least Square Core

def least_squares(self):
    """
        The nonlinear least squares iteration
    """
    for iteration in range(self.max_iteration):
        jacobian = self.compute_jacobian(self.satelite_measurements)
        b = self.compute_b(self.satelite_measurements)
        W = self.compute_weights(self.satelite_measurements)

        delta = self.solve_normal_equation(jacobian, b, W)

        self.variables_xy += delta

Compute the jacobian, the current residual and solve the normal equation.

Compute the Jacobian

def compute_jacobian(self, satelite_measurements):
    """
        Compute jacobian of residual function analytically
    """
    num_residuals = len(satelite_measurements)
   
    jacobian = np.zeros([num_residuals, 2])

    for i, s in enumerate(satelite_measurements):
        f = self.variables_xy - s.position
        jacobian[i, :] = f / np.linalg.norm(f)
        
        np.testing.assert_allclose(
            jacobian[i, :],
            self.gradient_checking_simple(satelite_measurements[i]), 1e-4)

    return jacobian

Try to derive the jacobian of the distance function yourself. I also did a jacobian checking. Read the function for details.

Plot the Covariance

def plot_hessian_as_covarinace_2d(ax, xy, hessian, satelite_measurements):
    """
        plot 2d hessian as cov
        ...
    """
    x, y = xy
    cov = np.linalg.inv(hessian)
    value, v = np.linalg.eig(cov)
    value = np.sqrt(value)
    for j in range(1, 4):
        SCALE = 3
        ell = Ellipse(xy=(np.mean(x), np.mean(y)),
                      width=value[0] * j * SCALE,
                      height=value[1] * j * SCALE,
                      angle=np.rad2deg(np.arctan2(v[0, 1], v[1, 0])),
                      facecolor='none',
                      edgecolor='red')
        ax.add_artist(ell)

    ...

I plotted ellipses defined by the eigenvalues and eigenvectors of the inverse hessian.

Play with the code for more intuition.

Problem 2, The Covariance of the Lucas-Kanade tracker.

Prerequisite: Lucas-Kanade tracker: https://wang-yimu.com/lucas-kanade-tracker/


From my previous post, we know the cost function for Lucus-Kanade is,

\(\text{minimize} \; \text{cost}(dx, dy) = \sum_{x,y}||A(x,y) – B(x+dx, y + dy) ||^2\)

Where \(A\), \(B\) are two image patches.

The Jacobian of the residual function (things inside the L2 norm) is,

\(J = \begin{bmatrix}\frac{\partial{r_0} }{\partial{dx} } (cur) & \frac{\partial{r_0} }{\partial{dy}}(cur)) \\ … & …\\ \frac{\partial{r_i} }{\partial{dx} } (cur) & \frac{\partial{r_i} }{\partial{dy}}(cur)) \\ … & …\\ \frac{\partial{r_n} }{\partial{dx} } (cur) & \frac{\partial{r_n} }{\partial{dy}}(cur))\end{bmatrix} \)

So the Hessian of the cost function is,

\(\text{Hessian} = J^T J\)

By definition, the Covariance of Lucas-Kanade is,

\(\text{Cov} = (J^TJ)^{-1}\)

We can plot the covariance of the Lucas-Kanade tracker.

Here we go.

LK with covariance.
The size of the eclipse is the uncertainty.
We classify tracks to bad or good by a threshold.
Good tracks are green. Bad tracker are black.

LK with Covariance Implementation

Code: https://github.com/yimuw/yimu-blog/tree/master/least_squares/lucas_kanade

The code the same as normal Lucas-Kanade tracker. except we need to,

  1. Output the Hessian.
  2. Plot the Covariance.

Output the Hessian

    cv::Point2f solve_normal_equation(const cv::Mat& jacobian,
                                      const cv::Mat& b,
                                      Hessian* const hessian_ptr)
    {
        cv::Mat_<float> delta;

        // J^T J delta = -J^T b
        cv::solve(jacobian, -b, delta, cv::DECOMP_CHOLESKY | cv::DECOMP_NORMAL);

        // std::optional
        if (hessian_ptr != nullptr) {
            *hessian_ptr = jacobian.t() * jacobian;
        }

        return { delta.at<float>(0, 0), delta.at<float>(1, 0) };
    }
    

We pass the hessian output by a pointer.

Plot the Covariance

cv::Mat show_features_with_covariance(const std::string wname = "features",
                                      const size_t dt = 1)
{
    cv::Mat im_debug = last_color_im_.clone();
    assert(cur_features_locations_.size() == hessians_.size());
    for (size_t i = 0; i < hessians_.size(); ++i) {
        auto& p = cur_features_locations_[i];
        auto& hessian = hessians_[i];

        cv::Mat eigenvalues;
        cv::Mat eigenvectors;
        cv::eigen(hessian.inv(), eigenvalues, eigenvectors);

        const float theta = std::atan2(eigenvectors.at<float>(0, 1), eigenvectors.at<float>(0, 0)) / M_PI * 180.;
        constexpr float SCALE = 10;
        const int l1 = static_cast<int>(SCALE * std::sqrt(eigenvalues.at<float>(0, 0)));
        const int l2 = static_cast<int>(SCALE * std::sqrt(eigenvalues.at<float>(1, 0)));

        constexpr double MAX_STD = 50;
        if (l1 > MAX_STD) {
            // Bad tracks
            cv::ellipse(im_debug, p, { static_cast<int>(MAX_STD), static_cast<int>(l2 * MAX_STD / l1) },
                theta, 0, 360, { 100, 0, 0 }, 1);
        } else {
            // TODO: opencv bug if thinkness is not 1?
            cv::ellipse(im_debug, p, { l1, l2 }, theta, 0, 360, { 0, 255, 0 }, 1);
        }
    }
    cv::imshow(wname, im_debug);
    cv::waitKey(dt);

    return im_debug;
}

We compute the covariance and plot the ellipses defined by the covariance.

We draw tracks with high uncertainty in black.

Discussion

I personally think the least-squares/optimization framework is much easier to work with.

Many authors modeled a problem by probability but implemented the solution by least squares.

Some people think Probabilistic Robotics is the bible of Robotics.

I prefer Introduction to Applied linear Algebra, Convex Optimization and Linear Matrix Inequalities in System and Control Theory.

The book of Proverbs,
Introduction to Applied linear Algebra
Stephen Boyd, Lieven Vandenberghe
http://vmls-book.stanford.edu/vmls.pdf

The book of Ecclesiastes,
Convex Optimization
Stephen Boyd, Lieven Vandenberghe
https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf


The book of Job,
Linear Matrix Inequalities in System and Control Theory
Stephen Boyd, Laurent El Ghaoui, Eric Feron, and V. Balakrishnan
http://web.stanford.edu/~boyd/lmibook/lmibook.pdf

Paper Unlock

On Degeneracy of Optimization-based State Estimation Problems
Ji Zhang, Michael Kaess, and Sanjiv Singh

The Iterative-Closest-Point (ICP) algorithm is a method to align 2 point clouds.

In a tunnel, the ICP can have uncertainty along the tunnel direction. As a result the estimated movement along the tunnel direction can be a arbitrary scalar.

(b) is the tunnel scenario where the direction along the tunnel (blue arrow) is uncertain (degenerated).

In this paper, the author find the degenerated direction from the Hessian matrix \(A^T A\). And they don’t update estimated movement along the degenerated direction.

It is basically Principal component analysis(PCA).
\(V_p \Delta x_u\) project the update vector \( \Delta x_u\) into the principal axis frame and remove elements with small eigenvalues.
\(V_f^{-1}\) project the \(V_p \Delta x_u\) back to the original frame.

Degeneracy-Aware Factors with Applications to Underwater SLAM
Akshay Hinduja1, Bing-Jui Ho, and Michael Kaess

This paper first did the same PCA style ICP as the previous paper.

3. Compute Hessian
4. Eigenvalue & Eigenvector defines the covariance ellipsoid. Small eigenvalue mean uncertainty.
5~8. Remove update in degenerated direction.

The ICP links 2 poses in the SLAM problem. To make sure SLAM problem also knows there is a degenerate direction, the author updated the ICP cost (loop closure cost) in the SLAM problem.

1. The measurement is projected to \(V_f\).
2. The weight for the quadratic term \(\Lambda_{ik}\) is updated by \(V_f \Lambda_{ik} V_f^T\)

The update for weight can be confusing. Let me do a example.

\(cost(x) = x^T \Lambda x\)

Let’s project vector x into a whatever space by a projection matrix \(V\). In math, it is simply \(y = Vx\).

\(cost(y) = (Vx)^T \Lambda Vx = x (V^T \Lambda V) x = ||x||^2_{V^T \Lambda V}\)

(I am not sure why the weight update was \(V_f \Lambda_{ik} V_f^T\). Probably just need some detail derivations)

Alternative

Personally, I don’t think doing \(V_f \Lambda_{ik} V_f^T\) is necessary. I would argue the Hessian matrix \(A^T \Lambda A\) (in my annotation \(J^T W J\)) already captured the degenerated direction.

In other word, forcing the weight for the degenerated direction to be 0 is not necessary because \(A^T \Lambda A\) already give the degenerated direction very small cost.

So, for this loop closure cost, we can do,

\(\text{loop-closure-cost}(x1, x2) = ||\Delta_{icp} \ominus (x1 \ominus x2) ||^2_{A^T \Lambda A}\)


Where \(A^T \Lambda A\) define the weight for the cost. \(A^T \Lambda A\) is a Positive-semi definite matrix. It can be understood as a uncertainty ellipsoid (please see the GPS example above).


Motion planning under uncertainty using iterative local optimization in belief space
Jur van den Berg, Sachin Patil and Ron Alterovitz

Motion planning is usually modeled as a least squares problem. The author used iLQG ( some people call it DDP) to solve the least squares efficiently.

The paper discussed how to solve a least squares problem by iLQG efficiently in details. I will skip this part.

The interesting part is the objective function for the motion planning least squares problem.

The cost function has trace of \(J^T W J\)

As we discussed in previous section, the Hessian \(J^T W J\) of a least squares problem encode the uncertainty. The inverse of the Hessian is the Covariance. The eigenvalues of the Covariance quantities the uncertainty. For example, if eigenvalues of the Covariance is large, then the uncertainty ellipsoid is large.

There is a math fact, \(\text{trace(M)} = \text{sum of eigenvalues of M}\) .

In this paper, the author wanted to minimize the uncertainty by minimizing the trace of covariance.

(I am not a expert on Planning though. I am only 90% sure about the above explanations)

7 thoughts on “The Covariance of Optimization Problems

  1. Pingback: Muha med carts
  2. Pingback: agen slot
  3. Pingback: yehyeh com

Leave a Reply