Implement Ceres Solver from Scratch

Why writing this post?

The Ceres nonlinear least squares solver is my favorite library!

The Ceres library is very flexible and easy to use. It makes the development of least squares problems super simple. Without it, many algorithmic breakthroughs wouldn’t have happened. Good tools are extremely important!

It’s a honor and super fun to reinvent the wheel and go thought the thinking process again.

What’s in this post?

The Ceres solver provides a simple framework to solve least squares problems. e.g.

\(\text{cost}(x1, x2, x3) = ||\text{residual}_1(x1)||^2 + ||\text{residual}_2(x2)||^2 + ||\text{residual}_3(x1, x2, x3)||^2\)

Specifically, Ceres

  1. groups parameters into blocks,
  2. models each residual terms separately,
  3. can apply a robust loss to a cost term,
  4. computes the Jacobian for residuals by forward auto differentiation,
  5. builds a sparse normal equation and uses a solver to solve it,
  6. and update variables by Trust Region methods.

In this post, I will implement the cool part of Ceres, The forward auto differentiation. I also implement the robust loss function for fun.

1. The Forward Auto Differentiation

There are 2 common auto differentiation(AD) methods. The forward AD and the backward AD.

Both the forward AD and the backward AD use the Chain rule to compute derivatives. There is only an algebraic difference between them.

The algebraic difference leads to performance difference.

The backward AD is good for computing the derivative of \(y = f(x), y \in R^m \; x \in R^n\) where \(n >> m\). For example, we use backward AD for deep learning because the function to minimize maps a high dimensional input to a scalar. (read this for details)

Besides, the backward AD needs to traverse the computational graph 2 times. Whereas the forward AD only needs to do it once. For \(y = f(x), y \in R^m \; x \in R^n\) where \(n \approx m\), we can argue the forward AD is faster.

The forward AD is good for computing the derivative of \(y = f(x), y \in R^m \; x \in R^n\) where \(n \approx m\) or \(n << m\). It’s perfect for real-time Least Squares problems where residual functions typically have the same number of inputs and outputs.

Let me brifely review the Chain rule and dig into the forward AD.

1.1 The Chain Rule

Given a function,

\(cost(x) = f_3(f_2(f_1(x)))\)

Assuming the current variable is at \(x = \bar{x}\).

How to compute the gradient?

Let \(z_1 = f_1(x), z_2 = f_2(z_1)\). So \(\bar{z_1} = f_1(\bar{x}), \bar{z_2} = f_2(\bar{z_1})\)

The chain rule says,

\(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}} \)

How to compute the expression?

The Forward Auto Differentiation compute the derivative in a forward pass.

  1. Compute \(\frac{\partial z_1}{\partial x}|_{\bar{x}}\).
  2. Compute \(\frac{\partial z_2}{\partial x}|_{\bar{x}} = \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}}\).
  3. Compute \(\frac{\partial cost}{\partial z_2}|_{\bar{z_2}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial x}|_{\bar{x}}\)

A picture is a thousand words.

The forward diff. I ignored the “evaluated at” for simplicity.

This give us a clear picture of how to implement the forward auto diff.

  1. At each (intermediate) variable \(z_i\), maintain it’s derivative w.r.t the optimization variable \(x\). i.e. \(\frac{\partial z_{i}}{\partial x}\).
  2. For each intermediate function \(z_{i+1} = f_i(z_i)\), we need to implement \(\frac{\partial z_{i+1}}{\partial z_i}\).
  3. Whenever, a variable \(z_i\) and it’s derivative \( \frac{\partial z_{i}}{\partial x}\) passes a function \(z_{i+1} = f_i(z_i)\), we compute the next value \(z_{i+1}\) and using the chain rule to compute \( \frac{\partial z_{i+1}}{\partial x} = \frac{\partial z_{i+1}}{\partial z_i} \frac{\partial z_{i}}{\partial x}\).

1.2 The derivative of primitive functions

As we discussed in the previous section, we need \(\frac{\partial z_{i+1}}{\partial z_i}\) for every intermediate function.

Here are some examples.

Example 1.

\(f(x, y) = x y\)

\(\frac{\partial f}{\partial x} = y\)

\(\frac{\partial f}{\partial y} = x\)

The Ceres implementation

// Binary *
template <typename T, int N>
inline Jet<T, N> operator*(const Jet<T, N>& f, const Jet<T, N>& g) {
  return Jet<T, N>(f.a * g.a, f.a * g.v + f.v * g.a);
}

Example 2. \(f(x) = cos(x)\)

\(\frac{\partial f}{\partial x} = -sin(x)\)

// cos(a + h) ~= cos(a) - sin(a) h
template <typename T, int N>
inline Jet<T, N> cos(const Jet<T, N>& f) {
  return Jet<T, N>(cos(f.a), -sin(f.a) * f.v);
}

In sum, we simply need to do the high school math to compute \(f'(x)\).

1.3 Forward diff example

Consider this function,

\(f(x) = cos(x) x\)

Let’s say we want to compute \(\frac{\partial{f}}{\partial x}|
_{x = \pi / 4}\)


To make our life easier,

\(z_1 = f_1(x) = cos(x)\)

\(f(a = z_1, b = x) = a b\)

Where \(a = z_1, b = x\).

Compare with the back forward auto diff, the forward auto diff simply need to do one forward pass to compute the derivative.

1.4 Multi-variable inputs function

Given,

\(cost(x) = f_3(f_2(f_1(x)))\)

We compute it’s derivative by,

\(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}} \)

How about a multi-variable function \(f(x, y)\)?

\(cost(x, y) = f_3(f_2(f_1(x, y)))\)

We have to use the chain rule twice!

\(\frac{\partial cost}{\partial x}|_{\bar{x}, \bar{y}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}, \bar{y}} \)

\(\frac{\partial cost}{\partial y}|_{\bar{x}, \bar{y}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial y}|_{\bar{x}, \bar{y}} \)

That’s a caveat for the forward diff.

For \(y = f(x), y \in R^m \; x \in R^n\) where \(n >> m\), backward auto diff is a better choice.

1.5 Forward auto diff vs Backward auto diff

Both the forward auto diff and the backward auto diff uses the chain rule to compute the derivative. There are only technical differences.

For forward auto diff, we go through the expression tree from every variable node (or leaf node) to the root. We need to do the pass n times where n is the number of variables.

Forward Diff: we need to do n passes for n variables.
In this case, we need to do 2 pass for x and y.

For backward auto diff, we first traversal to the tree forwardly. Then we traversal the tree from every root to leaves.

Backward diff: we need to traversal this tree 2 times.

2 The Ceres implementation and the Jet

2.1 The Jet

For a multi-variable inputs function, Ceres uses a concept called Jet.

Given a function \(y = f(x)\), where \(x \in R^n, y \in R^m\), The Jet is a matrix which keep track of \(\frac{\partial f}{\partial x}\).

Using Jet, we basically do the forward diff paths “in parallel” (which is done by SIMD using Eigen).

The Ceres document didn’t make this clear (at least for me). However, the comments from this source file are very good.

// All inputs and outputs may be vector-valued.
//
// To understand how jets are used to compute the jacobian, a
// picture may help. Consider a vector-valued function, F, returning 3
// dimensions and taking a vector-valued parameter of 4 dimensions:
//
//     y            x
//   [ * ]    F   [ * ]
//   [ * ]  <---  [ * ]
//   [ * ]        [ * ]
//                [ * ]
//
// Similar to the 2-parameter example for f described in jet.h, computing the
// jacobian dy/dx is done by substutiting a suitable jet object for x and all
// intermediate steps of the computation of F. Since x is has 4 dimensions, use
// a Jet<double, 4>.
//
// Before substituting a jet object for x, the dual components are set
// appropriately for each dimension of x:
//
//          y                       x
//   [ * | * * * * ]    f   [ * | 1 0 0 0 ]   x0
//   [ * | * * * * ]  <---  [ * | 0 1 0 0 ]   x1
//   [ * | * * * * ]        [ * | 0 0 1 0 ]   x2
//         ---+---          [ * | 0 0 0 1 ]   x3
//            |                   ^ ^ ^ ^
//          dy/dx                 | | | +----- infinitesimal for x3
//                                | | +------- infinitesimal for x2
//                                | +--------- infinitesimal for x1
//                                +----------- infinitesimal for x0
//
// The reason to set the internal 4x4 submatrix to the identity is that we wish
// to take the derivative of y separately with respect to each dimension of x.
// Each column of the 4x4 identity is therefore for a single component of the
// independent variable x.
//
// Then the jacobian of the mapping, dy/dx, is the 3x4 sub-matrix of the
// extended y vector, indicated in the above diagram.
//
// Functors with multiple parameters
// ---------------------------------
// In practice, it is often convenient to use a function f of two or more
// vector-valued parameters, for example, x[3] and z[6]. Unfortunately, the jet
// framework is designed for a single-parameter vector-valued input. The wrapper
// in this file addresses this issue adding support for functions with one or
// more parameter vectors.
//
// To support multiple parameters, all the parameter vectors are concatenated
// into one and treated as a single parameter vector, except that since the
// functor expects different inputs, we need to construct the jets as if they
// were part of a single parameter vector. The extended jets are passed
// separately for each parameter.
//
// For example, consider a functor F taking two vector parameters, p[2] and
// q[3], and producing an output y[4]:
//
//   struct F {
//     template<typename T>
//     bool operator(const T *p, const T *q, T *z) {
//       // ...
//     }
//   };
//
// In this case, the necessary jet type is Jet<double, 5>. Here is a
// visualization of the jet objects in this case:
//
//          Dual components for p ----+
//                                    |
//                                   -+-
//           y                 [ * | 1 0 | 0 0 0 ]    --- p[0]
//                             [ * | 0 1 | 0 0 0 ]    --- p[1]
//   [ * | . . | + + + ]         |
//   [ * | . . | + + + ]         v
//   [ * | . . | + + + ]  <--- F(p, q)
//   [ * | . . | + + + ]            ^
//         ^^^   ^^^^^              |
//        dy/dp  dy/dq            [ * | 0 0 | 1 0 0 ] --- q[0]
//                                [ * | 0 0 | 0 1 0 ] --- q[1]
//                                [ * | 0 0 | 0 0 1 ] --- q[2]
//                                            --+--
//                                              |
//          Dual components for q --------------+
//
// where the 4x2 submatrix (marked with ".") and 4x3 submatrix (marked with "+"
// of y in the above diagram are the derivatives of y with respect to p and q
// respectively. This is how autodiff works for functors taking multiple vector
// valued arguments (up to 6).

For example, in this case, a function \(F(p,q)\) is given.

// In this case, the necessary jet type is Jet<double, 5>. Here is a
// visualization of the jet objects in this case:
//
//          Dual components for p ----+
//                                    |
//                                   -+-
//           y                 [ * | 1 0 | 0 0 0 ]    --- p[0]
//                             [ * | 0 1 | 0 0 0 ]    --- p[1]
//   [ * | . . | + + + ]         |
//   [ * | . . | + + + ]         v
//   [ * | . . | + + + ]  <--- F(p, q)
//   [ * | . . | + + + ]            ^
//         ^^^   ^^^^^              |
//        dy/dp  dy/dq            [ * | 0 0 | 1 0 0 ] --- q[0]
//                                [ * | 0 0 | 0 1 0 ] --- q[1]
//                                [ * | 0 0 | 0 0 1 ] --- q[2]

The initial Jet for p is \([\frac{\partial p}{\partial p}, \frac{\partial p}{\partial q}] =, [I, 0]\). The initial Jet for q is \([\frac{\partial q}{\partial p}, \frac{\partial q}{\partial q}] = [0, I]\).

After the forward diff (section 2.2), we get \(y\) and \(y\)’s Jet which has \([\frac{\partial y}{\partial p}, \frac{\partial y}{\partial q}]\).

2.2 A question…

To be honest, I am not sure why people are using Jets.

Jets makes some math simple. But Jet introduces unnecessary computation. i.e. Jet tracks the derivative of current variable w.r.t to ALL optimization variables.

Personally, I think using the expression tree topology to track derivative is more intuitive and more efficient. My implement used a hashmap to track the derivative w.r.t to existing variables.

My guess why Ceres used Jets are,

  1. Theory of forward AD uses the Jets.
  2. The performance gain of using a hashmap maybe not be significant. A Hashmap is not memory friendly. Jets operations are just matrix operations that are very fast and can be optimized by SIMD. (But Can we mark part of “unmet” Jet as void and skip the unnecessary computation?)

3 My implementation

3.1 Jet

class JetMapImpl:
    def __init__(self, value, var_id=None):
        self.value = value
        self.derivative_wrt_variables = defaultdict(float)
        self.var_id = var_id
        if var_id is not None:
            self.derivative_wrt_variables = {var_id: 1.}

The Jet class tracks a variable \(z\), and it’s derivative w.r.t other optimization variables. When init an optimization variable (a Jet with ID), the derivative w.r.t itself is 1.

3.2 Primitive Functions

My implementation of multiplication function and cosine function.

class JetMapImpl:

    ... 
    
    def __mul__(self, other):
        if isinstance(other, numbers.Number):
            other = JetMapImpl(other)

        ret = JetMapImpl(value=self.value * other.value)
        ret.derivative_wrt_variables = defaultdict(float)

        for var in self.derivative_wrt_variables:
            ret.derivative_wrt_variables[var] += other.value * self.derivative_wrt_variables[var]

        for var in other.perturb:
            ret.derivative_wrt_variables[var] += self.value * other.derivative_wrt_variables[var]
        return ret

    def __rmul__(self, other):
        if isinstance(other, numbers.Number):
            other = JetMapImpl(other)

        return other * self
        
    ...
    
def cosine(num):
    if isinstance(num, numbers.Number):
        return math.cos(num)
    else:
        ret = JetMapImpl(value=math.cos(num.value))
        ret.derivative_wrt_variables = defaultdict(float)

        for var in num.derivative_wrt_variables:
            ret.derivative_wrt_variables[var] += - math.sin(num.value) * num.derivative_wrt_variables[var]

        return ret

For multiplication, I overload the Python multiplication operator for the Jet class.

For complex functions like cosine, I implemented them by functions.

Line 13,16,35 are the (forward) chain rules.

3.3 The Residual Block

The residual block takes a residual function \(f(x)\). It replace x by a Jet and do the forward auto diff. After the evaluation, we can read out the derivative from its result.

class ResidualBlock:
    def __init__(self, residual_function, init_vars):
        self.residual_function = residual_function
        self.jets = np.array([JetMapImpl(v, var_id='var{}'.format(i))
                              for i, v in enumerate(init_vars)])

    def evaluate(self):
        jet_residual = self.residual_function(self.jets)

        jacobian = np.zeros([len(jet_residual), len(self.jets)])
        for ridx in range(len(jet_residual)):
            for vidx in range(len(self.jets)):
                jacobian[ridx, vidx] = jet_residual[ridx].derivative_wrt_variables[self.jets[vidx].var_id]
        residual = np.array([j.value for j in jet_residual])
        return residual, jacobian

The residual block takes a residual function and initial values.

In the evaluate function, the jet passes the residual function. Using the forward diff, the derivative w.r.t variables are stored in the jet_residual.

In the end, we matched indices to construct the full derivative (I called it Jacobian).


That’s it! The whole library is 121 lines.

Well… I didn’t put the solver and the full jacobian construction into the library. But they are simple to implement. You know, we have to do something for the marketing people (me) so that they can use the title.

To honestly show what the library is doing, I called the library number_forward_flow.

3 Experiments (and the robust loss)

Let’s do some cool stuff!

well… it’s better to start with something simple.

3.1 Linear regression.

Problem:

\(cost(x) = ||Ax – b ||^2\)

from number_forward_flow import *


def linear_case():
    print('=============== linear_case ==============')
    A = np.random.rand(6, 5)
    x_gt = np.ones(5, dtype='float64')

    b = A @ x_gt

    def residual_test(vars):
        """
        r = Ax - b
        """
        ret = A @ vars - b
        return ret

    x0 = np.random.rand(5, 1)
    r, J = ResidualBlock(residual_test, x0).evaluate()
    print('r:', r)
    print('J:', J)
    print('A:', A)

    J = np.array(J)
    r = np.array(r)
    dx = np.linalg.solve(J.T @ J, -J.T @ r)
    print('x0:', x0.T)
    print('dx:', dx.T)
    print('solver res:', (x0 + dx).T)
    print('x_gt:', x_gt.T)


if __name__ == "__main__":
    linear_case()

result:

(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/least_squares/ceres-from-scratch/linear_regression.py
=============== linear_case ==============
r: [[-1.30688531]
 [-1.03153702]
 [-1.1555956 ]
 [-1.22759951]
 [-0.74989858]
 [-0.70513636]]
J: [[0.27403841 0.77745345 0.827085   0.1841853  0.75343071]
 [0.30750629 0.79677311 0.60875236 0.103416   0.49716925]
 [0.12260936 0.62014886 0.00796148 0.725355   0.71716332]
 [0.3284037  0.39432525 0.77510378 0.11730527 0.91617197]
 [0.74388594 0.14046224 0.08931841 0.96549898 0.33602619]
 [0.50720116 0.31832743 0.4707162  0.96679627 0.11728903]]
A: [[0.27403841 0.77745345 0.827085   0.1841853  0.75343071]
 [0.30750629 0.79677311 0.60875236 0.103416   0.49716925]
 [0.12260936 0.62014886 0.00796148 0.725355   0.71716332]
 [0.3284037  0.39432525 0.77510378 0.11730527 0.91617197]
 [0.74388594 0.14046224 0.08931841 0.96549898 0.33602619]
 [0.50720116 0.31832743 0.4707162  0.96679627 0.11728903]]
x0: [[0.87244477 0.49009767 0.78562987 0.71376338 0.14327615]]
dx: [[0.12755523 0.50990233 0.21437013 0.28623662 0.85672385]]
solver res: [[1. 1. 1. 1. 1.]]
x_gt: [1. 1. 1. 1. 1.]

3.2 SLAM

A simple SLAM,

\(cost(s_0, s_1) = ||\text{motion}(s_0) – s_1||^2 + ||observation(s_0) – o_0||^2 + ||observation(s_1) – o_1||^2\)

from number_forward_flow import *


def motion_case():
    print('=============== motion (nonlinear) case ==============')

    def motion_func(state):
        x, y, v, theta = state

        xnext = x + cosine(theta) * v
        ynext = y + sine(theta) * v

        return [xnext, ynext, v, theta]

    def observation_func(state):
        x, y, v, theta = state
        return x, y

    s0 = [0., 0., 0.5, math.pi / 9]
    s1 = motion_func(s0)

    o0 = observation_func(s0)
    o1 = observation_func(s1)

    v0 = [0., 0., 1., math.pi / 4]
    v1 = [0., 0., 0., 0.]
    states = v0 + v1

    def motion_residaul(state0andstate1):
        state0 = state0andstate1[:4]
        state1 = state0andstate1[4:]
        pred = motion_func(state0)
        return [b - a for a, b in zip(pred, state1)]

    def observation_residual0(state0andstate1):
        state0 = state0andstate1[:4]
        pred = observation_func(state0)
        return [a - b for a, b in zip(pred, o0)]

    def observation_residual1(state0andstate1):
        state1 = state0andstate1[4:]
        pred = observation_func(state1)
        return [a - b for a, b in zip(pred, o1)]

    for iter in range(10):
        full_jacobian = np.zeros([0, 8])
        residaul = np.zeros([0])

        r, J = ResidualBlock(observation_residual0, states).evaluate()
        r, J = np.array(r).T, np.array(J)
        residaul = np.hstack([residaul, r])
        full_jacobian = np.vstack([full_jacobian, J])

        r, J = ResidualBlock(observation_residual1, states).evaluate()
        r, J = np.array(r).T, np.array(J)
        residaul = np.hstack([residaul, r])
        full_jacobian = np.vstack([full_jacobian, J])

        r, J = ResidualBlock(motion_residaul, states).evaluate()
        r, J = np.array(r).T, np.array(J)
        residaul = np.hstack([residaul, r])
        full_jacobian = np.vstack([full_jacobian, J])

        delta = np.linalg.solve(
            full_jacobian.T @ full_jacobian, - full_jacobian.T @ residaul)
        states = [s + 0.8 * d for s, d in zip(states, delta)]
        print('cost:', residaul.T @ residaul)

    print('result s0:', states[:4])
    print('gt     s0:', s0)
    print('result s1:', states[4:])
    print('gt     s1:', s1)


if __name__ == "__main__":
    motion_case()

Note that I have to match indices for the Normal equation since I didn’t implement the index matching internally.

result:

(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/least_squares/ceres-from-scratch/mpc.py
=============== motion (nonlinear) case ==============
cost: 2.866850275068085
cost: 0.1170672477967987
cost: 0.004831987209573185
cost: 0.00018651715814521183
cost: 7.375151884934139e-06
cost: 2.94287403335809e-07
cost: 1.17656996267606e-08
cost: 4.705815402071198e-10
cost: 1.8822889933427077e-11
cost: 7.529126237782477e-13
result s0: [-1.9058241313235495e-22, -9.529120656487484e-23, 0.49999997554265163, 0.3490660486509205]
gt     s0: [0.0, 0.0, 0.5, 0.3490658503988659]
result s1: [0.46984626228069204, 0.17101005415140302, 0.49999987314265165, 0.34906596822614855]
gt     s1: [0.4698463103929542, 0.17101007166283436, 0.5, 0.3490658503988659]

3.3 Local parameterization SO(3)

Please read this tutorial if you are not familiar with the problem.

Problem:

\(cost(R, t) = \sum_i ||(R p_i + t) – q_i||^2\)

Where \(R \in SO(3), t \in R^3\).

Solve as,

\(cost(\epsilon, t) = \sum_i ||(exp(\epsilon) R p_i + t) – q_i||^2\)

\(\text{Update:} R^+ = exp(\epsilon^*) R\)

from number_forward_flow import *


def euler_angle_to_rotation(yaw, pitch, roll):
    from math import cos, sin
    Rz = np.array([
        [cos(yaw), -sin(yaw), 0.],
        [sin(yaw), cos(yaw), 0.],
        [0, 0, 1.],
    ])
    Ry = np.array([
        [cos(pitch), -0., sin(pitch)],
        [0., 1., 0.],
        [-sin(pitch), 0, cos(pitch)],
    ])
    Rx = np.array([
        [1., 0., 0.],
        [0., cos(roll), -sin(roll)],
        [0, sin(roll), cos(roll)],
    ])

    return Rz @ Ry @ Rx


def skew(w):
    wx, wy, wz = w
    return np.array([
        [0, -wz, wy],
        [wz, 0, -wx],
        [-wy, wx, 0.],
    ])


def so3_exp(w):
    from math import cos, sin
    theta = np.linalg.norm(w)
    # Approximation when theta is small
    if(abs(theta) < 1e-8):
        return np.identity(3) + skew(w)

    normalized_w = w / theta
    K = skew(normalized_w)
    # Rodrigues
    R = np.identity(3) + sin(theta) * K + (1 - cos(theta)) * K @ K
    np.testing.assert_almost_equal(R @ R.transpose(), np.identity(3))

    return R


def icp_so3():
    APPLY_CAUCHY_LOSS = False
    ADD_WRONG_ASSOCIATION = False

    print('=============== icp_so3 ==============')
    R = euler_angle_to_rotation(0.2, 0.1, 0.3)
    assert(abs(np.linalg.det(R) - 1) < 1e-4)
    t = np.array([1., 2., 3.]).T

    src = np.array([
        [0, 0, 1],
        [1, 0, 0],
        [0, 1, 0],
        [1, 1, 0],
        [0, 1, 1],
        [1, 0, 1],
        [0, 2, -1],
    ])

    def transform(p, R, t):
        return R @ p + t

    def transform_all(src, R, t):
        return (R @ src.T + t[:, np.newaxis]).T

    target = transform_all(src, R, t)

    if ADD_WRONG_ASSOCIATION:
        src = np.vstack([src, np.array([1, 2, 3])])
        target = np.vstack([target, np.array([10, 20, 30])])

    def icp_residual_i(epsilonWithT, Rot, src_i, target_i):
        E = skew(epsilonWithT[:3])
        t = epsilonWithT[3:]

        # this is the true function
        #  return transform(src_i, so3_exp(epsilonWithT[:3]) @ R, t) - target_i
        # but need to implement a couple of operators for so3_exp
        return transform(src_i, (E + np.identity(3)) @ Rot, t) - target_i

    epsilonWithT_var = np.array([0., 0., 0., 0., 0., 0.]).T
    R_var = np.identity(3)


    for iter in range(20):
        lhs = np.zeros([6, 6])
        rhs = np.zeros(6)
        cost = 0
        for src_i, target_i in zip(src, target):
            r, J = ResidualBlock(lambda param: icp_residual_i(
                param, R_var, src_i, target_i), epsilonWithT_var).evaluate()
            if APPLY_CAUCHY_LOSS:
                def cauchy_dot(s):
                    return 1 / (1 + s)
                cauchy_weigth = cauchy_dot(r.T @ r)
                lhs += cauchy_weigth * J.T @ J
                rhs -= cauchy_weigth * J.T @ r
                cost += math.log(1 + np.linalg.norm(r))
            else:
                lhs += J.T @ J
                rhs -= J.T @ r
                cost += np.linalg.norm(r)
        print('iter:', iter, 'cost:', cost)
        delta = 0.8 * np.linalg.solve(lhs, rhs)
        R_var = so3_exp(delta[:3]) @ R_var
        epsilonWithT_var[3:] += delta[3:]
    print('t_est:', epsilonWithT_var[3:])
    print('t_gt: ', t)
    print('R_var:', R_var)
    print('R_gt: ', R)


if __name__ == "__main__":
    icp_so3()

Result:

(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/least_squares/ceres-from-scratch/icp_so3.py
=============== icp_so3 ==============
iter: 0 cost: 26.884512073938634
iter: 1 cost: 5.419576314487301
iter: 2 cost: 1.0879076340845337
iter: 3 cost: 0.21775315837565062
iter: 4 cost: 0.04355755850429881
iter: 5 cost: 0.008711789207605125
iter: 6 cost: 0.0017423689451638363
iter: 7 cost: 0.0003484742332049086
iter: 8 cost: 6.969486440709107e-05
iter: 9 cost: 1.3938973593755753e-05
iter: 10 cost: 2.7877947472706096e-06
iter: 11 cost: 5.57558952314859e-07
iter: 12 cost: 1.1151179136412616e-07
iter: 13 cost: 2.2302356268680893e-08
iter: 14 cost: 4.4604727227661115e-09
iter: 15 cost: 8.920953211635121e-10
iter: 16 cost: 1.784177261299197e-10
iter: 17 cost: 3.568323659195618e-11
iter: 18 cost: 7.137834990046151e-12
iter: 19 cost: 1.4267434495726705e-12
t_est: [1. 2. 3.]
t_gt:  [1. 2. 3.]
R_var: [[ 0.97517033 -0.16088136  0.15218417]
 [ 0.19767681  0.94215466 -0.27068149]
 [-0.09983342  0.29404384  0.95056379]]
R_gt:  [[ 0.97517033 -0.16088136  0.15218417]
 [ 0.19767681  0.94215466 -0.27068149]
 [-0.09983342  0.29404384  0.95056379]]

3.4 Robust Loss

Did you notice the robust loss in the previous implementation?

Here are some robust loss functions (copy from Ceres doc).

A robust loss function maps \(\rho \in R^1 \to R^1\).

To use it in the least squares framework,

\(cost(x) = \sum_i \rho(||r_i(x)||^2)\)

The logic behind the robust loss is: If a cost term is high, there is something wrong with the data! As a result, we limited the contribution of the cost term.

A good robust loss can’t pass the behavior interview.

How does the loss function affect the normal equation?

Let’s derive it! We want to take the derivative to x and solve the (local) optimality condition.

\(\frac{\partial cost}{\partial x} =\sum_i \frac{\partial \rho(||r_i(x)||^2)}{\partial ||r_i(x)||^2} \frac{\partial ||r_i(x)||^2}{\partial x} \; \; \; (3.1)\)

Firstly, let’s consider the second term in the RHS of (3.1). \(\frac{\partial ||r_i(x)||^2}{\partial x}\) is the common Normal equation. Let’s derive it.

Let’s linearize the residaul. \(r_i(x) = J_i x + b_i\). Plug in,

\(\frac{\partial ||J_i x + b_i||^2}{\partial x} = 2 J_i^T J_i x + 2 J_i^T b_i\)

Secondly, let’s consider the first term \(\frac{\partial \rho(||r_i(x)||^2)}{\partial ||r_i(x)||^2}\) in (3.1). \(\frac{\partial \rho(||r_i(x)||^2)}{\partial ||r_i(x)||^2}\) is can be writen as \(\rho'(||r_i(x)||^2)\) because \(\rho \in R^1 \to R^1\).

Let \(c_i = ||r_i(x)||^2\).

Put them together, the local optimality condition (gradient == 0) become,

\(\frac{\partial cost}{\partial x} = \sum_i \rho'(c_i) (2 J_i^T J_i x + 2 J_i^T b_i) = 0\)

\(\sum_i \rho'(c_i) J_i^T J_i x = – \sum_i \rho'(c_i) J_i^T b_i\)

Compare it with the “normal” Normal Equation,

\(J_i^T J_i x = – J_i^T b_i\)

We simply need to weight each residual terms in the Normal equation to apply robust loss.

    ...
    for iter in range(20):
        lhs = np.zeros([6, 6])
        rhs = np.zeros(6)
        cost = 0
        for src_i, target_i in zip(src, target):
            r, J = ResidualBlock(lambda param: icp_residual_i(
                param, R_var, src_i, target_i), epsilonWithT_var).evaluate()
            if APPLY_CAUCHY_LOSS:
                def cauchy_dot(s):
                    return 1 / (1 + s)
                cauchy_weigth = cauchy_dot(r.T @ r)
                lhs += cauchy_weigth * J.T @ J
                rhs -= cauchy_weigth * J.T @ r
                cost += math.log(1 + np.linalg.norm(r))
            else:
        ...

Result when I enable the wrong association without enabling the Cauchy loss.

APPLY_CAUCHY_LOSS = False
ADD_WRONG_ASSOCIATION = True
(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/least_squares/ceres-from-scratch/icp_so3.py
=============== icp_so3 ==============
iter: 0 cost: 60.5594285549041
iter: 1 cost: 43.30780430671041
iter: 2 cost: 53.554583003105876
iter: 3 cost: 58.16457480658197
iter: 4 cost: 47.261017105176116
iter: 5 cost: 55.82294904408156
iter: 6 cost: 69.23781198574048
iter: 7 cost: 63.85039428878929
iter: 8 cost: 58.253117204681786
iter: 9 cost: 58.494428329036374
iter: 10 cost: 70.38793814727589
iter: 11 cost: 72.49860034582484
iter: 12 cost: 73.39338170895616
iter: 13 cost: 72.27579193738345
iter: 14 cost: 74.2428946732636
iter: 15 cost: 71.66874982062299
iter: 16 cost: 73.23984346786979
iter: 17 cost: 72.38554941504754
iter: 18 cost: 73.89789524152792
iter: 19 cost: 71.57678077744617
t_est: [-1.78556957 -1.06431437  0.54212178]
t_gt:  [1. 2. 3.]
R_var: [[ 0.4413223   0.09248707  0.89256975]
 [-0.26768531  0.96295561  0.03257399]
 [-0.85649238 -0.25330344  0.44973122]]
R_gt:  [[ 0.97517033 -0.16088136  0.15218417]
 [ 0.19767681  0.94215466 -0.27068149]
 [-0.09983342  0.29404384  0.95056379]]

Result When I enable the Cauchy loss.

APPLY_CAUCHY_LOSS = True
ADD_WRONG_ASSOCIATION = True
(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/least_squares/ceres-from-scratch/icp_so3.py
=============== icp_so3 ==============
iter: 0 cost: 14.579071608386414
iter: 1 cost: 7.274149623826291
iter: 2 cost: 4.3649747185988454
iter: 3 cost: 3.606929402577699
iter: 4 cost: 3.4792882066666486
iter: 5 cost: 3.4809811124850443
iter: 6 cost: 3.482993355129383
iter: 7 cost: 3.48346479535868
iter: 8 cost: 3.4835635397001203
iter: 9 cost: 3.4835836971716323
iter: 10 cost: 3.48358778731986
iter: 11 cost: 3.483588615613597
iter: 12 cost: 3.483588783169081
iter: 13 cost: 3.4835888170356775
iter: 14 cost: 3.4835888238759565
iter: 15 cost: 3.483588825256659
iter: 16 cost: 3.4835888255351937
iter: 17 cost: 3.4835888255913563
iter: 18 cost: 3.483588825602674
iter: 19 cost: 3.483588825604953
t_est: [0.99941573 1.99951618 3.0101074 ]
t_gt:  [1. 2. 3.]
R_var: [[ 0.97475665 -0.15963402  0.15609761]
 [ 0.19688822  0.94426422 -0.26381834]
 [-0.10528301  0.28789246  0.95185788]]
R_gt:  [[ 0.97517033 -0.16088136  0.15218417]
 [ 0.19767681  0.94215466 -0.27068149]
 [-0.09983342  0.29404384  0.95056379]]

Much better!


Ceres, maybe you can do better?

This is the doc from Ceres for Robust Loss functions.

I believe the author did a second-order Taylor expansion for \(\rho(f^2(x))\) and find it’s minimal.

Personally, I don’t think it’s necessary since we are able to solve the optimal condition \(\nabla \rho(f^2(x)) = 0\) directly by the first-order Taylor expansion, which is what least-squares does. Besides, the \(2 \rho ” f(x) f^T(x)\) term seems like a dense matrix. remove it will make the linear equation faster to solve.

Please try it! or let me know if I am wrong.

If possible may I can raise a PR and add call myself “a significant contributor to Ceres lib”?

A significant contributor to Ceres Library

3.5 Local parameterization SE(3)

Please read this tutorial if you are not familiar with the problem.

Problem:

\(cost(T) = \sum_i ||T p_i – q_i||^2\)

Where \(T \in SE(3)\).

Solve as,

\(cost(\epsilon) = \sum_i || exp(\epsilon) T p_i – q_i ||^2\)

\(\text{Update:} T^+ = exp(\epsilon^*) T\)

from number_forward_flow import *


def euler_angle_to_rotation(yaw, pitch, roll):
    ...


def skew(w):
    ...

def so3_exp(w):
    ...

def V_operator(w):
    from math import cos, sin
    w_skew = skew(w)
    theta = np.linalg.norm(w)
    if(abs(theta) < 1e-7):
        return np.identity(3) + w_skew / 2.
    
    V = np.identity(3) + (1. - cos(theta)) / (theta * theta) * w_skew + \
        + (theta - sin(theta)) / (theta * theta * theta) * w_skew @ w_skew
    return V

def se3_exp(se3):
    w = se3[:3]
    t = se3[3:].reshape([3,1])

    SE3_mat = np.identity(4)
    SE3_mat[:3, :3] = so3_exp(w)
    # I hate numpy
    SE3_mat[:3, 3] = (V_operator(w) @ t).flatten()

    return SE3_mat

def icp_se3():
    print('=============== icp_se3 ==============')

    T = np.identity(4)
    T[:3,:3] = euler_angle_to_rotation(0.2, 0.1, 0.3)
    T[:3, 3] = np.array([1., 2., 3.])

    src = np.array([
        [0, 0, 1],
        [1, 0, 0],
        [0, 1, 0],
        [1, 1, 0],
        [0, 1, 1],
        [1, 0, 1],
        [0, 2, -1],
    ])
    # homogenous coordinate
    src = np.hstack([src, np.ones([src.shape[0],1])]).T
    
    target = T @ src

    def icp_residual_i(se3_para, T, src_i, target_i):
        # this is the true function
        #  return exp(se3_para) @ T @ src_i - target_i
        # but need to implement a couple of operators for so3_exp
        wx,wy,wz,tx,ty,tz = se3_para
        SE3_approx = np.array([
            [1., -wz, wy, tx],
            [wz, 1., -wx, ty],
            [-wy, wx, 1., tz],
            [0., 0., 0., 1.],
        ])

        return (SE3_approx @ T @ src_i - target_i)[:3]

    T_var = np.identity(4)

    for iter in range(20):
        local_se3 = np.array([0., 0., 0., 0., 0., 0.]).T
        lhs = np.zeros([6, 6])
        rhs = np.zeros(6)

        cost = 0
        for i in range(src.shape[1]):
            src_i = src[:, i]
            target_i = target[:, i]
            r, J = ResidualBlock(lambda param: icp_residual_i(
            param, T_var, src_i, target_i), local_se3).evaluate()
            lhs += J.T @ J
            rhs -= J.T @ r
            cost += np.linalg.norm(r)
        print('iter', iter, 'cost:', cost)
        local_se3_delta = 0.8 * np.linalg.solve(lhs, rhs)
        T_var = se3_exp(local_se3_delta) @ T_var

    print('T_var:', T_var)
    print('T_gt: ', T)


if __name__ == "__main__":
    icp_se3()

Results.

(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/least_squares/ceres-from-scratch/icp_se3.py
=============== icp_se3 ==============
iter 0 cost: 26.884512073938634
iter 1 cost: 5.772819496034312
iter 2 cost: 1.1612456277943384
iter 3 cost: 0.23231098216681542
iter 4 cost: 0.046462990229864484
iter 5 cost: 0.009292616417063856
iter 6 cost: 0.0018585239113496584
iter 7 cost: 0.00037170480653331354
iter 8 cost: 7.434096227232765e-05
iter 9 cost: 1.4868192493123225e-05
iter 10 cost: 2.9736385004466533e-06
iter 11 cost: 5.947276998528477e-07
iter 12 cost: 1.1894554141255272e-07
iter 13 cost: 2.3789105864585908e-08
iter 14 cost: 4.757822697514017e-09
iter 15 cost: 9.515621324543842e-10
iter 16 cost: 1.9031257564598513e-10
iter 17 cost: 3.806277286344445e-11
iter 18 cost: 7.613182718185125e-12
iter 19 cost: 1.5202378638153743e-12
T_var: [[ 0.97517033 -0.16088136  0.15218417  1.        ]
 [ 0.19767681  0.94215466 -0.27068149  2.        ]
 [-0.09983342  0.29404384  0.95056379  3.        ]
 [ 0.          0.          0.          1.        ]]
T_gt:  [[ 0.97517033 -0.16088136  0.15218417  1.        ]
 [ 0.19767681  0.94215466 -0.27068149  2.        ]
 [-0.09983342  0.29404384  0.95056379  3.        ]
 [ 0.          0.          0.          1.        ]]

Fun fact. The cost for \(SE(3)\) is a little bit higher than \((SO(3), R^3)\). I believe it’s due to the “coupled” update for \(SE(3)\). I guess someone can write a paper to analyze it.

4 Forward differential in PyTorch

In some situation, the forward differential is faster than the backward differential.

This feature request asked for adding the forward diff into Pytorch.

However, from section 2.4, we can see that the forward diff and the backward diff doesn’t work together. The graph traversal and the value to track are different for these two methods.

Mixing complex things together is a dangeous move in practice. The feature request may be rejected.

But we can use the forward diff in the Pytorch framework without too much disturbing complexity.


We can intergrate the forward diff as a “module” in the backward diff framework.

For example, consider this paper.

Digging Into Self-Supervised Monocular Depth Estimation
Clement Godard, Oisin Mac Aodha, Michael Firman, Gabriel Brostow

The goal of the paper is,

  • Given 2 images, A and B.
  • The authors wanted to find a depth estimation function \(p^A = \text{depth} = f_1(A)\) and a transformation estimation function \(T_A^B = f_2(A,B)\),
  • Such that the reprojection error between the predicted image B from the depth (In math, \(\text{pred-pixel-coord-in-B} = \pi (T_A^B p^A)\) ) and actual image B is minimized.

In a picture,

The Pytorch did the backward diff and find the optimal \(f_1(A)\) and \(f_2(A, B)\).

It works. But maybe we can do better.

The transformation function and the camera projection function maps a point (cloud) \(p^B\) into pixel coordinate \(z = (u,v)\).

We want to compute the \(\frac{\partial{p^B}}{\partial z}\). Using the forward diff is (likely) more efficient than the backward diff for this function because,

  1. forward diff doesn’t need to do the topological sort.
  2. size of the input is similar to the size of the output.
  3. (or just look at the experimental results. The benefits of the C++ Template! AKA hard coding with compiler support 🙂 )

Instead of doing the forward diff for the whole problem, we can group the two function together as a “module” and use the forward diff to compute the \(\frac{\partial{p^B}}{\partial z}\).

(A more interesting experiment is to use the forward diff for the \(f_1(A)\) which is a U-net. My feeling is we can get a little bit speed-up.)

The whole problem will be solved by the backward diff. We use the forward diff to compute the derivate of a “module”.

All you need to do is implement a mini forward diff lib. With the existing Pytorch supports, I feel like it’s not a hard task.

The syntax could be sometime like,

class Network(nn.ForwardModule):
    def __init__(self):
        ...
    
    def forward(self, x):
        ...
    
        return x

Internally, the ForwardModule uses the forward diff for computing it’s derivative. The Pytorch computational graph doesn’t need to be updated since the ForwardModule is simply a module in the eye of Pytorch.

Please let me know if you think it’s beneficial and feasible. I’d like to submit a PR and call me a significant contributor to Pytorch.

The End

It was a pleasure to go through the development process for the Ceres library! I really appreciate your hard work!

You saved my hair!

Me after Ceres

The code link.

And I got a couple of questions for Ceres developers.

  1. Why using Jet instead of tracking the derivative? Section 2.2
  2. A simpler robust loss formula? Section 3.4
  3. Why using Trust Region instead of line search? I believe there are equivalent. And Line search is easier.
  4. Please add an API for adding Quadratic cost directly. (Otherwise, I need to do a Cholesky and add the “sqrt” terms as a residual).
  5. Please add \(SO(3)\) support! I hate quaternion! (This is personal.)
  6. Why not adding constrained least-squares? It’s not too hard to implement. A POC.

And a question for Pytorch/Tensorflow developers.

  1. Can we add a forward diff module into tensorflow? Section 4.

Thanks!

12 thoughts on “Implement Ceres Solver from Scratch

  1. Pingback: psl rifle for sale
  2. Pingback: raja bandarq
  3. Pingback: bonanza178
  4. Pingback: nagaqq login
  5. Pingback: paito hongkong
  6. Pingback: Vape carts USA

Leave a Reply