Variable Elimination, Smoothing, and Marginalization Explained

Why writing this article?

Variables elimination is the fundamental algorithm for solving a linear system of equations.

For the least-squares problems, the Normal Equation is a linear system. Variable-elimination is a method to solve the linear system.

In previous articles, we used the matrix decomposition methods to solve the linear system. In how-to-land-a-rocket-section-1, we used the Sparse Cholesky Decomposition to solve the Normal Equation. In how-to-land-a-rocket-section-2, we used the Sparse LU Decomposition to solve the relaxed linearized KKT system of equations for the constrained optimization problem.

LU decomposition and Cholesky Decomposition are simply Variable-Elimination.

Smoothing (or fixed-lag smoothing) and Marginalization are common techniques for least-squares applications. They are simply variables elimination too.

I’d like to explain Variable-elimination first. Then I will explain the Smoothing & Marginalization in the perspective of variable-elimination.

Also, understanding variable-elimination allows us to exploit the structure of a linear system of equations. As a result, we can solve a structured linear system of equations faster.

Variable Elimination

Variable elimination is exactly what we learned at the beginning of Linear Algebra. It is super easy.

Consider a example. Given a 2 by 2 linear system,

\(a x_1 + b x_2 = e\\ c x_1 + d x_2 = f\)

In vector form,

\(\begin{bmatrix} a & b\\ c & d \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} e\\ f \end{bmatrix}\)

We can eliminate c by subtracting \(a^{-1} c\) times the first row from the second row. It doesn’t change the solution x of the linear system.

\((\begin{bmatrix} a & b\\ c & d \end{bmatrix} – \begin{bmatrix} 0 & 0\\ a a^{-1} c & b a^{-1} c \end{bmatrix}) \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} e\\ f \end{bmatrix} – \begin{bmatrix} 0\\ e a^{-1} c \end{bmatrix} \)

\(\begin{bmatrix} a & b\\ 0 & d – b a^{-1} c \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} e\\ f – e a^{-1} c \end{bmatrix}\)

We eliminated \(x_1\) in the second row of the system.

Marginalization

Now consider the second row of the eliminated equation,

\(\begin{bmatrix} 0 & d – b a^{-1} c \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} f – e a^{-1} c \end{bmatrix}\)

Because the coefficient for \(x_1\) is 0, we can ignore the first col. We get,

\(\begin{bmatrix} d – b a^{-1} c \end{bmatrix} \begin{bmatrix} x_2 \end{bmatrix} = \begin{bmatrix} f – e a^{-1} c \end{bmatrix}\)

We can solve for \(x_2\) directly.

This is Marginalization. We eliminated x1 and the new linear system only has x2.

To solve the oringinal system for \(x_1, x_2\), we can first use the second row,

\(\begin{bmatrix} d – b a^{-1} c \end{bmatrix} \begin{bmatrix} x_2 \end{bmatrix} = \begin{bmatrix} f – e a^{-1} c \end{bmatrix}\)

to compute \(x_2^*\).

Then, consider the first row of the original equation,

\(\begin{bmatrix} a & b\end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} e \end{bmatrix} \)

Since \(x_2^*\) is given,

\(\begin{bmatrix} a \end{bmatrix} \begin{bmatrix} x_1 \end{bmatrix} = \begin{bmatrix} e – b x_2^* \end{bmatrix} \)

We can compute \(x_1^*\) easily.

Schur Complement

The above computation also apples for matrices. That’s Schur Complement.

Consider a example. Given a 2 by 2 block linear system,

\(\begin{bmatrix} A & B\\ C & D \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} e\\ f \end{bmatrix} \; \; \;(1.1)\)

Where \(A, B, C, D\) are matrices, \(e, f\) are vectors.

We can eliminate block C by subtracting the first row times \(A^{-1}C\) from the second row. It doesn’t change the solution x of the linear system.

\((\begin{bmatrix} A & B\\ C & D \end{bmatrix} – \begin{bmatrix} 0 & 0\\ A A^{-1}C & B A^{-1}C \end{bmatrix}) \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} e\\ f \end{bmatrix} – \begin{bmatrix} 0\\ e A^{-1}C \end{bmatrix} \)

\(\begin{bmatrix} A & B\\ 0 & D – BA^{-1}C \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} e\\ f – e A^{-1}C \end{bmatrix}\)

To solve for \(x_2\), we can solve the second row,

\((D – BA^{-1}C) x_2 = f – e A^{-1}C \; \; \;(1.2)\)

Given \(x_2\), we can solve for \(x_1\) using the first row,

\(A x_1 + B x_2 = e\)

\(A x_1 = E – B x_2\)

It’s the back-subsitution.

Note that the solution \(x_2\) of (1.2) is the same as the solution \(x_2\) of (1.1). In other word, (1.2) contains the same information of \(x_2\) as (1.1) contains.

Equation (1.2) is the Schur Complement of the linear system of equations (1.1).
We also call equation (1.2) the Marginalization of (1.1) with \(x_1\) marginalized.
They are just fancy names for Variable Eliminations.

Using the Variable elimination, we can solve a linear system of equations.

Consider solving a linear system of \((x_1, x_2, x_3, x_4, … , x_n)\), using the variable elimination, we can eliminate variable \(x_i\) by index order. e.g. first eliminate \(x_1\), then \(x_2\). And we do back-subsitution in the end.

Now, due to wherever reason, we get \(x_i\) (and related data, costs) in sequence. e.g. We first get \(x_1\). Later, we get \(x_2\) and so on. While waiting for the new variables, we can still do the above variable elimination. When we get the last variable \(x_n\), we just need to eliminate \(x_{n-1}\) and do the back-substitution in the end.

The computation for the above two methods is the same. But the latency for the second approach is eliminating one variable when we receive the last variable. Whereas the first approach needs to eliminate all variables after we receive all the data. Doing the variable-elimination in advance is basically the fixed-lag smoothing.

Fixed-Lag Smoothing, A example

It is always nice to start with a example. Consider a least squares quadratic cost term,

\(c_i(x_i, x_{i+1}) := \begin{bmatrix} x_i & x_{i+1} \end{bmatrix} \begin{bmatrix} A_i & B_i \\ C_i & D_i \end{bmatrix} \begin{bmatrix} x_i \\ x_{i+1} \end{bmatrix} + \begin{bmatrix} e_i & f_i \end{bmatrix} \begin{bmatrix} x_i \\ x_{i+1} \end{bmatrix} \)

\(:=\) means definition. It is the expanded least squares cost function \(||J x + r||^2\).

The optimality condition \(\nabla c_i(x_i, x_{i+1}) = 0\) of the quadratic cost is

\( \begin{bmatrix} A_i & B_i \\ C_i & D_i \end{bmatrix} \begin{bmatrix} x_i \\ x_{i+1} \end{bmatrix} = \begin{bmatrix} e_i \\ f_i \end{bmatrix} \; \; \; (2.1) \)

Now consider a optimization problem. The cost function is given by the sum of the above cost terms.

\(\text{cost1}(x_1, x_2, x_3) := c_1(x_1, x_2) + c_2(x_2, x_3) \; \; \; (2.2)\)

Because \( c_i(x_i, x_{i+1}) \) are qudratic cost functions, we can solve the optimization problem by solving the optimality condition \(\nabla \text{cost1}(x_1, x_2, x_3) = 0\).

Since the gradient operator is linear, \(\nabla \text{ cost1 }(x_1, x_2, x_3) = 0\) can be computed as,

\( \nabla c_1(x_1, x_2) + \nabla c_2(x_2, x_3) = 0\)

Using (2.1) the gradient of \(c_i\),

\((\begin{bmatrix} A_1, & B1 & 0,\\ C_1 & D1, &0, \\ 0, & 0 & 0, \end{bmatrix} + \begin{bmatrix} 0, & 0& 0,\\ 0& A_2, & B_2 \\ 0, & C_2 & D_2 \end{bmatrix} ) \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} e_1\\ f_1 \\ 0 \end{bmatrix} + \begin{bmatrix} 0\\ e_2 \\ f_2 \end{bmatrix} \)

\(\begin{bmatrix} A_1, & B1 & 0,\\ C_1 & D1 + A_2, & B_2 \\ 0, & C_2 & D_2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} e_1\\ f_1 + e_2 \\ f_2 \end{bmatrix} \; \; (2.3)\)

We can solve for \(( x_1^* , x_2^* , x_3^* )\) by solving the \(Ax = b\) system directly. But to illustrate the smoothing algorithm, consider eliminate \((x_1^*, x_2^*)\) by the Schur complement (I will skip the algebra for simplicity). we get,

\(M_1 x_3 = m_1\)

Where \(M_1 x_3 = m_1\) is (2.3) with \((x_1, x_2)\) eliminated.

Soving it we got: \(x_3^* = M_1^{-1} m_1\).

We can argue for \(x_3\) the cost function (2.2),

\(\text{ cost1 }(x_1, x_2, x_3) = c1(x_1, x_2) + c2(x_2, x_3)\; \; \; (2.2)\)

is equavelent to this cost function,

\(\text{cost1-marginal}(x_3) := ||x_3 – x_3^*||^2_{M_1} = (x_3 – x_3^*)^T M_1 (x_3 – x_3^*) \; \; \; (2.4) \)

Prove:

For cost function (2.2), the optimal \(x_3\) is given by solving \(M_1 x_3 = m_1 \).

For (2.4), by construction, \(x_3^* = M_1^{-1} m_1\). Plug in (2.4).

\(\text{cost-marginal}(x_3) = ||x_3 – M_1^{-1} m_1 ||^2_{M_1} \)

\(\text{cost-marginal}(x_3) = x_3^T M_1 x_3 – 2 x_3^T M_1 M_1^{-1} m_1 + || M_1^{-1} m_1 ||^2 \)

The optimality condition for cost (2.4) is,

\(\nabla \text{cost-marginal}(x_3) = 0 \)

Do the math,

\(M_1 x_3 = m_1\)

Which is the same as the optimality condition for cost (2.2).

We showed the optimity condition for \(x_3\) is the same for (2.2) and (2.4).

Let me define a notion for this kind of cost equavelence \(\Leftrightarrow \),

\(\text{cost}(x_1, x_2, x_3) = c1(x_1, x_2) + c2(x_2, x_3) \\ \Leftrightarrow_{x_3} \\ \text{cost-marginal}(x_3) = ||x_3 – x_3^*||^2_{M_1} \)

It means the \(\text{cost}(x_1, x_2, x_3) \) is equavelent to \(\text{cost-marginal}(x_3) \) in the prespective of \(x_3\).

The Updated Cost

Now the cost got updated. A realistic example is we received new sensor data.

We got a new cost term \(c_3(x_3, x_4)\) with the new variable \(x_4\). The new cost \( \text{cost2}(x_1, x_2, x_3, x_4) \) is the sum of the last cost \( \text{cost1}(x_1, x_2, x_3) \) with the new cost \( c_3(x_3, x_4)\) .

\(\text{cost2}(x_1, x_2, x_3, x_4) := \text{cost1}(x_1, x_2, x_3) + c3(x_3, x_4) = c1(x_1, x_2) + c2(x_2, x_3) + c3(x_3, x_4) \)

Since ,

\(\text{cost1}(x_1, x_2, x_3) = c1(x_1, x_2) + c2(x_2, x_3) \\ \Leftrightarrow_{x_3} \\ \text{cost1-marginal}(x_3) = ||x_3 – x_3^*||^2_{M_1} \)

In the perspect of the optimality condition for \((x_3, x_4)\), the cost2 is equavelent to,

\(\text{cost2}(x_1, x_2, x_3, x_4) = \text{cost1}(x_1, x_2, x_3) + c3(x_3, x_4) \\ \Leftrightarrow_{x_3, x_4} \\ \text{cost2-new}(x_3, x_4) = ||x_3 – x_3^*||^2_{M_1} + c3(x_3, x_4) \)

In other word, the solution of \( \text{cost2-new}(x_3, x_4) \) is equal to the solution of \(\text{cost2}(x_1, x_2, x_3, x_4) \) in terms of \((x_3, x_4)\).

So, we can optimize the \( \text{cost2-new}(x_3, x_4) \) if we only interested in \((x_3, x_4)\)

\(\text{cost2-new}(x_3, x_4) = ||x_3 – x_3^*||^2_{M_1} + c3(x_3, x_4) \)


By (2.1), the Optimality Condition of \( \text{cost2-new}(x_3, x_4) \) is,

\(\begin{bmatrix} M_1 + A_3, & B_3\\ C_3 & D_3 \end{bmatrix} \begin{bmatrix} x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix} m_1 + e_3 \\ e_3 \end{bmatrix} \)

We can marginalize \(x3\) and get a new system of equation (ignore algebra),

\(M_2 x4 = m_2\)

It is equavelent to the cost function,

\(\text{cost2-marginal}(x_4) := ||x_4 – x_4^*||^2_{M_2}\)

where \( x_4^* = M_2^{-1} m_2\)

We can prove, in the perspect of \(x_4\),

\(\text{cost2-marginal}(x_4) \Leftrightarrow_{x_4} \text{cost2-updated}(x_3, x_4) \Leftrightarrow_{x_4} \text{cost2}(x_1, x_2, x_3, x_4) \)

Fix-lag Smoothing

Now I guess you find the pattern.

For a quadratic cost function,

\(\text{cost}_n({x_0, x_1, …, x_n}) = \sum_i c_i(x_i, x_{i+1}) \)

Where \(c_i(x_i, x_{i+1})\) is a quadratic cost term.

To optimize it, we can start with \(\text{cost1}(x_1, x_2) = c1(x_1, x_2)\). And we eliminate \(x_1\), compute \(\hat x_2^*\) and create the marginal cost \(\text{cost1-marginal}(x_2)\).

Then for,

\(\text{cost2}(x_1, x_2, x_3) = c1(x_1, x_2) + c2(x_2, x_3)\),

If we are only interested in \((x_2, x_3)\), we can solve

\(\text{cost2-updated}( x_2, x_3) = \text{cost1-marginal}(x_2) + c2(x_2, x_3)\).

And we eliminate \(x_2\) to get \(\text{cost2-marginal}(x_3)\).

In general, by induction, at iteration k, we have the marginal cost,

\( \text{cost_k-1-marginal}(x_{k})\)

To solve the cost at iteration k,

\(\text{cost_k}(x_0, x_1, x_2 … x_{k+1}) = \sum_{i=0}^k c_i(x_i, x_{i+1}) \),

we can solve

\(\text{cost_k-updated}( x_{k}, x_{k+1}) = \text{cost_k-1-marginal}(x_{k}) + c_k(x_k, x_{k+1}) \).

And we eliminate \(x_{k}\) to get \( \text{cost_k-marginal}(x_{k+1}) \).

We stop when we reach the last variable. i.e. when we have the marginal cost for \(x_n\), \( \text{cost_n-marginal}(x_{n+1}) \). And we can solve for \(x_{n+1}^*\).

To recover the optimal solution \({x_1^*, x_2^*, … x_{n}^*}\) we need to do the back subsitution. For example, given \(x_{k+1}^*\), we use the \(\text{cost_k-updated}( x_{k}, x_{k+1})\) to compute \(x_{k}^*\).

If we are only interested in the optimal value for the most recent variable \(\hat x_k^*\). \(\hat x_k^*\) is the optimal solution of the “current cost” at iteration k, \(\text{cost}_{k}({x_0, x_1, x_2 … x_k}) = \sum_i^k ci(x_i, x_{i+1}) \).


This the Fix-Lagged Smoothing. To solve a least squares problem,

\(\text{cost-n}(x_0, x_1, x_2 … x_{n+1}) = \sum_{i=0}^n c_i(x_i, x_{i+1}) \)

We keep track of a marginal cost,

\(\text{cost-k-1-marginal}({x_n}) \)

To solve \( \text{cost-k}(x_0, x_1, x_2 … x_{k+1}) \), we can solve the problem instead.

\(\text{cost-k-updated}(x_k, x_{k+1}) = \text{cost_k-1_marginal}({x_k}) + c_k(x_k, x_{k+1})\)

The Fixed-Lagged smoothing is a efficient way to solve a system of linear equations online.

Consider new cost terms that are added online. the Fixed-Lagged smoothing keeps track of an “almost solved” system of equations. When the cost updated, we just need to marginalize the last variable, compute the new variable, and do the back-substitution.

Discussion

Quadratic function?

If the cost function is a quadratic, variable-elimination/Schur-complement “preserve” the information of the quadratic function.

But cost function usually isn’t quadratic. For Newton’s method, we fit a quadratic at the current point, solve the optimality condition and update the fitting point.

There is a problem. After variable elimination, old variables are gone. We can’t update the old variables and fit a quadratic on the updated old variables.

Simple mitigation is to increase the lagged window.

Fixed-Lag?

In the example, the lag window size was 1 because we only considered 1 state in the past. We can increase the lag window size. The benefit is we can fit the quadratic in the window and mitigate the vanish fitting point problem we discussed above.

For example, when lag is 3, the fixed-lag smoothing cost is,

\(\text{cost_n_updated}( x_{n-3} , x_{n-2}, x_{n-1}, x_{n}) = \text{cost_n-3_marginal}(x_{n-3}) + \\ c_{n-2}(x_{n-3}, x_{n-2}) + c_{n-1}(x_{n-2}, x_{n-1}) + c_n(x_{n-1}, x_{n})\)

For nonlinear least squares, when the lag is bigger, the fixed-lagged cost is a better approximation of the full cost, but it needs more computation.

Running time?

When the linear system is block-diagonal. we were solving a small system repetitively. The time to solve a small system is \(O(d^3)\), where d is the size of \(x_i\). The running time to solve the whole linear system is \(O(d^3 n)\), where n is the number of states. Since \(d\) is constant, the running time is \(O(n)\).

Prior Cost? Initialization?

For fixed-lag smoothing, we don’t need a prior when initializing as long as the cost is non-singular. i.e. \(J^T W J\) is invertible. Please read this article for details.

But we do need a good initial guess for all the variables.

General Case?

In general, the cost function is not block-diagonal. For a non-block-diagonal system, fixed-lagged smoothing may not be a good method to solve it. e.g. We get a cost of old variables. But the old variables are eliminated. Fixed-lagged smoothing can’t handle this case elegantly.

Another problem for non-block-diagonal systems is “fill-in”. Consider the variable elimination,

\(\begin{bmatrix} 0 & d – b a^{-1} c \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} f – e a^{-1} c \end{bmatrix}\)

In general, for \( d – b a^{-1} c \), if d is zero, this operation makes a zero element non-zero. It is called “fill-in”.

Fill-in is a big consideration for a sparse linear system solver. The back-substitution performs drastically different w.r.t fill-in generated during the elimination process.

As a result, people tried to find a good elimination order such that the number of “fill-in” is small. Although it’s NP-hard, in reality, good heuristic solves the problem very well.

Please read this pdf for details about “fill-in”.

Implementation

let’s implement the fixed-lag smoothing and compare it with batch least squares.

The Problem

Consider a simple linear problem.

The state is

\(s_i = (x_i, y_i)\)

We have two type of costs.

The prior cost,

\(p(s_i) = ||s_i – m_i||^2\)

It directly constrains the state \(s_i\).

The odometry cost,

\(o_i(s_{i-1}, s_{i}) = ||s_i – (s_{i-1} + d_i) ||^2\)

It constrains two connected states \(s_i, s_{i-1}\) by an odometry measurement \(d_i\).

class State:
    def __init__(self, variables):
        self.variables = variables

    def unpack_state(self):
        x, y = self.variables
        return x, y


...


class DistanceMeasurement:
    def __init__(self, state1, state2, distance):
        ...

    def residual(self):
        x1, y1 = self.state1.unpack_state()
        x2, y2 = self.state2.unpack_state()
        dx, dy = self.distance
        return np.array([
            [x2 - x1 - dx],
            [y2 - y1 - dy],
        ])

    def jacobi_wrt_state1(self):
        return -np.identity(2)

    def jacobi_wrt_state2(self):
        return np.identity(2)

    ...


class PriorMeasurement:
    def __init__(self, state, prior):
        self.state = state
        self.prior = prior.copy()

    def residual(self):
        x, y = self.state.unpack_state()
        px, py = self.prior
        return np.array([
            [x - px],
            [y - py],
        ])

    def jacobi(self):
        return np.identity(2)

    ...

The model for the problem. Note the Jacobian is easy to compute.

The Batch Least Squares

The batch cost is defined as,

\(\text{cost}(s_0, s_1, s_2, … s_{20}) := p(s_0) + \sum^{20}_{i=1} o_i(s_{i-1}, s_{i})\)

class BatchOptimization:
    def __init__(self, states, distances):
        assert(len(states) - 1 == len(distances))
        self.states = states
        self.distances = distances
        self.num_states = len(states)

    @profiler.time_it
    def optimize(self):
        jacobi, r = self.construct_linear_system()
        # one step for linear system
        x = np.linalg.solve(jacobi.T @ jacobi, - jacobi.T @ r)

        return x

    def construct_linear_system(self):
        size_prior_residual = 2
        size_distance_residual = 2 * (self.num_states - 1)
        size_residual = size_distance_residual + size_prior_residual

        size_variables = 2 * self.num_states

        jacobi = np.zeros([size_residual, size_variables])
        residual = np.zeros([size_residual, 1])

        residual_index = 0
        # Add prior to first state. Not a good design.
        pm = t.PriorMeasurement(self.states[0], self.states[0].variables)
        jacobi[residual_index:residual_index +
               pm.residual_size(), 0:0 + pm.variable_size()] = pm.jacobi()
        residual[residual_index:residual_index +
                 pm.residual_size()] = pm.residual()
        residual_index += pm.residual_size()

        for distance_between in self.distances:

            state1_idx = distance_between.state1_index
            state1_var_idx = 2 * state1_idx
            state1 = self.states[state1_idx]

            state2_idx = distance_between.state2_index
            state2_var_idx = 2 * state2_idx
            state2 = self.states[state2_idx]

            distance = distance_between.distance

            dm = t.DistanceMeasurement(state1, state2, distance)

            jacobi[residual_index:residual_index + dm.residual_size(),
                   state1_var_idx:state1_var_idx + dm.variable_size()] = dm.jacobi_wrt_state1()
            jacobi[residual_index:residual_index + dm.residual_size(),
                   state2_var_idx:state2_var_idx + dm.variable_size()] = dm.jacobi_wrt_state2()
            residual[residual_index:residual_index +
                     dm.residual_size()] = dm.residual()
            residual_index += dm.residual_size()

        assert(residual_index == self.num_states * 2)
        return jacobi, residual

I constructed the normal equation and solve it.

The Fixed-lag Smoothing

I applied the fix-lag smoothing to solve the same cost function.

class FixLagSmoother:
    def __init__(self, prior):
        self.state = prior.variables.copy()
        # Because I didn't specify weight, so weights are indentity
        self.state_hessian = np.identity(2)
        # Should be J * W * prior, but J = W = I(2)
        self.state_b = prior.variables.copy().reshape([2, 1])

        self.all_states = [self.state]

    def get_all_states(self):
        return np.vstack(self.all_states)

    def construct_linear_system(self, distance_measurement):
        # x_i, x_i+1
        size_variables = 4
        # Hessian: A.T * W * A
        lhs = np.zeros([size_variables, size_variables])
        # A.T * W * r
        rhs = np.zeros([size_variables, 1])

        lhs[0:2, 0:2] += self.state_hessian
        rhs[0:2] += self.state_b

        assert(distance_measurement.state1_index +
               1 == distance_measurement.state2_index)
        dm = t.DistanceMeasurement(t.State(self.state.copy()), t.State(
            self.state.copy()), distance_measurement.distance)
        jacobi_wrt_s1 = dm.jacobi_wrt_state1()
        jacobi_wrt_s2 = dm.jacobi_wrt_state2()
        jacobi_dm = np.hstack([jacobi_wrt_s1, jacobi_wrt_s2])
        lhs[0:4, 0:4] += jacobi_dm.T @ jacobi_dm
        rhs[0:4] += jacobi_dm.T @ dm.residual()

        return lhs, rhs

    @profiler.time_it
    def optimize_for_new_measurement(self, distance_measurement):
        if False:
            return self.optimize_for_new_measurement_gaussian_impl(distance_measurement)
        else:
            return self.optimize_for_new_measurement_schur_impl(distance_measurement)

    @profiler.time_it
    def optimize_for_new_measurement_gaussian_impl(self, distance_measurement):
        lhs, rhs = self.construct_linear_system(distance_measurement)
        # one step for linear system
        lhs_LU = linalg.lu_factor(lhs)
        x = linalg.lu_solve(lhs_LU, -rhs)

        self.state = x[2:4].reshape([2])
        self.all_states.append(self.state)

        self.marginalization_guassian_impl(lhs_LU, rhs)

        return x

    def marginalization_guassian_impl(self, lhs_LU, rhs):
        """
            factor-graph-for-robot-perception, page 68
        """
        hessian_LU = lhs_LU
        # By definition, covariance = inv(hessian)
        # A * A-1 = I, reuse LU
        covariance = linalg.lu_solve(hessian_LU, np.identity(4))
        mean_covariance_form = covariance @ rhs

        self.state_hessian = np.linalg.inv(covariance[2:4, 2:4])
        # This is tricky
        # get it by equating ||Ax + b||^2 = ||x + mu ||^2_W
        self.state_b = self.state_hessian @ mean_covariance_form[2:4]

        x_test = np.linalg.solve(self.state_hessian, -self.state_b)
        np.testing.assert_array_almost_equal(x_test.reshape([2]), self.state)

    @profiler.time_it
    def optimize_for_new_measurement_schur_impl(self, distance_measurement):
        """
            Force schur ordering.
        """
        lhs, rhs = self.construct_linear_system(distance_measurement)
        # one step for linear system
        # lhs = [A, B; C, D]
        A = lhs[:-2, :-2]
        B = lhs[:-2, -2:]
        C = lhs[-2:, :-2]
        D = lhs[-2:, -2:]
        # rhs = [b1, b2]
        b1 = rhs[:-2]
        b2 = rhs[-2:]
        A_inv = np.linalg.inv(A)

        self.state_hessian = D - C @ A_inv @ B
        self.state_b = b2 - C @ A_inv @ b1

        x = np.linalg.solve(self.state_hessian, -self.state_b)

        self.state = x.reshape([2])
        self.all_states.append(self.state)

        return x

I tried two ways to do the marginalization.

  1. Schur complement.
  2. Gaussian distribution.

The Gaussian elimination looks simple but it needs to invert the Hessian \(J^T J\). Schur implementation is more efficient.

Note: I didn’t do the back-substitution because, in this problem, odometry cost doesn’t change past states. (The real reason was I wanted to play Civilization 6).

Output

yimu@yimu-mate:~/Desktop/blog$ /usr/bin/python3 yimu-blog/least_squares/fix_lag_smoothing/main.py
batch :
 [[-9.76996262e-15 -6.66133815e-15]
 [ 6.40942707e+00  5.40942707e+00]
 [ 8.58468945e+00  6.58468945e+00]
 [ 1.53039805e+01  1.23039805e+01]
 [ 2.01210836e+01  1.61210836e+01]
 [ 2.27561377e+01  1.77561377e+01]
 [ 2.63916016e+01  2.03916016e+01]
 [ 2.92273220e+01  2.22273220e+01]
 [ 3.51904914e+01  2.71904914e+01]
 [ 4.10854687e+01  3.20854687e+01]
 [ 4.52505194e+01  3.52505194e+01]
 [ 4.94839571e+01  3.84839571e+01]
 [ 5.51286650e+01  4.31286650e+01]
 [ 5.80325840e+01  4.50325840e+01]
 [ 6.40624253e+01  5.00624253e+01]
 [ 6.81428818e+01  5.31428818e+01]
 [ 7.07973654e+01  5.47973654e+01]
 [ 7.66264477e+01  5.96264477e+01]
 [ 8.26776289e+01  6.46776289e+01]
 [ 8.93678689e+01  7.03678689e+01]]
fixed lag :
 [[ 0.          0.        ]
 [ 6.40942707  5.40942707]
 [ 8.58468945  6.58468945]
 [15.30398052 12.30398052]
 [20.12108358 16.12108358]
 [22.7561377  17.7561377 ]
 [26.39160158 20.39160158]
 [29.22732198 22.22732198]
 [35.1904914  27.1904914 ]
 [41.08546867 32.08546867]
 [45.25051939 35.25051939]
 [49.48395709 38.48395709]
 [55.128665   43.128665  ]
 [58.032584   45.032584  ]
 [64.06242526 50.06242526]
 [68.14288184 53.14288184]
 [70.79736536 54.79736536]
 [76.62644773 59.62644773]
 [82.67762895 64.67762895]
 [89.3678689  70.3678689 ]]
diff :
 [[9.76996262e-15 6.66133815e-15]
 [1.95399252e-14 1.33226763e-14]
 [2.84217094e-14 1.86517468e-14]
 [3.37507799e-14 2.13162821e-14]
 [5.68434189e-14 3.90798505e-14]
 [7.10542736e-14 4.97379915e-14]
 [9.59232693e-14 6.75015599e-14]
 [9.94759830e-14 7.10542736e-14]
 [9.94759830e-14 7.10542736e-14]
 [9.23705556e-14 6.39488462e-14]
 [1.20792265e-13 8.52651283e-14]
 [1.91846539e-13 1.49213975e-13]
 [1.98951966e-13 1.63424829e-13]
 [2.20268248e-13 1.63424829e-13]
 [2.13162821e-13 1.70530257e-13]
 [2.13162821e-13 1.70530257e-13]
 [1.98951966e-13 1.70530257e-13]
 [2.13162821e-13 1.70530257e-13]
 [1.84741111e-13 1.42108547e-13]
 [0.00000000e+00 0.00000000e+00]]

I output the result of the batch least squares and the fixed-lag smoothing and their difference. The estimated states are the same for batch least squares and the fixed-lag smoothing.

Paper Unlocked

IROS 2018 best paper candidate.

Information Sparsification in Visual-Inertial Odometry
Jerry Hsiung, Ming Hsiao, Eric Westman, Rafael Valencia, Michael Kaess

This paper tried to solve the problem of too much “fill-in” after the marginalization.

The fixed-lagged smoothing is the standard technique for Visual Odometry. It gives the best trade-off between accuracy and running time.

(a) The Visual Odometry cost can be represented as a graph.
(b)+(c) There are many ways to select what variables to eliminate.

For simplicity, the author considered a simple case of the problem.

Given a cost function,

Left: A dot means a cost term. A circle is a variable. The links from circle to dot are variables in the cost.
Right: Given the cost structure, the Hessian’s non-zero pattern is shown.
The variable order is \(( l_1, l_2, l_3 ,\xi_1, v_1, b_1, \xi_2, v_2, b_2 )\)

The cost function for the above factor graph is,

\(\text{cost}( l_1, l_2, l_3 , \xi_1, v_1, b_1, \xi_2, v_2, b_2) = \text{c1}( \xi_1, v_1, b_1, \xi_2, v_2, b_2 ) + \text{c2}(\xi_1 ) + \text{c3}(v_1) + \text{c4}(b_1 ) \\ + \text{c5}( \xi_1, l_1 ) + \text{c6}( \xi_1, l_2 ) + \text{c7}( \xi_1, l_3) \)

Let \(x = ( l_1, l_2, l_3 ,\xi_1, v_1, b_1, \xi_2, v_2, b_2 )\), We rewrite the cost as ,

\(\text{cost}(x) = x^T \Lambda_{MB} x + \text{lower-order-terms}\)

The Hessian of the cost \( \Lambda_{MB} \) is sparse. It is due to the cost function structure. For example, there is no \(\text{c}(l_1, l_2, …)\), so the \(\frac{\partial \text{cost} }{\partial l_1 \partial l_2}\) is zero.

Now, \( \xi_1, v_1, b_1 \) are eliminated. Because of the cost structure, eliminating \( \xi_1, v_1, b_1 \) creates fill-in for the Hessian (LHS of the optimality linear system).

In the graph, because \(\xi_1\) “links” \( (\xi_2, v_2, b_2) \) and \((l_1, l_2, l_3)\) together, marginalizing \(\xi_1\) makes \( (\xi_2, v_2, b_2) \) and \((l_1, l_2, l_3)\) coupled.

The marginal cost is,

\(\text{cost-marginal}( l_1, l_2, l_3 , \xi_2, v_2, b_2) = \text{c8}( \xi_1, v_1, b_1, \xi_2, v_2, b_2 ) \)

The cost structure with \( \xi_1, v_1, b_1 \) eliminated.
\(\Lambda_t\) is a dense matrix.
The variable order for \(\Lambda_t\) is \(( l_1, l_2, l_3 , \xi_2, v_2, b_2 )\)

Let \(y = ( l_1, l_2, l_3 , \xi_2, v_2, b_2 )\), we can rewrite it as,

\(\text{cost-marginal}( y) = y^T \Lambda_t y + \text{lower-order-terms}\)

The Hessian of cost-marginal \(\Lambda_t\) is a dense matrix.

Dense matrix is a problem for the linear system of equations \(Ax = b\) solver. In general, the sparser the matrix, the faster the solver solves the equation.

So the authors wanted to make the Hessian sparse. Some people did it by setting small values to zero in the Hessian matrix or discarding information. It works but there are rooms for research and improvements.

OKVIS and VINS-MONO discard measurements to maintain sparsity.

The authors did the sparsification by fitting a sparse cost to the dense marginal cost.

By construction of the cost, \(\Lambda_s\) is sparse.

To best approximate the Dense Hessian \(\Lambda_t\). The authors designed a cost structure.

\( \text{cost-marginal-sparse}(l_1, l_2, l_3 , \xi_2, v_2, b_2) = cc1(v_2) + cc2(b_2) + cc3(\xi_2, l_1, l_2, l_3)\)

Let \(y = ( l_1, l_2, l_3 , \xi_2, v_2, b_2 )\), we can rewrite it as,

\(\text{cost-marginal- sparse }( y) = y^T \Lambda_s y + \text{lower-order-terms}\)

The Hessian of the cost \( \Lambda_{s} \) is sparse. It is because the cost structure. For example, there is no \(cost(v_2, l_1)\), so the \(\frac{\partial \text{cost-marginal- sparse } }{\partial l_1 \partial v_2}\) is zero

The next step is to approximate the \(\text{cost-marginal}(y)\) by \( \text{cost-marginal-sparse}(y)\).

The author did this by maximizing the similarity (KL-divergence) of two Quadratic costs (Gaussian distributions).

This problem is a convex optimization. It can be solved in closed form.

Alternative

The Log Determinant Convex Problem is hard to explain. There is another way to understand the above approximation.

We can approximate the \(\text{cost-marginal}(y)\) by \( \text{cost-marginal-sparse}(y)\) in direct optimization.

\(\text{minimize}_{\Lambda_s} cost( \Lambda_s ) = \int_y || \text{cost-marginal}(y) – \text{cost-marginal-sparse}(y, \Lambda_s ) ||^2\)

Basically, we want to make the shape of \( \text{cost-marginal-sparse}(y, \Lambda_s ) \) similar to \( \text{cost-marginal}(y)\) by optimizing \( \Lambda_s \).

This integral can be replaced by sampling around the mean of \(\text{cost-marginal}(y)\). For example, we can do the Sigma points simpling.

It becomes,

\(\text{minimize}_{\Lambda_s} cost( \Lambda_s ) = \sum_{y \in \text{sigma-points}} || \text{cost-marginal}(y) – \text{cost-marginal-sparse}(y, \Lambda_s ) ||^2\)

Note the \(\Lambda_s \) is just parameters (with some zero values and symmetric) for a qudratic function. The optimization problem tries to fit a qudratic to sampled data. It is a Linear regression. We should able to solve it in one step.

If the cost-marginal is lowed-bounded, then by solving the above least-squares, the \( \Lambda_s \) should also be low-bounded, which means \( \Lambda_s \) is PSD. But It’s just a guess, someone please prove it.

Further reading

ISAM2

The example of solving a block-diagonal system is a special case of variable elimination. We eliminated variables in index order. During back-substitution, the dependency is a chain. e.g. we solve \(x_n\) first. Then recover \(x_{n-1}\); then \(x_{n-2}\); … until \(x_1\).

In the general case of solving \(J^TJ x = b\), where the matrix \(J^TJ\) is not block-diagonal. We can eliminate variables in arbitrary order. The dependency is a tree. Back-substitution means to traverse the variable dependency tree.

Please read the sparse linear solver for an introduction to the elimination tree. And read An Approximate Minimum Degree Ordering Algorithm for details about the elimination tree.

1. A PSD matrix can be modeled as a graph.
2. Variable elimination is like finding a spanning tree for the graph (however, the graph changes whenever we eliminate a variable. So, technically, variable elimination is not finding a spanning tree for a graph).
3. Back-substitution means to traverse the spanning-tree.
An Approximate Minimum Degree Ordering Algorithm
This figure of the paper explained the graph and graph-update of the variable elimination.
The elimination tree.
A tree node has more than one variable. It’s because this implementation is cache-friendly.

What happened if a node of the tree changed? i.e. A value in the original matrix changed? We just need to update the sub-tree which contains the changed value and do the back-substitution. If the changed value is in a tree leaf node, then we just need to update the node.

Now, you can try to read this paper. (make sure you have read sparse linear solver, and make sure to ignore everything related to Probabilistic Graphical Model)

iSAM2: Incremental Smoothing and Mapping Using the Bayes Tree
Michael Kaess, Hordur Johannsson, Richard Roberts, Viorela Ila, John Leonard, and Frank Dellaert

Semidefinite Programming

Semidefinite Programming is a type of optimization problems.

The linear matrix inequality constraints can be confusing at this point. Interesting reader please read this tutorial.

SEMIDEFINITE PROGRAMMING
LIEVEN VANDENBERGHE AND STEPHEN BOYD

For our purpose, we don’t need to fully understand the linear matrix inequality constraints.

The authors want to show the problem (3) is a Semidefinite Programming problem.

First, introduce an auxiliary (slack) variable so that (4) is equal to (3).

(6) is the interesting part. The authors applied the Schur Complement reversely to convert \(\frac{(c^Tx)^2}{d^Tx} \leq t\) in (4) to (6). i.e. \(\frac{(c^Tx)^2}{d^Tx} \leq t\) is the result of applying Schur Complement to (6).

Putting (6) and \(Ax + b \geq 0\) together, the authors got (5), which is equivalent to (4) and is in the form of (1) Semidefinite Programming.

What’s next?

As we know from the example, for a block-diagonal linear system of equations, eliminating variables in index-order solves the system in \(O(n)\) time, where \(n \) is the number of variables.

In the next article, I will formally prove the time complexity for solving a block-diagonal system is \(O(n)\), and I will formulate the KKT linearized system of equations of the constrained rocket landing problem as a block-diagonal system and solve it in \(O(n)\) time.

Least Squares Robotics

This article is part of the Least Square Robotics serials.

58 thoughts on “Variable Elimination, Smoothing, and Marginalization Explained

  1. Pingback: additional info
  2. Pingback: 뉴토끼
  3. Pingback: togel sdyney
  4. Pingback: rich89bet
  5. Pingback: bonanza178
  6. Pingback: pouches
  7. Pingback: 20141 zip code
  8. Pingback: pengeluaran HK
  9. Pingback: macbeth essay help
  10. Pingback: sildenafil 55mg
  11. Pingback: flagyl gingivitis
  12. Pingback: gabapentin bp
  13. Pingback: diltiazem dose
  14. Pingback: diclofenac uses

Leave a Reply