Least Squares Refresher

Why Least Squares?

Least-squares is a special optimization problem. Because of the construct of the problem, we can compute the Hessian (second-order derivatives) for free. Thus, we can solve an optimization problem by the Newton method, which is fundamentally faster than gradient descent.

The combination of easy Hessian computation and Newton method makes least squares the best real-time optimization algorithm.

Some applications:

* Planning/Control
* Visual tracking.
* Kalman Filter
* Calibration.
* Matching point clouds.
* Make map locally consistent (SLAM).
* Inverse Kinematics.
* State estimation with IMU.
* …

What is in the article?

Derivation of the least-squares problem with just enough detail.

  1. I will first give some basic math definitions and conversions.
  2. Then we will derive the algorithm for solving least squares problems. We start from the linear case, then the nonlinear case, finish with the weighted nonlinear case.
  3. In the end, I will do a warm-up problem as a least squares example.

This article provides supporting materials for later posts.

0. Definitions & Conversions

Matrix:
Matrix means a 2D matrix. \(M \in R^{(n,m)}\) means a row size n, col size m matrix.

Vector:
Vector always means a column vector. e.g. \(v \in R^{(n,1)}\) is a size n vector.

Function:
Function defines a mapping from input to output.
e.g. \(f(x) = Mx \;, \; M \in R^{(m,n)} \;, x \in R^{(n,1)}\), is a function maps \(f(x) : R^{(n,1)} \rightarrow R^{(m,1)}\).

L2 norm:
L2 norm means the inner product of a vector. It is always larger then 0.
e.g. \(x = [x1, x2]\), the L2 norm of x is \(x^T x = ||x||^2 = x1 x1 + x2 x2\)

Gradient
Gradient is basically the derivative for multi-variables functions.
e.g. There is a function \(f(x): R^{(2,1)} \to R\), the gradient of the function w.r.t x is defined as,

\(\nabla f=\begin{bmatrix}\frac{\partial f_1(x)}{\partial x_1} \\ \\ \frac{\partial f_1(x)}{\partial x_2}\end{bmatrix}\)

Quadratic form:
Quadratic form is the multi-variable version of quadratic function.
Compare with quadratic function, \(f(x) = ax^2 + bx + c\), its argmin is \(x^* = \text{argmin} \; f(x) = -0.5 b / a\).
For a quadratic form, \(g(x) = x^T A x + b^T x + c\), where \(x \in R^{(n,1)}\), \(A \in R^{(n,n)}\), \(b \in R^{(n,1)}\). Its argmin is \(x^* = \text{argmin} \; g(x) = -0.5 A^{-1} b\).

The Optimality Condition of Quadratic form
the optimality condition for quadratic form is gradient equation to zero.

\(\nabla (x^T A x + b^T x + c) = 0\)

which is,

\(2A x^* = – b \Rightarrow x^* = – 0.5A^{-1}b\)

Linearization
Given a function \(f(x): R^{(n,1)} \to R^{(m,1)}\), its linearization at \(x_0\) is the first two terms of the Taylor expansion.

\(f(x) \approx \nabla f(x_0)^T(x – x_0) + f(x_0)\)

Where \(\nabla f(x_0)\) is the gradient of \(f(x)\) at \(x_0\).

Jacobian
Jacobian is the transpose of Gradient. The benefit is linearization is easier.

\(f(x) \approx J(x_0) (x – x_0) + f(x_0) = \nabla f(x_0)^T(x – x_0) + f(x_0)\)

It looks trivial. but consider doing chain rule with all those transpose.

Cost function
A cost function usually maps a vector to a scalar

\(\text{cost}(x): R^{(n,1)} \to R\).

A cost function usually means a penalty for some kind of error and we want to minimize the error. So the cost function should,

  1. Make sense. it should encode what you want to penalize.
  2. Be bounded. otherwise, I can always decrease it.
  3. Better be differentiable. Since good optimization methods are based on the derivative of the cost function. The cost functions need to be differentiable. But for simple problems, brute force search (nested for loops) are just fine.

Hessian

Hessian is the nickname for the second derivative.
e.g. Consider a quadratic form, \(g(x) = x^T A x + b^T x + c\), its Hessian is \(\text{Hessian}(g(x)) = 2A\)

Gradient Descent
Gradient Descent is a method to minimize a cost function by going the descent direction.

\(x_{new} = x – k \nabla f(x)\)

Newton’s Method
Newton’s Method is another way to minimize a cost function.
Consider a cost function,

\(cost(x) : R^{(n,1)} \to R^1\)

We can fit a quadratic form for it at current variable.

\(g(x) = x^T A x – 2b^T x + c = 0\)

Where \(A\) is the hessian of the cost function, \(2b\) is the gradient of the cost function.

For the quadratic form, we can minimize it by solve the quadratic optimality condition \(x^* = \text{argmin} \; x = A^{-1} b\). And continue fitting and solving the optimality condition until converge.

For a awesome math refresher please read the Appendix of
Convex Optimization
Stephen Boyd, Lieven Vandenberghe


For an fantastic introduction, please read:
Introduction to Applied Linear Algebra
Stephen Boyd, Lieven Vandenberghe

1. Linear Least Squares

Problem definition

Given: a linear residual function

\(r(x) = Ax +b \)

where \(x \in R^{n \times 1}\), \(A \in R^{n \times m}\), \(b \in R^{m \times 1}\)

We want: minimize a cost function. It is the squares of the residual function.


\(\text{cost} (x) = r(x)^T r(x)\)

How to solve?

Plug in \(r(x) = Ax +b\),


\(\text{cost} = (Ax + b)^T (Ax + b) \)


\( \text{cost} = x^TA^TAx + 2A^Tx + b^Tb \)


It is a quadratic function. And it is bounded because \(A^T A\) is a positive semi-definite matrix.

Recall the minimum condition for a quadratic function is gradient equal to zero.

So,

\( \text{minimize cost(x)} \Rightarrow \text{solve} \frac{\partial cost }{\partial x} = 0 \)

Now, find the solusion for,

\(\frac{\partial cost }{\partial x} = 0 \)

Take derivative of cost w.r.t x,

\(\frac{\partial cost }{\partial x} = \frac{\partial{x^TA^TAx + 2b^TAx + b^Tb}}{\partial{x}} \)

\(\frac{\partial cost }{\partial x} = A^TAx +2A^Tb \)

In sum, the solution for \(\text{minimize cost(x)} \), i.e. \(x^* = \text{argmin cost(x)}\) is the solution of,

\(A^TAx^* +2A^Tb = 0\)
\(A^TA x^* = – A^Tb\)

It is called normal equation.

By construct of the problem, \(A^T A\) is the Hessian (second derivative) of the cost function. Without derive Hessian analytically, we can do the newton optimization.

2. Nonlinear Least Squares

Problem definition

Given: a residual function

\(r(x) \)

where \(x \in R^{n \times 1}\), \(r\) is a function (mapping) \(r : R^{n \times 1} \rightarrow R^{m \times 1}\).

We want: minimize a cost function. The cost function is the squares of the residual function.


\(\text{cost} (x) = r(x)^T r(x)\)

How to solve (minimize) this problem?

The idea is to linearly approximate the nonlinear function, and using the above linear least squares solution to solve nonlinear problem iteratively.

Let’s start from doing a linear approximation of \(r(x)\) at \(x_0\),

\(r(x) \approx J(x_0) (x – x_0) + r(x_0)\)

Let \(\Delta x = x – x_0\),

\(r(\Delta x) \approx J(x_0) \Delta x + r(x_0)\)

Go back to minimizing the cost. Plug in the approximation \(r(\Delta x) \approx J(x_0) \Delta x + r(x_0)\) into the cost function,

\(\text{cost(x)} \approx \text{approx-cost(x)} = ( J(x_0) \Delta x + r(x_0) )^T) (J(x_0) \Delta x + r(x_0)) \)

It is exactly the linear least squares we discussed above.

The argmin of the approximated cost is given by the normal equation,

\(J(x_0)^T J(x_0) \Delta{x}^* = – J(x_0)^T r(x_0)\)

Where \(\Delta{x}^* \; = \; \text{argmin} \; \; \text{approx-cost}(x) \).

Because of the change of variables, the true solution is,

\(x = x_0 + \Delta{x}^* \)

It is the update rule for nonlinear least squares.

Since we approximated the cost locally, we need to do this procedure again until converge.

In sum, the algorithm to minimize a nonlinear least squares problem is

  1. linearly approximate the residual function. \(r(x) \approx J(x_0) (x – x_0) + r(x_0)\)
  2. solve the Normal Equation of the approximated cost. \(J(x_0)^T J(x_0) \Delta{x}^* = – J(x_0)^T r(x_0)\)
  3. Update variables by \(x = x_0 + \Delta{x}^* \).
  4. Go back to 1. if not converge.

3. Weighted Nonlinear Least Squares

In practice, we give weights to different terms in the optimization problem.

Problem definition

Given: a residual function

\(r(x) \)

where \(x \in R^{n \times 1}\), \(r\) is a function (mapping) \(r : R^{n \times 1} \rightarrow R^{m \times 1}\).

We want: minimize a cost function,

\(\text{cost(x)} = r(x) W r(x)\)

Where W is a constant positive definite matrix.

The prove is almost the nonlinear case, except we have a W. I will give the solution here.

How to solve it?

Since the derivation is similar to the nonlinear case, I simply give the answer here

  1. linearly approximate the residual function. \(r(x) \approx J(x_0) (x – x_0) + r(x_0)\)
  2. solve the Normal Equation of the approximated cost. \(J(x_0)^T W J(x_0) \Delta{x}^* = – J(x_0)^T W r(x_0)\)
  3. Update variables by \(x = x_0 + \Delta{x}^* \).
  4. Go back to 1. if not converge.

4. Warm-up Problem

let’s do a warm-up problem to practice the least squares problem.

Problem definition

We know a set of data \((x_1, y_1), (x_2, y_2), … , (x_n, y_n)\) is sampled from a quadratic function \(f(a, b, c, x) = a x^2 + b x + c \). We want to find the coefficient \((a,b,c)\) for the quadratic function.

Mathematically,

The residual function is,

\(r_i(a, b, c) = f(a, b, c, x_i) – y_i = a x_i^2 + b x_i + c – y_i \;\;\; \forall i\)

Note: the residual function is a function of \((a,b,c)\).

The cost function is defined as,

\(cost(a, b, c) = \sum_i r_i(a, b, c)^T r_i(a, b,c)\)

Let,

\(r(a,b,c) = \begin{bmatrix} r_1(a,b,c)\\ r_2(a,b,c)\\ r_3(a,b,c)\\ … \\ r_n(a,b,c) \end{bmatrix} \)

The cost function can be expressed as,

\(cost(a,b, c) = r(a,b,c)^T r(a,b,c) = ||r(a,b,c)||^2\)

The problem is a least squares problem.

To solve the least squares problem, we simply need to follow the steps procedures,

  1. linearly approximate the residual function. \(r(x) \approx J(x_0) (x – x_0) + r(x_0)\)
  2. solve the Normal Equation of the approximated cost. \(J(x_0)^T J(x_0) \Delta{x}^* = – J(x_0)^T r(x_0)\)
  3. Update variables by \(x = x_0 + \Delta{x}^* \).
  4. Go back to 1. if not converge.

Linearization

For a element of the residual function \(r_i = a x_i^2 + b x_i + c – y_i \),

It’s jacobian is defined as,

\(J_i=\begin{bmatrix}\frac{\partial r_i}{\partial a} && \frac{\partial r_i}{\partial b} && \frac{\partial r_i}{\partial c}\end{bmatrix}\)

Plug in \(r_i\), we can compute the jacobian of it directly,

\(J_i(a, b, c)=\begin{bmatrix} 2ax_i && x_i && 1\\\end{bmatrix}\)

So the linearization of \(r_i(x)\) is,

\(r_i(\Delta a, \Delta b, \Delta c) \approx \begin{bmatrix} 2ax_i && x_i && 1\end{bmatrix} \begin{bmatrix} \Delta a \\ \Delta b\\ \Delta c\end{bmatrix}+ r_i(a,b, c)\)

It is the linearization for a element of the residual function. For the whole residual function, its linearization is,

\( r(\Delta a, \Delta b, \Delta c) = \begin{bmatrix} r_1(\Delta a, \Delta b, \Delta c)\\ r_2(\Delta a, \Delta b, \Delta c)\\ r_3(\Delta a, \Delta b, \Delta c)\\ … \\ r_n(\Delta a, \Delta b, \Delta c) \end{bmatrix} \approx \begin{bmatrix} 2ax_1 && x_1 && 1\\ 2ax_2 && x_2 && 1\\ 2ax_3 && x_3 && 1\\ …\\ 2ax_n && x_n && 1\\\end{bmatrix} \begin{bmatrix} \Delta a \\ \Delta b\\ \Delta c\end{bmatrix} + \begin{bmatrix} r_1(a,b, c) \\ r_2(a,b,c)\\ r_3(a,b,c) \\ … \\ r_n(a,b,c)\end{bmatrix} \)

Which is in the form of,

\(r(\Delta ) = J(a,b,c) \Delta + r(a,b,c) \)

Where \(\Delta = (\Delta a, \Delta b, \Delta c)\)

Normal Equation

By solving the Normal equation,

\(J^TJ \Delta = – J^T r\)

We get \(\Delta a, \Delta b,\Delta c\).

Update

Then we update the optimization variables by,

\(a_{new} = a + \Delta a\\ b_{new} = b + \Delta b\\ c_{new} = c + \Delta c\)

If not converge, we will compute the jacobian at \(a_{new}, b_{new}, c_{new}\) and re-do the process until converge.

Linear?

The above derivation assumes the problem is a nonlinear least squares problem.

Did you notice it is actually a linear least squares problem?

Go back to the residual function,

\(r_i(a, b, c) = f(a, b, c, x_i) – y_i = a x_i^2 + b x_i + c – y_i \;\;\;\)

\(r_i(a, b, c) = \begin{bmatrix} x_i^2 && x_i && 1 \end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix}\)

So,

\( r(a,b, c) = \begin{bmatrix} r_1(a,b,c)\\ r_2(a,b,c)\\ r_3(a,b,c)\\ … \\ r_n(a,b,c) \end{bmatrix} =\begin{bmatrix} x_1^2 && x_1 && 1 \\x_2^2 && x_2 && 1\\x_3^2 && x_3 && 1\\…\\x_n^2 && x_n && 1\\\end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix}\)

It means the residual function is linear, and the cost is a quadratic function.

Implementation

Least squares core

class LeastSquares:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        self.data_mat = np.vstack(
            [self.x * self.x, self.x,
             np.ones_like(self.x)]).T
        self.theta = np.array([0, 0, 0.])
    def residual(self):
        pred_y = self.data_mat @ self.theta
        r = pred_y - self.y
        return r
    def cost(self):
        r = self.residual()
        return r.T @ r
    def compute_jacobian(self):
        return self.data_mat
    def least_squares_solve(self):
        for i in range(10):
            print('iteration: {} cost: {}'.format(i, self.cost()))
            J = self.compute_jacobian()
            r = self.residual()
            delta = np.linalg.solve(J.T @ J, -J.T @ r)
            self.theta += delta
            if np.linalg.norm(delta) < 1e-8:
                print('converged iteration: {} cost: {}'.format(
                    i, self.cost()))
                break
        return self.theta

In least_squares_solve, I did the least squares core iterations.

  1. Linearization.
  2. Solving the Normal Equation.
  3. Update variables.
  4. Go to 1. if not converge.

Data generation

def generate_quadratic_data():
    quadratic_a = 2.4
    quadratic_b = -2.
    quadratic_c = 1.
    num_data = 30
    noise = 2 * np.random.randn(num_data)
    sampled_x = np.linspace(-10, 10., num_data)
    sampled_y = quadratic_a * sampled_x * sampled_x \
        + quadratic_b * sampled_x + quadratic_c + noise
    return sampled_x, sampled_y

The ground truth (a,b,c) is : (2.4, -2, 1)

The Main function

def main():
    x_data, y_data = generate_quadratic_data()
    solver = LeastSquares(x_data, y_data)
    theta = solver.least_squares_solve()
    x = np.linspace(-12, 12., 100)
    a, b, c = theta
    pred_y = a * x * x + b * x + c
    p1 = plt.plot(x_data, y_data, '*')
    p2 = plt.plot(x, pred_y, 'g')
    plt.legend((p1[0], p2[0]), ('sampled data', 'fitted function'))
    plt.title('Data points vs Fitted curve')
    plt.show()

In the main function, I

  1. generated the test data.
  2. solved the least squares problem for (a, b ,c).
  3. plotted the result.

Result

yimu@yimu-mate:~/Desktop/yimu-blog$ python3 qudratic_regression.py
iteration: 0 cost: 407170.1892255854
iteration: 1 cost: 97.84215255683235
converged iteration: 1 cost: 97.84215255683235
fitted coefficient (a,b,c): [ 2.4083295  -2.11060826  1.01204246]

What’s next?

Least squares are the basis of Robotics.

The mainstream methods for State-estimation, Tracking, Vision, Planning, Control, Calibration are least squares problems.

I’d like to formulate those problems by least squares in a simple and intuitive way and implement them.

4 thoughts on “Least Squares Refresher

Leave a Reply