Facial Landmarks Detection

What’s in this article?

Facial landmarks are defined as certain interesting points on the face. For example, noise and eyes are good facial landmarks.

Facial Landmarks detection is simply founding these facial landmarks in a facial image.

The pipeline for facial landmarks detection is typically broken into 2 steps. First, detect faces in an image, then localize facial landmarks.

The face detection is considered a solved problem because faces of human are well defined (whereas the definition of some objects varies. e.g. what’s the definition of pedestrians? how about a baby in a Stroller? Per lay down? people in costume?) Given enough training data, a neural net can reliability detection faces.

In this article, We will focus on the facial landmarks detection. We will go through a classical face landmarks detection algorithm called the Active Appearance Models (AAMs) and implement it from scratch in C++.

1. The Active Appearance Models (AAMs)

The AAMs localize facial landmarks by the least-squares template matching.

Given a template image \(I_1\) with labeled landmarks,

and a source image \(I_2\),

We want to match the template image \(I_1\) to the source image \(I_2\).

After a transformation is estimated, we can transform the landmarks to the source image \(I_2\). That’s the “detected” facial landmarks.

2. The Least-Squares Matching

The above matching problem is formulated as a least squares.

Let \((u,v) \) be the image coordinate. given a transformation function \((u’,v’) = W(u, v, p)\) which is parameterized by \(p\), we can definite an optimization problem to estimate the transformation,

\(\text{minimize}_p \; \text{cost}(p) = \sum_{u} \sum_{v} || I_1(u, v) – I_2(W(u, v, p)) ||^2\)

In English, we want to search for a transformation, which is parameterized by p, such that the transform \(I_2\) to close to \(I_1\).

In the least-squares framework, we need to derive the Jacobian of the residual function \(I_1(u, v) – I_2(W(u, v, p))\) and solve the Normal Equation. Please make sure you understand the basics of the least-squares. read this article for a refresher.

The Jacobian of the residual can be broken down by the chain rule,

\(\frac{\partial I_1(u, v) – I_2(W(u, v, p))}{\partial p} = – \frac{\partial I_2(W(u, v, p))}{\partial W} \frac{\partial W}{\partial p}\)

It’s exactly the Lucas-Kanade optimization problem. I will not go into the detail of the Lucas-Kanade optimization. For a detailed discussion, please read this article for an introduction and this article for a problem similar to the AAMs. Warning: If you are not familiar with Lucas-Kanade, you have to read them. Otherwise you will be confused very soon.

It’s an overly simplified discussion of the AAMs algorithm. There are some important implementation details, I will discuss these in the implementation section.

3. Implementation

Code link

3.1 Preprocessing

Preprocessing is important since we directly compute the pixel level difference in the optimization problem. Since I will just use 2 of my own picture later, I assume my skin color doesn’t change in 10 seconds. So, I will not do any fancy preprocessing expect that I blurred the image to make the image gradient smoother.

3.2 The Affine Transformation

The abstract transformation \((u’,v’) = W(u, v, p)\) in the previous section is chosen to be the Affine transformation.

\(\begin{bmatrix} u’ \\ v’ \end{bmatrix}
= \begin{bmatrix} a & b \\ c & d\end{bmatrix}
\begin{bmatrix} u \\ v \end{bmatrix} + \begin{bmatrix} e \\ f \end{bmatrix}\)

Where the parameters are \(p = [a,b,e,d,e,f]\).

It can be implemented as,

cv::Point2f apply_affine(const cv::Mat& trans, const cv::Point2f& p)
{
    assert(trans.rows == 6);
    assert(trans.cols == 1);

    const float a = trans.at<float>(0, 0);
    const float b = trans.at<float>(1, 0);
    const float e = trans.at<float>(2, 0);
    const float c = trans.at<float>(3, 0);
    const float d = trans.at<float>(4, 0);
    const float f = trans.at<float>(5, 0);

    cv::Point2f p_trans;
    p_trans.x = a * p.x + b * p.y + e;
    p_trans.y = c * p.x + d * p.y + f;
    return p_trans;
}

The Jacobian of the transform w.r.t to parameters \(\frac{\partial W}{\partial p}\) is,

cv::Mat jacobian_xy_wrt_affine(
    const cv::Point2f& xy)
{
    const float x = xy.x;
    const float y = xy.y;
    cv::Mat jacobian_wrt_affine = (cv::Mat_<float>(2, 6)
            << x,
        y, 1, 0, 0, 0,
        0, 0, 0, x, y, 1);
    return jacobian_wrt_affine;
}

3.3 The full Jacobian

To get the full Jacobian, \(\frac{\partial I_1(u, v) – I_2(W(u, v, p))}{\partial p} = – \frac{\partial I_2(W(u, v, p))}{\partial W} \frac{\partial W}{\partial p}\) we need to compute \(\frac{\partial I_2(W(u, v, p))}{\partial W}\).

Same as the Lucas-Kanade tracking, we need to compute the \(\frac{\partial I_2(W(u, v, p))}{\partial W}\) numerically with backward interpolation.

The idea was implemented (with some details ignored) as,

...
src_im_grad_x_ = compute_derivatives(src_im_, "x");
src_im_grad_y_ = compute_derivatives(src_im_, "y");
...
const cv::Mat jacobian_pixel_wrt_xy = (cv::Mat_<float>(1, 2)
                     << bilinear_interp(last_im_grad_x_, p_trans),
                        bilinear_interp(last_im_grad_y_, p_trans));
const cv::Mat jacobian_wrt_affine = jacobian_xy_wrt_affine({ x, y });
const cv::Mat jacobian_im_wrt_affine = jacobian_pixel_wrt_xy * jacobian_wrt_affine;

I computed the \(– \frac{\partial I_2(W(u, v, p))}{\partial W} \frac{\partial W}{\partial p}\)

3.4 The Least Squares Iteration

Then I implemented the least squares main iteration.

    cv::Mat least_squares(
        const cv::Mat& affine_trans_init,
        const int x_start,
        const int x_len,
        const int y_start,
        const int y_len)
    {
        // row major
        cv::Mat affine_trans = affine_trans_init.clone();

        constexpr size_t MAX_ITERS = 100;
        for (size_t iters = 0; iters < MAX_ITERS; ++iters) {
            const cv::Mat J = compute_jacobian(affine_trans,
                src_im_grad_x_, src_im_grad_y_, x_start, x_len, y_start, y_len);

            const cv::Mat b = compute_b(affine_trans,
                x_start, x_len, y_start, y_len);

            cv::Mat delta = solve_normal_equation(J, b);
            affine_trans = affine_trans + delta;

            ...

            if (cv::norm(delta) < 1e-1) {
                std::cout << "converged" << std::endl;
                break;
            }
        }

        std::cout << "final trans: " << affine_trans << std::endl;

        return affine_trans;
    }

I computed the Jacobian, constructed the Normal Equation and solved it.

3.5 The Piece-wise Affine

Using Affine to model facial image changes is an oversimplification. The real sources of the face image changes comes from the camera projection changes and facial expression changes. Model them precisely is hard. An approximation is to use the piece-wise Affine transform.

I implemented a decoupled piece-wise Affine. Basically, I broke the image into pieces and estimated an Affine transformation for each piece independently. Note in the AAMs paper, the author used a better transformation model. Please read the paper for details.

// grid is hardcoded for simplicity.
// assuming the size of im is: cv::Size im_size(500, 550);
cv::Mat init_trans = (cv::Mat_<float>(6, 1) << 1, 0, 0, 0, 1, 0);
cv::Mat affine_trans_global = least_squares(init_trans, 50, cols_ - 100, 50, rows_ - 100);

const int x_start = 100;
const int y_start = 150;
const int gridy_len = 350;
const int gridx_len = 150;

cv::Mat result = src_im_orig_.clone();
// take middle roi
for (int y_grid = 0; y_grid <= 0; ++y_grid) {
    for (int x_grid = 0; x_grid <= 1; ++x_grid) {

        cv::Mat affine_trans_patch = least_squares(
            affine_trans_global,
            x_start + x_grid * gridx_len, gridx_len,
            y_start + y_grid * gridy_len, gridy_len);

        draw_landmark_in_src(result, affine_trans_patch,
            x_start + x_grid * gridx_len, gridx_len,
            y_start + y_grid * gridy_len, gridy_len);

        cv::imshow("cur result", result);
        cv::waitKey(1000);
    }
}

cv::imshow("result", result);
cv::waitKey(0);

I computed an initial guess for the Affine transform using a big image patch. Then, in the double for loop, I girded the image and estimated an Affine transform separately.

3.6 Result

The global affine

The first step is to estimate a global affine transformation.

The piece-wise affine

Since an affine transform can’t capture the change of camera position and facial changes. I did a decoupled piece-wise affine.

I can’t cut the images into too many small pieces because I treated each optimization independently. So, I cut the image into 2 pieces.

Sadly, my right face is peculiar which can’t be warped by the affine transform.

Project Landmarks

Given the piece-wise affine transform, I can project landmarks in the labeled image frame into the source image frame.

*Left: the template image. Right: the estimated facial landmarks.*
*Clearly, something is wrong with my face or my camera.*
Update: I cropped the image with difference aspect-ratio… But it’s fine, more challenges for the algorithm!

The matching is not perfect. How to make it better? The AAMs uses multiple templates (eigen faces) to cover different facial expression and camera position.

4. Paper Unlock

4.1 Active Appearance Models

Active Appearance Models
Timothy F. Cootes, Gareth J. Edwards, and
Christopher J. Taylor

This paper is the original AAMs paper. It’s a bit more complex than what we did in section 3.

4.1.1 Eigen Faces

Given labeled faces, the authors firstly aligned all labeled faces into a common frame.

Then, the author did a eigen analysis (PCA) to extract some significant basis vectors (eigen faces). These significant basis vectors will be used as templates for later optimization.

4.1.2 The cost

Given the eigen faces, the authors defined the residual function (stuffs inside the L2 cost).

The above definition can be confusing. Using modern and full notation, the residual function is defined as,

\(r(u, c) = \sum_x \sum_y( I(W_u(x, y)) – \sum_i c_i F_i(x , y))\)

Where, u is the parameter for the 2D transformation \(W_u(x, y)\), \(F_i\) are eigen faces. They are images. \(c_i\) are the linear combination weights for these eigen faces.

In English, The authors wanted to minimize the difference of the warped image and the linear combination of eigen faces.

4.1.3 The optimization

Just least squares.

(3) is the linearization (Jacobian) for the residual.
(4) is the normal equation

The steps for the algorithm are,

4.2 One Millisecond Face Alignment with an Ensemble of Regression Trees

One Millisecond Face Alignment with an Ensemble of Regression Trees
Vahid Kazemi and Josephine Sullivan

Before deep learning, gradient boosting tree is the king of machine learning. (It still is if you don’t have enough data or for simple problems).

In this paper, the author used deeply coupled gradient boosting tree to predict the landmarks locations.

The high-level idea was simple. Given a set of labeled faces \(\{ I_i, S_i \}\) where \(I_i\) is the face image and \(S_i\) is a vector of landmarks, the author want to learn a function which maps \(I_i\) to \(S_i\).

However, learning the function directly using any supervised method is hard (expect neural nets). Instead, the authors using ensemble learning with deeply coupled feature extraction.

OpenCV implemented this paper in FacemarkAAM.

4.2.1 The ensemble learning

The gradient boosting tree is a ensemble learning algorithm. Ensemble learning basic means use a weak learner to fix the error of previous prediction iteratively.

For the facial landmarks problem, there is a weak regressor at each iteration to prediction the difference of the current landmarks estimation \(\hat S^{(t)}\) and the ground-truth \(S_{\pi}\).

(1) is the inference. (6) is the function we want to learn at each ensemble iteration.

Note that the input to the regressor is the image and the current landmark estimation. The motivation was the author wanted the feature of the regressor to be the sampled close to the current landmarks estimation \(\hat S^{(t)}\).

Utilizing the prior information defined by \(\hat S^{(t)}\) is a key reason why this method works well in practice.

4.2.2 Feature selection

(Hand-craft) features are the input to the regressor. At each level, the author first align the current landmarks estimation \(\hat S^{(t)}\) to a mean shape to achieve shape invariant.

Then, the authors select the difference of pixel around the current landmarks estimation \(\hat S^{(t)}\) as features.

In summary, the author deeply coupled the gradient boosting tree with the landmarks detection. The key idea is, for each weak regressor (constant value tree) in the ensemble learner (gradient boosting tree), the author did the feature extraction by sampling pixels around the current best landmarks estimation.

OpenCV implemented this paper in FacemarkKazemi.

4.3 Deep Learn it!

Given a set of labeled faces \(\{ I_i, S_i \}\) where \(I_i\) is the face image and \(S_i\) is a vector of landmarks, we can learn a function which maps \(I_i\) to \(S_i\).

Supervise learning is function approximation.

Before the age of deep learning, all available learning algorithms are “weak”. They can’t learn a complex function directly. We have to do some engineering or feature engineering similar to what in the (4.2) to make the problem easier for a learning algorithm.

Then, deep learning comes.

Deep learning is able to learn very complex functions (with over-fitting sometimes). We can just learn a function that maps a image into a vector of facial landmarks.

I will review 3 papers for the facial landmarks detection problem. The goal is to go through the development process of how we reach the state-of-the-art method.

4.3.1 Flowing ConvNets for Human Pose Estimation in Videos

Flowing ConvNets for Human Pose Estimation in Videos
Tomas Pfister, James Charles, Andrew Zisserman

The facial landmark detection is very similar to the Human Pose Estimation problem.

To be consistent with the facial landmarks problem, I will call the human pose as human landmarks.

Given a set of images and labeled human landmarks, we can learn a network which maps a image to a vector of human landmarks.

The left image is the training pair which consists of a image and a vector of human landmarks. We can ignore the links in the image because they are implicitly defined by landmarks order.
The right image is the one-hot encoding for each landmarks.

In this paper, authors did a one-hot (with Gaussian one) encoding for each landmarks. And the author learned a function which maps a image to k classification maps (heatmaps).

In the inference time, the authors inputted a image into the function and get k heatmaps. Then, they found the maximum for each heatmap. The maximum location is the final output for the human landmarks location.

There are some other ideas in the paper for video handing. Read the paper if you are interested.

We can apply the idea directly to the facial landmarks detection problem.

4.3.2 Recurrent Human Pose Estimation

Recurrent Human Pose Estimation
Vasileios Belagiannis and Andrew Zisserman

Deep learning is a very competitive field. People try anything to beat other algorithms.

To improve the accuracy for the heatmap prediction, the authors changes the network structure to a “recurrent” net.

The network from (4.3.1) is in the first row of the picture.

To boost performance, the author concatenated the output of first row with the output of layer-3, and inputted it into a new network on the second row. The author did this concatenation 3 times. To make the optimization converge faster, the author used auxiliary losses for intermediate networks.

Note that if we remove the auxiliary losses and flatten the network, it is simply a deeper network with jumping connections. (I don’t think the auxiliary losses is very helpful. How do you think?).

Anyway, the performance is improved (by a little bit but good enough to beat others).

In sum, the paper improved the landmarks detection performance by the “recurrent” network.

4.3.3 Facial Landmark Point Localization using Coarse-to-Fine Deep Recurrent Neural Network

Facial Landmark Point Localization using
Coarse-to-Fine Deep Recurrent Neural Network
Shahar Mahpod, Rig Das, Emanuele Maiorana, Yosi Keller, and Patrizio Campisi,

Having the “recurrent” network is not enough!

Convolution net typically output a heatmap with a smaller size because of pooling and strides. The prediction accuracy is limited by the heatmap resolution. For example, assuming the input image is (256, 256) and the output heatmap is (64,64). The prediction resolution is limited to 4 pixel in the input image frame.

To get the a few pixels of accuracy, people did a regression net to predict the offset from the heatmap maximum to the ground-truth position.

To boosting the regression net performance, the author also did a “recurrent” net for the regression net.

Top row: the network in (4.2.3).
Bottom row: the current regression net for heatmap “sub-pixel” accuracy.

The author didn’t mention how to train the regression net. My guess is they need to associate the current heatmap maximum to the ground truth and compute the difference. The difference is what the regression net need to learn. The process has to be done for every passes of the training.

5. Summary

In this article, we implement a simple version of the AAMs facial landmarks detection algorithm. We discussed the original AAMs algorithm, the gradient boosting algorithm and deep learning methods to do facial landmarks detection.

I hope you had fun 🙂

Facial Landmarks Detection

What’s in this article?

1. The Active Appearance Models (AAMs)

2. The Least-Squares Matching

3. Implementation

3.1 Preprocessing

3.2 The Affine Transformation

3.3 The full Jacobian

3.4 The Least Squares Iteration

3.5 The Piece-wise Affine

3.6 Result

The global affine

The piece-wise affine

Project Landmarks

4. Paper Unlock

4.1 Active Appearance Models

4.1.1 Eigen Faces

4.1.2 The cost

4.1.3 The optimization

4.2 One Millisecond Face Alignment with an Ensemble of Regression Trees

4.2.1 The ensemble learning

4.2.2 Feature selection

4.3 Deep Learn it!

4.3.1 Flowing ConvNets for Human Pose Estimation in Videos

4.3.2 Recurrent Human Pose Estimation

4.3.3 Facial Landmark Point Localization using Coarse-to-Fine Deep Recurrent Neural Network

5. Summary

15 thoughts on “Facial Landmarks Detection”

Leave a ReplyCancel reply

What’s in this article?

1. The Active Appearance Models (AAMs)

2. The Least-Squares Matching

3. Implementation

3.1 Preprocessing

3.2 The Affine Transformation

3.3 The full Jacobian

3.4 The Least Squares Iteration

3.5 The Piece-wise Affine

3.6 Result

The global affine

The piece-wise affine

Project Landmarks

4. Paper Unlock

4.1 Active Appearance Models

4.1.1 Eigen Faces

4.1.2 The cost

4.1.3 The optimization

4.2 One Millisecond Face Alignment with an Ensemble of Regression Trees

4.2.1 The ensemble learning

4.2.2 Feature selection

4.3 Deep Learn it!

4.3.1 Flowing ConvNets for Human Pose Estimation in Videos

4.3.2 Recurrent Human Pose Estimation

4.3.3 Facial Landmark Point Localization using Coarse-to-Fine Deep Recurrent Neural Network

5. Summary

Share this:

15 thoughts on “Facial Landmarks Detection”

Leave a ReplyCancel reply