Why writing this post?

Deep learning changed the world!

What makes deep learning so successful?

Here are my answers.

Simple yet power theory: By stacking linear+activation together, we can approximate any mapping!
Rich training data: Since the deep network can approximate any mapping, we need lots of data to make sure the mapping is not overfitting.
Tools: GPU makes the computation tractable. Auto differentiate tools make deep learning accessible.

I would say Simple yet power theory, Rich training data contribute to 33.3% and 33.3% to deep learning’s success. And good tools contributes to 33.4% of the success.

I’d like to implement a auto differentiation library to pay my respect.

What’s in this post?

In this post, we will implement a auto differentiation library similar to Tensorflow/Pytorch.

Specifically, we will discuss,

The Math Theory.
How to do the forward pass and backward pass.
The tree implementation.
The Directed acyclic graph implementation.

0. Deep learning refresher

Supervised learning is simply function approximation.

Given a set of input, output pairs \(\{ (x_0, y_0), (x_1, y_1), .. \}\), we want to learn a mapping f. It is usually formulated as an optimization.

\(\text{cost}_{\theta}(f) = \sum_i ||f_{\theta}(x_i) – y_i||\)

In English, we want to find a function f parameterized by \(\theta\) that best approximates the mapping between input x and output y.

There are many ways to do the function approximation. Deep learning is one of them.

Why deep learning performs well? Because a deep learning model is able to approximate any function.

The deep learning model can be viewed as a composition of simple nonlinear functions.

For example,

\(f(x) = f3_{\theta_1}(f1_{\theta_2}(f1_{\theta_1}(x)))\)

\(f(x)\) is composed of 3 nonlinear functions which are parameterized by \(\theta_i\).

To optimize the best \(f(x)\), people use the chain rule to compute the gradient w.r.t \(\theta_i\) and do the gradient descent.

1. The Chain Rule

1.1 The misleading notation

I searched the chain rule and back-propagation online. Here are what I found.

Chain rule 1.

Chain rule 2

Chain rule 3

There are correct math notation. But it’s NOT what backpropagation is doing.

1.2 The better notation

Given a function,

\(cost(x) = f_3(f_2(f_1(x)))\)

Assuming the current variable is at \(x = \bar{x}\).

We are interested in minimize the cost by updating \(x\). A obvious way is doing the gradient descent.

For the gradient descent, we want to do,

\(\text{next x} = \bar{x} – k \frac{\partial cost}{\partial x}|_{x=\bar{x}}\)

\(\frac{\partial cost}{\partial x}|_{x=\bar{x}}\) means the derivative of cost w.r.t x evaluated at the current variable \(\bar{x}\).

How to compute the gradient?

Let \(z_1 = f_1(x), z_2 = f_2(z_1)\). So \(\bar{z_1} = f_1(\bar{x}), \bar{z_2} = f_2(\bar{z_1})\)

The chain rule says,

\(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}} \)

How to compute the expression?

Backpropagation does it this way:

Given \(x = \bar{x}\), compute \(\bar{z_1} = f_1(\bar{x}), \bar{z_2} = f_2(\bar{z_1})\).
With \(\bar{z_2}, \bar{z_1}, \bar{x}\), we can evaluate \(\frac{\partial cost}{\partial z_2}|_{\bar{z_2}}\). Then \(\frac{\partial cost}{\partial z_1}|_{\bar{z_1}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \). Then \(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial cost}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}}\).

The step 1 is the forward pass. The step 2 is the backward pass.

In a picture,

1. Forward pass to compute \(z_i\). 2. Backward pass to compute \(\frac{\partial cost}{\partial z_i}|_{\bar{z_i}}\).

This lead to an implementation of the Backpropagation.

For each intermidate function \(z_i = f(z_{i-1})\). We need to track the dependencies. e.g. output \(z_i\) and input \(z_{i-1}\)
In the forward pass, given inputs \(\bar{z_{i-1}}\), we compute the outputs \(\bar{z_{i}}\).
In the backward pass, given the derivative of cost w.r.t output \(\frac{\partial cost}{\partial z_i}|_{\bar{z_i}}\). We evaluate the derivation of the cost w.r.t input by the chain rule \(\frac{\partial cost}{\partial z_{i-1}}|_{\bar{z{i-1}}} = \frac{\partial cost}{\partial z_i}|_{\bar{z_i}} \frac{\partial z_i}{\partial z{i-1}}|_{\bar{z{i-1}}}\).

To support this, we can define a type for optimization variables. In the type, we track the variable’s value and the derivative of cost w.r.t to the variable. We also need to track the variable’s dependencies.

e.g. For a variable \(z_i\), we save it’s value \(\bar{z_i}\) and \(\frac{\partial cost}{\partial z_i}|_{\bar{z_i}}\). And we tracks it’s dependencies (parents and children).

class Node:
    def __init__(self, id):
        self.id = id
        # parent operators
        self.parents = []
        # child operator
        self.children = []
        
class Variable(Node):
    def __init__(self, value=0, id='', ntype='opt-varible'):
        super().__init__(id)
        self.value = value
        self.grad = 0
        
        # We need to differentiate variable types
        # "opt-varible", "const", "intermediate"
        self.ntype = ntype
        
        ...

1.3 The Correct chain rule

In Backpropagation, we are doing

\(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}} \)

Instead of

\(\frac{\partial cost}{\partial x} = \frac{\partial cost}{\partial z_2} \frac{\partial z_2}{\partial z_1} \frac{\partial z_1}{\partial x} \)

Assuming \(x \in R^n\) and \(cost \in R^1\)

The formal expression is a value \( \frac{\partial cost}{\partial x}|_{ \bar{x}} \in R^{1 \times n}\).

The later expression is a function \( \frac{\partial cost}{\partial x} \in R^1 \to R^{1 \times n}\).

2 The Expression Tree

In the previous section, we discuss the chain rule in a list-like example. In general, a math expression can be represented as a directed acyclic graph, which is equivalent to a tree.

For example, consider this expression.

\(c = x + x y\)

We can represent the expression as a tree or a graph

Personally, I prefer to view the expression as a tree since it makes the analysis easier.

Even the graph neural net can be thought of as an expression tree!
(A one-line summary of the graph theory: \(f = AAAA…AAAx\). It’s an expression!)

For example, assume we did the forward pass (\(\bar{x}\) means the current value of \(x\)), the Backpropagation for a tree look like this,

For each intermedia function \(y_i = f(x_i, z_i)\),
The forward pass: traversal the tree from leaves to root to compute the current value \(\bar{y_i} = f(\bar{x_i}, \bar{z_i})\).
The backward pass: traversal the tree from root to leaves and use the chain rule to compute \(\frac{\partial cost}{\partial x_i}|_{\bar{x_i}} = \frac{\partial cost}{\partial y_i}|_{\bar{y_i}} \frac{\partial y_i}{\partial x_i}|_{\bar{x_i}}\) and \(\frac{\partial cost}{\partial z_i}|_{\bar{z{i}}} = \frac{\partial cost}{\partial y_i}|_{\bar{y_i}} \frac{\partial y_i}{\partial z{i}}|_{\bar{z{i}}}\)

(Note, we need to use \(+= \) to update the gradient since there can be multiple paths from the root to a variable. I ignore it for simplicity.)

Alright, now we have the big picture of how to implement the auto Backpropagation.

We want to:

compute the gradient of cost w.r.t all variables \(\frac{\partial cost}{\partial x}|_{x=\bar{x}}\) so that we can do the gradient descent.

To compute the gradient we need to,

Track the dependencies of all variables.
Two kinds of nodes: variable node and operator node.
Forward pass: traverse the tree from leaves to root. Perform expression evaluations.
Backward pass: traverse the tree from root to leaf. Using the chain rule to compute the gradients.

3. Implementation

3.1 The Tree Implementation

Node

A node is a node in a tree.

class Node:
    def __init__(self, id):
        self.id = id
        # parent operators
        self.parents = []
        # child operator
        self.children = []

We have 2 kinds of node. The variable node and the operator node.

The variable node.

class Variable(Node):
    def __init__(self, value=0, id='', ntype='opt-varible'):
        super().__init__(id)
        self.value = value
        self.grad = 0
        # "opt-varible", "const", "intermediate"
        self.ntype = ntype
        
    def __add__(self, other):
        if isinstance(other, numbers.Number):
            other = Variable(value=other, ntype='const')
        plus = Plus(self, other)
        self.parents.append(plus)
        other.parents.append(plus)
        plus.result.children = [plus]
        return plus.result
        
    def __radd__(self, other):
        ...
        
    def __sub__(self, other):
        if isinstance(other, numbers.Number):
            other = Variable(value=other, ntype='const')
        neg_other = - other
        return self + neg_other
        
    def __rsub__(self, other):
        ...
        
    def __mul__(self, other):
        if isinstance(other, numbers.Number):
            other = Variable(value=other, ntype='const')
        mul = Mul(self, other)
        self.parents.append(mul)
        other.parents.append(mul)
        mul.result.children = [mul]
        return mul.result
        
    def __rmul__(self, other):
        ...
        
    def __neg__(self):
        neg = Neg(self)
        self.parents.append(neg)
        neg.result.children = [neg]
        return neg.result

For a variable node, we implement some common operators like “+”, “-“, “*”.

And we have Operators.

class Operator(Node):
    def __init__(self, id):
        super().__init__(id)
        
    def forward(self):
        raise NotImplementedError("Should have implemented this")
    
    def backward(self):
        raise NotImplementedError("Should have implemented this")
        
class Plus(Operator):
    def __init__(self, a: Variable, b: Variable):
        super().__init__("({})+({})".format(a.id, b.id))
        self.a = a
        self.b = b
        self.result = Variable(ntype="intermediate", id="res:{}".format(self.id))
        self.children = [a, b]
        self.parents = [self.result]
    
    def forward(self):
        self.result.value = self.a.value + self.b.value
    
    def backward(self):
        if self.a is self.b:
            self.a.grad += 2 * self.result.grad
        else:
            self.a.grad += self.result.grad
            self.b.grad += self.result.grad
            
class Mul(Operator):
    def __init__(self, a: Variable, b: Variable):
        super().__init__("({})*({})".format(a.id, b.id))
        self.a = a
        self.b = b
        self.result = Variable(ntype="intermediate", id="res:{}".format(self.id))
        self.children = [a, b]
        self.parents = [self.result]
    
    def forward(self):
        self.result.value = self.a.value * self.b.value
    
    def backward(self):
        if self.a is self.b:
            self.a.grad += 2 * self.result.grad * self.a.value
        else:
            self.a.grad += self.result.grad * self.b.value
            self.b.grad += self.result.grad * self.a.value
            
class Neg(Operator):
    ...

In a forward pass, the operator takes its children’s current values, do some math, put the result in its parent variable.

In a backward pass, the operator takes its parent’s gradient, do the chain rule, and put the result gradient in its children.

Build The Tree

The implementation build the tree implicitly using python’s build-in operator.

class Variable(Node):
    
    def __init__(self, value=0, id='', ntype='opt-varible'):
        super().__init__(id)
        self.value = value
        self.grad = 0
        # "opt-varible", "const", "intermediate"
        self.ntype = ntype
        
    def __add__(self, other):
        if isinstance(other, numbers.Number):
            other = Variable(value=other, ntype='const')
        plus = Plus(self, other)
        self.parents.append(plus)
        other.parents.append(plus)
        plus.result.children = [plus]
        return plus.result
        
    ...
    
    def __mul__(self, other):

For example, we implemented the __add__ build-in function.

We connect a Plus operator and connect it to the current variable self and the other variable other. We return the result of the Plus operator.

We just build a tiny tree!

     result
       |
       +
      / \
  self   other

How to build the tree for \(c = x + xy\)?

x = Variable(value=1., id='t{}'.format('x'))
y = Variable(value=1., id='t{}'.format('y'))
c = x + x * y

x * y created a intermediate variable t. c = x + t creates the tree.

Now, we built the expression tree for the cost function expression. The next step is to evaluate the expression by the forward pass.

The Forward Pass

The forward pass can be implemented as a recursive tree traversal from leaf to root.

    # simple version
    def __forward_recursive_simple(self):
        def evaluate(node):
            for child in node.children:
                evaluate(child)
            if isinstance(node, Operator):
                node.forward()
        evaluate(self.cost_node)

When the node is an Operator, we call its forward function. The forward function reads from inputs variables and saves result to the output variable.

When the node is a variable, we do nothing.

Note the actual implementation handles cycle detection and memorization. Please read the source code for details.

If you are confused, please do this Leetcode esay question.

The Backward Pass

The backward pass can be implemented as a recursive tree traversal from root to leaf.

    def backward(self):
        def backward_dfs(node):
            if isinstance(node, Operator):
                node.backward()
            for child in node.children:
                backward_dfs(child)
        self.cost_node.grad = 1
        backward_dfs(self.cost_node)

We simply call the backward function for all operators from root to leaf.

The backward function reads the derivative of cost w.r.t to output and compute the derivative w.r.t input variables by the chain rule.

The Core

We can group operations on a tree together. Let’s call it the NumberFlowCore

The NumberFlowCore take a tree root (the cost node) as input.

The operations in the NumberFlowCore are,

The forward pass
The backward pass
Clear gradient for all variables
Gradient descent

class NumberFlowCore:
    def __init__(self, cost_node):
        
        self.cost_node = cost_node
    
    def __forward_recursive(self):
        ...
        def evaluate(node):
            ...
            for child in node.children:
                evaluate(child)
            if isinstance(node, Operator):
                node.forward()
            ...
        evaluate(self.cost_node)
        
    def backward(self):
        def backward_dfs(node):
            if isinstance(node, Operator):
                node.backward()
            for child in node.children:
                backward_dfs(child)
        self.cost_node.grad = 1
        backward_dfs(self.cost_node)
        
    def clear_grad(self):
        def clear_grad_dfs(node):
            if isinstance(node, Variable):
                node.grad = 0
            for child in node.children:
                clear_grad_dfs(child)
        clear_grad_dfs(self.cost_node)
        
    def gradient_desent(self, rate=0.01):
        for var in self.varible_nodes:
            var.value -= rate * var.grad

Alright, this it!

The library build a tree and track the value/gradient flow. To honestly capture what the lib is doing, let me call the lib variables_tree_flow.

Let’s try it!

Consider a simple optimization problem.

\(\text{cost}(x,y) = ||x + xy||^2\)

We can solve the problem using this code.

x = Variable(value=1., id='x')
y = Variable(value=1., id='y')
c = x + x * y
cost = c * c

core = NumberFlowCore(cost)
for i in range(100):
    if i % 10 == 0:
        print("cost.val:", cost.value, " iter:", i)
    core.forward()
    core.clear_grad()
    core.backward()
    core.gradient_desent(rate=0.01)
    if abs(cost.value) < 1e-8:
        print('break', cost.value)
        break
    
for var in core.varible_nodes:
    print(var.id, var.value)

line 1~6, build the expression (tree) for the cost expression.

line 9~16, the gradient descent algorithm.

Result:

(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/deep-learning/tensorflow-from-scratch/test.py
cost.val: 0  iter: 0
cost.val: 0.3720476003288574  iter: 10
cost.val: 0.0913675588756892  iter: 20
cost.val: 0.023025588582097953  iter: 30
cost.val: 0.005812055177713739  iter: 40
cost.val: 0.0014672133989577039  iter: 50
cost.val: 0.0003703903644286424  iter: 60
cost.val: 9.350315320913633e-05  iter: 70
cost.val: 2.360439345048723e-05  iter: 80
cost.val: 5.9588085747461865e-06  iter: 90
x 7.992086857341993e-07
y 0.6401459158392562

3.2 Directed Acyclic Graph Implementation

We can also view an expression tree as a directed acyclic graph.

An acyclic graph encodes the dependency for variables. The dependency is captured by a directed edge of the graph. To compute the value of a node, we need to compute all nodes “in order”.

We use an algorithm called Topological sort to find the order.

Topological Sort

Go back to the example.

\(c = x + xy\)

After the topological sort, the order for all nodes are \(x,y,*,t,+,c\). It means to compute the value of c, we need to evaluate nodes in the topological order.

In the context of the backpropogation, we follow the topological order for the forward pass and follow the reversed order for the backward pass.

I won’t discuss the detail of the algorithm. Please do this leetcode medium question or read this article for details.

The “in-degree” topological sort implementation.

(BTW, your interviewer expect you memorize the question and code it in 15 minites)

    def __topological_sort(self):
        zero_degree_nodes = []
        indegree = defaultdict(int)
        
        for node in self.all_nodes:
            indegree[node] = len(node.children)
            if len(node.children) == 0:
                zero_degree_nodes.append(node)
                
        topo_order = []
        while zero_degree_nodes:
            next_zero_degree_nodes = []
            for znode in zero_degree_nodes:
                topo_order.append(znode)
                for p in znode.parents:
                    indegree[p] -= 1
                    if indegree[p] == 0:
                        next_zero_degree_nodes.append(p)
            zero_degree_nodes = next_zero_degree_nodes
            
        if len(topo_order) != len(self.all_nodes):
            raise RuntimeError("cycle found!")
            
        self.topologic_order = topo_order

The forward & backward pass

    def __forward_iterative(self):
        for node in self.topologic_order:
            if isinstance(node, Operator):
                node.forward()
                
    def __backward_iter(self):
        self.cost_node.grad = 1.
        for node in reversed(self.topologic_order):
            if isinstance(node, Operator):
                node.backward()

The forward and backward pass simply loop through operators in order.

The graph implementation of backpropagation is faster since there is no recursion overhead. But a downside could be we have to redo the topological sort whenever the graph gets updated.

A interesting trade-off for libraray developers.

Alright. 300 lines of python code for 2 implementations. Let’s use it for some applications.

4 Applications

We can implement a couple of classical problems using the library.

4.1 Linear Regression

Let’s use the library to solve the Linear regression problem by gradient descent.

\(\text{cost}(\Theta) = ||\Theta x – b ||^2\)

This is basically the linear regression example from sklearn. There are two options for linear regression.

The sklearn build-in solver. I believe it bases on least squares (2nd order Newton).
Using the variable-tree-flow to do gradient descent.

The main function. It reads data and calls the library.

# this is basically a sklearn linear regression example.
# change the METHOD to see difference.
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import datasets, linear_model
import numpy as np
import matplotlib.pyplot as plt


if __name__ == "__main__":
    
    # 'sklearn' or 'vtf'
    METHOD = 'vtf'
    # Create linear regression object
    if METHOD == 'sklearn':
        regr = linear_model.LinearRegression()
    elif METHOD == 'vtf':
        regr = LinearRegression()
        
    # Load the diabetes dataset
    diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
    # Use only one feature
    diabetes_X = diabetes_X[:, np.newaxis, 2]
    # Split the data into training/testing sets
    diabetes_X_train = diabetes_X[:-400]
    diabetes_X_test = diabetes_X[-400:]
    # Split the targets into training/testing sets
    diabetes_y_train = diabetes_y[:-400]
    diabetes_y_test = diabetes_y[-400:]
    # Train the model using the training sets
    regr.fit(diabetes_X_train, diabetes_y_train)
    # Make predictions using the testing set
    diabetes_y_pred = regr.predict(diabetes_X_test)
    
    # # The coefficients
    print('Coefficients:', regr.coef_)
    print('Intercept:', regr.intercept_)
    
    # Plot outputs
    plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
    plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
    plt.xticks(())
    plt.yticks(())
    plt.show()

The implementation of the linear regression using the home made library.

import variables_tree_flow as vtf
import numpy as np


class LinearRegression:
    
    def __init__(self):
        pass
    
    def fit(self, X, y):
        assert len(X) == len(y)
        self.data_num = len(X)
        self.len_theta = X.shape[1]
        theta = np.array([vtf.Variable(value=0, id='t{}'.format(i))
                 for i in range(self.len_theta)])
        bias = vtf.Variable(value=0, id='b')
        pred = X @ theta + bias
        diff = pred - y
        cost = diff.T @ diff * (1 / len(X))
        core = vtf.NumberFlowCore(cost)
        
        for i in range(10000):
            core.forward()
            if i % 500 == 0:
                print("cost.val:", cost.value, " iter:", i)
            core.clear_grad()
            core.backward()
            core.gradient_desent(rate=0.05)
        for var in set(core.varible_nodes):
            print(var.id, var.value)
        self.coef_ = np.array([t.value for t in theta])
        self.intercept_ = bias.value
        
    def predict(self, X):
        res = X @ self.coef_ + self.intercept_
        return res

Line 11~20, build the expression tree for the cost.

Please note that Numpy supports arbitrary types as long as they support math operations. So, we can use Numpy for operations like matrix multiplications for the variables_tree_flow.Variable type.

Line 22~29, the gradient descent loop. forward pass, backward pass and gradient descent.

Result:

(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/deep-learning/tensorflow-from-scratch/linear_regression.py
cost.val: 26645.642857142855  iter: 0
cost.val: 4617.747728745511  iter: 500
cost.val: 4208.370338581641  iter: 1000
cost.val: 4033.241331959609  iter: 1500
cost.val: 3958.3222726824383  iter: 2000
cost.val: 3926.2723824258806  iter: 2500
cost.val: 3912.5616461276954  iter: 3000
cost.val: 3906.6962815871752  iter: 3500
cost.val: 3904.187116474157  iter: 4000
cost.val: 3903.113711825411  iter: 4500
cost.val: 3902.6545162412012  iter: 5000
cost.val: 3902.4580753325445  iter: 5500
cost.val: 3902.374039179059  iter: 6000
cost.val: 3902.338089054714  iter: 6500
cost.val: 3902.322709823434  iter: 7000
cost.val: 3902.3161306875027  iter: 7500
cost.val: 3902.313316175593  iter: 8000
cost.val: 3902.3121121455547  iter: 8500
cost.val: 3902.3115970692725  iter: 9000
cost.val: 3902.311376722961  iter: 9500
t0 886.9538881643437
b 148.09904663418547
Coefficients: [886.95388816]
Intercept: 148.09904663418547

4.2 Logistic Regression

The function to learn,

\(p_i(\theta) = \frac{1}{1 + e^{\theta x_i}}\)

The cost,

\(\text{cost}(\theta) = \sum_i( -y_i log(p_i(\theta)) – (1 – y_i)log(1 – p_i(\theta)) )\)

To support logistic regression, we need 2 operators. The Sigmoid and the Log.

class Sigmoid(Operator):
    def __init__(self, a: Variable):
        super().__init__("sigmoid({})".format(a.id))
        self.a = a
        self.result = Variable(ntype="intermediate", id="res:{}".format(self.id))
        self.children = [a]
        self.parents = [self.result]
        
    def forward(self):
        if abs(self.a.value) > 50:
            self.a.value = np.sign(self.a.value) * 50
        expo = math.exp(self.a.value)
        self.result.value = expo / (1 + expo)
        
    def backward(self):
        expo = math.exp(self.a.value)
        sigmoid = expo / (1 + expo)
        sigmoid_grad = sigmoid * (1 - sigmoid)
        self.a.grad += self.result.grad * sigmoid_grad
        
class Log(Operator):
    
    def __init__(self, a: Variable):
        super().__init__("log({})".format(a.id))
        self.a = a
        self.result = Variable(ntype="intermediate", id="res:{}".format(self.id))
        self.children = [a]
        self.parents = [self.result]
        
    def forward(self):
        # utils.traverse_tree(self)
        self.result.value = math.log(self.a.value)
        
    def backward(self):
        self.a.grad += self.result.grad * (1. / self.a.value)

Note that I handled the numerical issue in Sigmoid by clamping which is suboptimal (line 10).

We also need to write the “interface” function to build the tree. It’s easy, please read the code for details.

We can do the Logistic regression on the iris dataset.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
import variables_tree_flow as vtf

class MyLogisticRegression:
    def fit(self, X, y):
        return self.__fit_numpy_impl(X, y)
        
    def __fit_numpy_impl(self, X, y):
        assert len(X) == len(y)
        # bias
        ones = np.ones((X.shape[0], 1))
        X = np.hstack([X, ones])
        self.data_num = len(X)
        self.num_classes = max(y) + 1
        self.len_theta = X.shape[1]
        theta = np.array([vtf.Variable(value=0., id='t{}'.format(i))
                          for i in range(self.len_theta * self.num_classes)]).reshape([self.len_theta, self.num_classes])
        
        linear_comb = X @ theta
        pred = np.vectorize(vtf.ntf_sigmoid)(linear_comb)
        cost = 0
        
        for pred_row, gt_class in zip(pred, y):
            for class_idx, pred_class in enumerate(pred_row):
                # idx encoding
                # multiple logistic regression to handle multiclasses.
                if class_idx == gt_class:
                    cost += - vtf.ntf_log(pred_class)
                else:
                    cost += - vtf.ntf_log(1 - pred_class)
        cost = cost * (1 / self.data_num)
        
        core = vtf.NumberFlowCore(cost)
        for i in range(2000):
            core.forward()
            if i % 200 == 0:
                print("cost.val:", cost.value, " iter:", i)
            core.clear_grad()
            core.backward()
            core.gradient_desent(rate=0.05)
        self.coef_ = np.array([[theta.value for theta in row]
                               for row in theta])
                               
    def predict(self, X):
        res = X @ self.coef_[:-1, :] + self.coef_[-1, :]
        return np.argmax(res, axis=1)
        
if __name__ == "__main__":
    X, y = load_iris(return_X_y=True)
    mlg = MyLogisticRegression()
    mlg.fit(X, y)
    pred = mlg.predict(X)
    print('pred:', pred)
    print('y_gt:', y)
    print('train accuracy:', np.sum((y == pred)) / len(y))

line 21~33, build the expression for the cost.

line 35~44, gradient descent.

BTW, I ignore the training/testing split.

result

(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/deep-learning/tensorflow-from-scratch/logistic_regression.py
cost.val: 2.0794415416798406  iter: 0
cost.val: 0.9165179539616499  iter: 200
cost.val: 0.8204635597223325  iter: 400
cost.val: 0.7693105578987969  iter: 600
cost.val: 0.7356343790416037  iter: 800
cost.val: 0.7114761316284721  iter: 1000
cost.val: 0.6932389174928136  iter: 1200
cost.val: 0.6789656571543744  iter: 1400
cost.val: 0.6674786970513418  iter: 1600
cost.val: 0.658022175128815  iter: 1800
pred: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1
 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
y_gt: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
train accuracy: 0.9666666666666667

4.3 Neural Network

We can use the lib to build a Neural Net.

Here is a implementation for the MNIST dataset.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
import numpy as np
import variables_tree_flow as vtf
import matplotlib.pyplot as plt
import random
class NeuralNet:
    
    def fit_and_test(self, images, labels):
        assert len(images) == len(labels)
        num_data = len(labels)
        len0 = 64
        len1 = 10
        len2 = 10
        self.theta0 = np.array([vtf.Variable(value=random.gauss(0, 0.01), id='t{}'.format(i))
                                for i in range((len0 + 1) * len1)]).reshape([len1, (len0 + 1)])
        self.theta1 = np.array([vtf.Variable(value=random.gauss(0, 0.01), id='t{}'.format(i))
                                for i in range((len1 + 1) * len2)]).reshape([len2, (len1 + 1)])
        
        cost = 0
        for i, (im, label) in enumerate(zip(images, labels)):
            if i == 2:
                break
            pred = self.__predict_func(im)
            assert(len(pred) == 10)
            cost_this = 0
            for idx, pred_val in enumerate(pred):
                # idx encoding
                # multiple logistic regression to handle multiclasses.
                if idx == label:
                    cost_this += - vtf.ntf_log(pred_val)
                else:
                    cost_this += - vtf.ntf_log(1 - pred_val)
            cost += cost_this * (1 / num_data)
        
        core = vtf.NumberFlowCore(cost)
        for i in range(10000):
            core.forward()
            if i % 100 == 0:
                print("cost.val:", cost.value, " iter:", i)
            core.clear_grad()
            core.backward()
            core.gradient_desent(rate=0.05)
        # replace graph node type by np array
        self.theta0 = [[e.value for e in r] for r in self.theta0]
        self.theta1 = [[e.value for e in r] for r in self.theta1]
        for i in range(2):
            print(self.predict(images[i]))
            print('gt:', labels[i])
            
    def __predict_func(self, data):
        x0 = data.reshape(-1)
        x0b = np.hstack([x0, 1])
        z0 = self.theta0 @ x0b
        x1 = np.vectorize(vtf.ntf_sigmoid)(z0)
        x1b = np.hstack([x1, 1])
        z1 = self.theta1 @ x1b
        x2 = np.vectorize(vtf.ntf_sigmoid)(z1)
        return x2
        
    def predict(self, data):
        print('========================')
        pred = self.__predict_func(data)
        print('pred:', pred)
        return np.argmax(pred, axis=0)
        
class LogisticRegression:
    ...
    
if __name__ == "__main__":
    # The digits dataset
    digits = datasets.load_digits()
    # The data that we are interested in is made of 8x8 images of digits, let's
    # have a look at the first 4 images, stored in the `images` attribute of the
    # dataset.  If we were working from image files, we could load them using
    # matplotlib.pyplot.imread.  Note that each image must have the same size. For these
    # images, we know which digit they represent: it is given in the 'target' of
    # the dataset.
    _, axes = plt.subplots(2, 4)
    images_and_labels = list(zip(digits.images, digits.target))
    for ax, (image, label) in zip(axes[0, :], images_and_labels[:4]):
        ax.set_axis_off()
        ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
        ax.set_title('Training: %i' % label)
    plt.show()
    nn = NeuralNet()
    nn.fit_and_test(digits.images, digits.target)
    lg = LogisticRegression()
    lg.fit_and_test(digits.images, digits.target)

line 62, the predict function. We can use the function to predict result for both training and testing.

line 21~35, The cost function. Note: The cost is 10 binary cross entropy.

line 68, I also implemented a logistic regression for fun.

However, because I construct the expression tree for every number. The running time was very very slow. So, I took just a few samples for a demo.

The result is correct but it’s not very interesting. I will skip it.

4.4 GPU Operations

Recall that a deep learning model is a composition of functions. e.g.

\(cost(x) = f_3(f_2(f_1(x)))\)

A intermediate function (after reshape) is usually a mapping from a vector to another vector \(y = f_i(x) \in R^n \to R^m\).

For most cases, GPU evaluates such a function faster.

All you need to do is provide a GPU implementation for the forward and backward pass for a function.

For example,

class Mul(Operator):
    def __init__(self, a: Variable, b: Variable):
        ...
    
    def forward_cpu(self):
        ...
    
    def backward_cpu(self):
        ...
            
    def forward_gpu(self):
        ...
        
    def backward_gpu(self):
        ...

(But a good implementation needs a lot more. e.g. minimize the data copies between CPU and GPU.)

5 Random stuff

5.1 TensorRT, The expression tree, and The Compiler

After training, TensorFlow saves the result as a PB file. A PB file specifies an expression tree (DAG if you want) with node values.

The tensorRT parse the expression tree and generates optimized cuda code for execution.

It’s a classic compliation problem. Specifically,

A deep learning framework saves an expression tree as a PB file. The PB format can be thought of as a virtual machine language for expressions.
The TensorRT parses an expression and converts it into Cuda code. The Cuda code can be thought of as a virtual machine language for Nvidia GPUs.
(I guess) a Cuda compiler compiles the Cuda code into target GPU instructions.

Obviously, steps 2 & 3 can be replaced by other hardware. And I guess we can make cheaper and faster hardward specific for deep learning operations.

I am curious why there is no good replacement for Nvidia GPU yet. (I guess some companies are doing that. But why they are stuck?)

I am also curious why can’t we invent some kind of VM language for deep learning operations? Then, we can do optimization in the training time and decouple code with hardware.

The compiler is a very interesting topic.
Build a Modern Computer from First Principles: From Nand to Tetris is an AWESOME course about the compiler, computer, and program management.
It’s the best online course I’ve ever taken! highly highly recommended!
BTW, If you want to be COOL. Compilers by Standford is a better choice.

5.2 Forward auto diff

In some situation, the forward differential is faster than the backward differential.

This feature request asked for adding the forward diff into Pytorch.

However, the forward diff and the backward diff doesn’t work together. The graph traversal and the value to track are different for these two methods.

But we can add the forward diff into the Pytorch framework without too much disturbing complexity.

I don’t want to digress too much. Please read this post for details about the forward diff and how to put forward diff into a deep learning framework.

The End

code link

Good tools are critical! Caffe/Tensorflow/Pytorch developers! Thank you for enabling me to do crazy Alchemy !!!

Appendix1: The Forward Auto Diff

Forward diff is another way to compute gradient automatically. You can read this post if you are interested.

Implement Ceres Solver from Scratch

Appendix2: Derivation of the Chain rule

I found lots of people (including myself) were confused by the chain rule. Let’s derive the chain rule.

Given a function,

\(cost(x) = f_3(f_2(f_1(x))) \;\; (1)\)

Let \(z_1 = f_1(x), z_2 = f_2(z_1)\).

Assuming the current variable is at \(x = \bar{x}\). So \(\bar{z_1} = f_1(\bar{x}), \bar{z_2} = f_2(\bar{z_1})\)

We can linearize (doing first order Taylor expansion) \(f_1(x)\) at \(\bar{x}\).

\(f_1(x) \approx \frac{\partial f_1}{\partial x}|_{\bar{x}}(x – \bar{x}) + f_1(\bar{x}) \; \; (2)\)

And linearize \(f_2(z_1)\) at \(\bar{z_1}\).

\(f_2(z_1) \approx \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}}(z_1 – \bar{z_1}) + f_2(\bar{z_1})\; \; (3)\)

And linearize \(f_3(z_2)\) at \(\bar{z_2}\).

\(f_3(z_2) \approx \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}}(z_2 – \bar{z_2}) + f_3(\bar{z_2})\; \; (4)\)

Plug (2),(3),(4) into (1), and do some algebra.

\(cost(x) \approx \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}} \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial f_1}{\partial x}|_{\bar{x}} x + \text{constant}\)

w.l.o.g, let \(x = x + \Delta\).

\(cost(x + \Delta) \approx \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}} \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial f_1}{\partial x}|_{\bar{x}} (x + \Delta) + \text{constant}\)

\(cost(x + \Delta) \approx cost(x) + \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}} \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial f_1}{\partial x}|_{\bar{x}} \Delta \;\; (5)\)

By the definition of Taylor expasion,

\(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}} \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial f_1}{\partial x}|_{\bar{x}}\)

From (5), We can easily see the physical meaning of the chain rule. A perturbation \(\Delta\) is scaled by the “slope” of each intermediate function.

Implement Tensorflow/Pytorch from Scratch

Why writing this post?

What’s in this post?

0. Deep learning refresher

1. The Chain Rule

1.1 The misleading notation

1.2 The better notation

1.3 The Correct chain rule

2 The Expression Tree

3. Implementation

3.1 The Tree Implementation

Node

Build The Tree

The Forward Pass

The Backward Pass

The Core

Let’s try it!

3.2 Directed Acyclic Graph Implementation

Topological Sort

The forward & backward pass

4 Applications

4.1 Linear Regression

4.2 Logistic Regression

4.3 Neural Network

4.4 GPU Operations

5 Random stuff

5.1 TensorRT, The expression tree, and The Compiler

5.2 Forward auto diff

The End

Appendix1: The Forward Auto Diff

Appendix2: Derivation of the Chain rule

29 thoughts on “Implement Tensorflow/Pytorch from Scratch”

Leave a ReplyCancel reply

Why writing this post?

What’s in this post?

0. Deep learning refresher

1. The Chain Rule

1.1 The misleading notation

1.2 The better notation

1.3 The Correct chain rule

2 The Expression Tree

3. Implementation

3.1 The Tree Implementation

Node

Build The Tree

The Forward Pass

The Backward Pass

The Core

Let’s try it!

3.2 Directed Acyclic Graph Implementation

Topological Sort

The forward & backward pass

4 Applications

4.1 Linear Regression

4.2 Logistic Regression

4.3 Neural Network

4.4 GPU Operations

5 Random stuff

5.1 TensorRT, The expression tree, and The Compiler

5.2 Forward auto diff

The End

Appendix1: The Forward Auto Diff

Appendix2: Derivation of the Chain rule

Share this:

29 thoughts on “Implement Tensorflow/Pytorch from Scratch”

Leave a ReplyCancel reply