Why writing this post?
Deep learning changed the world!
What makes deep learning so successful?
Here are my answers.
- Simple yet power theory: By stacking linear+activation together, we can approximate any mapping!
- Rich training data: Since the deep network can approximate any mapping, we need lots of data to make sure the mapping is not overfitting.
- Tools: GPU makes the computation tractable. Auto differentiate tools make deep learning accessible.
I would say Simple yet power theory
, Rich training data
contribute to 33.3% and 33.3% to deep learning’s success. And good tools
contributes to 33.4% of the success.
I’d like to implement a auto differentiation library to pay my respect.
What’s in this post?
In this post, we will implement a auto differentiation library similar to Tensorflow/Pytorch.
Specifically, we will discuss,
- The Math Theory.
- How to do the forward pass and backward pass.
- The tree implementation.
- The Directed acyclic graph implementation.
0. Deep learning refresher
Supervised learning is simply function approximation.
Given a set of input, output pairs \(\{ (x_0, y_0), (x_1, y_1), .. \}\), we want to learn a mapping f. It is usually formulated as an optimization.
\(\text{cost}_{\theta}(f) = \sum_i ||f_{\theta}(x_i) – y_i||\)
In English, we want to find a function f parameterized by \(\theta\) that best approximates the mapping between input x and output y.
There are many ways to do the function approximation. Deep learning is one of them.
Why deep learning performs well? Because a deep learning model is able to approximate any function.
The deep learning model can be viewed as a composition of simple nonlinear functions.
For example,
\(f(x) = f3_{\theta_1}(f1_{\theta_2}(f1_{\theta_1}(x)))\)
\(f(x)\) is composed of 3 nonlinear functions which are parameterized by \(\theta_i\).
To optimize the best \(f(x)\), people use the chain rule to compute the gradient w.r.t \(\theta_i\) and do the gradient descent.
1. The Chain Rule
1.1 The misleading notation
I searched the chain rule and back-propagation online. Here are what I found.
Chain rule 1.
Chain rule 2
Chain rule 3
There are correct math notation. But it’s NOT what backpropagation is doing.
1.2 The better notation
Given a function,
\(cost(x) = f_3(f_2(f_1(x)))\)
Assuming the current variable is at \(x = \bar{x}\).
We are interested in minimize the cost by updating \(x\). A obvious way is doing the gradient descent.
For the gradient descent, we want to do,
\(\text{next x} = \bar{x} – k \frac{\partial cost}{\partial x}|_{x=\bar{x}}\)
\(\frac{\partial cost}{\partial x}|_{x=\bar{x}}\) means the derivative of cost w.r.t x evaluated at the current variable \(\bar{x}\).
How to compute the gradient?
Let \(z_1 = f_1(x), z_2 = f_2(z_1)\). So \(\bar{z_1} = f_1(\bar{x}), \bar{z_2} = f_2(\bar{z_1})\)
The chain rule says,
\(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}} \)
How to compute the expression?
Backpropagation does it this way:
- Given \(x = \bar{x}\), compute \(\bar{z_1} = f_1(\bar{x}), \bar{z_2} = f_2(\bar{z_1})\).
- With \(\bar{z_2}, \bar{z_1}, \bar{x}\), we can evaluate \(\frac{\partial cost}{\partial z_2}|_{\bar{z_2}}\). Then \(\frac{\partial cost}{\partial z_1}|_{\bar{z_1}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \). Then \(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial cost}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}}\).
The step 1 is the forward pass. The step 2 is the backward pass.
In a picture,
This lead to an implementation of the Backpropagation.
- For each intermidate function \(z_i = f(z_{i-1})\). We need to track the dependencies. e.g. output \(z_i\) and input \(z_{i-1}\)
- In the forward pass, given inputs \(\bar{z_{i-1}}\), we compute the outputs \(\bar{z_{i}}\).
- In the backward pass, given the derivative of cost w.r.t output \(\frac{\partial cost}{\partial z_i}|_{\bar{z_i}}\). We evaluate the derivation of the cost w.r.t input by the chain rule \(\frac{\partial cost}{\partial z_{i-1}}|_{\bar{z{i-1}}} = \frac{\partial cost}{\partial z_i}|_{\bar{z_i}} \frac{\partial z_i}{\partial z{i-1}}|_{\bar{z{i-1}}}\).
To support this, we can define a type for optimization variables. In the type, we track the variable’s value and the derivative of cost w.r.t to the variable. We also need to track the variable’s dependencies.
e.g. For a variable \(z_i\), we save it’s value \(\bar{z_i}\) and \(\frac{\partial cost}{\partial z_i}|_{\bar{z_i}}\). And we tracks it’s dependencies (parents and children).
class Node: def __init__(self, id): self.id = id # parent operators self.parents = [] # child operator self.children = [] class Variable(Node): def __init__(self, value=0, id='', ntype='opt-varible'): super().__init__(id) self.value = value self.grad = 0 # We need to differentiate variable types # "opt-varible", "const", "intermediate" self.ntype = ntype ...
1.3 The Correct chain rule
In Backpropagation, we are doing
\(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial cost}{\partial z_2}|_{\bar{z_2}} \frac{\partial z_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial z_1}{\partial x}|_{\bar{x}} \)
Instead of
\(\frac{\partial cost}{\partial x} = \frac{\partial cost}{\partial z_2} \frac{\partial z_2}{\partial z_1} \frac{\partial z_1}{\partial x} \)
Assuming \(x \in R^n\) and \(cost \in R^1\)
The formal expression is a value \( \frac{\partial cost}{\partial x}|_{ \bar{x}} \in R^{1 \times n}\).
The later expression is a function \( \frac{\partial cost}{\partial x} \in R^1 \to R^{1 \times n}\).
2 The Expression Tree
In the previous section, we discuss the chain rule in a list-like example. In general, a math expression can be represented as a directed acyclic graph, which is equivalent to a tree.
For example, consider this expression.
\(c = x + x y\)
We can represent the expression as a tree or a graph
Personally, I prefer to view the expression as a tree since it makes the analysis easier.
(A one-line summary of the graph theory: \(f = AAAA…AAAx\). It’s an expression!)
For example, assume we did the forward pass (\(\bar{x}\) means the current value of \(x\)), the Backpropagation for a tree look like this,
- For each intermedia function \(y_i = f(x_i, z_i)\),
- The forward pass: traversal the tree from leaves to root to compute the current value \(\bar{y_i} = f(\bar{x_i}, \bar{z_i})\).
- The backward pass: traversal the tree from root to leaves and use the chain rule to compute \(\frac{\partial cost}{\partial x_i}|_{\bar{x_i}} = \frac{\partial cost}{\partial y_i}|_{\bar{y_i}} \frac{\partial y_i}{\partial x_i}|_{\bar{x_i}}\) and \(\frac{\partial cost}{\partial z_i}|_{\bar{z{i}}} = \frac{\partial cost}{\partial y_i}|_{\bar{y_i}} \frac{\partial y_i}{\partial z{i}}|_{\bar{z{i}}}\)
(Note, we need to use \(+= \) to update the gradient since there can be multiple paths from the root to a variable. I ignore it for simplicity.)
Alright, now we have the big picture of how to implement the auto Backpropagation.
We want to:
compute the gradient of cost w.r.t all variables \(\frac{\partial cost}{\partial x}|_{x=\bar{x}}\) so that we can do the gradient descent.
To compute the gradient we need to,
- Track the dependencies of all variables.
- Two kinds of nodes: variable node and operator node.
- Forward pass: traverse the tree from leaves to root. Perform expression evaluations.
- Backward pass: traverse the tree from root to leaf. Using the chain rule to compute the gradients.
3. Implementation
3.1 The Tree Implementation
Node
A node is a node in a tree.
class Node: def __init__(self, id): self.id = id # parent operators self.parents = [] # child operator self.children = []
We have 2 kinds of node. The variable node and the operator node.
The variable node.
class Variable(Node): def __init__(self, value=0, id='', ntype='opt-varible'): super().__init__(id) self.value = value self.grad = 0 # "opt-varible", "const", "intermediate" self.ntype = ntype def __add__(self, other): if isinstance(other, numbers.Number): other = Variable(value=other, ntype='const') plus = Plus(self, other) self.parents.append(plus) other.parents.append(plus) plus.result.children = [plus] return plus.result def __radd__(self, other): ... def __sub__(self, other): if isinstance(other, numbers.Number): other = Variable(value=other, ntype='const') neg_other = - other return self + neg_other def __rsub__(self, other): ... def __mul__(self, other): if isinstance(other, numbers.Number): other = Variable(value=other, ntype='const') mul = Mul(self, other) self.parents.append(mul) other.parents.append(mul) mul.result.children = [mul] return mul.result def __rmul__(self, other): ... def __neg__(self): neg = Neg(self) self.parents.append(neg) neg.result.children = [neg] return neg.result
For a variable node, we implement some common operators like “+”, “-“, “*”.
And we have Operators.
class Operator(Node): def __init__(self, id): super().__init__(id) def forward(self): raise NotImplementedError("Should have implemented this") def backward(self): raise NotImplementedError("Should have implemented this") class Plus(Operator): def __init__(self, a: Variable, b: Variable): super().__init__("({})+({})".format(a.id, b.id)) self.a = a self.b = b self.result = Variable(ntype="intermediate", id="res:{}".format(self.id)) self.children = [a, b] self.parents = [self.result] def forward(self): self.result.value = self.a.value + self.b.value def backward(self): if self.a is self.b: self.a.grad += 2 * self.result.grad else: self.a.grad += self.result.grad self.b.grad += self.result.grad class Mul(Operator): def __init__(self, a: Variable, b: Variable): super().__init__("({})*({})".format(a.id, b.id)) self.a = a self.b = b self.result = Variable(ntype="intermediate", id="res:{}".format(self.id)) self.children = [a, b] self.parents = [self.result] def forward(self): self.result.value = self.a.value * self.b.value def backward(self): if self.a is self.b: self.a.grad += 2 * self.result.grad * self.a.value else: self.a.grad += self.result.grad * self.b.value self.b.grad += self.result.grad * self.a.value class Neg(Operator): ...
In a forward pass, the operator takes its children’s current values, do some math, put the result in its parent variable.
In a backward pass, the operator takes its parent’s gradient, do the chain rule, and put the result gradient in its children.
Build The Tree
The implementation build the tree implicitly using python’s build-in operator.
class Variable(Node): def __init__(self, value=0, id='', ntype='opt-varible'): super().__init__(id) self.value = value self.grad = 0 # "opt-varible", "const", "intermediate" self.ntype = ntype def __add__(self, other): if isinstance(other, numbers.Number): other = Variable(value=other, ntype='const') plus = Plus(self, other) self.parents.append(plus) other.parents.append(plus) plus.result.children = [plus] return plus.result ... def __mul__(self, other):
For example, we implemented the __add__
build-in function.
We connect a Plus
operator and connect it to the current variable self
and the other variable other
. We return the result of the Plus
operator.
We just build a tiny tree!
result | + / \ self other
How to build the tree for \(c = x + xy\)?
x = Variable(value=1., id='t{}'.format('x')) y = Variable(value=1., id='t{}'.format('y')) c = x + x * y
x * y
created a intermediate variable t
. c = x + t
creates the tree.
Now, we built the expression tree for the cost function expression. The next step is to evaluate the expression by the forward pass.
The Forward Pass
The forward pass can be implemented as a recursive tree traversal from leaf to root.
# simple version def __forward_recursive_simple(self): def evaluate(node): for child in node.children: evaluate(child) if isinstance(node, Operator): node.forward() evaluate(self.cost_node)
When the node is an Operator
, we call its forward
function. The forward function reads from inputs variables and saves result to the output variable.
When the node is a variable, we do nothing.
Note the actual implementation handles cycle detection and memorization. Please read the source code for details.
If you are confused, please do this Leetcode esay question.
The Backward Pass
The backward pass can be implemented as a recursive tree traversal from root to leaf.
def backward(self): def backward_dfs(node): if isinstance(node, Operator): node.backward() for child in node.children: backward_dfs(child) self.cost_node.grad = 1 backward_dfs(self.cost_node)
We simply call the backward
function for all operators from root to leaf.
The backward function reads the derivative of cost w.r.t to output and compute the derivative w.r.t input variables by the chain rule.
The Core
We can group operations on a tree together. Let’s call it the NumberFlowCore
The NumberFlowCore
take a tree root (the cost node) as input.
The operations in the NumberFlowCore
are,
- The forward pass
- The backward pass
- Clear gradient for all variables
- Gradient descent
class NumberFlowCore: def __init__(self, cost_node): self.cost_node = cost_node def __forward_recursive(self): ... def evaluate(node): ... for child in node.children: evaluate(child) if isinstance(node, Operator): node.forward() ... evaluate(self.cost_node) def backward(self): def backward_dfs(node): if isinstance(node, Operator): node.backward() for child in node.children: backward_dfs(child) self.cost_node.grad = 1 backward_dfs(self.cost_node) def clear_grad(self): def clear_grad_dfs(node): if isinstance(node, Variable): node.grad = 0 for child in node.children: clear_grad_dfs(child) clear_grad_dfs(self.cost_node) def gradient_desent(self, rate=0.01): for var in self.varible_nodes: var.value -= rate * var.grad
Alright, this it!
The library build a tree and track the value/gradient flow. To honestly capture what the lib is doing, let me call the lib variables_tree_flow
.
Let’s try it!
Consider a simple optimization problem.
\(\text{cost}(x,y) = ||x + xy||^2\)
We can solve the problem using this code.
x = Variable(value=1., id='x') y = Variable(value=1., id='y') c = x + x * y cost = c * c core = NumberFlowCore(cost) for i in range(100): if i % 10 == 0: print("cost.val:", cost.value, " iter:", i) core.forward() core.clear_grad() core.backward() core.gradient_desent(rate=0.01) if abs(cost.value) < 1e-8: print('break', cost.value) break for var in core.varible_nodes: print(var.id, var.value)
line 1~6, build the expression (tree) for the cost expression.
line 9~16, the gradient descent algorithm.
Result:
(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/deep-learning/tensorflow-from-scratch/test.py cost.val: 0 iter: 0 cost.val: 0.3720476003288574 iter: 10 cost.val: 0.0913675588756892 iter: 20 cost.val: 0.023025588582097953 iter: 30 cost.val: 0.005812055177713739 iter: 40 cost.val: 0.0014672133989577039 iter: 50 cost.val: 0.0003703903644286424 iter: 60 cost.val: 9.350315320913633e-05 iter: 70 cost.val: 2.360439345048723e-05 iter: 80 cost.val: 5.9588085747461865e-06 iter: 90 x 7.992086857341993e-07 y 0.6401459158392562
3.2 Directed Acyclic Graph Implementation
We can also view an expression tree as a directed acyclic graph.
An acyclic graph encodes the dependency for variables. The dependency is captured by a directed edge of the graph. To compute the value of a node, we need to compute all nodes “in order”.
We use an algorithm called Topological sort to find the order.
Topological Sort
Go back to the example.
\(c = x + xy\)
After the topological sort, the order for all nodes are \(x,y,*,t,+,c\). It means to compute the value of c, we need to evaluate nodes in the topological order.
In the context of the backpropogation, we follow the topological order for the forward pass and follow the reversed order for the backward pass.
I won’t discuss the detail of the algorithm. Please do this leetcode medium question or read this article for details.
The “in-degree” topological sort implementation.
(BTW, your interviewer expect you memorize the question and code it in 15 minites)
def __topological_sort(self): zero_degree_nodes = [] indegree = defaultdict(int) for node in self.all_nodes: indegree[node] = len(node.children) if len(node.children) == 0: zero_degree_nodes.append(node) topo_order = [] while zero_degree_nodes: next_zero_degree_nodes = [] for znode in zero_degree_nodes: topo_order.append(znode) for p in znode.parents: indegree[p] -= 1 if indegree[p] == 0: next_zero_degree_nodes.append(p) zero_degree_nodes = next_zero_degree_nodes if len(topo_order) != len(self.all_nodes): raise RuntimeError("cycle found!") self.topologic_order = topo_order
The forward & backward pass
def __forward_iterative(self): for node in self.topologic_order: if isinstance(node, Operator): node.forward() def __backward_iter(self): self.cost_node.grad = 1. for node in reversed(self.topologic_order): if isinstance(node, Operator): node.backward()
The forward and backward pass simply loop through operators in order.
The graph implementation of backpropagation is faster since there is no recursion overhead. But a downside could be we have to redo the topological sort whenever the graph gets updated.
A interesting trade-off for libraray developers.
Alright. 300 lines of python code for 2 implementations. Let’s use it for some applications.
4 Applications
We can implement a couple of classical problems using the library.
4.1 Linear Regression
Let’s use the library to solve the Linear regression problem by gradient descent.
\(\text{cost}(\Theta) = ||\Theta x – b ||^2\)
This is basically the linear regression example from sklearn. There are two options for linear regression.
- The sklearn build-in solver. I believe it bases on least squares (2nd order Newton).
- Using the
variable-tree-flow
to do gradient descent.
The main function. It reads data and calls the library.
# this is basically a sklearn linear regression example. # change the METHOD to see difference. from sklearn.metrics import mean_squared_error, r2_score from sklearn import datasets, linear_model import numpy as np import matplotlib.pyplot as plt if __name__ == "__main__": # 'sklearn' or 'vtf' METHOD = 'vtf' # Create linear regression object if METHOD == 'sklearn': regr = linear_model.LinearRegression() elif METHOD == 'vtf': regr = LinearRegression() # Load the diabetes dataset diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True) # Use only one feature diabetes_X = diabetes_X[:, np.newaxis, 2] # Split the data into training/testing sets diabetes_X_train = diabetes_X[:-400] diabetes_X_test = diabetes_X[-400:] # Split the targets into training/testing sets diabetes_y_train = diabetes_y[:-400] diabetes_y_test = diabetes_y[-400:] # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # Make predictions using the testing set diabetes_y_pred = regr.predict(diabetes_X_test) # # The coefficients print('Coefficients:', regr.coef_) print('Intercept:', regr.intercept_) # Plot outputs plt.scatter(diabetes_X_test, diabetes_y_test, color='black') plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3) plt.xticks(()) plt.yticks(()) plt.show()
The implementation of the linear regression using the home made library.
import variables_tree_flow as vtf import numpy as np class LinearRegression: def __init__(self): pass def fit(self, X, y): assert len(X) == len(y) self.data_num = len(X) self.len_theta = X.shape[1] theta = np.array([vtf.Variable(value=0, id='t{}'.format(i)) for i in range(self.len_theta)]) bias = vtf.Variable(value=0, id='b') pred = X @ theta + bias diff = pred - y cost = diff.T @ diff * (1 / len(X)) core = vtf.NumberFlowCore(cost) for i in range(10000): core.forward() if i % 500 == 0: print("cost.val:", cost.value, " iter:", i) core.clear_grad() core.backward() core.gradient_desent(rate=0.05) for var in set(core.varible_nodes): print(var.id, var.value) self.coef_ = np.array([t.value for t in theta]) self.intercept_ = bias.value def predict(self, X): res = X @ self.coef_ + self.intercept_ return res
Line 11~20, build the expression tree for the cost.
Please note that Numpy supports arbitrary types as long as they support math operations. So, we can use Numpy for operations like matrix multiplications for the variables_tree_flow
.Variable
type.
Line 22~29, the gradient descent loop. forward pass, backward pass and gradient descent.
Result:
(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/deep-learning/tensorflow-from-scratch/linear_regression.py cost.val: 26645.642857142855 iter: 0 cost.val: 4617.747728745511 iter: 500 cost.val: 4208.370338581641 iter: 1000 cost.val: 4033.241331959609 iter: 1500 cost.val: 3958.3222726824383 iter: 2000 cost.val: 3926.2723824258806 iter: 2500 cost.val: 3912.5616461276954 iter: 3000 cost.val: 3906.6962815871752 iter: 3500 cost.val: 3904.187116474157 iter: 4000 cost.val: 3903.113711825411 iter: 4500 cost.val: 3902.6545162412012 iter: 5000 cost.val: 3902.4580753325445 iter: 5500 cost.val: 3902.374039179059 iter: 6000 cost.val: 3902.338089054714 iter: 6500 cost.val: 3902.322709823434 iter: 7000 cost.val: 3902.3161306875027 iter: 7500 cost.val: 3902.313316175593 iter: 8000 cost.val: 3902.3121121455547 iter: 8500 cost.val: 3902.3115970692725 iter: 9000 cost.val: 3902.311376722961 iter: 9500 t0 886.9538881643437 b 148.09904663418547 Coefficients: [886.95388816] Intercept: 148.09904663418547
4.2 Logistic Regression
The function to learn,
\(p_i(\theta) = \frac{1}{1 + e^{\theta x_i}}\)
The cost,
\(\text{cost}(\theta) = \sum_i( -y_i log(p_i(\theta)) – (1 – y_i)log(1 – p_i(\theta)) )\)
To support logistic regression, we need 2 operators. The Sigmoid and the Log.
class Sigmoid(Operator): def __init__(self, a: Variable): super().__init__("sigmoid({})".format(a.id)) self.a = a self.result = Variable(ntype="intermediate", id="res:{}".format(self.id)) self.children = [a] self.parents = [self.result] def forward(self): if abs(self.a.value) > 50: self.a.value = np.sign(self.a.value) * 50 expo = math.exp(self.a.value) self.result.value = expo / (1 + expo) def backward(self): expo = math.exp(self.a.value) sigmoid = expo / (1 + expo) sigmoid_grad = sigmoid * (1 - sigmoid) self.a.grad += self.result.grad * sigmoid_grad class Log(Operator): def __init__(self, a: Variable): super().__init__("log({})".format(a.id)) self.a = a self.result = Variable(ntype="intermediate", id="res:{}".format(self.id)) self.children = [a] self.parents = [self.result] def forward(self): # utils.traverse_tree(self) self.result.value = math.log(self.a.value) def backward(self): self.a.grad += self.result.grad * (1. / self.a.value)
Note that I handled the numerical issue in Sigmoid by clamping which is suboptimal (line 10).
We also need to write the “interface” function to build the tree. It’s easy, please read the code for details.
We can do the Logistic regression on the iris dataset.
from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression import numpy as np import variables_tree_flow as vtf class MyLogisticRegression: def fit(self, X, y): return self.__fit_numpy_impl(X, y) def __fit_numpy_impl(self, X, y): assert len(X) == len(y) # bias ones = np.ones((X.shape[0], 1)) X = np.hstack([X, ones]) self.data_num = len(X) self.num_classes = max(y) + 1 self.len_theta = X.shape[1] theta = np.array([vtf.Variable(value=0., id='t{}'.format(i)) for i in range(self.len_theta * self.num_classes)]).reshape([self.len_theta, self.num_classes]) linear_comb = X @ theta pred = np.vectorize(vtf.ntf_sigmoid)(linear_comb) cost = 0 for pred_row, gt_class in zip(pred, y): for class_idx, pred_class in enumerate(pred_row): # idx encoding # multiple logistic regression to handle multiclasses. if class_idx == gt_class: cost += - vtf.ntf_log(pred_class) else: cost += - vtf.ntf_log(1 - pred_class) cost = cost * (1 / self.data_num) core = vtf.NumberFlowCore(cost) for i in range(2000): core.forward() if i % 200 == 0: print("cost.val:", cost.value, " iter:", i) core.clear_grad() core.backward() core.gradient_desent(rate=0.05) self.coef_ = np.array([[theta.value for theta in row] for row in theta]) def predict(self, X): res = X @ self.coef_[:-1, :] + self.coef_[-1, :] return np.argmax(res, axis=1) if __name__ == "__main__": X, y = load_iris(return_X_y=True) mlg = MyLogisticRegression() mlg.fit(X, y) pred = mlg.predict(X) print('pred:', pred) print('y_gt:', y) print('train accuracy:', np.sum((y == pred)) / len(y))
line 21~33, build the expression for the cost.
line 35~44, gradient descent.
BTW, I ignore the training/testing split.
result
(base) yimu@yimu-dell:~/Desktop/yimu-blog$ /usr/bin/python3 /home/yimu/Desktop/yimu-blog/deep-learning/tensorflow-from-scratch/logistic_regression.py cost.val: 2.0794415416798406 iter: 0 cost.val: 0.9165179539616499 iter: 200 cost.val: 0.8204635597223325 iter: 400 cost.val: 0.7693105578987969 iter: 600 cost.val: 0.7356343790416037 iter: 800 cost.val: 0.7114761316284721 iter: 1000 cost.val: 0.6932389174928136 iter: 1200 cost.val: 0.6789656571543744 iter: 1400 cost.val: 0.6674786970513418 iter: 1600 cost.val: 0.658022175128815 iter: 1800 pred: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] y_gt: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] train accuracy: 0.9666666666666667
4.3 Neural Network
We can use the lib to build a Neural Net.
Here is a implementation for the MNIST dataset.
from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn import datasets import numpy as np import variables_tree_flow as vtf import matplotlib.pyplot as plt import random class NeuralNet: def fit_and_test(self, images, labels): assert len(images) == len(labels) num_data = len(labels) len0 = 64 len1 = 10 len2 = 10 self.theta0 = np.array([vtf.Variable(value=random.gauss(0, 0.01), id='t{}'.format(i)) for i in range((len0 + 1) * len1)]).reshape([len1, (len0 + 1)]) self.theta1 = np.array([vtf.Variable(value=random.gauss(0, 0.01), id='t{}'.format(i)) for i in range((len1 + 1) * len2)]).reshape([len2, (len1 + 1)]) cost = 0 for i, (im, label) in enumerate(zip(images, labels)): if i == 2: break pred = self.__predict_func(im) assert(len(pred) == 10) cost_this = 0 for idx, pred_val in enumerate(pred): # idx encoding # multiple logistic regression to handle multiclasses. if idx == label: cost_this += - vtf.ntf_log(pred_val) else: cost_this += - vtf.ntf_log(1 - pred_val) cost += cost_this * (1 / num_data) core = vtf.NumberFlowCore(cost) for i in range(10000): core.forward() if i % 100 == 0: print("cost.val:", cost.value, " iter:", i) core.clear_grad() core.backward() core.gradient_desent(rate=0.05) # replace graph node type by np array self.theta0 = [[e.value for e in r] for r in self.theta0] self.theta1 = [[e.value for e in r] for r in self.theta1] for i in range(2): print(self.predict(images[i])) print('gt:', labels[i]) def __predict_func(self, data): x0 = data.reshape(-1) x0b = np.hstack([x0, 1]) z0 = self.theta0 @ x0b x1 = np.vectorize(vtf.ntf_sigmoid)(z0) x1b = np.hstack([x1, 1]) z1 = self.theta1 @ x1b x2 = np.vectorize(vtf.ntf_sigmoid)(z1) return x2 def predict(self, data): print('========================') pred = self.__predict_func(data) print('pred:', pred) return np.argmax(pred, axis=0) class LogisticRegression: ... if __name__ == "__main__": # The digits dataset digits = datasets.load_digits() # The data that we are interested in is made of 8x8 images of digits, let's # have a look at the first 4 images, stored in the `images` attribute of the # dataset. If we were working from image files, we could load them using # matplotlib.pyplot.imread. Note that each image must have the same size. For these # images, we know which digit they represent: it is given in the 'target' of # the dataset. _, axes = plt.subplots(2, 4) images_and_labels = list(zip(digits.images, digits.target)) for ax, (image, label) in zip(axes[0, :], images_and_labels[:4]): ax.set_axis_off() ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest') ax.set_title('Training: %i' % label) plt.show() nn = NeuralNet() nn.fit_and_test(digits.images, digits.target) lg = LogisticRegression() lg.fit_and_test(digits.images, digits.target)
line 62, the predict function. We can use the function to predict result for both training and testing.
line 21~35, The cost function. Note: The cost is 10 binary cross entropy.
line 68, I also implemented a logistic regression for fun.
However, because I construct the expression tree for every number. The running time was very very slow. So, I took just a few samples for a demo.
The result is correct but it’s not very interesting. I will skip it.
4.4 GPU Operations
Recall that a deep learning model is a composition of functions. e.g.
\(cost(x) = f_3(f_2(f_1(x)))\)
A intermediate function (after reshape) is usually a mapping from a vector to another vector \(y = f_i(x) \in R^n \to R^m\).
For most cases, GPU evaluates such a function faster.
All you need to do is provide a GPU implementation for the forward and backward pass for a function.
For example,
class Mul(Operator): def __init__(self, a: Variable, b: Variable): ... def forward_cpu(self): ... def backward_cpu(self): ... def forward_gpu(self): ... def backward_gpu(self): ...
(But a good implementation needs a lot more. e.g. minimize the data copies between CPU and GPU.)
5 Random stuff
5.1 TensorRT, The expression tree, and The Compiler
After training, TensorFlow saves the result as a PB file. A PB file specifies an expression tree (DAG if you want) with node values.
The tensorRT parse the expression tree and generates optimized cuda code for execution.
It’s a classic compliation problem. Specifically,
- A deep learning framework saves an expression tree as a PB file. The PB format can be thought of as a virtual machine language for expressions.
- The TensorRT parses an expression and converts it into Cuda code. The Cuda code can be thought of as a virtual machine language for Nvidia GPUs.
- (I guess) a Cuda compiler compiles the Cuda code into target GPU instructions.
Obviously, steps 2 & 3 can be replaced by other hardware. And I guess we can make cheaper and faster hardward specific for deep learning operations.
I am curious why there is no good replacement for Nvidia GPU yet. (I guess some companies are doing that. But why they are stuck?)
I am also curious why can’t we invent some kind of VM language for deep learning operations? Then, we can do optimization in the training time and decouple code with hardware.
Build a Modern Computer from First Principles: From Nand to Tetris is an AWESOME course about the compiler, computer, and program management.
It’s the best online course I’ve ever taken! highly highly recommended!
BTW, If you want to be COOL. Compilers by Standford is a better choice.
5.2 Forward auto diff
In some situation, the forward differential is faster than the backward differential.
This feature request asked for adding the forward diff into Pytorch.
However, the forward diff and the backward diff doesn’t work together. The graph traversal and the value to track are different for these two methods.
But we can add the forward diff into the Pytorch framework without too much disturbing complexity.
I don’t want to digress too much. Please read this post for details about the forward diff and how to put forward diff into a deep learning framework.
The End
Good tools are critical! Caffe/Tensorflow/Pytorch developers! Thank you for enabling me to do crazy Alchemy !!!
Appendix1: The Forward Auto Diff
Forward diff is another way to compute gradient automatically. You can read this post if you are interested.
Appendix2: Derivation of the Chain rule
I found lots of people (including myself) were confused by the chain rule. Let’s derive the chain rule.
Given a function,
\(cost(x) = f_3(f_2(f_1(x))) \;\; (1)\)
Let \(z_1 = f_1(x), z_2 = f_2(z_1)\).
Assuming the current variable is at \(x = \bar{x}\). So \(\bar{z_1} = f_1(\bar{x}), \bar{z_2} = f_2(\bar{z_1})\)
We can linearize (doing first order Taylor expansion) \(f_1(x)\) at \(\bar{x}\).
\(f_1(x) \approx \frac{\partial f_1}{\partial x}|_{\bar{x}}(x – \bar{x}) + f_1(\bar{x}) \; \; (2)\)
And linearize \(f_2(z_1)\) at \(\bar{z_1}\).
\(f_2(z_1) \approx \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}}(z_1 – \bar{z_1}) + f_2(\bar{z_1})\; \; (3)\)
And linearize \(f_3(z_2)\) at \(\bar{z_2}\).
\(f_3(z_2) \approx \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}}(z_2 – \bar{z_2}) + f_3(\bar{z_2})\; \; (4)\)
Plug (2),(3),(4) into (1), and do some algebra.
\(cost(x) \approx \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}} \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial f_1}{\partial x}|_{\bar{x}} x + \text{constant}\)
w.l.o.g, let \(x = x + \Delta\).
\(cost(x + \Delta) \approx \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}} \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial f_1}{\partial x}|_{\bar{x}} (x + \Delta) + \text{constant}\)
\(cost(x + \Delta) \approx cost(x) + \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}} \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial f_1}{\partial x}|_{\bar{x}} \Delta \;\; (5)\)
By the definition of Taylor expasion,
\(\frac{\partial cost}{\partial x}|_{ \bar{x}} = \frac{\partial f_3}{\partial z_2}|_{\bar{z_2}} \frac{\partial f_2}{\partial z_1}|_{\bar{z_1}} \frac{\partial f_1}{\partial x}|_{\bar{x}}\)
From (5), We can easily see the physical meaning of the chain rule. A perturbation \(\Delta\) is scaled by the “slope” of each intermediate function.
29 thoughts on “Implement Tensorflow/Pytorch from Scratch”