$$\DeclareMathOperator*\argmax{argmax}$$

### Theoretical exercises

1. Consider a neural network with the synaptic weight matrix $\begin{pmatrix} 0 &1 &1 &1\cr 0 &0 &1 &0\cr 0 &0 &0 &0\cr 0 &1 &1 &0 \end{pmatrix}$ We follow the convention that $W_{ij}$ denotes the strength of the $i\leftarrow j$ connection, and $0$ strength means there is no connection. Show that this is a feedforward net.

• Show it using a diagram of the network graph.

• Show it by performing a topological sort. Your answer should include the sorting permutation, as well as the weight matrix after applying the permutation.

2. Using LT neurons, construct a multilayer perceptron (MLP) that computes the sum of three binary inputs $x_{1}$, $x_{2}$, $x_{3}$. The two MLP outputs $y_{1}$ and $y_{2}$ should represent the two bits of the sum when written as a binary number, i.e., 00, 01, 10, 11.

3. Backprop for a general feedforward net. In class, we derived the backpropagation algorithm for a multilayer perceptron. Here we will generalize to any feedforward net. Write the forward pass as the iteration of $$x_i = f\left(\sum_j W_{ij}x_j + b_i\right) \label{eq:ForwardPass}$$ for $i=1$ to $N$, where $N$ is the number of neurons. Here assume that $W_{ij}$ is topologically sorted so that it is nonzero only for $i>j$.

• Write $\vec{b}$ as a function of $\vec{x}$. (This is the inverse of the normal mode of network operation, that $\vec{x}$ is a function of $\vec{b}$. The inverse function can be interpreted in the following way. Suppose that you have complete control over $\vec{b}$, and you want to make the activity of the network take on a given value $\vec{x}$. The output of the $\vec{x}\to\vec{b}$ function tells you how you should set $\vec{b}$. The inverse function is mathematically useful for deriving backprop, even if you never use the network in this way. For simplicity assume that the activation function $f$ is monotone increasing, so that its inverse $f^{-1}$ is well-defined.)

• Define an online loss function $e(\vec{x})$. Applying the chain rule to the composition of functions $\vec{x}\to\vec{b}\to e$ yields $$\frac{\partial e}{\partial x_j} = \sum_i \frac{\partial e}{\partial b_i}\frac{\partial b_i}{\partial x_j}$$ Use this to derive the backward pass as the iteration of $$\frac{\partial e}{\partial b_j} = f’(f^{-1}(x_j)) \left[\frac{\partial e}{\partial x_j} + \sum_i \frac{\partial e}{\partial b_i} W_{ij}\right]$$ from $j=N$ to $1$.

• Suppose that the loss function is $$e(\vec{x}) = \frac{1}{2} \sum_{i\in S} (d_i - x_i)^2$$ where $S$ is a subset of $\{1,\ldots,N\}$. What is $-\partial e/\partial x_j$? (To answer correctly, you will have to think about the meaning of the chain rule for $\vec{x}\to\vec{b}\to e$. If your answer is correct, you will be able to see how backprop for a multilayer perceptron follows from the above as a special case where $S$ is the set of neurons in the output layer. Meditate on this for yourself; you don’t need to submit anything. You will need the sensitivity lemma $\partial e/\partial W_{ij} = \left(\partial e/\partial b_i\right)x_j$.)

### Programming exercises

1. Classifying MNIST images with a multilayer perceptron. If you haven’t already done so, refresh your code repo. In JuliaBox, you can do this by selecting “Console”, followed by these commands.
$cd ~/code$ git pull

Return to the Notebook Dashboard and select Backprop.ipynb.

• The sample code trains a two-layer perceptron with 10 output neurons. The code displays the first layer weight vectors, the input vector, and a learning curve. Run the code and report the final value of the time-averaged classification error.

• Modify the code to use hyperbolic tangent as the activation function. Here it may be appropriate to change the desired values for the output neurons. Run the code to show that it works.

• Modify the code to use rectification (ReLU) as the activation function. Run the code to show that it works.

• Modify the code to add a third layer. Run the code to show that it works.

For each modification of the code, you should create a separate cell in your notebook and generate a plot. You may need to fiddle with the learning rate parameter eta to get the training to converge. As an optional exercise, modify the last layer by removing the final activation function $f$ and replacing it by softmax. (Don’t submit this.)