\( \DeclareMathOperator*\argmax{argmax} \)

Due at 3pm on Monday, Feb. 20. Please submit to Blackboard and follow the general guidelines regarding homework assignments.

Theoretical exercises

  1. Consider a neural network with the synaptic weight matrix \[ \begin{pmatrix} 0 &1 &1 &1\cr 0 &0 &1 &0\cr 0 &0 &0 &0\cr 0 &1 &1 &0 \end{pmatrix} \] We follow the convention that denotes the strength of the connection, and strength means there is no connection. Show that this is a feedforward net.

    • Show it using a diagram of the network graph.

    • Show it by performing a topological sort. Your answer should include the sorting permutation, as well as the weight matrix after applying the permutation.

  2. Using LT neurons, construct a multilayer perceptron (MLP) that computes the sum of three binary inputs , , . The two MLP outputs and should represent the two bits of the sum when written as a binary number, i.e., 00, 01, 10, 11.

  3. Backprop for a general feedforward net. In class, we derived the backpropagation algorithm for a multilayer perceptron. Here we will generalize to any feedforward net. Write the forward pass as the iteration of \begin{equation} x_i = f\left(\sum_j W_{ij}x_j + b_i\right) \label{eq:ForwardPass} \end{equation} for to , where is the number of neurons. Here assume that is topologically sorted so that it is nonzero only for .

    • Write as a function of . (This is the inverse of the normal mode of network operation, that is a function of . The inverse function can be interpreted in the following way. Suppose that you have complete control over , and you want to make the activity of the network take on a given value . The output of the function tells you how you should set . The inverse function is mathematically useful for deriving backprop, even if you never use the network in this way. For simplicity assume that the activation function is monotone increasing, so that its inverse is well-defined.)

    • Define an online loss function . Applying the chain rule to the composition of functions yields \begin{equation} \frac{\partial e}{\partial x_j} = \sum_i \frac{\partial e}{\partial b_i}\frac{\partial b_i}{\partial x_j} \end{equation} Use this to derive the backward pass as the iteration of \begin{equation} \frac{\partial e}{\partial b_j} = f’(f^{-1}(x_j)) \left[\frac{\partial e}{\partial x_j} + \sum_i \frac{\partial e}{\partial b_i} W_{ij}\right] \end{equation} from to .

    • Suppose that the loss function is \begin{equation} e(\vec{x}) = \frac{1}{2} \sum_{i\in S} (d_i - x_i)^2 \end{equation} where is a subset of . What is ? (To answer correctly, you will have to think about the meaning of the chain rule for . If your answer is correct, you will be able to see how backprop for a multilayer perceptron follows from the above as a special case where is the set of neurons in the output layer. Meditate on this for yourself; you don’t need to submit anything. You will need the sensitivity lemma .)

Programming exercises

  1. Classifying MNIST images with a multilayer perceptron. If you haven’t already done so, refresh your code repo. In JuliaBox, you can do this by selecting “Console”, followed by these commands.
    $ cd ~/code
    $ git pull

    Return to the Notebook Dashboard and select Backprop.ipynb.

    • The sample code trains a two-layer perceptron with 10 output neurons. The code displays the first layer weight vectors, the input vector, and a learning curve. Run the code and report the final value of the time-averaged classification error.

    • Modify the code to use hyperbolic tangent as the activation function. Here it may be appropriate to change the desired values for the output neurons. Run the code to show that it works.

    • Modify the code to use rectification (ReLU) as the activation function. Run the code to show that it works.

    • Modify the code to add a third layer. Run the code to show that it works.

    For each modification of the code, you should create a separate cell in your notebook and generate a plot. You may need to fiddle with the learning rate parameter eta to get the training to converge. As an optional exercise, modify the last layer by removing the final activation function and replacing it by softmax. (Don’t submit this.)