$$\DeclareMathOperator*\argmax{argmax}$$

Theoretical exercises

1. Why gradient descent moves slowly in a canyon. To gain some intuitions, we will consider some simple quadratic cost functions. We don’t actually need to apply gradient descent to find the minimum; we know that it’s located at the origin. The purpose is theoretical analysis for the sake of insight.

• Consider the cost function $E(w_1, w_2) = (10000w_1^2 + w_2^2)/2$. Draw a sketch of the contours of constant cost. In which direction does the canyon point? Gradient descent on this cost function is $\Delta w_i = -\eta \partial E/\partial w_i$. For what value of $\eta$ will $w_1$ converge to zero immediately? For this value of $\eta$, how will $w_2$ behave as a function of time? Roughly how many time steps will it take for $w_2$ to converge to $1/e$ of its initial value?

• Diagonally scaled gradient descent means multiplying the gradient by a diagonal matrix $D$ with positive elements along the diagonal, $\Delta\vec{w} = - \eta D\partial E/\partial\vec{w}$. This is equivalent to having an independent learning rate parameter for each variable, $\Delta w_i = - \eta_i\partial E/\partial w_i$. How can we set $\eta_1$ and $\eta_2$ for the problem above to eliminate the canyon?

• The preceding example is two-dimensional, and the canyon is axis-aligned, which is why it can be eliminated by diagonal scaling. Now we generalize to higher dimensions, with a “canyon” that can point in an arbitrary direction and typically cannot be eliminated by diagonal scaling. Consider the cost function $E(\vec{w}) = \vec{w}^TH\vec{w}/2$. Assume that $H$ is a positive definite matrix (all eigenvalues are positive). That means $E$ is a strictly convex function with a strict global minimum at the origin. Assume without loss of generality that the eigenvalues are ordered such that $\lambda_1\geq\lambda_2\geq\ldots\geq\lambda_d$ where $d$ is the dimensionality. Define the condition number of $H$ as the ratio $\lambda_1/\lambda_d$. When the condition number is large, we say that $E$ has a canyon. Which direction is most canyon-like? Prove that for $t\to\infty$ and $\eta=1/\lambda_1$ the Euclidean norm of $\vec{w}$ is approximately proportional to $(1-\lambda_d/\lambda_1)^t$, which decays slowly when the condition number is very large. Numerical analysts say that such a matrix is “ill-conditioned.” (Hint: If you transform to the basis of eigenvectors of $H$, the cost function will become a weighted sum of squares, and gradient descent will take a simple form. Analysis becomes easy after that.)

Programming exercises

1. The art of backprop. In this exercise you will experiment with “tricks of the trade” discussed in class, with the goal of optimizing generalization performance on MNIST. A new file in the repo, Minibatch.ipynb, contains sample code for minibatch training. The code randomly selects a “validation set” of 10,000 examples from the 60,000 training examples. Only the remaining 50,000 examples are used for backprop, and are called the “training set”.

Run the code, which trains a 784-200-100-10 architecture. In the graph of classification error, the blue curve is for the training set, with each point an average over $t_{show}=1000$ minibatches. The red curve is for the validation set. You will see the curves track each other for a while, but eventually the training error will become lower than the validation error. Note that the vertical axis is a log scale, so differences between curves may appear compressed. If you wait longer, you will see that the validation error flattens out although the training error is still decreasing. Train for 600,000 minibatch updates.

The three histograms show the distributions of activities in the three layers (for the validation set). With proper weight initialization, the distributions should be peaked around the center of the range, but not too narrow. As training goes on, you will see the distributions shift towards the edges of the range, utilizing the nonlinearity in the activation function.

• What are the final values for validation error and training error? Save the data for the two learning curves (validation and training) for future comparison.

• The code is provided with $\tanh$ as activation function. Switch to logistic function by commenting and uncommenting the appropriate lines. (# is the comment symbol.) When you start training, you will see that the error stays almost constant. You might be tempted to give up on logistic function immediately. Instead, please fiddle with eta and/or epsinit until you can make the learning curve decrease. After you optimize these hyperparameters, plot the learning curves for logistic function on the same graph as those for $\tanh$, which you saved earlier. Is logistic worse than $\tanh$ for training? What are the final values for validation error and training error?

• Switch to rectification nonlinearity (ReLU) and train again. Try initializing all biases at one. What do you see? Can you find a better value for initial biases? What is the rationale for your choice?

• One way of improving generalization performance is to achieve lower training error. You could try increasing the size of your network (width or depth). Or you could try various tricks for improving gradient descent. Report your best validation error, and how you achieved it.

• Sometimes lower training error does not yield lower validation error, due to overfitting. Implement one trick for improving generalization performance. This could include L1 or L2 weight decay, or max norm projection. You could also try dropout, but this might be too time-consuming to train. Report your best validation error, and how you achieved it.

• Once you have found your best architecture and hyperparameters based on the validation set, add the validation set back to the training set, and train a new network on all 60,000 examples. Report the classification error on the test set.

Those with the top three results on the test set will receive a small prize! According to the rules, you are supposed to write your own code rather than use canned packages, and you shouldn’t use the test set for training networks or model selection. (I.e. it’s OK to peek at the error on the test set occasionally, but not OK to systematically optimize hyperparameters using the test set. For the latter you should use the validation set. In a real machine learning challenge, the test set would be withheld to prevent cheating.) If you feel like sharing, please post your classification error on the test set and other information on Piazza. This is helpful as it gives others an idea of how well it’s possible to do, and also ideas on how to proceed.