You can compare the results of your MNIST experiments with results reported in the literature.

  • Srivastava et al. (2014) discusses the application of dropout and other kinds of regularization to multilayer perceptrons learning MNIST.
  • Ciresan et al. (2010) showed that multilayer perceptrons could attain state-of-the-art results on the MNIST benchmark, if the training set were augmented using elastic deformations.

L2 regularization means adding this penalty to the cost function for gradient descent: \[ \frac{\lambda_2}{2}\sum_{ij}W_{ij}^2 \]

L1 regularization means adding this penalty: \[ \frac{\lambda_1}{2}\sum_{ij}|W_{ij}| \]

In max-norm regularization, each vector of incoming weights is constrained to lie within a ball of radius , i.e. \[ \sum_j W_{ij}^2 \leq r^2 \] for all . The constraint is enforced after each gradient update by projecting each weight vector onto the -ball, \[ W_{ij} := W_{ij} \min\left\{\frac{r}{\sqrt{\sum_k W_{ik}^2}},1\right\} \] The radius is a hyperparameter. Projection onto the -ball is a special case of projection onto a convex set.