MNIST with multilayer perceptrons

You can compare the results of your MNIST experiments with results reported in the literature.

Srivastava et al. (2014) discusses the application of dropout and other kinds of regularization to multilayer perceptrons learning MNIST.
Ciresan et al. (2010) showed that multilayer perceptrons could attain state-of-the-art results on the MNIST benchmark, if the training set were augmented using elastic deformations.

L2 regularization means adding this penalty to the cost function for gradient descent: \[ \frac{\lambda_2}{2}\sum_{ij}W_{ij}^2 \]

L1 regularization means adding this penalty: \[ \frac{\lambda_1}{2}\sum_{ij}|W_{ij}| \]

In max-norm regularization, each vector of incoming weights is constrained to lie within a ball of radius $r$ , i.e. \[ \sum_j W_{ij}^2 \leq r^2 \] for all $i$ . The constraint is enforced after each gradient update by projecting each weight vector onto the $r$ -ball, \[ W_{ij} := W_{ij} \min\left\{\frac{r}{\sqrt{\sum_k W_{ik}^2}},1\right\} \] The radius $r$ is a hyperparameter. Projection onto the $r$ -ball is a special case of projection onto a convex set.