$$\DeclareMathOperator*\argmax{argmax}$$

### Theoretical exercises

1. Let $x_{1},…,x_{N}$ be $N$ boolean variables, taking on the values 0 or 1.
• Consider the conjunction $\bar{x}_{1}\wedge\bar{x}_{2}\cdots\wedge\bar{x}_{n}\wedge x_{n+1}\wedge\cdots\wedge x_{N}$ in which the first $n$ variables are negated. Express this function as an LT neuron $H\left(\sum_{i}w_{i}x_{i}-\theta\right)$ with weights $w_{i}$ and threshold $\theta$, where $H$ is the Heaviside step function. Note that conjunction, or AND, is defined by $x_{1}\wedge\cdots\wedge x_{N}=1\text{ if and only if }x_{i}=1\text{ for all }i$

• Do the same for the disjunction $\bar{x}_{1}\vee\bar{x}_{2}\cdots\vee\bar{x}_{n}\vee x_{n+1}\vee\cdots\vee x_{N}$. Note that disjunction, or OR, is defined by $x_{1}\vee\cdots\vee x_{N}=1\text{ if and only if }x_{i}=1\text{ for some }i$

2. Logistic regression. The delta rule was first introduced in class using the LT neuron, which has binary output. Then we learned that variants of the delta rule can be derived for a model neuron with a smooth, analog activation function $f$. The form of the learning rule depends on the loss function is used to measure the deviation between desired and actual outputs. If the desired output $y$ is binary (0 or 1), and the activation function ranges from 0 to 1, then some people like to interpret the output of the model neuron as the probability of a biased coin. These probability-lovers like to use the “log loss”, $e\left(\vec{w},\vec{x},y\right)=-y\log f\left(\vec{w}\cdot\vec{x}\right)-\left(1-y\right)\log\left[1-f\left(\vec{w}\cdot\vec{x}\right)\right]$
• Derive the version of the delta rule $(\Delta\vec{w}=?)$ that results from online gradient descent using the above “log loss”.
• Specialize to the case where the activation function is given by the logistic function $f(u)=\frac{1}{1+e^{-u}}.$ You will get a simple result.
• How does this differ from the version of the delta rule derived in class from the squared error loss function?

You’ve derived an algorithm for logistic regression. Statisticians use logistic regression when outputs are binary, and linear regression when outputs are analog.

3. Delta rule for a layer of multiple neurons. Suppose that there are $k$ LT neurons, which all share the same $n$ inputs. The $k$ weight vectors of the neurons are the rows of a $k\times n$ matrix $W$. The desired output $\vec{y}$ and bias $\vec{b}$ are $k\times 1$ vectors, and the input $\vec{x}$ is an $n\times 1$ vector. The goal of supervised learning is to find $W$ and $\vec{b}$ such that this approximation holds: $\vec{y}\approx H\left(W\vec{x}+\vec{b}\right).$ Here the Heaviside step function $H$ has been extended to take a vector argument, simply by applying it to each component of the vector. We could write $k$ separate delta rules, one for each of the LT neurons, or we could alternatively write it as a single matrix-vector equation. Write the delta rule in matrix-vector format.

4. Multiclass training with softmax. We could train multiple LT neurons with separate delta rules to classify digits. Alternatively, there are various multiclass training methods. For example, define the softmax function from $R^{n}$ to $R^{n}$ by $f_{i}(u_{1},\ldots,u_{n})=\frac{e^{u_{i}}}{\sum_{j=1}^{n}e^{u_{j}}}$ If we scale the magnitude of $\vec{u}$ to infinity, i.e., $\lambda\vec{u}$ with $\lambda\to\infty$, then $f_i\to 1$ for the $i$ such that $u_i$ is maximal, and $f_i\to 0$ for all other $i$. This is the reason for the name softmax.
• Show that the gradient of the softmax function is: $\frac{\partial f_{i}}{\partial u_{j}}=f_{i}\delta_{ij}-f_{i}f_{j}$
• If any input vector $\vec{x}$ falls into $1$ of $n$ classes, we can represent the correct class for any input vector through an $n$-dimensional output vector with one-hot or unary encoding: $y_{i} = \begin{cases} 1, & i=\text{correct class}\cr 0, & \text{otherwise} \end{cases}$ Suppose we would like to train $n$ weight vectors $\vec{w}_{1},\ldots,\vec{w}_{n}$ so that $y_{i}\approx f_{i}(\vec{w}_1\cdot\vec{x},\ldots,\vec{w}_n\cdot\vec{x})$ Derive online gradient descent for the cost function $e(\vec{w}_1,\ldots,\vec{w}_n,\vec{x},y_1,\ldots y_n) = - \sum_i y_i\log f_i(\vec{w}_1\cdot\vec{x},\ldots,\vec{w}_n\cdot\vec{x})$ This can be viewed as a generalization of logistic regression.

### Programming exercises

The MNIST dataset was created at AT&T Bell Labs in the 1990s by selecting from and modifying a larger dataset from NIST. It consists of images of handwritten digits, and was heavily used in machine learning research in the 1990s and 2000s. Documentation about the dataset can be found at Yann LeCun’s personal website.

If you are using your own computer, download the sample code and start the Jupyter notebook using the commands

$git clone https://github.com/cos495/code.git$ cd code
\$ julia
julia> using IJulia
julia> notebook()

IJulia is the Julia flavor of the Jupyter notebook. If it doesn’t install properly, you may have to consult the IJulia documentation.

If you are using JuliaBox, point your browser to juliabox.com. Select “Sync” in the toolbar, type https://github.com/cos495/code.git into the box under “Git Clone URL”, and press the “+” button on the right. Now select “Jupyter” in the toolbar.

Either of these methods should open the Jupyter Notebook Dashboard in your browser. The code folder should appear in the list of files. (If it doesn’t, select the “Files” tab and/or refresh the page.) Select it, and then select MNIST-Intro.ipynb. The notebook should open in your browser.

1. Intro to MNIST
• Execute the code in the notebook cells to visualize the MNIST images and their statistical properties. (No work to submit here.)

• Write code to compute the pixelwise standard deviation of the images in each digit class. (You can start from the provided code for the pixelwise mean of the images in each digit class. Use std to compute standard deviation.)

• For each digit class, visualize the standard deviation as an image. Use the provided montage function.

• Explain why the results look the way they do. Write your answer in a notebook cell. You can use Markdown to format your answer if you select Cell > Cell Type > Markdown for the relevant cell.

2. Classifying MNIST images with an LT neuron. Return to the Notebook Dashboard and select DeltaRule.ipynb.
• The sample code trains an LT neuron to be activated by images of “two” while remaining inactive for images of other digits. When you run the code, you should see time-varying images of the weight vector, and the input vector. You will also see a learning curve, defined as the cumulative time average of the classification error versus time. What is the final value of this time-averaged classification error?

• In class we derived the loss function for the LT neuron delta rule. Modify the code so that it additionally plots another learning curve, the cumulative time average of the loss function versus time.

• After training a model in machine learning, you’ll want to know whether your results are good or bad. It’s often helpful to compare with the performance of simple “baseline” models. For example, a simple baseline here is the trivial recognition algorithm that always returns an output of 0. What would be the classification error for this algorithm?

3. Train 10 model neurons to detect the 10 digit classes by applying the multiclass softmax algorithm that you derived above. The predicted class can be defined as the neuron for which the softmax output is largest. If the prediction disagrees with the true digit class, that constitutes a classification error. (Note that Julia has 1-based indexing of arrays, so you will probably want to relabel “0” as “10”. This is not a problem with Python, which has 0-based indexing.)

• Your code should plot a learning curve, the cumulative time average of the classification error versus time. What is the final value?

• Your code should visualize the ten learned weight vectors displayed as images.

• Add other visualizations that are helpful for monitoring the progress of learning and evaluating the final result.