Jekyll2017-05-16T15:32:44+00:00/COS 495 Neural Networks: Theory and ApplicationsProblem Set 9 (beta version)2017-05-15T21:20:47+00:002017-05-15T21:20:47+00:00/2017/05/15/pset9<style>
.center-image
{
margin: 0 auto;
display: block;
}
</style>
<script type="math/tex; mode=display">\DeclareMathOperator*\trace{Tr}
\DeclareMathOperator*\argmax{argmax}
\DeclareMathOperator*\argmin{argmin}
\DeclareMathOperator*\prob{\mathbb{P}}</script>
<h4 id="this-pset-is-ungraded--the-exercises-are-for-review-purposes">This pset is ungraded. The exercises are for review purposes.</h4>
<h3 id="unfolding-in-time">Unfolding in time</h3>
<p>Define a recurrent net by
<script type="math/tex">% <![CDATA[
\begin{align}
\vec{h}_{t} &= f (W \vec{h}_{t - 1} + U \vec{x}_t)\\
\vec{y}_{t} &= g (V\vec{h}_t)
\end{align} %]]></script>
for <script type="math/tex">t=1</script> to <script type="math/tex">T</script> with initial condition <script type="math/tex">\vec{h}_{0} =0</script>. At time <script type="math/tex">t</script>,</p>
<ul>
<li><script type="math/tex">\vec{x}_t</script> is the <script type="math/tex">d_x</script>-dimensional activity vector for the input neurons (input vector),</li>
<li><script type="math/tex">\vec{h}_t</script> is the <script type="math/tex">d_h</script>-dimensional activity vector for the hidden neurons (hidden state vector),</li>
<li>and <script type="math/tex">\vec{y}_t</script> is the <script type="math/tex">d_y</script>-dimensional activity vector for the output neurons (output vector).</li>
</ul>
<p>The network contains three connection matrices:</p>
<ul>
<li>The <script type="math/tex">d_h\times d_x</script> matrix <script type="math/tex">U</script> specifies how the hidden neurons are influenced by the input,</li>
<li>the <script type="math/tex">d_h\times d_h</script> “transition matrix” <script type="math/tex">W</script> governs the dynamics of the hidden neurons,</li>
<li>and the <script type="math/tex">d_y\times d_h</script> matrix <script type="math/tex">V</script> specifies how the output is “read out” from the hidden neurons.</li>
</ul>
<p>The network can be depicted as</p>
<p><img src="/assets/ElmanNetDiagram.png" alt="architecture" height="150px" /></p>
<p>where the recurrent connectivity is represented by the loop. Note that if <script type="math/tex">W</script> were zero, this would reduce to a multilayer perceptron with a single hidden layer.</p>
<p>If we want to run the recurrent net for a fixed number of time steps, we can unfold or unroll the loop in the above diagram to yield
a feedforward net with the architecture:</p>
<script type="math/tex; mode=display">\require{AMScd}
\begin{CD}
@. \vec{x}_1 @. \vec{x}_2 @. \cdots @. \vec{x}_T\\
@. @VU_1VV @VU_2VV @VVV @VU_TVV\\
\vec{h}_0 @>W_1>> \vec{h}_1 @>W_2>> \vec{h}_2 @>>> \cdots @>W_T>> \vec{h}_T\\
@. @V{V_1}VV @V{V_2}VV @VVV @V{V_T}VV\\
@. \vec{y}_1 @. \vec{y}_2 @. \cdots @. \vec{y}_T\\
\end{CD}</script>
<p>In this diagram, we temporarily imagine that the connection matrices <script type="math/tex">W_t</script>, <script type="math/tex">U_t</script>, and <script type="math/tex">V_t</script> vary with time:
<script type="math/tex">% <![CDATA[
\begin{align}
\vec{h}_{t} &= f (W_t \vec{h}_{t - 1} + U_t \vec{x}_t)\\
\vec{y}_{t} &= g (V_t\vec{h}_t)
\end{align} %]]></script>
This reduces to the recurrent net in the case that <script type="math/tex">W_t=W</script> for all <script type="math/tex">t</script> (and similarly for the other connection matrices). This is called “weight sharing through time.”</p>
<p>The diagram has the virtue of making explicit the pathways that produce temporal dependencies. The last output <script type="math/tex">\vec{y}_T</script> depends on the last input <script type="math/tex">\vec{x}_T</script> through the short pathway that goes through <script type="math/tex">\vec{h}_T</script>. The last output also depends on the first input <script type="math/tex">\vec{x}_1</script> through the long pathway from <script type="math/tex">\vec{h}_1</script> to <script type="math/tex">\vec{h}_T</script>.</p>
<h3 id="backpropagation-through-time">Backpropagation through time</h3>
<p>The backward pass can be depicted as</p>
<script type="math/tex; mode=display">% <![CDATA[
\require{AMScd}
\begin{CD}
@. \tilde{\vec{x}}_1 @. \tilde{\vec{x}}_2 @. \cdots @. \tilde{\vec{x}}_T\\
@. @A{U_1^\top}AA @A{U_2^\top}AA @AAA @A{U_T^\top}AA\\
\tilde{\vec{h}}_0 @<{W_1^\top}<< \tilde{\vec{h}}_1 @<{W_2^\top}<< \tilde{\vec{h}}_2 @<<< \cdots @<{W_T^\top}<< \tilde{\vec{h}}_T\\
@. @A{V_1^\top}AA @A{V_2^\top}AA @AAA @A{V_T^\top}AA\\
@. \tilde{\vec{y}}_1 @. \tilde{\vec{y}}_2 @. \cdots @. \tilde{\vec{y}}_T\\
\end{CD} %]]></script>
<p>Here <script type="math/tex">\tilde{\vec{h}}_t</script> is the backward pass variable that corresponds with the forward pass preactivity <script type="math/tex">f^{-1}(\vec{h}_t)</script> (and similarly for other tilde variables).</p>
<h4 id="exercises">Exercises</h4>
<ol>
<li>
<p>Write the equations of the backward pass corresponding to the diagram. As a concrete example, use the loss
\[
E = \sum_{t=1}^T \frac{1}{2}|\vec{d}_t - \vec{y}_t|^2
\]
where <script type="math/tex">\vec{d}_t</script> is a vector of desired values for the output neurons at time <script type="math/tex">t</script>. The equations for <script type="math/tex">\tilde{\vec{x}}_t</script> and <script type="math/tex">\tilde{\vec{h}}_0</script> won’t be used in the gradient descent updates, but write them anyway. (They would be useful if <script type="math/tex">\vec{x}^t</script> were the output of other neural net layers and we wanted to continue backpropagating through those layers too.)</p>
</li>
<li>
<p>In the gradient descent updates,
<script type="math/tex">% <![CDATA[
\begin{align}
\Delta U_t &\propto \tilde{\vec{h}}_?\vec{x}_?^\top\\
\Delta W_t &\propto \tilde{\vec{h}}_?\vec{h}_?^\top\\
\Delta V_t &\propto \tilde{\vec{y}}_?\vec{h}_?^\top
\end{align} %]]></script>
supply the missing letters where there are question marks. Note how the backward pass diagram above shows the paths by which the loss at time <script type="math/tex">t</script> can influence the gradient updates for previous times.</p>
</li>
<li>
<p>To obtain the backprop update for the recurrent net, we must enforce the constraint of weight sharing in time (<script type="math/tex">W_t=W</script> for all <script type="math/tex">t</script>, etc.). Write the gradient updates for <script type="math/tex">U</script>, <script type="math/tex">V</script>, and <script type="math/tex">W</script>.</p>
</li>
</ol>
<h3 id="backprop-with-multiplicative-interactions">Backprop with multiplicative interactions</h3>
<p>Consider the feedforward net,
<script type="math/tex">% <![CDATA[
\begin{align}
\vec{x}^1 &= f(W^{1x}\vec{x}^0 + \vec{b}^{1x})\\
\vec{y}^1 &= f(W^{1y}\vec{y}^0 + \vec{b}^{1y})\\
\vec{z}^1 &= \vec{x}^1.*\vec{y}^1\\
\vec{z}^2 &= f(W^2\vec{z}^1 + \vec{b}^2)
\end{align} %]]></script>
The vectors represent the layer activities, and superscripts indicate layer indices. The elementwise multiplication of <script type="math/tex">\vec{x}^1</script> and <script type="math/tex">\vec{y}^1</script> makes this net different from a multilayer perceptron.</p>
<h4 id="exercises-1">Exercises</h4>
<ol>
<li>Write the backward pass for the net.</li>
<li>Write the gradient updates for the weights and biases.</li>
</ol>
<h3 id="reinforce">REINFORCE</h3>
<p>Suppose that the probability that a random variable <script type="math/tex">X</script> takes on the value <script type="math/tex">x</script> is given by the parametrized function <script type="math/tex">p_\theta(x)</script>,</p>
<script type="math/tex; mode=display">\prob[X = x] = p_\theta(x)</script>
<p>I am following the <a href="https://en.wikipedia.org/wiki/Notation_in_probability_and_statistics">convention</a> that random variables (e.g. <script type="math/tex">X</script>) are written in upper case, while values (e.g. <script type="math/tex">x</script>) are written in lower case.</p>
<p>Given a reward function <script type="math/tex">r(X)</script>, define its expected value by</p>
<script type="math/tex; mode=display">\begin{equation}\label{eq:ExpectedReward}
\langle r(X)\rangle = \sum_x p_\theta(x) r(x)
\end{equation}</script>
<p>where the sum is over all possible values <script type="math/tex">x</script> of the random variable <script type="math/tex">X</script>. The expected reward is a function of the parameters <script type="math/tex">\theta</script>. How can we maximize this function?</p>
<p>The REINFORCE algorithm consists of the following steps</p>
<ul>
<li>Draw a sample <script type="math/tex">x</script> of the random variable <script type="math/tex">X</script>.</li>
<li>Compute the eligibility (for reinforcement)</li>
</ul>
<script type="math/tex; mode=display">\begin{equation}\label{eq:Eligibility}
e_\theta(x) = \nabla_\theta\log p_\theta(x)
\end{equation}</script>
<ul>
<li>Observe the reward <script type="math/tex">r(x)</script>.</li>
<li>Update the parameters via</li>
</ul>
<script type="math/tex; mode=display">\begin{equation}\label{eq:ThetaUpdate}
\Delta\theta \propto r(x)e_\theta(x)
\end{equation}</script>
<p>The basic theorem is that this algorithm is stochastic gradient ascent on the expected reward of Eq. (\ref{eq:ExpectedReward}). To apply the REINFORCE algorithm,</p>
<ul>
<li>It is sufficient to observe reward from random samples; no explicit knowledge of the reward function is necessary.</li>
<li>One must have enough knowledge about <script type="math/tex">p_\theta(x)</script> to be able to compute the eligibility.</li>
</ul>
<p>As a special case, suppose that <script type="math/tex">X=(S,A)</script> is a state-action pair taking on values <script type="math/tex">x=(s,a)</script>. The reward <script type="math/tex">r(S,A)</script> is a function of both state and action, and the probability distribution of state-action pairs is</p>
<script type="math/tex; mode=display">\prob[X=x] = \prob[S=s, A=a] = \prob[S=s]\prob[A=a|S=s]</script>
<p>We interpret the probability \(\prob[S=s]\) as describing the states of the world. The conditional probability \(\log\prob[A=a|S=s]\) describes a stochastic machine that takes the state \(S\) of the world as input and produces action \(A\) as output. So the eligibility is</p>
<script type="math/tex; mode=display">\nabla_\theta\log\prob[X=x] = \nabla_\theta\log\prob[S=s] + \nabla_\theta\log\prob[A=a|S=s]</script>
<p>The first term in the sum vanishes because the parameter <script type="math/tex">\theta</script> does not affect the world. Therefore we can be ignorant of <script type="math/tex">\prob[S=s]</script>, which is fortunate given that the world is complex. Only the second term in the sum matters, because <script type="math/tex">\theta</script> only affects the stochastic machine. So the eligibility reduces to</p>
<script type="math/tex; mode=display">\nabla_\theta\log\prob[X=x] = \nabla_\theta\log\prob[A=a|S=s]</script>
<p>We’ve only considered the case in which successive states of the world are drawn independently at random. It turns out that REINFORCE easily generalizes to control of a <a href="https://en.wikipedia.org/wiki/Markov_decision_process">Markov decision process</a>, in which states of the world have temporal dependences. It also generalizes to partially observable Markov decision processes, in which the stochastic machine has incomplete information about the state of the world.</p>
<h3 id="reinforce-backprop">REINFORCE-backprop</h3>
<p>Suppose that the stochastic machine is a neural net with a softmax output. The parameter vector \(\theta\) then refers to the weights and biases. Note that \(\log\prob[A=a|S=s]\) is the same (up to a minus sign) as the loss function used for neural net training, assuming that the desired output is \(a\) (see softmax exercise in Problem Set 1). So \(\nabla_\theta\log\prob[A=a|S=s]\) should be the same as the backprop updates assuming that the desired output is \(a\). The minus sign doesn't matter because here we are performing stochastic gradient ascent rather than stochastic gradient descent. Therefore we can combine REINFORCE and backprop in the following algorithm</p>
<ul>
<li>Draw a sample <script type="math/tex">s</script> of the random variable <script type="math/tex">S</script> (observe the world).</li>
<li>Feed <script type="math/tex">s</script> as input to a neural network with a softmax output.</li>
<li>Choose a random action <script type="math/tex">a</script> by sampling from the softmax probability distribution.</li>
<li>Pretend that the chosen action <script type="math/tex">a</script> is the correct one, and compute the regular backprop updates for the weights and biases.</li>
<li>Observe the reward <script type="math/tex">r</script>.</li>
<li>Before applying the incremental updates to the weights and biases, multiply the updates by the reward <script type="math/tex">r</script>.</li>
</ul>
<h4 id="exercises-2">Exercises</h4>
<ol>
<li>
<p>Prove that Eq. (\ref{eq:ThetaUpdate}) is stochastic gradient ascent on the expected reward of Eq. (\ref{eq:ExpectedReward}). In other words, show that the expectation value of the right hand side of Eq. (\ref{eq:ThetaUpdate}) is equal to <script type="math/tex">\nabla_\theta\langle r(X)\rangle</script>.</p>
</li>
<li>
<p>Prove that the eligibility of Eq. (\ref{eq:Eligibility}) has zero mean. This means that the eligibility is like a “noise” that is intrinsic to the system. Note that statisticians refer to the eligibility as the <a href="https://en.wikipedia.org/wiki/Score_(statistics)">score function</a>, which is used to define <a href="https://en.wikipedia.org/wiki/Fisher_information">Fisher information</a>.</p>
</li>
<li>
<p>Prove that the gradient of the expected reward is the covariance of the reward and the eligibility.</p>
</li>
<li>
<p>Consider a multilayer perceptron with a single hidden layer (two layers of connections) and a final softmax output. Write out the REINFORCE-backprop algorithm in full.</p>
</li>
</ol>Sebastian Seunglinks to online resources2017-05-09T00:00:00+00:002017-05-09T00:00:00+00:00/2017/05/09/links-online-resources<p>Here are some resources that cover similar material as in the lectures. (Thanks to Karan Kathpalia for finding these.)</p>
<ol>
<li><a href="http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/">autoencoders</a></li>
<li><a href="http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes5.pdf">language modeling</a>, <a href="https://www.tensorflow.org/tutorials/word2vec">word2vec</a></li>
<li><a href="http://www.deeplearningbook.org/contents/rnn.html">Backprop through time</a></li>
<li><a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">RNNs and LSTMs</a></li>
<li><a href="http://incompleteideas.net/sutton/book/bookdraft2016sep.pdf">policy gradient</a>. Chapter 13</li>
<li><a href="https://drive.google.com/open?id=0BwWWE4p0wihleWFIT0NGcWwtMTg">AlphaGo</a></li>
</ol>Sebastian SeungHere are some resources that cover similar material as in the lectures. (Thanks to Karan Kathpalia for finding these.)Problem Set 8 (beta version)2017-04-27T01:42:16+00:002017-04-27T01:42:16+00:00/2017/04/27/pset8<style>
.center-image
{
margin: 0 auto;
display: block;
}
</style>
<script type="math/tex; mode=display">\DeclareMathOperator*\trace{Tr}
\DeclareMathOperator*\argmax{argmax}
\DeclareMathOperator*\argmin{argmin}</script>
<h3 id="deadline-extended-due-at-3pm-on-wednesday-may-10-please-submit-to-blackboard-and-follow-the-general-guidelines-regarding-homework-assignments">DEADLINE EXTENDED: Due at 3pm on Wednesday, May 10. Please submit to Blackboard and follow the general <a href="https://cos495.github.io/general/2017/02/06/homework-guidelines.html">guidelines</a> regarding homework assignments.</h3>
<p>In this assignment, you will study the problem of predicting the next character of text. You’ll start with unigram and bigram models, and progress to recurrent neural networks.</p>
<h3 id="programming-exercises">Programming exercises</h3>
<p>Use <code class="highlighter-rouge">git pull</code> to update your cos495 sample code repo.</p>
<ol>
<li>
<p>Unigram model. The notebook <code class="highlighter-rouge">Unigram.ipynb</code> reads in the text corpus <code class="highlighter-rouge">shakespeare.txt</code> using <a href="https://en.wikibooks.org/wiki/Introducing_Julia/Working_with_text_files">text-handling functions</a>, uses the <code class="highlighter-rouge">unique</code> function to get a list of the characters in the corpus, and counts the number of times each character appears in the corpus using <a href="https://en.wikibooks.org/wiki/Introducing_Julia/Strings_and_characters">string functions</a>. Run the code and you will see a bar graph of the counts. The characters are ordered so that the counts decrease from left to right.</p>
<ul>
<li>
<p>What are the three most frequent characters? What are the three least frequent characters?</p>
</li>
<li>
<p>Divide the corpus into training and test sets, reserving the last 10% of the corpus for the test set. Repeat the counts calculation above for the training set only. Divide the counts by the length of the training set to estimate the probability distribution of the <a href="https://en.wikipedia.org/wiki/Language_model#Unigram_models">unigram model</a>. What is the <a href="https://en.wikipedia.org/wiki/Perplexity">perplexity</a> per letter of your unigram model for the training and test sets? What is the cross entropy per letter (in bits) for the training and test sets? Submit your code as well as numerical answers.</p>
</li>
<li>
<p>Generate 1000 characters from your unigram model by sampling from the unigram probability distribution. Hint: If <code class="highlighter-rouge">p</code> is an <script type="math/tex">n</script>-dimensional vector, then <code class="highlighter-rouge">rand(Categorical(p))</code> will generate a random integer <code class="highlighter-rouge">i</code> from 1 to <script type="math/tex">n</script> with probability <code class="highlighter-rouge">p[i]</code>. The function <code class="highlighter-rouge">Categorical</code> comes from the <code class="highlighter-rouge">Distributions.jl</code> package and implements a <a href="https://en.wikipedia.org/wiki/Categorical_distribution">categorical distribution</a>.</p>
</li>
</ul>
</li>
<li>
<p>Bigram model. Write code to estimate the conditional probability distributions of the <a href="https://en.wikipedia.org/wiki/Language_model#n-gram_models">bigram model</a>. For your estimation code, use <a href="https://en.wikipedia.org/wiki/Additive_smoothing">additive smoothing</a>, which is important for handling rare bigrams. For fun you can read about the <a href="https://en.wikipedia.org/wiki/Sunrise_problem">sunrise problem</a>, “What is the probability that the sun will rise tomorrow?” As a solution, Laplace proposed the <a href="https://en.wikipedia.org/wiki/Rule_of_succession">rule of succession</a>, which is the origin of “add-one” smoothing, the most basic kind of additive smoothing.</p>
<ul>
<li>
<p>What are the three most frequent bigrams? What are the three least frequent bigrams (excluding those that don’t occur at all)?</p>
</li>
<li>
<p>What is the perplexity per letter of your bigram model? What is the cross entropy per letter (in bits)? Submit numerical answers for both training and test sets, as well as your code.</p>
</li>
<li>
<p>Generate 1000 characters from your bigram model.</p>
</li>
</ul>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Recurrent_neural_network#Elman_networks_and_Jordan_networks">Elman network</a>. Here you will use a recurrent net to predict the next character in the Shakespeare corpus. It won’t produce samples indistinguishable from Shakespeare, but some structure should emerge from training.</p>
<ul>
<li>
<p>Look at <code class="highlighter-rouge">Elman.ipynb</code>, and write the equations for the recurrent network implemented in the code, using letters that correspond to the variable names in the code. What is the input vector? What is the output vector?</p>
</li>
<li>
<p>Modify the code to train only on the training set. Run the code for at least 10,000 steps if using the default <code class="highlighter-rouge">batch_size</code> and <code class="highlighter-rouge">seq_length</code>. (Reducing <code class="highlighter-rouge">batch_size</code> might speed training if you’re training on a CPU. If you change <code class="highlighter-rouge">batch_size</code> and <code class="highlighter-rouge">seq_length</code>, set <code class="highlighter-rouge">num_steps = (10,000 * 128 * 50) / (batch_size * seq_length))</code>. Now calculate the cross entropy per letter in bits on the test set. Is this better or worse than your unigram and bigram models?</p>
</li>
<li>
<p>Experiment with sampling from the model with different temperature parameters. Print one sample with temperature very close to 0 and one with an extremely high temperature. What does temperature qualitatively do to the samples?</p>
</li>
<li>
<p>If you want to reduce the cross entropy per letter and/or generate better samples you could train for longer and experiment with the optimization parameters. For example, the Julia code uses <code class="highlighter-rouge">AdamOptimizer</code>, and you can improve training by adjusting the parameters. The Python code also anneals the learning rate parameters, which is not implemented in the Julia code. (Don’t submit anything for this bullet).</p>
</li>
</ul>
</li>
<li>
<p>LSTM network.</p>
<ul>
<li>
<p>Modify the code to implement the equations of the <a href="https://en.wikipedia.org/wiki/Long_short-term_memory#Traditional_LSTM">traditional LSTM</a>.</p>
</li>
<li>
<p>Hint: as well as changing the <code class="highlighter-rouge">step</code> and <code class="highlighter-rouge">init_params</code> functions to include the LSTM equations and parameters, you need to ensure the other methods can handle the cell state as well as the hidden state. One way to do this is to add a <em>c</em> everywhere there is an <em>h</em>. Another way to do this is to change the size of <em>h</em> to <script type="math/tex">2\times</script> <em>hidden_size</em> everywhere outside of <code class="highlighter-rouge">step</code>, and within <code class="highlighter-rouge">step</code> (i) use <code class="highlighter-rouge">tf.split</code> to split the input <em>h</em> (size <script type="math/tex">2\times</script> <em>hidden_size</em>) into <em>hidden</em> and <em>cell</em> (both size <script type="math/tex">1\times</script> <em>hidden_size</em>), (ii) compute the new <em>hidden</em> and <em>cell</em>, and (iii) use <code class="highlighter-rouge">tf.concat</code> to merge them into a new <em>h</em> to return from <code class="highlighter-rouge">step</code>.</p>
</li>
<li>
<p>Once training is done, calculate the cross entropy per letter in bits on the test set. Is this better or worse than the Elman net?</p>
</li>
</ul>
</li>
</ol>Alex Beatson and Sebastian SeungGrading note2017-04-13T00:00:00+00:002017-04-13T00:00:00+00:00/2017/04/13/grading<p>This is the distribution of exam scores, based on the 69 undergraduates in the class:
<img src="/assets/midterm.png" alt="midterm grades" /></p>
<p>We also computed scaled grades based on the midterm and psets 1-4, weighting the pset grades by 0.125 and the midterm grade by 0.3 using the formula:</p>
<p>scaled_grade = ( (mean_pset/70) x 0.125 + (midterm/100) x 0.30 ) x (100 / (0.125 + 0.30))</p>
<p>This is the distribution of scaled grades for undergraduates:
<img src="/assets/scaled.png" alt="scaled grades" /></p>
<p>If we were to assign letter grades right now, the minimum scaled grade required for A, B and C would roughly follow:
55 C
70 B
85+ A</p>
<p>The median score (69 for the midterm, 75 for the scaled grade) would be just above the cutoff for B+.</p>
<p>This should serve as a rough guide to how you’re doing so far. We can’t guarantee any particular distribution of final grades: the final median could be higher or lower than B+, and the percent score required for a given letter grade might increase or decrease depending on how the class does on the remainder of assessment.</p>Alex, Ari, Donghun, Qipeng and SebastianThis is the distribution of exam scores, based on the 69 undergraduates in the class:Problem Set 7 (beta version)2017-04-10T16:11:25+00:002017-04-10T16:11:25+00:00/2017/04/10/pset7<style>
.center-image
{
margin: 0 auto;
display: block;
}
</style>
<script type="math/tex; mode=display">\DeclareMathOperator*\trace{Tr}
\DeclareMathOperator*\argmax{argmax}
\DeclareMathOperator*\argmin{argmin}</script>
<h3 id="due-at-3pm-on-monday-apr-17-please-submit-to-blackboard-and-follow-the-general-guidelines-regarding-homework-assignments">Due at 3pm on Monday, Apr. 17. Please submit to Blackboard and follow the general <a href="https://cos495.github.io/general/2017/02/06/homework-guidelines.html">guidelines</a> regarding homework assignments.</h3>
<h3 id="theoretical-exercises">Theoretical exercises</h3>
<p>Word embeddings are representations of words as vectors. The embeddings are obtained using unsupervised learning, and are used as input features for other language models trained via supervised learning. Word2Vec is a popular word embedding model introduced by <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">Mikolov et al. 2013</a> and featured in a <a href="https://www.tensorflow.org/tutorials/word2vec">TensorFlow tutorial</a>. Word2Vec has two main variants, Continuous Bag-of-Words (CBOW) and Skip-Gram (SG).</p>
<ol>
<li>
<p>The CBOW model is trained to predict the middle word <script type="math/tex">w_0</script> from a context window of <script type="math/tex">2k</script> surrounding words <script type="math/tex">w_{-k}, \ldots, w_{-1}, w_1, \ldots, w_k</script>. Here each word <script type="math/tex">w_i</script> is represented by an integer from 1 to <script type="math/tex">n</script>, where <script type="math/tex">n</script> is the size of the vocabulary. The prediction of the model is
\[
o_{w} = \sum_{a=1}^d \sum_{i=-k\atop i\neq 0}^k U_{aw}V_{aw_i}
\]
followed by softmax.</p>
<ul>
<li>Write the CBOW model as a multilayer perceptron with softmax output. Suppose that the input is represented by word counts, <script type="math/tex">x^0_w = \sum_i \delta_{ww_i}</script>, for all words <script type="math/tex">w</script> in the vocabulary. Use the notation <script type="math/tex">x^1_a</script> for the activities of the hidden layer, and the notation <script type="math/tex">x^2_w</script> for the activities of the output layer following softmax.</li>
<li>How many neurons are in the hidden layer?</li>
<li>How many neurons are in the output layer?</li>
<li>What are the connections from the input layer to the hidden layer?</li>
<li>What are the connections from the hidden layer to the output layer?</li>
<li>Write down the log loss for this model as a function of the prediction <script type="math/tex">x^2_w</script> and the correct word <script type="math/tex">w_0</script>.</li>
</ul>
<p>We can regard the <script type="math/tex">w</script>th column of the <script type="math/tex">d\times n</script> matrix <script type="math/tex">V</script> as an embedding of the <script type="math/tex">w</script>th word of the vocabulary in <script type="math/tex">d</script>-dimensional space. The same is true of the rows of the matrix <script type="math/tex">U</script>. Therefore training the CBOW model yields two word embeddings. This is an example of dimensionality reduction, since the embedding dimension <script type="math/tex">d</script> is generally chosen much smaller than the size <script type="math/tex">n</script> of the vocabulary.</p>
<p>In the Skip-Gram model, the central word <script type="math/tex">w_0</script> is used to predict the surrounding words <script type="math/tex">w_{-k}, \ldots, w_{-1}, w_1, \ldots, w_k</script>. You can think of this as <script type="math/tex">2k</script> separate prediction problems organized into a minibatch of size <script type="math/tex">2k</script>. This can also be regarded as a multilayer perceptron with a single hidden layer.</p>
</li>
</ol>
<h3 id="programming-exercises">Programming exercises</h3>
<p>The theoretical exercise above gives a rough idea of how word embeddings are trained. (For other important details you’ll have to consult the original papers.) In the following exercises you will visualize and use pre-trained word embeddings. We will use an embedding model called GloVe rather than Word2Vec for convenience. Pre-trained GloVe embeddings are available for a relatively small vocabulary size, which requires less RAM and computational cycles. Download <code class="highlighter-rouge">glove.6B.zip</code>, which is the smallest of the <a href="https://nlp.stanford.edu/projects/glove/">pre-trained GloVe embeddings</a>. If you unzip this file, you will see several <code class="highlighter-rouge">txt</code> files. Use <code class="highlighter-rouge">glove.6B.200d.txt</code> for the following exercises. Each line of this file is a word followed by 200 numbers, which constitute the embedding of that word.</p>
<ol>
<li>
<p>Visualization of Word Embeddings.</p>
<ul>
<li>
<p>Consider the following 10 seed words: mathematics, rose, cherry, toy, fear, approval, insect, steel, invention, music. For each seed word, list the 9 nearest neighbors based on <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> of word embeddings. You will end up with a total of 100 words.</p>
</li>
<li>
<p>Compute the top two principal components of the 100 word embeddings, and project the embeddings onto the 2D subspace spanned by the principal components. Use the projections to plot the 100 words as points in a 2D scatter plot. Use one plot symbol for seed words, and another for neighbors. Assign a distinct color to each seed word along with its nearest neighbors. Include a legend mapping the seed words to their colors.</p>
</li>
</ul>
</li>
<li>
<p>Solving analogy problems with vector arithmetic. In the analogy problem, you are given three words (e.g. “king”, “man”, and “woman”) as input, and your task is to generate a fourth word (“queen”) that completes the analogy (“king” is to “man” as “queen” is to “woman”). One possible solution is to use word embeddings. The idea is that the “king” to “man” relationship is represented by the difference <script type="math/tex">\vec{v}_{king}-\vec{v}_{man}</script> between their word embeddings, and that an analogous relationship should have approximately the same difference between word embeddings. In other words, it should be the case that <script type="math/tex">\vec{v}_{king} - \vec{v}_{man} \approx \vec{v}_{queen} - \vec{v}_{woman}</script>.
Therefore the minimization
<script type="math/tex">\argmin_j | \vec{v}_{king} - \vec{v}_{man} + \vec{v}_{woman} - \vec{v}_j|^2</script>
over all words <script type="math/tex">j</script> in the vocabulary ought to recover “queen.” Solve this <a href="http://download.tensorflow.org/data/questions-words.txt">list of analogies</a> using word embeddings. Your code should predict the fourth word from the first three words in each analogy. Report the accuracy of your code. You can ignore analogies that contain words that are missing from the GloVe vocabulary.</p>
</li>
</ol>Kiran Vodrahalli and Sebastian SeungProblem Set 6 (alpha version)2017-04-03T16:37:51+00:002017-04-03T16:37:51+00:00/2017/04/03/pset6<style>
.center-image
{
margin: 0 auto;
display: block;
}
</style>
<script type="math/tex; mode=display">\DeclareMathOperator*\trace{Tr}
\DeclareMathOperator*\argmax{argmax}
\DeclareMathOperator*\argmin{argmin}</script>
<h3 id="due-at-3pm-on-monday-apr-10-please-submit-to-blackboard-and-follow-the-general-guidelines-regarding-homework-assignments">Due at 3pm on Monday, Apr. 10. Please submit to Blackboard and follow the general <a href="https://cos495.github.io/general/2017/02/06/homework-guidelines.html">guidelines</a> regarding homework assignments.</h3>
<h3 id="theoretical-exercises">Theoretical exercises</h3>
<ol>
<li>
<p>In the competitive learning algorithm, a population of <script type="math/tex">k</script> neurons learns from a series of input vectors. For each input <script type="math/tex">\vec{x}</script>, the <script type="math/tex">k</script> neurons compete to be closest,
\[
\begin{equation}\label{eq:Competition}
\forall a=1..k, \quad y_a =
\begin{cases}
1, \text{ if } a = \argmin_b |\vec{x}-\vec{w}_b|\cr
0, \text{ otherwise}
\end{cases}
\end{equation}
\]
The winning neuron is active and the rest are silent. Then the weight vectors are updated via the Hebb rule and weight decay,
\[
\begin{equation}\label{eq:HebbRuleWeightDecay}
\forall b=1..k, \quad\Delta \vec{w}_b \propto y_b(\vec{x}-\vec{w}_b)
\end{equation}
\]
Note that the weight decay is activity-dependent, happening only for the neuron that is active.</p>
<ul>
<li>
<p>Show that Eqs. (\ref{eq:Competition}) and (\ref{eq:HebbRuleWeightDecay}) are equivalent to performing
$$
\begin{align*}
a &= \argmin_b |\vec{x}-\vec{w}_b|\cr
\Delta \vec{w}_a &\propto \vec{x}-\vec{w}_a
\end{align*}
$$
after each input vector \(\vec{x}\). These equations are simpler than Eqs. (\ref{eq:Competition}) and (\ref{eq:HebbRuleWeightDecay}), but postsynaptic neural activity \(y_a\) does not appear explicitly in them, so the relation to Hebbian plasticity is not manifest.</p>
</li>
<li>
<p>Show that the competitive learning update is gradient descent on
$$
e(\vec{w}_1,\ldots,\vec{w}_k,\vec{x}) = \min_{\vec{y}\in U} \frac{1}{2}\sum_b y_b |\vec{x}-\vec{w}_b|^2
$$
where \(U\) is the set of all "one-hot" binary vectors.</p>
</li>
</ul>
<p>It follows that competitive learning is online gradient descent on
$$
E = \sum_{t=1}^T e(\vec{w}_1,\ldots,\vec{w}_k,\vec{x}_t)
= \min_Y\frac{1}{2}\sum_{t=1}^T Y_{ta} |\vec{x}_t-\vec{w}_a|^2
$$
where the minimum is over all assignment matrices. Inside the minimum is the \(k\)-means cost function.</p>
</li>
<li>
<p>Suppose that you apply competitive learning (or <script type="math/tex">k</script>-means clustering) with <script type="math/tex">k=2</script> neurons to the 2D input vectors shown in this diagram.
<img src="/assets/EllipseClusters.svg" alt="EllipseClusters" class="center-image" />
Competitive learning can converge to different steady states, depending on the initial conditions. Identify two very different steady states that are possible. (Each neuron should have at least one input vector assigned to it.) Which one is a global minimum of the cost function, and which is only a local minimum? Justify your answer.</p>
</li>
<li>
<p>Let \(\vec{x}_1,\ldots,\vec{x}_T\) be a series of input vectors. Suppose that the input vectors have zero mean,
$$
0 = \frac{1}{T}\sum_{t=1}^T\vec{x}_t
$$
Define the covariance matrix
$$
C = \frac{1}{T}\sum_{t=1}^T \vec{x}_t\vec{x}_t^\top
$$
(Don't confuse the transpose symbol \(\top\) with the number \(T\) of input vectors.) The diagonal elements of \(C\) equal to the variances of the input elements,
$$
\sigma_i^2 = C_{ii} = \frac{1}{T}\sum_{t=1}^T X_{it}^2
$$
where \(X_{it}\) is the \(i\)th element of the \(t\)th input vector.</p>
<ul>
<li>
<p>The trace of a square matrix is defined as the sum of its diagonal elements. Suppose that matrix product <script type="math/tex">AB</script> is a square matrix. Prove that <script type="math/tex">\trace AB = \trace BA</script>. Note that <script type="math/tex">A</script> and <script type="math/tex">B</script> need not be square; only their product is required to be square.</p>
</li>
<li>
<p>Show that the sum of the eigenvalues of <script type="math/tex">C</script> is equal to the sum of the variances <script type="math/tex">\sum_i\sigma_i^2</script></p>
</li>
</ul>
</li>
<li>Convergence of Oja’s Rule in the average velocity approximation. Consider a linear neuron <script type="math/tex">y=\vec{w}\cdot\vec{x}</script>, where <script type="math/tex">\vec{w}</script> is the weight vector, <script type="math/tex">\vec{x}</script> is the input vector, and <script type="math/tex">y</script> is the output of the neuron. Oja’s Rule is
\[
\begin{equation}\label{eq:OjasRule}
\Delta\vec{w} \propto y\vec{x} - y^2\vec{w}
\end{equation}
\]
<ul>
<li>
<p>Show that the average velocity approximation for Oja’s Rule is
\[
\begin{equation}\label{eq:OjasRuleAverageVelocity}
\Delta\vec{w} \propto C\vec{w}-\left(\vec{w}^\top C\vec{w}\right)\vec{w}
\end{equation}
\]
where <script type="math/tex">C=\langle\vec{x}\vec{x}^\top\rangle</script> is the covariance of the input vectors.</p>
</li>
<li>
<p>Show that a steady state of Eq. (\ref{eq:OjasRuleAverageVelocity}) is an eigenvector of <script type="math/tex">C</script> normalized to have length one.</p>
</li>
</ul>
<p>It turns out that the principal eigenvector of <script type="math/tex">C</script> (the one with the largest eigenvalue) is a stable steady state, while the other eigenvectors are unstable steady states, but you’re not required to prove that here.</p>
<!--
- The continuous time version of the average velocity approximation is
\\[
\begin{equation}\label{eq:OjasRuleDiffEq}
\frac{d\vec{w}}{dt}=C\vec{w}-\left(\vec{w}^\top C\vec{w}\right)\vec{w}
\end{equation}
\\]
This differential equation will be analyzed in the remaining exercises. Prove that a solution of this equation must be of the form
\\[
\vec{w}(t)=\phi(t)e^{Ct}\vec{w}(0)
\\]
where $$\phi$$ is some scalar-valued function of time.
- Verify that
\\[
\vec{w}(t)=\frac{e^{Ct}\vec{w}(0)}{\sqrt{|e^{Ct}\vec{w}(0)|^{2}+1-|\vec{w}(0)|^{2}}}
\\]
is a solution of the differential equation (\ref{eq:OjasRuleDiffEq}).
- Suppose that $$\left|\vec{w}\right|^{2}<2$$. Define the reconstruction
error
\\[
\begin{align}
E & = \frac{1}{2}\left\langle \left|\vec{x}-\vec{w}\vec{w}^\top\vec{x}\right|^{2}\right\rangle \cr
& = \frac{1}{2}\trace\left(I-\vec{w}\vec{w}^\top\right)C\left(I-\vec{w}\vec{w}^{T}\right)
\end{align}
\\]
where $$C=\left\langle \vec{x}\vec{x}^\top\right\rangle $$. Show that the reconstruction error
is nonincreasing ($$dE/dt \leq 0$$) under the dynamics of Eq. (\ref{eq:OjasRuleDiffEq}), and that $$dE/dt$$ vanishes only at steady states. (Hint: There are a number of ways to do this. One is to decompose
both $$\partial E/\partial\vec{w}$$ and $$d\vec{w}/dt$$ into components that are parallel and perpendicular to $$\vec{w}$$.)
It follows that Eq. (\ref{eq:OjasRuleDiffEq}) converges to a minimum of the reconstruction error, with some other weak technical assumptions. Control theorists say that the reconstruction error $$E$$ is a Lyapunov function of the dynamics. This is somewhat different from the Lyapunov function derived in class, which was the projection error $$\trace\left(I-\hat{\vec{w}}\hat{\vec{w}}^\top\right)C$$ where $$\hat{\vec{w}}$$ is the unit vector in the direction of $$\vec{w}$$.
-->
</li>
</ol>
<h3 id="programming-exercises">Programming exercises</h3>
<ol>
<li>
<p>Write code that implements the competitive learning algorithm for the MNIST dataset.</p>
<ul>
<li>
<p>Run your code on all the training images with <script type="math/tex">k=10</script>. Display the 10 weight vectors as images. Do they look like the 10 digit classes? Why or why not? Do they look like examples in the dataset? Why or why not?</p>
</li>
<li>
<p>Run your code on just the “two” images with <script type="math/tex">k=5</script>. Display the 5 weight vectors as images. Describe qualitatively what the algorithm has learned.</p>
</li>
</ul>
<p>There are two real-world situations in which competitive learning might be used instead of <script type="math/tex">k</script>-means clustering. First, competitive learning can be used to cluster datasets larger than available RAM, since the online algorithm holds only one input vector in RAM at any given time. Second, competitive learning can be useful for tracking clusters that vary slowly with time.</p>
</li>
<li>
<p>Write code that implements Oja’s Rule for the MNIST dataset. Oja’s Rule can be viewed as an efficient online method of finding the principal component of a series of data vectors. Alternatively, one could compute the covariance matrix of the data, find all its eigenvalues and eigenvectors, and discard all but the principal eigenvector. However, this would be inefficient in use of memory and time, if only the top principal component is desired.</p>
<ul>
<li>
<p>Apply your Oja’s Rule code to the “two” class of MNIST digits.</p>
</li>
<li>
<p>Compute the covariance matrix of the “two” digits in the MNIST dataset. The definition of the covariance matrix given above assumed that the input vectors have zero mean. Therefore you should “center” the input vectors by computing their mean, and then subtract the mean from each input vector. You should do this for Oja’s Rule also.</p>
</li>
<li>
<p>Find all eigenvalues and eigenvectors of the covariance matrix. Sort the eigenvectors by decreasing eigenvalues and display the top 10 eigenvectors as images. The top eigenvector should look the same as the result of Oja’s Rule.</p>
</li>
</ul>
</li>
</ol>Sebastian SeungHow to use Amazon Web Services2017-03-27T14:18:40+00:002017-03-27T14:18:40+00:00/2017/03/27/AWS<h1 id="should-i-use-amazons-cloud-service-for-my-homework">Should I use Amazon’s cloud service for my homework?</h1>
<p>In Problem Set 5, you are going to use Google’s deep learning framework TensorFlow. You are free to install TensorFlow on your local computer and do your work there. We are also providing the option to use Amazon EC2 instances in the cloud. This is not strictly necessary, but there are several incentives to do this:</p>
<ol>
<li>We have configured <a href="https://en.wikipedia.org/wiki/Amazon_Machine_Image">Amazon Machine Images</a> (AMIs) for your use, which should obviate the need to install TensorFlow, Python, Julia, etc.</li>
<li>Once you have developed your code, you may want to run it run it on a GPU so it’s faster. If you do not have an NVIDIA GPU in your local computer, you can access one through EC2. For example, you can use the <code class="highlighter-rouge">p2.xlarge</code> instance which is half a K80 GPU.</li>
<li>You may want experience in using cloud services, and Amazon is giving $100 free credit for you to learn.</li>
</ol>
<h1 id="how-to-sign-up-for-amazon-web-services-aws">How to sign up for Amazon Web Services (AWS)</h1>
<ol>
<li>Create an AWS account and claim $100 free credit. Note that you will have to provide your own credit card, which will be charged if you use up the $100 free credit. (If this worries you a lot, then perhaps you should stick with your local computer.) Therefore cost control is important, and tips about it are given in a later section.
<ul>
<li><a href="https://drive.google.com/file/d/0B-EJQbRhH_OtR0l6bm1kT2hfenc/view?usp=sharing">AWS Student Onboarding</a></li>
<li><strong>Note:</strong> AWS Account ID can be found at <a href="https://console.aws.amazon.com/billing/home?#/account">AWS Billing Console</a></li>
</ul>
</li>
<li>Create a key pair. You should store your private key in a <code class="highlighter-rouge">.pem</code> file somewhere on your computer. You will need it, for example, if you need to establish an <code class="highlighter-rouge">ssh</code> connection to an EC2 instance.
<ul>
<li><a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName">AWS Key Pair</a></li>
</ul>
</li>
</ol>
<h1 id="basic-information-about-costs">Basic information about costs</h1>
<p>A virtual machine running in Amazon’s cloud is called an <em>instance</em>. EC2 offers a bewildering number of <a href="https://aws.amazon.com/ec2/instance-types/">instance types</a>. The more powerful the instance, the higher the <a href="http://www.ec2instances.info/">price</a>. New AWS accounts receive <a href="https://aws.amazon.com/free/faqs/">free tier access</a> for 12 months, which means that the weakest instances are essentially free. However, these may not have enough RAM for Problem Set 5. You will probably want to use at least <code class="highlighter-rouge">t2.medium</code> or <code class="highlighter-rouge">t2.large</code>, which cost less than 10 cents per hour. A GPU instance like <code class="highlighter-rouge">p2.xlarge</code> costs closer to one dollar per hour. Note that the time unit for charges is one hour; if you use 10 minutes you are still charged for the full hour.</p>
<h1 id="run-tensorflow-in-the-cloud-from-a-jupyter-notebook">Run TensorFlow in the cloud from a Jupyter notebook</h1>
<p>We have remixed the Amazon deep learning AMI to add Julia support for TensorFlow.</p>
<ol>
<li>Launch the <a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#LaunchInstanceWizard:ami=ami-7c348a6a">remixed AMI</a>. The <code class="highlighter-rouge">t2.medium</code> instance should be sufficiently powerful to start, and costs less than 10 cents per hour. Later on, you may want to use the more expensive <code class="highlighter-rouge">p2.xlarge</code> instance. (That’s the cheapest <code class="highlighter-rouge">p2</code> instance, at less than one dollar per hour.)</li>
<li>Follow Amazon’s <a href="https://aws.amazon.com/blogs/ai/the-aws-deep-learning-ami-now-with-ubuntu/">instructions</a> for its original deep learning AMI to set up communication with your instance through an IJulia notebook.</li>
<li>You can ignore the “security group” instructions if you are lazy. If the instance will be active for a short time, it’s unlikely that someone will break into it.</li>
<li>Once the Jupyter dashboard is open, you can ignore the final Amazon instructions about MXNet and switch to the instructions in Problem Set 5.</li>
<li>Select <code class="highlighter-rouge">New > Terminal</code> if you want to interact with a shell. This will be helpful when you <code class="highlighter-rouge">git clone</code> the sample code for the homework assignment.</li>
</ol>
<h1 id="another-ami-not-requiring-ssh-key-but-only-for-python">Another AMI not requiring ssh key but only for Python</h1>
<p>If you don’t want to deal with ssh keys, this AMI offers a slightly less secure but slightly easier to use option for Python only.</p>
<ol>
<li><a href="http://www.bitfusion.io/2016/05/09/easy-tensorflow-model-training-aws/">Launch Bitfusion AMI</a>
<ul>
<li>Make sure not to choose an expensive instance type if you don’t need it. <code class="highlighter-rouge">t2.medium</code> should be sufficient to start.</li>
<li>After launching it may take up to 5 minutes before the Jupyter notebook is ready</li>
</ul>
</li>
<li>Follow the Bitfusion instructions to connect with your instance through a Python/Jupyter notebook.</li>
<li>Select <code class="highlighter-rouge">New > Terminal</code> if you want to interact with a shell. This will be helpful when you <code class="highlighter-rouge">git clone</code> the sample code for the homework assignment.</li>
</ol>
<h1 id="cost-control">Cost control</h1>
<p>There is <strong><a href="https://forums.aws.amazon.com/thread.jspa?threadID=58127">no way</a></strong> to automatically limit your costs in AWS.</p>
<p>Costs incurred above the $100 credit are <strong>automatically charged</strong> to your registered credit card!</p>
<p>Here are some tips to keep you under budget!</p>
<ul>
<li><strong>STOP</strong> Your instance when you’re not using them
<ul>
<li>Do <strong>NOT Terminate</strong>, you will lose all your work.</li>
<li>Start them again when you want to work again. <strong>Warning:</strong> IP address could change after restart!
<img src="https://cloud.githubusercontent.com/assets/1668987/24265114/c976a346-0fd8-11e7-9bde-cf18cdd0680b.png" alt="image" /></li>
</ul>
</li>
<li>TERMINATE <strong>only</strong> when all your data has been migrated out of the instance / you are finished with the assignment.</li>
<li>Use <a href="http://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/free-tier-alarms">Billing Alarms</a> to warn you when you are close to using up all your $100 credit</li>
<li>Create an Idle alarm to warn you if you forgot to stop the instance
<ol>
<li>Select instance and create alarm
<img src="https://cloud.githubusercontent.com/assets/1668987/24265190/0d1c6a86-0fd9-11e7-85c7-a71dd3858ee7.png" alt="image" /></li>
<li>Create a topic to send yourself an email
<img src="https://cloud.githubusercontent.com/assets/1668987/24264665/8f73565e-0fd7-11e7-9d31-4815ba0223ef.png" alt="image" /></li>
<li>Set CPU Utilization level (Note: you can also automatically Stop Instance as well)
<img src="https://cloud.githubusercontent.com/assets/1668987/24265020/8a7f030e-0fd8-11e7-9fd6-490524642155.png" alt="image" /></li>
</ol>
</li>
</ul>Sebastian Seung and William WongShould I use Amazon’s cloud service for my homework?Problem Set 5 (alpha version)2017-03-27T14:13:18+00:002017-03-27T14:13:18+00:00/2017/03/27/pset5<p>\(
\DeclareMathOperator*\argmax{argmax}
\)</p>
<h3 id="due-at-3pm-on-monday-apr-3-please-submit-to-blackboard-and-follow-the-general-guidelines-regarding-homework-assignments">Due at 3pm on Monday, Apr. 3. Please submit to Blackboard and follow the general <a href="https://cos495.github.io/general/2017/02/06/homework-guidelines.html">guidelines</a> regarding homework assignments.</h3>
<p>This assignment is based on the <a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR-10 dataset</a>. This consists of 32x32 RGB images in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The images are rather low resolution, but still generally recognizable.</p>
<p>We assume that you have set up a Jupyter notebook that is communicating with an EC2 instance, following our AWS <a href="https://cos495.github.io/general/2017/03/27/AWS.html">instructions</a>. (Or you can do the whole thing locally; no instructions provided but you can find them online elsewhere.) If you are using Jupyter to communicate with an EC2 instance, select <code class="highlighter-rouge">New > Terminal</code> if you want to interact with a shell. (You can also ssh to the instance using your private key, as explained in the other blog post.) Type <code class="highlighter-rouge">git clone https://github.com/cos495/code.git</code>. From the Jupyter dashboard, navigate to <code class="highlighter-rouge">code</code> and then open the <code class="highlighter-rouge">CIFAR10</code> or <code class="highlighter-rouge">CIFAR10_Python</code> notebook.</p>
<p>Google supports a Python API for use of TensorFlow. The Julia API is similar, and is a community effort <a href="https://malmaud.github.io/tfdocs/index.html">documented here</a>.</p>
<ol>
<li>
<p>Understanding the architecture.</p>
<ul>
<li>
<p>How many kernels are in the first convolutional layer? TensorFlow uses a single four-dimensional “filter” tensor for each convolutional layer: however for this and the following questions, please calculate the number of <strong>two-dimensional</strong> kernels, i.e., there is one kernel for every input channel - output channel pair.</p>
</li>
<li>
<p>How many kernels are in the second convolutional layer?</p>
</li>
<li>
<p>How many kernels are in the third convolutional layer?</p>
</li>
<li>
<p>After the third convolutional layer, how many feature maps are there? How large is each feature map?</p>
</li>
<li>
<p>How many fully connected layers are there, and how large are the weight matrices?</p>
</li>
</ul>
</li>
<li>
<p>Visualization. You can convert the first layer kernels into a Julia array using <code class="highlighter-rouge">run(session, weights["wc1"])</code> or into a Numpy array using <code class="highlighter-rouge">sess.run(weights['wc1'])</code>. Display the kernels as RGB images.</p>
</li>
<li>
<p>Data augmentation. If you run for at least 2000 minibatches, you should see evidence of overfitting in the loss on the validation set. Write code that left-right reflects a random subset of images in the minibatch. Is overfitting reduced? Does the validation error improve?</p>
</li>
<li>
<p>Architecture modifications. Submit code and learning curves for two modified architectures of your choice. For example, you could vary the number of</p>
<ul>
<li>
<p>feature maps in a convolutional layer</p>
</li>
<li>
<p>convolutional layers</p>
</li>
<li>
<p>maxpooling layers</p>
</li>
<li>
<p>fully connected layers</p>
</li>
</ul>
</li>
<li>
<p>Open competition (extra credit). Can you beat the <a href="http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html">state-of-the-art</a> or at least your classmates? Try any trick you’ve learned from class or papers, or invent your own. For example, you could try other architectures, dropout, or other data augmentation techniques such as random translations (horizontal and vertical) and color jitter. <strong>This must be your own work</strong>, i.e., you can look up both papers and code to learn more about different tricks, but you should implement everything yourself rather than cloning or copying directly from someone else’s implementation. You’re honor-bound to not train or do model selection based on the test set. Once you’ve selected your model based on the validation set, quantify its error on the test set and then submit. The three submissions with the best test error will receive modest prizes.</p>
</li>
</ol>
<p>For the competition, please submit by doing the following:</p>
<ul>
<li>
<p>Submit entries separately from your problem set by sending an email to cos495.automated@gmail.com from your @princeton email.</p>
</li>
<li>Use the subject “CIFAR10 $SCORE”, where $SCORE is the <strong>accuracy percent</strong> (higher is better). E.g. “CIFAR10 97.1”.</li>
<li>
<p>In the body of the email, write a very brief description of what you did. One sentence is fine (e.g. “Dropout 0.5 at the final layer plus fractional max pooling after all convolutions.”)</p>
</li>
<li>
<p>Upload the output files from saving a fully trained model to a folder in Google Drive. Set the sharing settings to “anyone with the link can view”, and add a link to the folder to the body of the email.</p>
</li>
<li>
<p>Upload to the same folder in Google Drive a notebook which can either reload and test the saved model, or train a new model from scratch.</p>
</li>
<li>
<p>Put a line at the top of the notebook, “RELOAD = True”, and write some code so that RELOAD determines whether the model is reloaded or trained. If we change RELOAD to True and run the notebook, the model should be reloaded and then tested (no training). If we change RELOAD to False and run the notebook, the model should be trained from scratch.</p>
</li>
<li>Please ensure that we can reload the model or train the model from scratch just by downloading the folder from Google Drive, setting “RELOAD” to the correct value, and running all the cells in the notebook.</li>
<li>
<p>For info on how to save and reload models, see <a href="https://www.tensorflow.org/api_docs/python/tf/train/Saver">the TensorFlow Saver documentation</a> and/or look at how savers are used in some code online. It’s well worth understanding how to save / reload models.</p>
</li>
<li>If you want to see what’s possible and what tricks might help you improve your accuracy, check out the papers mentioned under CIFAR10 <a href="http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html">here</a>. Some tricks would be hard to implement, but some simple ones could help a lot. It could also help with choosing hyperparameters.</li>
</ul>Sebastian Seung and Ari Seff\( \DeclareMathOperator*\argmax{argmax} \)Convolution animations2017-03-14T02:55:19+00:002017-03-14T02:55:19+00:00/2017/03/14/convolution-animations<p>Just in time for your midterm exam preparations: here is the site with fun <a href="https://github.com/vdumoulin/conv_arithmetic">animations</a> of convolution, strided convolution, and transposed strided convolution (also known as “deconvolution”). The animations also illustrate no/half/full padding, known in MATLAB-land as the “valid”, “same”, and “full” options for the <code class="highlighter-rouge">conv</code> function.</p>
<p>You can also download the paper that accompanies the animations (or is it the other way around?).</p>
<ul>
<li>Vincent Dumoulin and Francesco Visin, <a href="https://arxiv.org/abs/1603.07285">A guide to convolution arithmetic for deep learning</a>.</li>
</ul>Sebastian SeungJust in time for your midterm exam preparations: here is the site with fun animations of convolution, strided convolution, and transposed strided convolution (also known as “deconvolution”). The animations also illustrate no/half/full padding, known in MATLAB-land as the “valid”, “same”, and “full” options for the conv function.Convolution and correlation2017-03-05T14:36:45+00:002017-03-05T14:36:45+00:00/2017/03/05/convolution<p>\(
\newcommand{\Conv}{\mathop{\rm Conv}}
\newcommand{\Corr}{\mathop{\rm Corr}}
\)</p>
<p>These notes contain basic information about two mathematical operations that are commonly used in signal processing, <em>convolution</em> and <em>correlation</em>. Convolution is used to linearly filter a signal.
Correlation is used to characterize the statistical dependencies between two signals.</p>
<h1 id="convolution">Convolution</h1>
<p>Consider two time series, <script type="math/tex">g_i</script> and <script type="math/tex">h_i</script>, where the index <script type="math/tex">i</script> runs from <script type="math/tex">-\infty</script> to <script type="math/tex">\infty</script>. Their convolution is defined as</p>
<script type="math/tex; mode=display">\begin{equation}
\label{eq:ConvolutionInfinite}
(\vec{g}\ast \vec{h})_i = \sum_{j=-\infty}^\infty g_{i-j} h_j
\end{equation}</script>
<p>This definition is applicable to time series of infinite length. If
<script type="math/tex">\vec{g}</script> and <script type="math/tex">\vec{h}</script> are finite, they can be extended to infinite length by
adding zeros at both ends. After this trick, called zero padding, the
definition in Eq. (\ref{eq:ConvolutionInfinite}) becomes applicable.
For example, Eq. (\ref{eq:ConvolutionInfinite}) reduces to</p>
<script type="math/tex; mode=display">\begin{equation}\label{eq:ConvolutionFinite}
(\vec{g}\ast\vec{h})_i = \sum_{j=0}^{n-1} g_{i-j}h_j
\end{equation}</script>
<p>for the finite time series <script type="math/tex">h_0,\ldots,h_{n-1}</script>.</p>
<!---
Another trick for
turning a finite time series into an infinite one is to repeat it over
and over. This is sometimes called periodic boundary conditions, and
will be encountered later in our study of Fourier analysis.
--->
<p><em>Exercise</em>. Convolution is commutative and associative. Prove that <script type="math/tex">g\ast h=h\ast
g</script> and <script type="math/tex">f\ast(g\ast h) = (f\ast g)\ast h</script>.</p>
<p><em>Exercise</em>. Convolution is distributive over addition. Prove that <script type="math/tex">(g_1+g_2)\ast h
= g_1 \ast h + g_2 \ast h</script>. This means that filtering a signal via
convolution is a linear operation.</p>
<p>Although <script type="math/tex">g</script> and <script type="math/tex">h</script> are treated symmetrically by convolution,
they usually have very different meanings. Typically, one is a
signal that goes on for a long time. The other is concentrated
near time zero, and is called a kernel or filter. The output of the convolution
is also a signal, a filtered version of the input signal.</p>
<p>In Eq. (\ref{eq:ConvolutionFinite}), we chose <script type="math/tex">h_i</script> to be zero for
all negative <script type="math/tex">i</script>. This is called a <em>causal filter</em>, because
<script type="math/tex">g\ast h</script> is affected by <script type="math/tex">h</script> in the present and past, but not in
the future. In some contexts, the causality constraint is not
important, and one can take <script type="math/tex">h_{-M},\ldots,h_M</script> to be nonzero, for
example.</p>
<p>Formulas are nice and compact, but now let’s draw some diagrams to see
how convolution works. To take a concrete example, assume a causal
filter <script type="math/tex">(h_0,\ldots,h_{n-1})</script>. Then the <script type="math/tex">i</script>th component of the
convolution <script type="math/tex">(g\ast h)_i</script> involves aligning <script type="math/tex">g</script> and <script type="math/tex">h</script> this way:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{array}{ccccccccccc}
\cdots &g_{i-m-1}&g_{i-m}&g_{i-m+1}&\cdots &g_{i-2}&g_{i-1}&g_i & g_{i+1} &g_{i+2}&\cdots\\
\cdots &0&0&h_{m-1}&\cdots&h_2 &h_1 &h_0 & 0 & 0 &\cdots\\
\end{array}
\end{equation} %]]></script>
<p>In words, <script type="math/tex">(g\ast h)_i</script> is computed by looking at the signal <script type="math/tex">g</script>
through a window of length <script type="math/tex">m</script> starting at time <script type="math/tex">i</script> and extending back
to time <script type="math/tex">i-m+1</script>. The weighted sum of the signals in the window is
taken, using the coefficients given by <script type="math/tex">h</script>.</p>
<p>Note that <script type="math/tex">h</script> is left-right “flipped” in the diagrams. This is because of the minus sign that appears in the indices of convolution. The flipping may seem counterintuitive, but it’s necessary to make convolution commutative. Also it means the kernel can be interpreted as the impulse response function, which will be defined shortly.</p>
<p>If the signal <script type="math/tex">g</script> has a finite length, then the diagram looks
different when the window sticks out over the edges. Consider a signal
<script type="math/tex">g_0,\ldots,g_{m-1}</script>. Let’s consider the two extreme cases where the
window includes only one time bin in the signal. One extreme is
<script type="math/tex">(g\ast h)_0</script>, which can be visualized as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{array}{cccccccccc}
\cdots &0 &\cdots &0 &g_0 & g_1 &\cdots &g_{n-1}& 0 &\cdots\\
\cdots &h_{m-1}&\cdots &h_1 &h_0 & 0 & 0 & 0 & 0 &\cdots\end{array}
\end{equation} %]]></script>
<p>The other extreme is <script type="math/tex">(g\ast h)_{m+n-2}</script>, which can be visualized as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{array}{cccccccccc}
\cdots &g_0 &g_1 &\cdots &g_{n-1}&\cdots & 0 & 0 & 0 &\cdots\\
\cdots &0 &0 &\cdots &h_{m-1}&\cdots &h_1 &h_0 & 0 &\cdots
\end{array}
\end{equation} %]]></script>
<p>The range <script type="math/tex">(g\ast h)_0\ldots (g\ast h)_{m+n-2}</script> has length <script type="math/tex">m+n-1</script>. Outside this “full” range of the convolution, the elements of <script type="math/tex">g\ast h</script> must vanish.</p>
<p>The range <script type="math/tex">(g\ast h)_{n-1},\ldots,(g\ast h)_{m-1}</script> has length <script type="math/tex">m-n+1</script>. Inside this “valid” range, the kernel does not stick out beyond the borders of the signal; the kernel never encounters the zero padding of the signal. (Here we are assume that the signal <script type="math/tex">g</script> is at least as long as the kernel <script type="math/tex">h</script>, <script type="math/tex">m\geq n</script>.)</p>
<h1 id="using-the-conv-function-in-julia-and-matlab">Using the <code class="highlighter-rouge">conv</code> function in Julia and MATLAB.</h1>
<p>Suppose that <script type="math/tex">\vec{f}=\vec{g} \ast \vec{h}</script>, where <script type="math/tex">g_0,g_1,\ldots,g_{M-1}</script> and <script type="math/tex">h_0,h_1\ldots,h_{N-1}</script>.</p>
<p>Then <code class="highlighter-rouge">conv(g,h)</code> returns the “full” convolution <script type="math/tex">f_0,f_1,\ldots,f_{M+N-2}</script>.
It is the only option for Julia and the default option in MATLAB.</p>
<p>Alternatively, <code class="highlighter-rouge">conv(g,h,"valid")</code> returns the “valid” convolution “<script type="math/tex">f_{\min(M,N)-1},\ldots,f_{\max(M,N)-1}</script>. This extension to Julia’s built-in <code class="highlighter-rouge">conv</code> is provided in the sample code accompanying the homework. The “valid” option is available in MATLAB.</p>
<p>The Numpy function is called <code class="highlighter-rouge">convolve</code>, and has both “full” and “valid” options like MATLAB.</p>
<p>Note that the inputs to the <code class="highlighter-rouge">conv</code> function don’t include information about absolute time.
In general, the inputs could be interpreted as <script type="math/tex">g_{M_1},\ldots,g_{M_2}</script> and <script type="math/tex">h_{N_1},\ldots,h_{N_2}</script>.
Then the output should be interpreted as <script type="math/tex">f_{M_1+N_1},\ldots,f_{M_2+N_2}</script>. In other words, shifting either <script type="math/tex">g</script> or
<script type="math/tex">h</script> in time is equivalent to shifting <script type="math/tex">g\ast h</script> in time.</p>
<p>For example, suppose that <script type="math/tex">g</script> is a signal, and <script type="math/tex">h</script> represents an
acausal filter, with <script type="math/tex">% <![CDATA[
N_1<0 %]]></script> and <script type="math/tex">N_2>0</script>. Then the first element of
<script type="math/tex">f</script> returned by <code class="highlighter-rouge">conv</code> is <script type="math/tex">f_{M_1+N_1}</script>, and the last is
<script type="math/tex">f_{M_2+N_2}</script>. So discarding the first <script type="math/tex">|N_1|</script> and last <script type="math/tex">N_2</script>
elements of <script type="math/tex">f</script> leaves us with <script type="math/tex">f_{M_1},\ldots,f_{M_2}</script>. This is
time-aligned with the signal <script type="math/tex">g_{M_1},\ldots,g_{M_2}</script>, and has the
same length.</p>
<h1 id="impulse-response">Impulse response</h1>
<p>Consider the signal consisting of a single impulse at time zero,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\delta_j =
\left\{
\begin{array}{ll}
1,& j=0\\
0,& j\neq 0
\end{array}
\right.
\end{equation} %]]></script>
<p>The convolution of this signal with a kernel <script type="math/tex">h</script> is</p>
<script type="math/tex; mode=display">\begin{equation}
(\delta\ast h)_i = \sum_k \delta_{j-k}h_k = h_j
\end{equation}</script>
<p>which is just the kernel <script type="math/tex">h</script> again. In other words <script type="math/tex">h</script>, is the
response of the kernel to an impulse, or the <em>impulse response function</em>. If the impulse is displaced from time 0 to time <script type="math/tex">i</script>, then
the result of the convolution is the kernel <script type="math/tex">h</script>, displaced by <script type="math/tex">i</script> time
steps.</p>
<p>Elsewhere you may have seen the ``Kronecker delta’’ notation
<script type="math/tex">\delta_{ij}</script>, which is equivalent to <script type="math/tex">\delta_{i-j}</script>. The Kronecker
delta is just the identity matrix, since it is equal to one only for
the diagonal elements <script type="math/tex">i=j</script>. You can think about convolution with
<script type="math/tex">\delta_j</script> as multiplication by the identity matrix
<script type="math/tex">\delta_{ij}</script>. More generally, the following exercise shows that
convolution is equivalent to multiplication of a matrix and a vector.</p>
<p><em>Exercise</em>. Matrix form of convolution.
Show that the convolution of <script type="math/tex">g_0,g_1,g_2</script> and <script type="math/tex">h_0,h_1,h_2</script> can be written as</p>
<script type="math/tex; mode=display">\begin{equation}
g \ast h = G h
\end{equation}</script>
<p>where the matrix <script type="math/tex">G</script> is defined by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}\label{eq:ConvolutionMatrix}
G=\left(
\begin{array}{lll}
g_0 &0 &0\\
g_1 &g_0 &0\\
g_2 &g_1 &g_0\\
0 &g_2 &g_1\\
0 &0 &g_2
\end{array}
\right)
\end{equation} %]]></script>
<p>and <script type="math/tex">g\ast h</script> and <script type="math/tex">h</script> are treated as column vectors.</p>
<p><em>Exercise</em>. Each column of <script type="math/tex">G</script> is the same time series, but shifted by a different
amount. Use the MATLAB function <code class="highlighter-rouge">convmtx</code> to create matrices
like <script type="math/tex">G</script> from time series like <script type="math/tex">g</script>.</p>
<p>If you don’t have this toolbox installed, you can make use of the fact
that Eq. (\ref{eq:ConvolutionMatrix}) is a Toeplitz matrix, and can
be constructed by giving its first column and first row to the
<code class="highlighter-rouge">toeplitz</code> command in MATLAB.</p>
<p><em>Exercise</em>. Convolution as polynomial multiplication.
If the second degree polynomials <script type="math/tex">g_0 + g_1 z + g_2 z^2</script> and <script type="math/tex">h_0 + h_1 z + h_2 z^2</script> are multiplied together, the result is a fourth degree polynomial. Let’s call this polynomial <script type="math/tex">f_0 + f_1 z + f_2 z^2 + f_3 z^3 + f_4 z^4</script>. Show that this is equivalent to <script type="math/tex">f=g\ast h</script>.</p>
<!--
# Correlation
The cross-correlation of two time series is
$$
\begin{equation}\label{eq:Correlation}
\Corr[g,h]_j = \sum_{i=-\infty}^\infty g_{i+j} h_{j}
\end{equation}
$$
You can show this is equivalent to $$\vec{g}\ast\overline{\vec{h}}$$, where $$\bar{\vec{h}}$$ denotes $$\vec{h}$$ after left-right flipping, $$\bar{h}_j = h_{-j}$$.
As with the convolution, this definition can be applied to finite time
series by using zero padding. Note that
$$\Corr[g,h]_j=\Corr[h,g]_{-j}$$, so that the correlation operation is
not commutative. Typically, the correlation is applied to two
signals, while its output is concentrated near zero.
The zero lag case looks like
$$
\begin{equation}
\begin{array}{cccccccc}
\cdots &0 &g_1 &g_2 &\cdots &g_{n} &0 &\cdots\\
\cdots &0 &h_1 &h_2 &\cdots &h_{n} &0 &\cdots
\end{array}
\end{equation}
$$
and the other lags correspond to sliding $$h$$ right or left.
The autocorrelation is a special case of the correlation, with $$g=h$$.
If $$g\neq h$$, the correlation is sometimes called the crosscorrelation
to distinguish it from the autocorrelation.
*Exercise*. Prove that the autocorrelation is the same for equal and opposite time
lags $$\pm j$$.
# Using the `xcorr` function in Julia and MATLAB
If $$g$$ and $$h$$ are $$n$$-dimensional vectors, then the MATLAB command
`xcorr(g,h)` returns a $$2n-1$$ dimensional vector, corresponding
to the lags $$j=-(n-1)$$ to $$n+1$$. Lags beyond this range are not
included, as the correlation trivially vanishes. The $$n$$th element of
the result corresponds to zero lag.
One irritation is that MATLAB 5 and 6 follow different conventions.
MATLAB 5 calls Eq.\ (\ref{eq:Correlation}) \verb+xcorr(g,h)+, while
MATLAB 6 calls it \verb+xcorr(h,g)+.
A maximum lag can also be given as an argument,
\verb+xcorr(g,h,maxlag)+, to restrict the range of lags computed to
\verb+-maxlag+ to \verb+maxlag+. Then the \verb@maxlag+1@ element
corresponds to zero lag.
The default is the unnormalized correlation given above, but there is
also a normalized version that looks like
$$
\begin{equation}
Q^{xy}_j = \frac{1}{m} \sum_{i=1}^m x_i y_{i+j}
\end{equation}
$$
To compensate for boundary effects, the form
$$
\begin{equation}
Q^{xy}_j = \frac{1}{m-|j|} \sum_{i=1}^m x_i y_{i+j}
\end{equation}
$$
is sometimes preferred. Both forms can be obtained through the
appropriate options to the `xcorr` command.
\section{Two examples}
The autocorrelation of a sine wave. The period should be evident.
The autocorrelation of Gaussian random noise. There is a peak at zero,
while the rest is small. As the length of the signal goes to infinity,
the ratio of the sides to the center vanishes. If the autocorrelation
of a signal vanishes, except at lag zero, we'll call it *white
noise*. The origin of this term will become clear when we study
Fourier analysis.
-->Sebastian Seung\( \newcommand{\Conv}{\mathop{\rm Conv}} \newcommand{\Corr}{\mathop{\rm Corr}} \)