Problem Set 8 (beta version)

$\DeclareMathOperator*\trace{Tr} \DeclareMathOperator*\argmax{argmax} \DeclareMathOperator*\argmin{argmin}$

DEADLINE EXTENDED: Due at 3pm on Wednesday, May 10. Please submit to Blackboard and follow the general guidelines regarding homework assignments.

In this assignment, you will study the problem of predicting the next character of text. You’ll start with unigram and bigram models, and progress to recurrent neural networks.

Programming exercises

Use git pull to update your cos495 sample code repo.

Unigram model. The notebook Unigram.ipynb reads in the text corpus shakespeare.txt using text-handling functions, uses the unique function to get a list of the characters in the corpus, and counts the number of times each character appears in the corpus using string functions. Run the code and you will see a bar graph of the counts. The characters are ordered so that the counts decrease from left to right.
- What are the three most frequent characters? What are the three least frequent characters?
- Divide the corpus into training and test sets, reserving the last 10% of the corpus for the test set. Repeat the counts calculation above for the training set only. Divide the counts by the length of the training set to estimate the probability distribution of the unigram model. What is the perplexity per letter of your unigram model for the training and test sets? What is the cross entropy per letter (in bits) for the training and test sets? Submit your code as well as numerical answers.
- Generate 1000 characters from your unigram model by sampling from the unigram probability distribution. Hint: If p is an $n$ -dimensional vector, then rand(Categorical(p)) will generate a random integer i from 1 to $n$ with probability p[i]. The function Categorical comes from the Distributions.jl package and implements a categorical distribution.
Bigram model. Write code to estimate the conditional probability distributions of the bigram model. For your estimation code, use additive smoothing, which is important for handling rare bigrams. For fun you can read about the sunrise problem, “What is the probability that the sun will rise tomorrow?” As a solution, Laplace proposed the rule of succession, which is the origin of “add-one” smoothing, the most basic kind of additive smoothing.
- What are the three most frequent bigrams? What are the three least frequent bigrams (excluding those that don’t occur at all)?
- What is the perplexity per letter of your bigram model? What is the cross entropy per letter (in bits)? Submit numerical answers for both training and test sets, as well as your code.
- Generate 1000 characters from your bigram model.
Elman network. Here you will use a recurrent net to predict the next character in the Shakespeare corpus. It won’t produce samples indistinguishable from Shakespeare, but some structure should emerge from training.
- Look at Elman.ipynb, and write the equations for the recurrent network implemented in the code, using letters that correspond to the variable names in the code. What is the input vector? What is the output vector?
- Modify the code to train only on the training set. Run the code for at least 10,000 steps if using the default batch_size and seq_length. (Reducing batch_size might speed training if you’re training on a CPU. If you change batch_size and seq_length, set num_steps = (10,000 * 128 * 50) / (batch_size * seq_length)). Now calculate the cross entropy per letter in bits on the test set. Is this better or worse than your unigram and bigram models?
- Experiment with sampling from the model with different temperature parameters. Print one sample with temperature very close to 0 and one with an extremely high temperature. What does temperature qualitatively do to the samples?
- If you want to reduce the cross entropy per letter and/or generate better samples you could train for longer and experiment with the optimization parameters. For example, the Julia code uses AdamOptimizer, and you can improve training by adjusting the parameters. The Python code also anneals the learning rate parameters, which is not implemented in the Julia code. (Don’t submit anything for this bullet).
LSTM network.
- Modify the code to implement the equations of the traditional LSTM.
- Hint: as well as changing the step and init_params functions to include the LSTM equations and parameters, you need to ensure the other methods can handle the cell state as well as the hidden state. One way to do this is to add a c everywhere there is an h. Another way to do this is to change the size of h to $2\times$ hidden_size everywhere outside of step, and within step (i) use tf.split to split the input h (size $2\times$ hidden_size) into hidden and cell (both size $1\times$ hidden_size), (ii) compute the new hidden and cell, and (iii) use tf.concat to merge them into a new h to return from step.
- Once training is done, calculate the cross entropy per letter in bits on the test set. Is this better or worse than the Elman net?