### Theoretical exercises

Word embeddings are representations of words as vectors. The embeddings are obtained using unsupervised learning, and are used as input features for other language models trained via supervised learning. Word2Vec is a popular word embedding model introduced by Mikolov et al. 2013 and featured in a TensorFlow tutorial. Word2Vec has two main variants, Continuous Bag-of-Words (CBOW) and Skip-Gram (SG).

1. The CBOW model is trained to predict the middle word $w_0$ from a context window of $2k$ surrounding words $w_{-k}, \ldots, w_{-1}, w_1, \ldots, w_k$. Here each word $w_i$ is represented by an integer from 1 to $n$, where $n$ is the size of the vocabulary. The prediction of the model is $o_{w} = \sum_{a=1}^d \sum_{i=-k\atop i\neq 0}^k U_{aw}V_{aw_i}$ followed by softmax.

• Write the CBOW model as a multilayer perceptron with softmax output. Suppose that the input is represented by word counts, $x^0_w = \sum_i \delta_{ww_i}$, for all words $w$ in the vocabulary. Use the notation $x^1_a$ for the activities of the hidden layer, and the notation $x^2_w$ for the activities of the output layer following softmax.
• How many neurons are in the hidden layer?
• How many neurons are in the output layer?
• What are the connections from the input layer to the hidden layer?
• What are the connections from the hidden layer to the output layer?
• Write down the log loss for this model as a function of the prediction $x^2_w$ and the correct word $w_0$.

We can regard the $w$th column of the $d\times n$ matrix $V$ as an embedding of the $w$th word of the vocabulary in $d$-dimensional space. The same is true of the rows of the matrix $U$. Therefore training the CBOW model yields two word embeddings. This is an example of dimensionality reduction, since the embedding dimension $d$ is generally chosen much smaller than the size $n$ of the vocabulary.

In the Skip-Gram model, the central word $w_0$ is used to predict the surrounding words $w_{-k}, \ldots, w_{-1}, w_1, \ldots, w_k$. You can think of this as $2k$ separate prediction problems organized into a minibatch of size $2k$. This can also be regarded as a multilayer perceptron with a single hidden layer.

### Programming exercises

The theoretical exercise above gives a rough idea of how word embeddings are trained. (For other important details you’ll have to consult the original papers.) In the following exercises you will visualize and use pre-trained word embeddings. We will use an embedding model called GloVe rather than Word2Vec for convenience. Pre-trained GloVe embeddings are available for a relatively small vocabulary size, which requires less RAM and computational cycles. Download glove.6B.zip, which is the smallest of the pre-trained GloVe embeddings. If you unzip this file, you will see several txt files. Use glove.6B.200d.txt for the following exercises. Each line of this file is a word followed by 200 numbers, which constitute the embedding of that word.

1. Visualization of Word Embeddings.

• Consider the following 10 seed words: mathematics, rose, cherry, toy, fear, approval, insect, steel, invention, music. For each seed word, list the 9 nearest neighbors based on cosine similarity of word embeddings. You will end up with a total of 100 words.

• Compute the top two principal components of the 100 word embeddings, and project the embeddings onto the 2D subspace spanned by the principal components. Use the projections to plot the 100 words as points in a 2D scatter plot. Use one plot symbol for seed words, and another for neighbors. Assign a distinct color to each seed word along with its nearest neighbors. Include a legend mapping the seed words to their colors.

2. Solving analogy problems with vector arithmetic. In the analogy problem, you are given three words (e.g. “king”, “man”, and “woman”) as input, and your task is to generate a fourth word (“queen”) that completes the analogy (“king” is to “man” as “queen” is to “woman”). One possible solution is to use word embeddings. The idea is that the “king” to “man” relationship is represented by the difference $\vec{v}_{king}-\vec{v}_{man}$ between their word embeddings, and that an analogous relationship should have approximately the same difference between word embeddings. In other words, it should be the case that $\vec{v}_{king} - \vec{v}_{man} \approx \vec{v}_{queen} - \vec{v}_{woman}$. Therefore the minimization $\argmin_j | \vec{v}_{king} - \vec{v}_{man} + \vec{v}_{woman} - \vec{v}_j|^2$ over all words $j$ in the vocabulary ought to recover “queen.” Solve this list of analogies using word embeddings. Your code should predict the fourth word from the first three words in each analogy. Report the accuracy of your code. You can ignore analogies that contain words that are missing from the GloVe vocabulary.