Which statement correctly describes the main difference between the CBOW and Skip-gram models in Word2Vec?
Think about which model uses context to predict the center word and which uses the center word to predict context.
CBOW takes the surrounding words as input and tries to predict the center word. Skip-gram does the opposite: it takes the center word as input and tries to predict the surrounding words.
Given the following simplified Skip-gram training code snippet, what will be the shape of the output vector representing the predicted context word probabilities?
import numpy as np vocab_size = 10 embedding_dim = 5 # Random embedding matrix embeddings = np.random.rand(vocab_size, embedding_dim) # One-hot encoded center word (index 3) center_word = np.zeros(vocab_size) center_word[3] = 1 # Compute hidden layer (embedding lookup) hidden = embeddings.T @ center_word # shape (embedding_dim,) # Output weights output_weights = np.random.rand(vocab_size, embedding_dim) # Compute output layer output = output_weights @ hidden # shape ? print(output.shape)
Consider the matrix multiplication dimensions: output_weights (vocab_size x embedding_dim) times hidden (embedding_dim,).
The output_weights matrix has shape (10, 5) and hidden vector has shape (5,). Multiplying them results in a vector of shape (10,), representing scores for each word in the vocabulary.
You want to train word embeddings on a small dataset with many rare words. Which Word2Vec model is generally better at learning good embeddings for rare words?
Think about which model focuses more on individual target words and their contexts.
Skip-gram predicts context words from the target word, which helps it learn better embeddings for rare words by focusing on their contexts individually. CBOW averages context which can dilute rare word signals.
In Word2Vec training, what is the effect of increasing the window size parameter?
Window size defines how many words around the target word are used as context.
A larger window size means more surrounding words are included as context, which helps capture broader semantic relations but can introduce irrelevant words as noise.
After training Word2Vec embeddings, you want to evaluate them using the analogy task: "king is to queen as man is to ?". Which metric best measures the quality of the embeddings on this task?
Think about how analogy tasks use vector arithmetic and similarity measures.
The analogy task uses vector arithmetic: vector('queen') - vector('king') + vector('man'). The best matching word vector is found by highest cosine similarity to this result, indicating semantic relationships.