Inline Question 1
In our current image captioning setup, our RNN language model produces a word at every timestep as its output. However, an alternate way to pose the problem is to train the network to operate over characters (e.g. 'a', 'b', etc.) as opposed to words, so that at it every timestep, it receives the previous character as input and tries to predict the next character in the sequence. For example, the network might generate a caption like
'A', ' ', 'c', 'a', 't', ' ', 'o', 'n', ' ', 'a', ' ', 'b', 'e', 'd'
Can you describe one advantage of an image-captioning model that uses a character-level RNN? Can you also describe one disadvantage? HINT: there are several valid answers, but it might be useful to compare the parameter space of word-level and character-level models.
Your Answer :
(이 예제에서) Word-level 모델은 1004개의 단어 종류를 사용하여 임베딩 벡터로 표현하기 위한 파라미터 수가 Character-level 모델에 비해 많습니다. 따라서 더 큰 메모리와 계산을 필요로 합니다.
반면에, Character-level 모델은 parameter space가 Word-level 모델에 비해 작지만, 학습의 난이도가 높습니다. 입력된 Character를 통해 다음 Character를 추론하는 방식으로 모델이 sampling을 해가는 과정에서, 파라미터의 수가 적어 정확도가 높은 모델을 만들기 어렵고, 모델이 추론한 글자가 존재하는 단어가 아닐 수 있습니다.
Code
Image captioning은 이미지를 입력으로 받아 자연어로 묘사하는 기술입니다. 예제에서는 COCO 데이터셋을 사용했으며, 각 데이터는 이미지와 Caption이 한 쌍으로 구성되어있습니다. RNN-Captioning 모델을 만들기 위해선 이미지가 학습 데이터, Caption이 정답 라벨이 됩니다.

먼저, Caption의 각 단어들을 Embedding 해주는 과정을 거칩니다. 입력은 (N, T) 입니다. # (N: batch size, T: sequence length)
각 단어들에 대해 D 차원으로 임베딩을 하기 위해선 (V, D) 의 weight matrix가 필요합니다. # (V: Vocabulary size, D: embedding size)
out = W[x] 로 입력을 다음과 같이 변환할 수 있습니다. # (N, T) -> (N, T, D)
def word_embedding_forward(x, W):
"""Forward pass for word embeddings.
We operate on minibatches of size N where
each sequence has length T. We assume a vocabulary of V words, assigning each
word to a vector of dimension D.
Inputs:
- x: Integer array of shape (N, T) giving indices of words. Each element idx
of x muxt be in the range 0 <= idx < V.
- W: Weight matrix of shape (V, D) giving word vectors for all words.
Returns a tuple of:
- out: Array of shape (N, T, D) giving word vectors for all input words.
- cache: Values needed for the backward pass
"""
out, cache = None, None
##############################################################################
# TODO: Implement the forward pass for word embeddings. #
# #
# HINT: This can be done in one line using NumPy's array indexing. #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
out = W[x]
cache = (x, W)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################
return out, cache
word embedding의 backward 함수도 같이 만들어보겠습니다. 과제에서는 np.add.at 함수를 사용하라고 되어있습니다. np.add.at 함수는 (arr, indices, scalar value) 로 구성됩니다. indices를 참조하여 arr에 scalar 값을 더해줍니다.
먼저, word embedding 에서 학습하고자 하는 파라미터는 W 입니다. 따라서, W와 같은 shape의 dW를 생성해줍니다.
dW 는 (V, D) 입니다. V 는 word embedding 모델이 학습한(하고 있는) 단어 개수 입니다.
1 배치를 예시로 들면, W의 (V, D)중 (T, D) 만큼이 사용됩니다. 따라서, dW는 (T, D) 에만 흘려주면 됩니다.
def word_embedding_backward(dout, cache):
"""Backward pass for word embeddings.
We cannot back-propagate into the words
since they are integers, so we only return gradient for the word embedding
matrix.
HINT: Look up the function np.add.at
Inputs:
- dout: Upstream gradients of shape (N, T, D)
- cache: Values from the forward pass
Returns:
- dW: Gradient of word embedding matrix, of shape (V, D)
"""
dW = None
##############################################################################
# TODO: Implement the backward pass for word embeddings. #
# #
# Note that words can appear more than once in a sequence. #
# HINT: Look up the function np.add.at #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
x, W = cache
dW = np.zeros_like(W)
np.add.at(dW, x, dout) # (arr, indices, values)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
# END OF YOUR CODE #
##############################################################################
return dW
다음은, Image feature를 입력으로 받아 Caption을 학습/추론 하는 CaptioningRNN모델입니다.
1. affine_forward를 통해 입력을 latent vector로 만듭니다. (이 벡터를 rnn의 초기 hidden state로 사용합니다.)
2. image caption을 word embedding을 통해 임베딩 벡터로 만듭니다. (이 벡터는 rnn의 입력 벡터로 사용합니다.)
3. (1)을 초기 시점의 hidden state로 (2)을 rnn의 입력로 사용합니다. (입력 벡터는 rnn의 매 시점마다 하나씩 들어갑니다.)
4. rnn을 통해 생성된 hidden state를 affine forward를 통해 score matrix로 만들어 줍니다.
5. score 와 caption label간의 softmax loss를 구합니다. 이 때 mask 하이퍼 파라미터를 사용하여 출력이 <NULL> 인 부분은 loss 계산에서 제외합니다.
sampling 과정은 다음과 같습니다.
1. (1). (2.) (3). 과정을 똑같이 진행합니다.
2. rnn의 첫 입력을 <START> 토큰으로 지정합니다.
3. 생성된 히든 벡터를 affine forward를 통해 score로 표현합니다.
4. np.argmax를 통해 max값의 index를 captions[:, i]에 저장합니다. score는 [V, ]의 shape을 갖고 있으며, 이는 Vocab의 크기와 같습니다. score의 index가 높은 지점은 Vocab 내의 한 단어가 정답일 확률이 높다는 의미입니다.
5. word를 현재 예측한 단어로 지정한 뒤, (2)~ 를 반복합니다. 이는 max_length 길이 만큼 반복됩니다.
class CaptioningRNN:
"""
A CaptioningRNN produces captions from image features using a recurrent
neural network.
The RNN receives input vectors of size D, has a vocab size of V, works on
sequences of length T, has an RNN hidden dimension of H, uses word vectors
of dimension W, and operates on minibatches of size N.
Note that we don't use any regularization for the CaptioningRNN.
"""
def __init__(
self,
word_to_idx,
input_dim=512,
wordvec_dim=128,
hidden_dim=128,
cell_type="rnn",
dtype=np.float32,
):
"""
Construct a new CaptioningRNN instance.
Inputs:
- word_to_idx: A dictionary giving the vocabulary. It contains V entries,
and maps each string to a unique integer in the range [0, V).
- input_dim: Dimension D of input image feature vectors.
- wordvec_dim: Dimension W of word vectors.
- hidden_dim: Dimension H for the hidden state of the RNN.
- cell_type: What type of RNN to use; either 'rnn' or 'lstm'.
- dtype: numpy datatype to use; use float32 for training and float64 for
numeric gradient checking.
"""
if cell_type not in {"rnn", "lstm"}:
raise ValueError('Invalid cell_type "%s"' % cell_type)
self.cell_type = cell_type
self.dtype = dtype
self.word_to_idx = word_to_idx
self.idx_to_word = {i: w for w, i in word_to_idx.items()}
self.params = {}
vocab_size = len(word_to_idx)
self._null = word_to_idx["<NULL>"]
self._start = word_to_idx.get("<START>", None)
self._end = word_to_idx.get("<END>", None)
# Initialize word vectors
self.params["W_embed"] = np.random.randn(vocab_size, wordvec_dim)
self.params["W_embed"] /= 100
# Initialize CNN -> hidden state projection parameters
self.params["W_proj"] = np.random.randn(input_dim, hidden_dim)
self.params["W_proj"] /= np.sqrt(input_dim)
self.params["b_proj"] = np.zeros(hidden_dim)
# Initialize parameters for the RNN
dim_mul = {"lstm": 4, "rnn": 1}[cell_type]
self.params["Wx"] = np.random.randn(wordvec_dim, dim_mul * hidden_dim)
self.params["Wx"] /= np.sqrt(wordvec_dim)
self.params["Wh"] = np.random.randn(hidden_dim, dim_mul * hidden_dim)
self.params["Wh"] /= np.sqrt(hidden_dim)
self.params["b"] = np.zeros(dim_mul * hidden_dim)
# Initialize output to vocab weights
self.params["W_vocab"] = np.random.randn(hidden_dim, vocab_size)
self.params["W_vocab"] /= np.sqrt(hidden_dim)
self.params["b_vocab"] = np.zeros(vocab_size)
# Cast parameters to correct dtype
for k, v in self.params.items():
self.params[k] = v.astype(self.dtype)
def loss(self, features, captions):
"""
Compute training-time loss for the RNN. We input image features and
ground-truth captions for those images, and use an RNN (or LSTM) to compute
loss and gradients on all parameters.
Inputs:
- features: Input image features, of shape (N, D)
- captions: Ground-truth captions; an integer array of shape (N, T + 1) where
each element is in the range 0 <= y[i, t] < V
Returns a tuple of:
- loss: Scalar loss
- grads: Dictionary of gradients parallel to self.params
"""
# Cut captions into two pieces: captions_in has everything but the last word
# and will be input to the RNN; captions_out has everything but the first
# word and this is what we will expect the RNN to generate. These are offset
# by one relative to each other because the RNN should produce word (t+1)
# after receiving word t. The first element of captions_in will be the START
# token, and the first element of captions_out will be the first word.
captions_in = captions[:, :-1]
captions_out = captions[:, 1:]
# You'll need this
mask = captions_out != self._null
# Weight and bias for the affine transform from image features to initial
# hidden state
W_proj, b_proj = self.params["W_proj"], self.params["b_proj"]
# Word embedding matrix
W_embed = self.params["W_embed"]
# Input-to-hidden, hidden-to-hidden, and biases for the RNN
Wx, Wh, b = self.params["Wx"], self.params["Wh"], self.params["b"]
# Weight and bias for the hidden-to-vocab transformation.
W_vocab, b_vocab = self.params["W_vocab"], self.params["b_vocab"]
loss, grads = 0.0, {}
############################################################################
# TODO: Implement the forward and backward passes for the CaptioningRNN. #
# In the forward pass you will need to do the following: #
# (1) Use an affine transformation to compute the initial hidden state #
# from the image features. This should produce an array of shape (N, H)#
# (2) Use a word embedding layer to transform the words in captions_in #
# from indices to vectors, giving an array of shape (N, T, W). #
# (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to #
# process the sequence of input word vectors and produce hidden state #
# vectors for all timesteps, producing an array of shape (N, T, H). #
# (4) Use a (temporal) affine transformation to compute scores over the #
# vocabulary at every timestep using the hidden states, giving an #
# array of shape (N, T, V). #
# (5) Use (temporal) softmax to compute loss using captions_out, ignoring #
# the points where the output word is <NULL> using the mask above. #
# #
# #
# Do not worry about regularizing the weights or their gradients! #
# #
# In the backward pass you will need to compute the gradient of the loss #
# with respect to all model parameters. Use the loss and grads variables #
# defined above to store loss and gradients; grads[k] should give the #
# gradients for self.params[k]. #
# #
# Note also that you are allowed to make use of functions from layers.py #
# in your implementation, if needed. #
############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
h_in, cache_in= affine_forward(features, W_proj, b_proj)
emb, cache_emb = word_embedding_forward(captions_in, W_embed)
if self.cell_type == 'rnn':
h, cache_rnn = rnn_forward(emb, h_in, Wx, Wh, b)
else:
h, cache_lstm = lstm_forward(emb, h_in, Wx, Wh, b)
scores, cache_scores = temporal_affine_forward(h, W_vocab, b_vocab)
# loss
loss, dout = temporal_softmax_loss(scores, captions_out, mask)
# Backward
dout, grads['W_vocab'], grads['b_vocab'] = temporal_affine_backward(dout, cache_scores)
if self.cell_type == 'rnn':
dout, dh, grads['Wx'], grads['Wh'], grads['b'] = rnn_backward(dout, cache_rnn)
else:
dout, dh, grads['Wx'], grads['Wh'], grads['b'] = lstm_backward(dout, cache_lstm)
grads['W_embed'] = word_embedding_backward(dout, cache_emb)
dx, grads['W_proj'], grads['b_proj'] = affine_backward(dh, cache_in)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
############################################################################
# END OF YOUR CODE #
############################################################################
return loss, grads
def sample(self, features, max_length=30):
"""
Run a test-time forward pass for the model, sampling captions for input
feature vectors.
At each timestep, we embed the current word, pass it and the previous hidden
state to the RNN to get the next hidden state, use the hidden state to get
scores for all vocab words, and choose the word with the highest score as
the next word. The initial hidden state is computed by applying an affine
transform to the input image features, and the initial word is the <START>
token.
For LSTMs you will also have to keep track of the cell state; in that case
the initial cell state should be zero.
Inputs:
- features: Array of input image features of shape (N, D).
- max_length: Maximum length T of generated captions.
Returns:
- captions: Array of shape (N, max_length) giving sampled captions,
where each element is an integer in the range [0, V). The first element
of captions should be the first sampled word, not the <START> token.
"""
N = features.shape[0]
captions = self._null * np.ones((N, max_length), dtype=np.int32)
# Unpack parameters
W_proj, b_proj = self.params["W_proj"], self.params["b_proj"]
W_embed = self.params["W_embed"]
Wx, Wh, b = self.params["Wx"], self.params["Wh"], self.params["b"]
W_vocab, b_vocab = self.params["W_vocab"], self.params["b_vocab"]
###########################################################################
# TODO: Implement test-time sampling for the model. You will need to #
# initialize the hidden state of the RNN by applying the learned affine #
# transform to the input image features. The first word that you feed to #
# the RNN should be the <START> token; its value is stored in the #
# variable self._start. At each timestep you will need to do to: #
# (1) Embed the previous word using the learned word embeddings #
# (2) Make an RNN step using the previous hidden state and the embedded #
# current word to get the next hidden state. #
# (3) Apply the learned affine transformation to the next hidden state to #
# get scores for all words in the vocabulary #
# (4) Select the word with the highest score as the next word, writing it #
# (the word index) to the appropriate slot in the captions variable #
# #
# For simplicity, you do not need to stop generating after an <END> token #
# is sampled, but you can if you want to. #
# #
# HINT: You will not be able to use the rnn_forward or lstm_forward #
# functions; you'll need to call rnn_step_forward or lstm_step_forward in #
# a loop. #
# #
# NOTE: we are still working over minibatches in this function. Also if #
# you are using an LSTM, initialize the first cell state to zeros. #
###########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
h_in, _ = affine_forward(features, W_proj, b_proj)
c_in = np.zeros_like(h_in)
word = self._start * np.ones((N), dtype = np.int32) # 첫 번째 word는 <START> 토큰
for i in range(max_length):
emb, _ = word_embedding_forward(word, W_embed)
if self.cell_type == 'rnn':
h_in, _ = rnn_step_forward(emb, h_in, Wx, Wh, b)
else:
h_in, c_in, _ = lstm_step_forward(emb, c_in, h_in, Wx, Wh, b)
scores, _ = affine_forward(h_in, W_vocab, b_vocab)
captions[:,i] = np.argmax(scores, axis = 1)
word = np.argmax(scores, axis = 1)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
############################################################################
# END OF YOUR CODE #
############################################################################
return captions

학습 데이터에 대해선 이미지를 통해 Caption을 잘 생성하는 것을 볼 수 있습니다.

그런데, Validation 데이터에 대해서는 성능이 좋지 않습니다. 다음 과제인 Transformer 를 통해 아마 val 데이터에 대해서도 성능이 향상될 것 같습니다.
'Stanford CS231n' 카테고리의 다른 글
| CS231n Assignment 3(3) : Generative Adversarial Networks (0) | 2024.12.19 |
|---|---|
| CS231n Assignment 3(2) : Image Captioning with Transformers (0) | 2024.12.19 |
| CS231n Assignment2(4) : Convolutional Neural Networks (0) | 2024.12.18 |
| CS231n Assignment 2(3) : Dropout (0) | 2024.12.18 |
| CS231n Assignment 2(2) : BatchNormalization (0) | 2024.12.18 |