CS231n Assignment 3(2) : Image Captioning with Transformers

Inline Question 1

Several key design decisions were made in designing the scaled dot product attention we introduced above. Explain why the following choices were beneficial:

Using multiple attention heads as opposed to one.
Dividing by $\sqrt{d/h}$before applying the softmax function. Recall that d is the feature dimension and h is the number of heads.
Adding a linear transformation to the output of the attention operation.

Only one or two sentences per choice is necessary, but be sure to be specific in addressing what would have happened without each given implementation detail, why such a situation would be suboptimal, and how the proposed implementation improves the situation.

Your Answer :

1. 여러 헤드를 사용하면 모델이 다양한 표현을 학습할 수 있어 다양한 측면에 대해 주의(Attention)을 기울일 수 있습니다. 이는 CNN에서 필터를 여러개 쓰는 것과 동일한 이유로 사용됩니다.

2. Query와 Key의 모든 값이 독립이면서 평균=0, 분산=1 일 때,
$$E[Q_{1,i}, K_{i,1}] = E[Q_{1,i}]E[K_{i,1}]=0$$
$$Var(Q_{1,i}K_{i,1}) = (VAR(Q_{1,i}) + E[Q_{1,i}]^2)(Var(K_{i,1}) + E[K_{i,1}]^2) - E[Q_{1,i}]^2E[K_{i,1}]^2 = 1$$
$$\sum^{dk}_{i=0} Var(Q_{1,i}K_{i,1}) = d_k$$
즉, QK의 분산은 $d_k$가 됩니다. QK의 분산이 클 수록 원소 값들 간의 차이가 크며, 값이 큰 logit일 수록 softmax 값은 1에 가깝고, 나머지 값들은 0에 가까워집니다. Softmax의 gradient는 확률들의 곱으로 표현되기 때문에, 특정 값이 작게 나오는 경우 Gradient Vanishing 문제가 발생할 수 있습니다.

3. Attention operation의 출력은 헤드(h)개의 서로 다른 어텐션 스코어들이 연결된 값입니다. Linear transformation을 통해 각 헤드가 얼마나 활성화되어야 할지 가중치(weight)를 학습합니다. Linear transformation이 아닌, mean pooling 같은 layer를 사용하게 되면 각 헤드의 어텐션 스코어를 단순 평균하게 되어, 여러 헤드를 통해 얻는 이점이 사라집니다.

Code

Multi-Head Attention의 구현 코드 입니다.

1. 입력으로부터 Query, Key, Value 생성 (affine layer)

2. (Query, Key, Value) 모두 Head 수 만큼 embed dim 나눠주기

3. $Attention(Q,K,V) = Softmax(\frac{QK^T}{\sqrt{d_k}})V$

4. 계산된 score head concat

5. dropout, affine layer를 통해 최종 output 출력

그림으로 표현하면 아래와 같습니다.

class MultiHeadAttention(nn.Module):
    """
    A model layer which implements a simplified version of masked attention, as
    introduced by "Attention Is All You Need" (https://arxiv.org/abs/1706.03762).

    Usage:
      attn = MultiHeadAttention(embed_dim, num_heads=2)

      # self-attention
      data = torch.randn(batch_size, sequence_length, embed_dim)
      self_attn_output = attn(query=data, key=data, value=data)

      # attention using two inputs
      other_data = torch.randn(batch_size, sequence_length, embed_dim)
      attn_output = attn(query=data, key=other_data, value=other_data)
    """

    def __init__(self, embed_dim, num_heads, dropout=0.1):
        """
        Construct a new MultiHeadAttention layer.

        Inputs:
         - embed_dim: Dimension of the token embedding
         - num_heads: Number of attention heads
         - dropout: Dropout probability
        """
        super().__init__()
        assert embed_dim % num_heads == 0

        # We will initialize these layers for you, since swapping the ordering
        # would affect the random number generation (and therefore your exact
        # outputs relative to the autograder). Note that the layers use a bias
        # term, but this isn't strictly necessary (and varies by
        # implementation).
        self.key = nn.Linear(embed_dim, embed_dim)
        self.query = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.proj = nn.Linear(embed_dim, embed_dim)
        
        self.attn_drop = nn.Dropout(dropout)

        self.n_head = num_heads
        self.emd_dim = embed_dim
        self.head_dim = self.emd_dim // self.n_head

    def forward(self, query, key, value, attn_mask=None):
        """
        Calculate the masked attention output for the provided data, computing
        all attention heads in parallel.

        In the shape definitions below, N is the batch size, S is the source
        sequence length, T is the target sequence length, and E is the embedding
        dimension.

        Inputs:
        - query: Input data to be used as the query, of shape (N, S, E)
        - key: Input data to be used as the key, of shape (N, T, E)
        - value: Input data to be used as the value, of shape (N, T, E)
        - attn_mask: Array of shape (S, T) where mask[i,j] == 0 indicates token
          i in the source should not influence token j in the target.

        Returns:
        - output: Tensor of shape (N, S, E) giving the weighted combination of
          data in value according to the attention weights calculated using key
          and query.
        """
        N, S, E = query.shape
        N, T, E = value.shape
        # Create a placeholder, to be overwritten by your code below.
        output = torch.empty((N, S, E))
        ############################################################################
        # TODO: Implement multiheaded attention using the equations given in       #
        # Transformer_Captioning.ipynb.                                            #
        # A few hints:                                                             #
        #  1) You'll want to split your shape from (N, T, E) into (N, T, H, E/H),  #
        #     where H is the number of heads.                                      #
        #  2) The function torch.matmul allows you to do a batched matrix multiply.#
        #     For example, you can do (N, H, T, E/H) by (N, H, E/H, T) to yield a  #
        #     shape (N, H, T, T). For more examples, see                           #
        #     https://pytorch.org/docs/stable/generated/torch.matmul.html          #
        #  3) For applying attn_mask, think how the scores should be modified to   #
        #     prevent a value from influencing output. Specifically, the PyTorch   #
        #     function masked_fill may come in handy.                              #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        q = self.query(query) # (N, S, E) -> (N, S, E)
        k = self.key(key)
        v = self.value(value)

        q = q.view(N, S, self.n_head, E//self.n_head) # (N, S, E) -> (N, S, H, E//H)
        k = k.view(N, T, self.n_head, E//self.n_head)
        v = v.view(N, T, self.n_head, E//self.n_head)

        q = q.transpose(1,2)
        k = k.transpose(1,2)
        v = v.transpose(1,2)

        # compute attention
        k_adjusted = k.transpose(2, 3)
        product = torch.matmul(q, k_adjusted) # (N, S, H, E//H) x (N, S, E//H, H) -> (N, S, H, H)

        if attn_mask is not None:
          product = product.masked_fill(attn_mask == 0, -float('inf'))

        product = product / math.sqrt(E//self.n_head) # softmax(QK^T/sqrt(d_k))

        scores = F.softmax(product, dim=-1)

        scores = self.attn_drop(scores)

        scores = torch.matmul(scores, v)

        concat = scores.transpose(1,2).reshape(N, S, -1)

        output = self.proj(concat)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return output

다음은 Positional Encoding 입니다. Positional Encoding의 기본 형태는 다음과 같습니다.

$$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}})$$

$$PE_{(pos, 2i+1)} =cos(pos/10000^{2i/d_{model}})$$

[init]

1. 위치 배열(pos) 생성 # $pos$

2. $10000^{2i/d_{model}}$ 생성 # $t$

3. pe의 embed dimension의 짝수 index에 sin(pos*t) 값을, 홀수 index에 cos(pos*t) 값을 저장

[forward]

1. 입력(x) 와 pe를 summation # 이 때, 입력의 sequence 길이에 맞게 pe값을 적절하게 잘라서 사용

2. dropout 적용

class PositionalEncoding(nn.Module):
    """
    Encodes information about the positions of the tokens in the sequence. In
    this case, the layer has no learnable parameters, since it is a simple
    function of sines and cosines.
    """
    def __init__(self, embed_dim, dropout=0.1, max_len=5000):
        """
        Construct the PositionalEncoding layer.

        Inputs:
         - embed_dim: the size of the embed dimension
         - dropout: the dropout value
         - max_len: the maximum possible length of the incoming sequence
        """
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        assert embed_dim % 2 == 0
        # Create an array with a "batch dimension" of 1 (which will broadcast
        # across all examples in the batch).
        pe = torch.zeros(1, max_len, embed_dim)
        ############################################################################
        # TODO: Construct the positional encoding array as described in            #
        # Transformer_Captioning.ipynb.  The goal is for each row to alternate     #
        # sine and cosine, and have exponents of 0, 0, 2, 2, 4, 4, etc. up to      #
        # embed_dim. Of course this exact specification is somewhat arbitrary, but #
        # this is what the autograder is expecting. For reference, our solution is #
        # less than 5 lines of code.                                               #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        pos = torch.arange(0, max_len).reshape(-1, 1) # (max_len, 1)

        # PE(pos, 2i) = sin(pos/10000^(2i/embed_dim)) # PE(pos, 2i+1) = cos(pos*t)
        t = torch.pow(torch.tensor([1e-4]), torch.arange(0, embed_dim, 2)/embed_dim)
        pe[:, :, ::2] = torch.sin(pos * t)
        pe[:, :, 1::2] = torch.cos(pos * t)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # Make sure the positional encodings will be saved with the model
        # parameters (mostly for completeness).
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Element-wise add positional embeddings to the input sequence.

        Inputs:
         - x: the sequence fed to the positional encoder model, of shape
              (N, S, D), where N is the batch size, S is the sequence length and
              D is embed dim
        Returns:
         - output: the input sequence + positional encodings, of shape (N, S, D)
        """
        N, S, D = x.shape
        # Create a placeholder, to be overwritten by your code below.
        output = torch.empty((N, S, D))
        ############################################################################
        # TODO: Index into your array of positional encodings, and add the         #
        # appropriate ones to the input sequence. Don't forget to apply dropout    #
        # afterward. This should only take a few lines of code.                    #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        output = x + self.pe[:, :S, :]
        output = self.dropout(output)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return output

다음은 Transformer for image captioning 에 사용되는 CaptioningTransformer 입니다.

아래 아키텍처를 참고하면서 코드를 보면 이해하기 편합니다. 아래 코드에서는 Visual feature들에 대해 self-attention을 수행하는 대신 linear layer를 통과시켰습니다.

The Architecture of the Image Captioning Model. Source: "CPTR: Full transformer network for Image Captioning"

1-1. caption에 대해 embedding과 positional encoding 과정을 거칩니다.

1-2. Visual feature에 대해 affine layer 를 수행합니다.

1-3. tgt_mask를 생성합니다. (torch.tril은 하삼각 행렬을 생성해주는 함수 입니다. TxT 크기의 하삼각 행렬이 추후 Masked Self attention에서 Masked Attention에 참조됩니다.)

2-1. caption embedding을 masked self-attention 과정을 거칩니다.

2-2. masked self-attention의 출력을 Query, Image feature들을 Key, Value로 사용하여 Attention을 수행합니다.

3. Linear layer와 Softmax를 거쳐 Score를 출력합니다.

class CaptioningTransformer(nn.Module):
    """
    A CaptioningTransformer produces captions from image features using a
    Transformer decoder.

    The Transformer receives input vectors of size D, has a vocab size of V,
    works on sequences of length T, uses word vectors of dimension W, and
    operates on minibatches of size N.
    """
    def __init__(self, word_to_idx, input_dim, wordvec_dim, num_heads=4,
                 num_layers=2, max_length=50):
        """
        Construct a new CaptioningTransformer instance.

        Inputs:
        - word_to_idx: A dictionary giving the vocabulary. It contains V entries.
          and maps each string to a unique integer in the range [0, V).
        - input_dim: Dimension D of input image feature vectors.
        - wordvec_dim: Dimension W of word vectors.
        - num_heads: Number of attention heads.
        - num_layers: Number of transformer layers.
        - max_length: Max possible sequence length.
        """
        super().__init__()

        vocab_size = len(word_to_idx)
        self.vocab_size = vocab_size
        self._null = word_to_idx["<NULL>"]
        self._start = word_to_idx.get("<START>", None)
        self._end = word_to_idx.get("<END>", None)

        self.visual_projection = nn.Linear(input_dim, wordvec_dim)
        self.embedding = nn.Embedding(vocab_size, wordvec_dim, padding_idx=self._null)
        self.positional_encoding = PositionalEncoding(wordvec_dim, max_len=max_length)

        decoder_layer = TransformerDecoderLayer(input_dim=wordvec_dim, num_heads=num_heads)
        self.transformer = TransformerDecoder(decoder_layer, num_layers=num_layers)
        self.apply(self._init_weights)

        self.output = nn.Linear(wordvec_dim, vocab_size)

    def _init_weights(self, module):
        """
        Initialize the weights of the network.
        """
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def forward(self, features, captions):
        """
        Given image features and caption tokens, return a distribution over the
        possible tokens for each timestep. Note that since the entire sequence
        of captions is provided all at once, we mask out future timesteps.

        Inputs:
         - features: image features, of shape (N, D)
         - captions: ground truth captions, of shape (N, T)

        Returns:
         - scores: score for each token at each timestep, of shape (N, T, V)
        """
        N, T = captions.shape
        # Create a placeholder, to be overwritten by your code below.
        scores = torch.empty((N, T, self.vocab_size))
        ############################################################################
        # TODO: Implement the forward function for CaptionTransformer.             #
        # A few hints:                                                             #
        #  1) You first have to embed your caption and add positional              #
        #     encoding. You then have to project the image features into the same  #
        #     dimensions.                                                          #
        #  2) You have to prepare a mask (tgt_mask) for masking out the future     #
        #     timesteps in captions. torch.tril() function might help in preparing #
        #     this mask.                                                           #
        #  3) Finally, apply the decoder features on the text & image embeddings   #
        #     along with the tgt_mask. Project the output to scores per token      #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        # Embed captions
        caption_embeddings = self.embedding(captions)# [N, T] -> [N, T, W]
        caption_embeddings = self.positional_encoding(caption_embeddings)

        # Project image features into the same dimension
        projected_features = self.visual_projection(features).unsqueeze(1) # [N, D] -> [N, W] -> [N, 1, W]

        # additive mask for masking the future
        tgt_mask = torch.tril(torch.ones(T, T, device = caption_embeddings.device, dtype = caption_embeddings.dtype))
        
        # Apply Transformer decoder to the caption
        features = self.transformer(tgt=caption_embeddings, memory=projected_features, tgt_mask=tgt_mask)

        # project to scores per token
        scores = self.output(features) # [N, T, W] -> [N, T, V]

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return scores

    def sample(self, features, max_length=30):
        """
        Given image features, use greedy decoding to predict the image caption.

        Inputs:
         - features: image features, of shape (N, D)
         - max_length: maximum possible caption length

        Returns:
         - captions: captions for each example, of shape (N, max_length)
        """
        with torch.no_grad():
            features = torch.Tensor(features)
            N = features.shape[0]

            # Create an empty captions tensor (where all tokens are NULL).
            captions = self._null * np.ones((N, max_length), dtype=np.int32)

            # Create a partial caption, with only the start token.
            partial_caption = self._start * np.ones(N, dtype=np.int32)
            partial_caption = torch.LongTensor(partial_caption)
            # [N] -> [N, 1]
            partial_caption = partial_caption.unsqueeze(1)

            for t in range(max_length):

                # Predict the next token (ignoring all other time steps).
                output_logits = self.forward(features, partial_caption)
                output_logits = output_logits[:, -1, :]

                # Choose the most likely word ID from the vocabulary.
                # [N, V] -> [N]
                word = torch.argmax(output_logits, axis=1)

                # Update our overall caption and our current partial caption.
                captions[:, t] = word.numpy()
                word = word.unsqueeze(1)
                partial_caption = torch.cat([partial_caption, word], dim=1)

            return captions

Validation 이미지 데이터로 Captioning을 해봤을 때, 애매하게 나오는 것 같습니다.

'Stanford CS231n' 카테고리의 다른 글

CS231n Assignment 3(4) : Self-Supervised Learning for Image Classification (1)	2024.12.20
CS231n Assignment 3(3) : Generative Adversarial Networks (0)	2024.12.19
CS231n Assignment 3(1) : RNN_Captioning (1)	2024.12.18
CS231n Assignment2(4) : Convolutional Neural Networks (0)	2024.12.18
CS231n Assignment 2(3) : Dropout (0)	2024.12.18

Inline Question 1

Code

'Stanford CS231n' 카테고리의 다른 글

티스토리툴바