CS231n Assignment 2(2) : BatchNormalization

Inline Question 1

Describe the results of this experiment. How does the weight initialization scale affect models with/without batch normalization differently, and why?

Your Answer : 배치 정규화가 있는 모델은 배치 정규화가 없는 모델에 비해 넓은 weigt initialization scale에서 좋은 성능을 보입니다. 배치 정규화를 통해 internal covariate shift 문제를 해결할 수 있기 때문에 학습이 더 쉽습니다.

Inline Question 2

Describe the results of this experiment. What does this imply about the relationship between batch normalization and batch size? Why is this relationship observed?

Your Answer : Batch normalization을 사용하는 경우 배치 크기가 클 수록 성능이 좋습니다. 적은 샘플로 구성된 배치로는 기존 데이터의 평균과 분산을 정확하게 추정할 수 없으며, 이상치에 민감하게 반응합니다.

Inline Question 3

Which of these data preprocessing steps is analogous to batch normalization, and which is analogous to layer normalization?
1. Scaling each image in the dataset, so that the RGB channels for each row of pixels within an image sums up to 1.
2. Scaling each image in the dataset, so that the RGB channels for all pixels within an image sums up to 1.
3. Subtracting the mean image of the dataset from each image in the dataset.
4. Setting all RGB values to either 0 or 1 depending on a given threshold.

Your Answer :

2 : layer normalization 개별 샘플 데이터(이미지)단위의 정규화는 layer normalization에 해당합니다.

3 : batch normalization 배치 별 각 Feature(픽셀)단위의 정규화는 batch normalization에 해당합니다.

Inline Question 4

When is layer normalization likely to not work well, and why?
1. Using it in a very deep network
2. Having a very small dimension of features
3. Having a high regularization term

Your Answer :

2 : batch normalization에서 batch size가 작을 때 batch 내 평균과 분산이 모집단의 평균과 분산을 제대로 추정하기 어렵고, 이상치에 민감합니다. 이와 동일한 이유로, layer normalization에서는 feature dimension이 작을 때 이상치에 취약하며, 각 feature dimension의 평균과 분산이 데이터를 대표하기 힘들 수 있습니다.

Code

def batchnorm_forward(x, gamma, beta, bn_param):
    """
    Forward pass for batch normalization.

    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param["mode"]
    eps = bn_param.get("eps", 1e-5)
    momentum = bn_param.get("momentum", 0.9)

    N, D = x.shape
    running_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == "train":
        #######################################################################
        # TODO: Implement the training-time forward pass for batch norm.      #
        # Use minibatch statistics to compute the mean and variance, use      #
        # these statistics to normalize the incoming data, and scale and      #
        # shift the normalized data using gamma and beta.                     #
        #                                                                     #
        # You should store the output in the variable out. Any intermediates  #
        # that you need for the backward pass should be stored in the cache   #
        # variable.                                                           #
        #                                                                     #
        # You should also use your computed sample mean and variance together #
        # with the momentum variable to update the running mean and running   #
        # variance, storing your result in the running_mean and running_var   #
        # variables.                                                          #
        #                                                                     #
        # Note that though you should be keeping track of the running         #
        # variance, you should normalize the data based on the standard       #
        # deviation (square root of variance) instead!                        #
        # Referencing the original paper (https://arxiv.org/abs/1502.03167)   #
        # might prove to be helpful.                                          #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        sample_mean = np.mean(x, axis=0)
        sample_var = np.var(x, axis=0)

        running_mean = momentum * running_mean + (1-momentum) * np.mean(x, axis=0)
        running_var = momentum * running_var + (1-momentum) * sample_var

        x_normal = (x - sample_mean) / np.sqrt(sample_var + eps)
        out = gamma * x_normal + beta

        cache = (x, sample_mean, sample_var, gamma)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == "test":
        #######################################################################
        # TODO: Implement the test-time forward pass for batch normalization. #
        # Use the running mean and variance to normalize the incoming data,   #
        # then scale and shift the normalized data using gamma and beta.      #
        # Store the result in the out variable.                               #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        x_normal = (x - running_mean) / np.sqrt(running_var + eps)
        out = gamma * x_normal + beta

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param["running_mean"] = running_mean
    bn_param["running_var"] = running_var

    return out, cache

batchnorm은 Train 모드와 Test 모드로 나뉩니다.

입력 x의 평균과 분산을 구합니다. x는 2차원 벡터(N, D)이기 때문에 np.mean(x, axis=0), np.var(x, axis=0) 을 통해 (D,) 의 평균과 분산을 계산합니다. ((N, D) -> (1, D)) 이를 입력 x에 빼고 나누어 정규화 합니다. 이후, 학습 가능한 파라미터인 gamma와 beta를 통해 분포를 학습합니다. cache에는 test시 필요한 이동 평균과 분산, 입력 x, 학습 파라미터인 gamma, beta를 저장합니다.

Train 모드에서 입력 x의 이동 평균과 분산을 계산합니다. 이를 Test 모드에서의 정규화에 사용합니다.

def batchnorm_backward(dout, cache):
    """
    Backward pass for batch normalization.

    For this implementation, you should write out a computation graph for
    batch normalization on paper and propagate gradients backward through
    intermediate nodes.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    # Referencing the original paper (https://arxiv.org/abs/1502.03167)       #
    # might prove to be helpful.                                              #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    eps = 1e-5
    
    N = dout.shape[0]
    x, sample_mean, sample_var, gamma = cache
    x_normal = (x - sample_mean) / np.sqrt(sample_var + eps)

    dgamma = np.sum(dout * x_normal, axis = 0)
    dbeta = np.sum(dout, axis = 0)
    dx_normal = dout * gamma

    dlvar = np.sum(dx_normal * (x - sample_mean) * -0.5 * (sample_var + eps)**-1.5, axis=0)
    dlmean = np.sum(dx_normal * -1 / np.sqrt(sample_var + eps), axis = 0) + dlvar * np.sum(-2*(x - sample_mean), axis = 0) / N

    dx = dx_normal * 1 / np.sqrt(sample_var + eps) + dlvar * 2 * (x - sample_mean) / N + dlmean / N

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta

backpropagation 과정은 조금 복잡합니다. 우선, Computational Graph을 참고해 backpropagation 구현합니다.

이 과정을 차례대로 dout부터 시작해, upstream gradient와 local gradient를 계산해 나가면 됩니다.

(mean 게이트의 local gradient는 1/N 인 것만 알면 쉽게 해결할 수 있습니다.)

def layernorm_forward(x, gamma, beta, ln_param):
    """
    Forward pass for layer normalization.

    During both training and test-time, the incoming data is normalized per data-point,
    before being scaled by gamma and beta parameters identical to that of batch normalization.

    Note that in contrast to batch normalization, the behavior during train and test-time for
    layer normalization are identical, and we do not need to keep track of running averages
    of any sort.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - ln_param: Dictionary with the following keys:
        - eps: Constant for numeric stability

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    out, cache = None, None
    eps = ln_param.get("eps", 1e-5)
    ###########################################################################
    # TODO: Implement the training-time forward pass for layer norm.          #
    # Normalize the incoming data, and scale and  shift the normalized data   #
    #  using gamma and beta.                                                  #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of  batch normalization, and inserting a line or two of  #
    # well-placed code. In particular, can you think of any matrix            #
    # transformations you could perform, that would enable you to copy over   #
    # the batch norm code and leave it almost unchanged?                      #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    feature_mean = np.mean(x, axis = 1)[:, np.newaxis] # (n, ) -> (n,1)
    feature_var = np.var(x, axis = 1)[:, np.newaxis]

    x_normal = (x - feature_mean) / np.sqrt(feature_var + eps)
    out = gamma * x_normal + beta

    cache = (x, feature_mean, feature_var, gamma)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return out, cache

layer normalization은 배치 내 각 샘플 별로 정규화를 하는 방법입니다. 전체적으로 batch normalization forward와 유사합니다.

def layernorm_backward(dout, cache):
    """
    Backward pass for layer normalization.

    For this implementation, you can heavily rely on the work you've done already
    for batch normalization.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from layernorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for layer norm.                       #
    #                                                                         #
    # HINT: this can be done by slightly modifying your training-time         #
    # implementation of batch normalization. The hints to the forward pass    #
    # still apply!                                                            #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    eps = 1e-5

    N = dout.shape[1]
    x, feature_mean, feature_var, gamma = cache

    x_normal = (x - feature_mean) / np.sqrt(feature_var + eps)

    dgamma = np.sum(dout * x_normal, axis = 0)
    dbeta = np.sum(dout, axis = 0)
    dx_normal = dout * gamma

    dlvar = np.sum(dx_normal * (x - feature_mean) * -0.5 * (feature_var + eps)**-1.5, axis = 1)[:, np.newaxis]

    dlmean = np.sum(dx_normal * -1 / np.sqrt(feature_var + eps) , axis = 1)[:, np.newaxis] + dlvar * np.sum(-2 * (x - feature_mean), axis = 1)[:, np.newaxis] / N

    dx = dx_normal * 1 / np.sqrt(feature_var + eps) + dlvar * 2 * (x - feature_mean) / N + dlmean / N

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dgamma, dbeta

layernorm의 backward 또한 batchnorm의 backward와 유사합니다.

'Stanford CS231n' 카테고리의 다른 글

CS231n Assignment2(4) : Convolutional Neural Networks (0)	2024.12.18
CS231n Assignment 2(3) : Dropout (0)	2024.12.18
CS231n Assignment 2(1) : Multi-Layer Fully Connected Neural Networks (1)	2024.12.17
CS231n Assignment 1(3) : Implement a Softmax classifier (0)	2024.12.08
CS231n Assignment 1(1) : K-Nearest Neighbor classifier (0)	2024.12.04

Inline Question 1

Inline Question 2

Inline Question 3

Inline Question 4

Code

'Stanford CS231n' 카테고리의 다른 글

티스토리툴바