Gradient accumulation principle

During deep learning training, the batch size of the data is limited by the GPU memory, and the batch size affects the final accuracy of the model and the performance of the training process. As the model gets larger and larger with constant GPU memory, this means that the batch size of the data can only be reduced, and this is where Gradient Accumulation can be a simple solution to the problem.

Role of Batch size

The Batch size of the training data has a key impact on the convergence of the training process and ultimately the accuracy of the trained model. Typically, there is an optimal value or range of values for the Batch size of each neural network and dataset.

Different neural networks and different datasets may have different optimal Batch size.

There are two main issues to consider when choosing a Batch size:
Generalization: a large Batch size may fall into a local minimum. Falling into a local minimum means that the neural network will perform well on samples outside the training set, a process called generalization. Therefore, generalization generally indicates over-fitting.
Convergence speed: A small Batch size may lead to slow convergence of the algorithm learning. The update of the network model at each Batch will determine the starting point for the next Batch update. Each Batch trains the dataset and randomly draws training samples, so the resulting gradient is based on a partial estimate of the data noise. The fewer samples used in a single Batch, the less accurate the gradient estimation will be. In other words, a smaller Batch size may make the learning process more volatile, essentially lengthening the time it takes for the algorithm to converge.

Impact of Batch size on memory

While traditional computers have access to a large amount of RAM on top of the CPU, they can also make use of SSDs for secondary caching or virtual caching mechanisms. But the memory on an AI accelerator chip such as a GPU is much less. This is where the size of the training data Batch size has a significant impact on the GPU’s memory.
To understand this further, let us first examine the contents of the memory on the AI chip during training:
  • Model parameters: the weight parameters and biases that the network model needs to use.
  • Optimizer variables: variables needed by the optimizer algorithm, e.g. momentum momentum.
  • Intermediate computation variables: intermediate values generated by the network model computation, which are temporarily stored in the memory of the AI accelerator chip, e.g. the output of each activation layer.
  • Workspace: Local variables that are used by the kernel implementation of the AI accelerator chip and are generated in temporary memory, e.g. the local variables generated by the calculation of B/C in the operator D=A+B/C.
Therefore, a larger Batch size means that more samples are needed for the training of the neural network, resulting in a proliferation of variables that need to be stored in the AI chip’s memory. In many cases, there is not enough AI accelerator chip memory, and setting the Batch size too large results in an OOM error (Out Off Memory).

Ways to use large Batch sizes

One way to address the memory limitations of AI accelerator chips and run large Batch sizes is to split the Batch of a data Sample into smaller Batches called Mini-Batches. these small Mini-Batches can run independently and average or sum the gradients while the network model is being trained. There are two main ways to implement this.
  1. Data parallelism: Use multiple AI accelerator chips to train all Mini-Batches in parallel, with each copy of data on a single AI accelerator chip. The gradients of all Mini-Batches are accumulated and the results are used to sum and update the network parameters at the end of each Epoch.

  2. Gradient accumulation: Mini-Batch is executed sequentially while the gradients are accumulated, and the accumulated results are averaged to update the model variables after the last Mini-Batch is calculated.

Although both techniques are quite similar and solve the problem of memory not being able to execute larger Batch sizes, gradient accumulation can be done using a single AI accelerator chip, whereas data parallelism requires multiple AI accelerator chips, so students who only have a 12G used card on hand should hurry up and put gradient accumulation to use.

Gradient accumulation principle

Gradient accumulation is a way of training a neural network in which samples of the data Sample are split into several smaller Batches by Batch and then computed in sequence.
Before discussing gradient accumulation further, let’s look at the computational process of a neural network.
The deep learning model consists of a number of interconnected neural network units, and the sample data is propagated forward through all the neural network layers. After passing through all the layers, the network model outputs the predicted values of the samples, passes through the loss function and then calculates the loss value (error) for each sample. The neural network is back-propagated to calculate the gradient of the loss values relative to the model parameters. Finally this gradient information is used to make updates to the parameters in the network model.
The mathematical formula used by the optimizer to update the parameters of the model weights of the network model. Take a simple stochastic gradient descent (SGD) algorithm as an example. Assume that the Loss Function function is formulated as $$ \operatorname{Loss}(\theta)=\frac{1}{2}\left(h\left(x^k\right)-y^k\right)^2 $$ In building the model, the optimizer is used to calculate the algorithm that minimizes the losses. Here the SGD algorithm uses the Loss function to update the weighting parameters as follows $$ \theta i=\theta_{i-1}-l r * \operatorname{grad}_i $$ Theta is the trainable parameters (weights or biases) in the network model, lr is the learning rate and grad is the loss relative to the parameters of the network model.
Gradient accumulation is the calculation of the neural network model, but does not update the parameters of the network model in time, while accumulating the gradient information obtained during the calculation, and finally using the accumulated gradients to update the parameters uniformly. $$ \text { accumulated }=\sum_{i=0}^N \text { grad }_i $$ In not updating the model variables, the original data Batch is actually split into several smaller Mini-Batches, and the samples used in each step are actually smaller data sets.
By not updating the variables within N-steps, so that all Mini-Batches use the same model variables to calculate the gradient, to ensure that the same gradient and weight information is obtained from the calculation, the algorithmic equivalent is to use the same size of the original unsliced Batch. That is: $$ \theta i=\theta_{i-1}-l r * \sum_{i=0}^N g r a d_i $$ Ultimately accumulating the gradients in the above step will produce a sum of gradients of the same size as using the global Batch size.
In practical engineering, of course, there are two points to note about the tuning and algorithms:
Learning rate: Under certain conditions, the larger the Batch size, the better the training effect, gradient accumulation simulates the effect of increasing the Batch size.
Batch Norm: Batch size simulation when accumulation steps is 4. Compared with the real Batch size, the distribution of the data is not exactly the same, and the mean and variance calculated by BN with 4 times the Batch size are not quite the same as the actual data mean and variance. Some implementations use Group Norm instead of Batch Norm.
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

Arthur: Jerry       Post Date: May 2, 2036