Training a Model with Limited Memory using Mixed Precision and Gradient Checkpointing

Coaching a language mannequin is memory-intensive, not solely as a result of the mannequin itself is giant but in addition as a result of the lengthy sequences within the coaching knowledge batches. Coaching a mannequin with restricted reminiscence is difficult. On this article, you’ll be taught strategies that allow mannequin coaching in memory-constrained environments. Particularly, you’ll study:

Low-precision floating-point numbers and mixed-precision coaching
Utilizing gradient checkpointing

Let’s get began!

Coaching a Mannequin with Restricted Reminiscence utilizing Blended Precision and Gradient Checkpointing
Photograph by Meduana. Some rights reserved.

Overview

This text is split into three components; they’re:

Floating-point Numbers
Computerized Blended Precision Coaching
Gradient Checkpointing

Let’s get began!

Floating-Level Numbers

The default knowledge sort in PyTorch is the IEEE 754 32-bit floating-point format, also called single precision. It’s not the one floating-point sort you should utilize. For instance, most CPUs assist 64-bit double-precision floating-point, and GPUs usually assist half-precision floating-point as nicely. The desk under lists some floating-point varieties:

Knowledge Kind
PyTorch Kind
Complete Bits
Signal Bit
Exponent Bits
Mantissa Bits
Min Worth
Max Worth
eps

IEEE 754 double precision
torch.float64
64
1
11
52
-1.79769e+308
1.79769e+308
2.22045e-16

IEEE 754 single precision
torch.float32
32
1
8
23
-3.40282e+38
3.40282e+38
1.19209e-07

IEEE 754 half precision
torch.float16
16
1
5
10
-65504
65504
0.000976562

bf16
torch.bfloat16
16
1
8
7
-3.38953e+38
3.38953e+38
0.0078125

fp8 (e4m3)
torch.float8_e4m3fn
8
1
4
3
-448
448
0.125

fp8 (e5m2)
torch.float8_e5m2
8
1
5
2
-57344
57344
0.25

fp8 (e8m0)
torch.float8_e8m0fnu
8
1
8
0
1.70141e+38
5.87747e-39
1.0

fp6 (e3m2)

6
1
3
2
-28
28
0.25

fp6 (e2m3)

6
1
2
3
-7.5
7.5
0.125

fp4 (e2m1)

4
1
2
1
-6
6

Floating-point numbers are binary representations of actual numbers. Every consists of an indication bit, a number of bits for the exponent, and several other bits for the mantissa. They’re laid out as proven within the determine under. When sorted by their binary illustration, floating-point numbers retain their order by real-number worth.

Floating-point quantity illustration. Determine from Wikimedia.

Totally different floating-point varieties have totally different ranges and precisions. Not all kinds are supported by all {hardware}. For instance, fp4 is simply supported in Nvidia’s Blackwell structure. PyTorch helps just a few knowledge varieties. You’ll be able to run the next code to print details about numerous floating-point varieties:

import torch
from tabulate import tabulate

# float varieties:
float_types = [
torch.float64,
torch.float32,
torch.float16,
torch.bfloat16,
torch.float8_e4m3fn,
torch.float8_e5m2,
torch.float8_e8m0fnu,
]

# accumulate finfo for every sort
desk = []
for dtype in float_types:
data = torch.finfo(dtype)
strive:
typename = data.dtype
besides:
typename = str(dtype)
desk.append([typename, info.max, info.min, info.smallest_normal, info.eps])

headers = [‘data type’, ‘max’, ‘min’, ‘smallest normal’, ‘eps’]
print(tabulate(desk, headers=headers))

import torch

from tabulate import tabulate

# float varieties:

float_types = [

torch.float64,

torch.float32,

torch.float16,

torch.bfloat16,

torch.float8_e4m3fn,

torch.float8_e5m2,

torch.float8_e8m0fnu,

]

# accumulate finfo for every sort

desk = []

for dtype in float_types:

data = torch.finfo(dtype)

strive:

typename = data.dtype

besides:

typename = str(dtype)

desk.append([typename, info.max, info.min, info.smallest_normal, info.eps])

headers = [‘data type’, ‘max’, ‘min’, ‘smallest normal’, ‘eps’]

print(tabulate(desk, headers=headers))

Take note of the min and max values for every sort, in addition to the eps worth. The min and max values point out the vary a kind can assist (the dynamic vary). Should you practice a mannequin with such a kind, however the mannequin weights exceed this vary, you’re going to get overflow or underflow, often inflicting the mannequin to output NaN or Inf. The eps worth is the smallest optimistic quantity such that the sort can differentiate between 1+eps and 1. This can be a metric for precision. In case your mannequin’s gradient updates are smaller than eps, you’ll possible observe the vanishing gradient drawback.

Subsequently, float32 is an efficient default selection for deep studying: it has a large dynamic vary and excessive precision. Nevertheless, every float32 quantity requires 4 bytes of reminiscence. As a compromise, you should utilize float16 to avoid wasting reminiscence, however you’re prone to encounter overflow or underflow points because the dynamic vary is way smaller.

The Google Mind workforce recognized this drawback and proposed bfloat16, a 16-bit floating-point format with the identical dynamic vary as float32. As a trade-off, the precision is an order of magnitude worse than float16. It seems that dynamic vary is extra necessary than precision for deep studying, making bfloat16 extremely helpful.

Whenever you create a tensor in PyTorch, you possibly can specify the information sort. For instance:

x = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float16)
print(x)

x = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float16)

print(x)

There’s a easy approach to change the default to a distinct sort, reminiscent of bfloat16. That is useful for mannequin coaching. All you should do is about the next line earlier than you create any mannequin or optimizer:

# set default dtype to bfloat16
torch.set_default_dtype(torch.bfloat16)

# set default dtype to bfloat16

torch.set_default_dtype(torch.bfloat16)

Simply by doing this, you drive all of your mannequin weights and gradients to be bfloat16 sort. This protects half of the reminiscence. Within the earlier article, you had been suggested to set the batch dimension to eight to suit a GPU with solely 12GB of VRAM. With bfloat16, you must have the ability to set the batch dimension to 16.

Observe that trying to make use of 8-bit float or lower-precision varieties might not work. It is because you want {hardware} assist and PyTorch to carry out the corresponding mathematical operations. You’ll be able to strive the next code (requires a CUDA system) and discover that you’ll want additional effort to function on 8-bit float:

dtype = torch.float8_e4m3fn

# Outline a tensor with float8 will see
# NotImplementedError: “normal_kernel_cuda” not applied for ‘Float8_e4m3fn’
x = torch.randn(16, 16, dtype=dtype, system=”cuda”)

# Create in float32 and convert to float8 works
x = torch.randn(16, 16, system=”cuda”).to(dtype)

# However matmul isn’t supported. You will notice
# NotImplementedError: “addmm_cuda” not applied for ‘Float8_e4m3fn’
y = x @ x.T

# The right approach to run matrix multiplication on 8-bit float
y = torch._scaled_mm(x, x.T, out_dtype=dtype,
scale_a=torch.tensor(1.0, system=”cuda”),
scale_b=torch.tensor(1.0, system=”cuda”))
print(y)

dtype = torch.float8_e4m3fn

# Outline a tensor with float8 will see

# NotImplementedError: “normal_kernel_cuda” not applied for ‘Float8_e4m3fn’

x = torch.randn(16, 16, dtype=dtype, system=“cuda”)

# Create in float32 and convert to float8 works

x = torch.randn(16, 16, system=“cuda”).to(dtype)

# However matmul isn’t supported. You will notice

# NotImplementedError: “addmm_cuda” not applied for ‘Float8_e4m3fn’

y = x @ x.T

# The right approach to run matrix multiplication on 8-bit float

y = torch._scaled_mm(x, x.T, out_dtype=dtype,

scale_a=torch.tensor(1.0, system=“cuda”),

scale_b=torch.tensor(1.0, system=“cuda”))

print(y)

Computerized Blended Precision Coaching

Coaching a mannequin with float16 might encounter points as a result of not all operations must be carried out at decrease precision. For instance, matrix multiplication is strong in decrease precision, however discount operations, pooling, and a few activation features require float32.

You’ll be able to set the information sort manually for every element of your mannequin, however that is tedious since you should convert knowledge varieties between parts. A greater resolution is to make use of automated combined precision coaching in PyTorch.

PyTorch has a sub-library torch.amp that may robotically solid the information sort primarily based on the operation. Not all operations are carried out in the identical floating-point sort. If the operation is understood to be sturdy at decrease precision, this library will solid the tensors to that precision earlier than operating the operation. Therefore the identify “combined precision”. Utilizing decrease precision might not solely save reminiscence but in addition velocity up coaching. Some GPUs can run float16 operations at twice the velocity of float32.

Whenever you practice a mannequin with torch.amp, all you should do is run your ahead cross below the context of torch.amp.autocast(). Usually, additionally, you will use a GradScaler to deal with gradient scaling. That is obligatory as a result of below low precision, you might encounter vanishing gradients because of the restricted precision of your floating-point sort. The GradScaler scales the gradient earlier than the backward cross to stop lack of gradient move. Throughout the backward cross, you must scale the gradient again for correct updates. This course of will be cumbersome as a result of you should decide the proper scale issue, which the GradScaler handles for you.

In comparison with the coaching loop from the earlier article, under is the way you sometimes use torch.amp to coach a mannequin:

…

# Examine if combined precision coaching is supported
assert torch.amp.autocast_mode.is_autocast_available(“cuda”)

# Creates a GradScaler earlier than the coaching loop
scaler = torch.amp.GradScaler(“cuda”, enabled=True)

# begin coaching
for epoch in vary(begin_epoch, epochs):
pbar = tqdm.tqdm(dataloader, desc=f”Epoch {epoch+1}/{epochs}”)
for batch_id, batch in enumerate(pbar):
# get batched knowledge
input_ids, target_ids = batch
# create consideration masks: causal masks + padding masks
attn_mask = create_causal_mask(input_ids.form[1], system) +
create_padding_mask(input_ids, PAD_TOKEN_ID, system)
# with autocasting to bfloat16, run the ahead cross
with torch.autocast(device_type=”cuda”, dtype=torch.bfloat16):
logits = mannequin(input_ids, attn_mask)
loss = loss_fn(logits.view(-1, logits.dimension(-1)), target_ids.view(-1))
# backward with loss, scaled by the GradScaler
optimizer.zero_grad()
scaler.scale(loss).backward()
# step the optimizer and examine if the dimensions has been up to date
scaler.step(optimizer)
old_scale = scaler.get_scale()
scaler.replace()
if scaler.get_scale() < old_scale:
scheduler.step()
pbar.set_postfix(loss=loss.merchandise())
pbar.replace(1)
pbar.shut()

...

# Examine if combined precision coaching is supported

assert torch.amp.autocast_mode.is_autocast_available(“cuda”)

# Creates a GradScaler earlier than the coaching loop

scaler = torch.amp.GradScaler(“cuda”, enabled=True)

# begin coaching

for epoch in vary(begin_epoch, epochs):

pbar = tqdm.tqdm(dataloader, desc=f“Epoch {epoch+1}/{epochs}”)

for batch_id, batch in enumerate(pbar):

# get batched knowledge

input_ids, target_ids = batch

# create consideration masks: causal masks + padding masks

attn_mask = create_causal_mask(input_ids.form[1], system) +

create_padding_mask(input_ids, PAD_TOKEN_ID, system)

# with autocasting to bfloat16, run the ahead cross

with torch.autocast(device_type=“cuda”, dtype=torch.bfloat16):

logits = mannequin(input_ids, attn_mask)

loss = loss_fn(logits.view(–1, logits.dimension(–1)), target_ids.view(–1))

# backward with loss, scaled by the GradScaler

optimizer.zero_grad()

scaler.scale(loss).backward()

# step the optimizer and examine if the dimensions has been up to date

scaler.step(optimizer)

old_scale = scaler.get_scale()

scaler.replace()

if scaler.get_scale() < old_scale:

scheduler.step()

pbar.set_postfix(loss=loss.merchandise())

pbar.replace(1)

pbar.shut()

Utilizing AMP autocasting is easy: preserve the mannequin’s default precision at float32, then wrap the ahead cross and loss computation with torch.autocast(). Below this context, all supported operations will run within the specified knowledge sort.

After getting the loss, let the GradScaler deal with the backward cross. It should scale up the loss and replace the mannequin’s gradients. Nevertheless, this may occasionally trigger points if the scaling is simply too giant, leading to NaN or Inf gradients. Subsequently, use scaler.step(optimizer) to step the optimizer, which verifies the gradients earlier than executing the optimizer step. If GradScaler decides to not step the optimizer, it should cut back the dimensions issue when replace() is named. Examine whether or not the dimensions has been up to date to find out if you happen to ought to step the scheduler.

For the reason that backward cross makes use of scaled loss, if you happen to use gradient clipping, you must unscale the gradients earlier than clipping. Right here’s tips on how to do it:

…
# backward with loss, scaled by the GradScaler
optimizer.zero_grad()
scaler.scale(loss).backward()
# unscaled the gradients and apply gradient clipping
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(mannequin.parameters(), 1.0)
# step the optimizer and examine if the dimensions has been up to date
scaler.step(optimizer)
old_scale = scaler.get_scale()
scaler.replace()
if scaler.get_scale() < old_scale:
scheduler.step()

...

# backward with loss, scaled by the GradScaler

optimizer.zero_grad()

scaler.scale(loss).backward()

# unscaled the gradients and apply gradient clipping

scaler.unscale_(optimizer)

torch.nn.utils.clip_grad_norm_(mannequin.parameters(), 1.0)

# step the optimizer and examine if the dimensions has been up to date

scaler.step(optimizer)

old_scale = scaler.get_scale()

scaler.replace()

if scaler.get_scale() < old_scale:

scheduler.step()

Usually, you don’t have to name scaler.unscale_() manually because it’s a part of the scaler.step(optimizer) name. Nevertheless, you need to accomplish that when making use of gradient clipping in order that the clipping perform can observe the precise gradients.

Autocasting is automated, however the GradScaler maintains a state to trace the dimensions issue. Subsequently, while you checkpoint your mannequin, you must also save the scaler.state_dict(), simply as you’d save the optimizer state:

…
# Loading checkpoint
checkpoint = torch.load(“training_checkpoint.pth”)
mannequin.load_state_dict(checkpoint[“model”])
optimizer.load_state_dict(checkpoint[“optimizer”])
scheduler.load_state_dict(checkpoint[“scheduler”])
scaler.load_state_dict(checkpoint[“scaler”])

# Saving checkpoint
torch.save({
“mannequin”: mannequin.state_dict(),
“optimizer”: optimizer.state_dict(),
“scheduler”: scheduler.state_dict(),
“scaler”: scaler.state_dict(),
}, f”training_checkpoint.pth”)

...

# Loading checkpoint

checkpoint = torch.load(“training_checkpoint.pth”)

mannequin.load_state_dict(checkpoint[“model”])

optimizer.load_state_dict(checkpoint[“optimizer”])

scheduler.load_state_dict(checkpoint[“scheduler”])

scaler.load_state_dict(checkpoint[“scaler”])

# Saving checkpoint

torch.save({

“mannequin”: mannequin.state_dict(),

“optimizer”: optimizer.state_dict(),

“scheduler”: scheduler.state_dict(),

“scaler”: scaler.state_dict(),

}, f“training_checkpoint.pth”)

Gradient Checkpointing

Whenever you practice a mannequin with half precision, you utilize half the reminiscence in comparison with 32-bit float. With mixed-precision coaching, you might use barely extra reminiscence as a result of not all operations run at decrease precision.

Should you nonetheless encounter reminiscence points, one other method trades time for reminiscence: gradient checkpointing. Recall that in deep studying, for a perform $y=f(mathbb{u})$ and $mathbb{u}=g(mathbb{x}))$, then

$$
frac{partial y}{partial mathbb{x}} = huge(frac{partial mathbb{u}}{partial mathbb{x}}huge)^prime frac{partial y}{partial mathbb{u}}
$$

the place $y$ is a scalar (often the loss metric), and $mathbb{u}$ and $mathbb{x}$ are vectors. The time period $frac{partial mathbb{u}}{partial mathbb{x}}$ is the Jacobian matrix of $mathbb{u}$ with respect to $mathbb{x}$.

The gradient $frac{partial y}{partial mathbb{x}}$ is required to replace $mathbb{x}$ however relies on $frac{partial y}{partial mathbb{u}}$. Usually, while you run the ahead cross, all intermediate outcomes reminiscent of $mathbb{u}$ are saved in reminiscence in order that while you run the backward cross, you possibly can readily compute the gradient $frac{partial y}{partial mathbb{u}}$. Nevertheless, this requires substantial reminiscence for deep networks.

Gradient checkpointing discards some intermediate outcomes. So long as you recognize $mathbb{u}=g(mathbb{x})$, you possibly can recompute $mathbb{u}$ from $mathbb{x}$ throughout the backward cross. This manner, you don’t have to retailer $mathbb{u}$ in reminiscence, however you need to compute $mathbb{u}$ twice: as soon as for the ahead cross and as soon as for the backward cross.

You’ll be able to resolve which intermediate outcomes to discard. Making use of gradient checkpointing to each two operations nonetheless requires storing many intermediate outcomes. Making use of it to bigger blocks saves extra reminiscence.

Referring to the mannequin from the earlier article, you possibly can wrap each transformer block with gradient checkpointing:

…
class LlamaModel(nn.Module):
def __init__(self, config: LlamaConfig) -> None:
tremendous().__init__()
self.rotary_emb = RotaryPositionEncoding(
config.hidden_size // config.num_attention_heads,
config.max_position_embeddings,
)

self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
self.norm = nn.RMSNorm(config.hidden_size, eps=1e-5)

def ahead(self, input_ids: Tensor, attn_mask: Tensor) -> Tensor:
# Convert enter token IDs to embeddings
hidden_states = self.embed_tokens(input_ids)
# Course of by all transformer layers, then the ultimate norm layer
for layer in self.layers:
# Beforehand:
# hidden_states = layer(hidden_states, rope=self.rotary_emb, attn_mask=attn_mask)
hidden_states = torch.utils.checkpoint.checkpoint(layer, hidden_states, self.rotary_emb, attn_mask)
hidden_states = self.norm(hidden_states)
# Return the ultimate hidden states
return hidden_states

...

class LlamaModel(nn.Module):

def __init__(self, config: LlamaConfig) -> None:

tremendous().__init__()

self.rotary_emb = RotaryPositionEncoding(

config.hidden_size // config.num_attention_heads,

config.max_position_embeddings,

)

self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)

self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])

self.norm = nn.RMSNorm(config.hidden_size, eps=1e–5)

def ahead(self, input_ids: Tensor, attn_mask: Tensor) -> Tensor:

# Convert enter token IDs to embeddings

hidden_states = self.embed_tokens(input_ids)

# Course of by all transformer layers, then the ultimate norm layer

for layer in self.layers:

# Beforehand:

# hidden_states = layer(hidden_states, rope=self.rotary_emb, attn_mask=attn_mask)

hidden_states = torch.utils.checkpoint.checkpoint(layer, hidden_states, self.rotary_emb, attn_mask)

hidden_states = self.norm(hidden_states)

# Return the ultimate hidden states

return hidden_states

Just one line of code wants to vary: within the for-loop below the ahead() perform, as an alternative of calling the transformer block straight, use torch.utils.checkpoint.checkpoint(). This runs the ahead cross with gradient checkpointing, discarding all intermediate outcomes and retaining solely the block’s enter and output. Throughout the backward cross, the intermediate outcomes are quickly recomputed utilizing the enter.

Additional readings

Beneath are some assets that you could be discover helpful:

Abstract

On this article, you realized strategies for coaching a language mannequin with restricted reminiscence. Particularly, you realized that:

A number of varieties of floating-point numbers exist, with some utilizing much less reminiscence than others.
Blended-precision coaching robotically makes use of lower-precision floating-point numbers with out sacrificing accuracy on vital operations.
Gradient checkpointing trades time for reminiscence throughout coaching.

Training a Model with Limited Memory using Mixed Precision and Gradient Checkpointing

Overview

Floating-Level Numbers

Computerized Blended Precision Coaching

Gradient Checkpointing

Additional readings

Abstract

Leave a Reply Cancel reply

Follow US

Popular News

10 Incredible ’00s Anime That Aged Better Than Fine Wine

GFN Thursday: 18 New Games in October

Time series foundation models can be few-shot learners

Mediterranean Shrimp Recipe (One Pot, Baked)

7 Overrated Doctor Who Episodes Revisited

Categories

About US

Quick Links

Important Links

Subscribe US

Overview

Floating-Level Numbers

Computerized Blended Precision Coaching

Gradient Checkpointing

Additional readings

Abstract

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

10 Incredible ’00s Anime That Aged Better Than Fine Wine

GFN Thursday: 18 New Games in October

Time series foundation models can be few-shot learners

Mediterranean Shrimp Recipe (One Pot, Baked)

7 Overrated Doctor Who Episodes Revisited

Categories

About US

Quick Links

Important Links

Subscribe US