AllTopicsTodayAllTopicsToday
Notification
Font ResizerAa
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Reading: Train a Model Faster with torch.compile and Gradient Accumulation
Share
Font ResizerAa
AllTopicsTodayAllTopicsToday
  • Home
  • Blog
  • About Us
  • Contact
Search
  • Home
  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies
Have an existing account? Sign In
Follow US
©AllTopicsToday 2026. All Rights Reserved.
AllTopicsToday > Blog > AI > Train a Model Faster with torch.compile and Gradient Accumulation
Francois genon ivlv dlt9hg unsplash scaled.jpg
AI

Train a Model Faster with torch.compile and Gradient Accumulation

AllTopicsToday
Last updated: January 1, 2026 7:26 pm
AllTopicsToday
Published: January 1, 2026
Share
SHARE

Coaching language fashions utilizing deep transformer architectures takes time. Nonetheless, there are strategies you should utilize to hurry up your coaching. On this article, you’ll find out about:

Velocity ​​up your mannequin utilizing torch.compile() Practice your mannequin with a bigger efficient batch dimension utilizing gradient accumulation

Let’s get began!

Practice fashions quicker with torch.compile and Gradient Accumulation
Photograph: François Genon Some rights reserved.

overview

This text is split into two components. they’re:

Accumulating gradients utilizing torch.compile()

Utilizing torch.compile

Whenever you write mannequin code and run it in PyTorch, the code runs in keen mode. Which means that the code is executed line by line and the outcomes are saved in reminiscence. That is native to Python, since Python is an interpreted language. You realize that is the case while you make a mistake in your code since you do not see the error till you run that line of code.

Working the mannequin in keen mode takes a very long time. Beginning with PyTorch 2.0, you should utilize torch.compile() to compile your mannequin and enhance efficiency. It will generate a brand new, optimized mannequin object. This isn’t the identical mannequin object created utilizing nn.Module, but it surely does share the identical tensor as the unique mannequin. This compiled mannequin can be utilized for ahead passes, backward passes, and optimizer updates as regular.

Constructing a mannequin and compiling it as a computational graph is how TensorFlow 1.0 initially works. This makes debugging troublesome since you can’t match the mannequin you run line by line with the code you write. Subsequently, don’t compile the mannequin till you run a trial and ensure there aren’t any errors.

Not all fashions might be compiled. Nonetheless, in case your mannequin helps compilation, you’ll be able to instantly profit from the speedup. To compile a mannequin, merely substitute the mannequin objects simply earlier than you’re prepared to make use of them.

… mannequin = LlamaForPretraining(model_config).to(machine) mannequin.load_state_dict(checkpoint) mannequin = torch.compile(mannequin) …

...

mannequin = For llama pre-training(Mannequin configuration).to(machine)

mannequin.load state_dict(checkpoint)

mannequin = torch.compile(mannequin)

...

Don’t load mannequin weights after compilation. It’s because the compiled mannequin is an object that shares the identical weights as the unique mannequin. Throughout compilation, a computational graph is constructed by referencing the load tensor of the unique mannequin. For those who load weights after compilation, your mannequin might not behave as anticipated.

Equally, to avoid wasting the compiled mannequin, it’s essential reference the unique mannequin’s state dictionary like this:

torch.save(getattr(mannequin, “_orig_mod”, mannequin).state_dict(), “mannequin.pth”)

torch.maintain(Acquired attributes(mannequin, “_orig_mod”, mannequin).state_dict(), “Mannequin.pth”)

The unique mannequin might be accessed by fashions compiled utilizing mannequin._orig_mod. The above code makes use of getattr(mannequin, “_orig_mod”, mannequin) to get the unique mannequin if it exists, in any other case use the mannequin itself. This line of code works for each compiled and authentic fashions.

Gradient accumulation

When coaching a mannequin, you most likely spend 2-3 occasions extra time on the backward cross than the ahead cross. It’s because backward passes are extra computationally intensive and use extra reminiscence.

One easy trick to hurry up your coaching is to scale back the variety of backward passes. This may be achieved by rising the batch dimension. For a similar variety of information samples, a bigger batch dimension means fewer batches to course of.

Nonetheless, bigger batch sizes require extra reminiscence. In memory-constrained environments, bigger batch sizes might be mimicked by performing a number of ahead passes and accumulating gradients. That is referred to as gradient accumulation.

It is simpler to clarify this concept in code.

..accumulate_steps = 4 for epoch in vary(num_epochs): optimizer.zero_grad() for i, batch in enumerate(dataloader): # Get batched information input_ids, target_ids =batch # Create consideration masks: causal masks + padding masks attn_mask = create_causal_mask(input_ids.form[1]machine) + create_padding_mask(input_ids, PAD_TOKEN_ID, machine) # Extract output from the mannequin logits = mannequin(input_ids, attn_mask) # Calculate loss: Cross-entropy between logits and goal, ignoring padding tokens loss = loss_fn(logits.view(-1, logits.dimension(-1)), target_ids.view(-1)) loss = loss / accumulate_steps # Run backwards however solely replace as soon as each “accumulate_steps” steps loss.backward() if (i + 1) %accumulate_steps == 0: torch.nn.utils.clip_grad_norm_(mannequin.parameters(), 1.0) optimizer.step() optimizer.zero_grad()Scheduler.step()

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

twenty one

twenty two

..

Cumulative variety of steps = 4

for epoch in vary(num_epochs):

optimizer.zero grad()

for I, batch in enumerate(information loader):

# Get batched information

input_ids, goal ID = batch

# Create consideration masks: causal masks + padding masks

attn_mask = create_causal_mask(input_ids.form[1], machine) +

create_padding_mask(input_ids, PAD_TOKEN_ID, machine)

# extract the output from the mannequin

logit = mannequin(input_ids, attn_mask)

# Calculate loss: cross entropy between logit and goal, ignoring padding tokens

loss = loss_fn(logit.view(–1, logit.dimension(–1)), goal ID.view(–1))

loss = loss / accumulate_step

# Run backwards however solely replace as soon as each “accumulate_steps” steps

loss.backwards()

if (I + 1) % Cumulative variety of steps == 0:

torch.yeah.utility.Clip_Grad_Norm_(mannequin.parameters(), 1.0)

optimizer.step()

optimizer.zero grad()

scheduler.step()

The above coaching loop is an excerpt from a earlier article for coaching Llama fashions on native GPUs.

Usually, while you carry out a ahead cross, you calculate a loss. Subsequent, name loss.backward() to backpropagate the loss gradient by way of the mannequin parameters. In PyTorch, the backward() methodology is cumulative, that means that the gradients are added. Subsequently, you have to explicitly name optimizer.zero_grad() to clear the gradient earlier than performing the backward cross.

Within the above code, we deliberately don’t name optimizer.zero_grad() on every iteration. As a substitute, carry out backpropagation on the loss divided by accumulate_steps. On this approach, the gradient is scaled down however gathered over accumulate_steps iterations. At every iteration of Capsule_steps, run the optimizer to tune the mannequin parameters.

This method produces outcomes corresponding to utilizing a bigger batch dimension. Nonetheless, you’ll carry out fewer optimizer updates, and you will want to regulate your studying price schedule accordingly. Which means that we have to initialize the scheduler with a unique variety of steps.

… num_training_steps = (len(dataloader) //accumulate_steps) * num_epochs cosine_scheduler = lr_scheduler.CosineAnnealingLR( optimizer, T_max=num_training_steps – num_warmup_steps, eta_min=0 )

...

num_training_steps = (Ren(information loader) // Cumulative variety of steps) * num_epochs

cosine scheduler = lr_scheduler.Cosine annealing LR(

optimizer,

T_max=num_training_steps – num_warmup_steps,

eta_min=0

)

Learn extra

Beneath are some sources that you could be discover attention-grabbing.

abstract

On this article, you realized that you should utilize torch.compile() to compile computational graphs and pace up fashions. You additionally realized that gradient accumulation is a method for coaching with bigger efficient batch sizes by accumulating gradients from a number of minibatches. This methodology performs fewer optimizer updates, saving time on backward passes and parameter updates.

“This isn’t what we signed up for.”
7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings
How to Build a Vision-Guided Web AI Agent with MolmoWeb-4B Using Multimodal Reasoning and Action Prediction
How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis
Goldman Sachs makes big bet on ETFs focusing on downside protection
TAGGED:AccumulationfasterGradientmodeltorch.compileTrain
Share This Article
Facebook Email Print
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Social Medias
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!

Popular News
High protein cottage cheese 1.png
Wellness

High-Protein Cottage Cheese Recipes – Fit Foodie Finds

AllTopicsToday
AllTopicsToday
January 9, 2026
Meta pay $375 million for violating New Mexico law in child exploitation case
Where To Use the Stella Montis Medical Storage Key
A AAA game for the Alien franchise is back in the works
What Happened in The Last Thing He Told Me Season 1? Ending Explained
- Advertisement -
Ad space (1)

Categories

  • Tech
  • Investing & Finance
  • AI
  • Entertainment
  • Wellness
  • Gaming
  • Movies

About US

We believe in the power of information to empower decisions, fuel curiosity, and spark innovation.
Quick Links
  • Home
  • Blog
  • About Us
  • Contact
Important Links
  • About Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
  • Contact

Subscribe US

Subscribe to our newsletter to get our newest articles instantly!

©AllTopicsToday 2026. All Rights Reserved.
1 2
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?