How to Update LLM Weights with No Downtime

Think about making an attempt to renovate the foundations of a towering skyscraper with out asking residents to go away work or pause. That is precisely what Moonshotai’s checkpoint engine does for AI fashions. Giant language fashions enable for mind, weight updates. So there isn’t a downtime. This breakthrough permits builders to rapidly and effectively enhance AI, even for fashions with 1 trillion parameters operating on hundreds of GPUs. Designed to make AI programs run easily whereas evolving in actual time, it’s a key device for cutting-edge AI functions. This text explains what it’s, the way it works, and why it’s important for the way forward for massive AI programs.

What’s the checkpoint engine in Moonshot AI?

Moonshot AI’s checkpoint engine is a specialised middleware designed to replace the weights of large-scale language fashions (LLMS) in actual time throughout inference with out interrupting ongoing operations. This function is vital in reinforcement studying situations the place mannequin weights must be up to date ceaselessly. The checkpoint engine is now seamlessly built-in with the VLLM inference framework, offering optimized efficiency by way of pipeline and reminiscence administration strategies. It additionally offers options akin to reusing weights from current situations, decreasing overhead in scaling situations.

Structure

The checkpoint’s core is the parameter server class, which handles weight replace logic and adjusts knowledge stream.

H2D (host of machine): Strikes up to date weights from CPU reminiscence or storage to GPU reminiscence utilizing an optimized switch pipeline. Broadcast: Effectively distributes weight throughout all inference engine situations and leverages CUDA IPC buffers for shared reminiscence communication. Reload: Every inference engine selectively reloads the related weight shards from the published knowledge in line with its shard sample.

These three-stage pipelines be certain that you copy for environment friendly and overlapping communications and velocity.

If GPU reminiscence is restricted, the system can return to serial execution and preserve reliability.

Technique used

The checkpoint engine makes use of two primary strategies to replace the weights of the mannequin throughout inference.

Broadcast Technique: That is the quickest and default method. That is splendid if you want to replace many inference situations on the similar time. It broadcasts the up to date weights synced from CPU reminiscence to all inference GPUs, guaranteeing that every one situations stay in full sync with minimal delay. P2P (peer-to-peer) methodology: Used when inference situations are dynamically added or eliminated throughout execution time. It destroys current inference workloads and permits for clean and versatile updates by sending weights immediately from the CPU of an current occasion to the GPU of a brand new occasion by way of a peer-to-peer forwarding system.

work

The checkpoint engine coordinates the complete switch course of. First, gather the metadata wanted to create a plan, akin to figuring out the suitable bucket dimension for knowledge switch. Then, carry out a switch and management the inference engine by way of the ZeromQ socket to maximise efficiency. Use overlap communication and copying to arrange knowledge transfers into the pipeline, permitting quick and environment friendly weight updates even for heavy workloads.

By implementing the above strategies and structure, the checkpoint engine permits reside weight updates of LLM throughout hundreds of GPUs with minimal latency and repair disruption.

Set up and use

set up

Use the quickest broadcast

Please use the code:

PIP Set up Checkpoint Engine

To make use of a versatile P2P implementation:

Please use the code:

PIP Set up ‘CheckPoint-Engine[p2p]’

This can set up Mooncake-Switch-Engine to assist RDMA transfers between totally different ranks.

Examples of use circumstances

Step 1:

Put together an H800 or H20 machine with 8 GPUs with the most recent VLLM. The checkpoint engine makes use of this endpoint to replace the weights, so embrace the /Collective_RPC API endpoint commit (obtainable in the principle department).

Step 2:

Set up the checkpoint engine

code:

UV PIP Set up ‘Checkpoint-Engine[p2p]’

Step 3:

Within the use case, QWEN/QWEN3-235B-A222B-Instruct-2507 is used because the take a look at mannequin.

code:

HF Obtain QWEN/QWEN3-235B-A22B-INSTRUCT-2507 – LOCAL-DIR/OPT/MODELES/QWEN/QWEN3-235B-A22B-INSTRUCT-2507/

Step 4:

Begin VLLM in growth mode and set the –load-format dummy. Be sure you set it – worker-extension-cls = checkpoint_engine.employee.vllmcolocateworkerextension

code:

vllm_server_dev_mode = 1 python3 -m vllm.entrypoints.openai.api_server – host 0.0.0.0-port 19730-trust-remote-code – -tensor-paralleasize = 8 – Max-Mannequin-len 4096-load-model-name-model-name-model-name-mopoin /decide/fashions/qwen/qwen3-235b-a22b-instruct-2507/ – worker-extension-checkpoint_engine.employee.vllmcolocateworkerextension

Replace weights for every checkpoint engine. There isn’t any want to attend for VLLM to organize. Please use the code beneath:

code:

torchrun – nproc-per-node 8 examples/replace.py-update-method all-checkpoint-path/decide/fashions/qwen/qwen3-235b-a22b-instruct-2507/

Reuse weights from current situations

New Checkpoint Engine situations can be a part of current situations to reuse weight.

Utilizing the next strategies:

Step 1: To avoid wasting International Metas to a file, begin an current occasion with –Save-Metas-File Global_Metas.pkl.

Step 2: – Use Sleep-Time 300 to verify they keep alive.

code:

torchrun – nproc-per-node 8 examples/replace.py -checkpoint-path $ model_path – sleep-time 300 – save-metas-file global_metas.pkl

Step 3: After the checkpoint is registered, the brand new occasion can get a duplicate of the checkpoint by setting –load-metas-file global_metas.pkl

code:

torchrun – nproc-per-node 8 examples/replace.py -load-metas-file global_metas.pkl

FP8 quantization

At present, FP8 quantization doesn’t work in VLLM when updating weights. Use a easy patch in patch/vllm_fp8.patch to deal with right weight updates. Moreover, this patch has been examined solely on DeepSeek-V3.1 and Kimi-K2. Due to this fact, there could also be points which are suitable with different fashions.

take a look at

Run a easy correctness take a look at for CheckPoint_Engine

code:

torchrun – nproc-per-node 8 checks/test_update.py

benchmark

Mannequin Gadget Setup Metadata Assortment Replace (Broadcast) Replace (P2P) GLM-4.5-AIR (BF16) 8X H800 TP8 0.17s3.94s (1.42 GIB) 8.83s (4.77 GIB) (2.69 GIB) 16.47s (4.05 Give) Deepseek-V3.1 (FP8) 16X H20 TP16 1.44s12.22s (2.38 GIB) 25.77s (3.61 GIB) Kimi-K2-intruct (FP8) 16X H20 TP16 1.81s15.45s (2.93 GIB) (4.46 GIB) Deepseek-V3.1 (FP8) 256X H20 TP16 1.40s13.88s (2.54 GIB) 33.30s (FP8) KIMI-K2-INTRUCT (FP8) KIMI-K2-INTRUCT (FP8) 256X H20 TP16 1.88s21.50s (2.99 GIB) 34.49 GIB (4.57 GIB)

perception

Listed below are some observations I made:

Broadcast strategies typically present the quickest replace time optimized for synchronous weight updates on many inference situations. The P2P methodology takes time, however permits dynamic updates when an occasion joins or leaves throughout runtime. These benchmarks present the scalability of checkpoint engines and effectively deal with trillion parameter fashions in clusters starting from 8 to 256 GPUs

Checkpoint Engine Limitations

The Checkpoint Engine is a robust answer for reside weight updates on LLMS, however presently has some limitations.

It really works finest with VLLM for now. The engine is primarily examined with the VLLM framework. If you wish to use it with different AI frameworks or customized setups, extra work could also be required to run easily. The pipeline remains to be bettering. The best seamless pipeline isn’t absolutely full with knowledge but. This implies there may be nonetheless a risk that updates could possibly be made even quicker. P2P updates could be smoother. Peer-to-peer strategies ship knowledge by way of a bottleneck on one primary node earlier than sharing it with different nodes. Requires additional GPU reminiscence: A intelligent broadcast system makes use of extra GPU reminiscence to hurry issues up. On machines with low reminiscence, they change to slower, much less environment friendly processes. Restricted assist for FP8 fashions: If you’re utilizing a brand new FP8 quantization mannequin, a number of experimental patches are required. And but, not all fashions work properly, however they surpass some examined fashions.

Conclusion

Moonshot AI’s checkpoint engine is a recreation changer for updating enormous AI fashions with out stopping. It retains all the things operating easily whereas the mannequin’s “mind” is getting smarter in actual time. There are nonetheless some areas to enhance, however the potentialities are monumental. Should you’re utilizing a big AI system, this device is unquestionably value watching. It helps to make the way forward for AI quicker and extra environment friendly with out downtime.

FAQ

Q1. What issues does the checkpoint engine clear up?

A. Giant language fashions can replace weights in actual time throughout inference with out downtime, so AI programs keep on-line whereas bettering.

Q2. Which framework is Checkpoint Engine Assist?

A. At present, it’s primarily built-in and examined with the VLLM inference framework.

Q3. What’s the distinction between broadcast and P2P strategies?

A. Broadcasting is quicker with synced updates on many GPUs, however P2P permits for versatile updates when situations be a part of or go away.

I’m a Knowledge Science Trainee at Analytics Vidhya and am obsessed with growing superior AI options akin to generator AI functions, large-scale language fashions, and cutting-edge AI instruments that push the boundaries of expertise. My roles embrace creating partaking instructional content material for Analytics Vidhya’s YouTube channel, growing complete programs that cowl AI that generates the complete scope of machine studying, and making a expertise weblog that connects the most recent improvements and fundamental ideas in AI. By this, I goal to contribute to constructing clever programs and share data that evokes and empowers the AI neighborhood.

Contents

What’s the checkpoint engine in Moonshot AI?Structure Technique used work Set up and use set up Examples of use circumstances Reuse weights from current situations FP8 quantization benchmark perception Checkpoint Engine Limitations Conclusion FAQ Log in and proceed studying and revel in professional curated content material.

Log in and proceed studying and revel in professional curated content material.

Proceed studying at no cost

How to Update LLM Weights with No Downtime

What’s the checkpoint engine in Moonshot AI?

Structure

Technique used

work

Set up and use

set up

Examples of use circumstances

Reuse weights from current situations

FP8 quantization

benchmark

perception

Checkpoint Engine Limitations

Conclusion

FAQ

Log in and proceed studying and revel in professional curated content material.

Leave a Reply Cancel reply

Follow US

Popular News

Why and When to Use Sentence Embeddings Over Word Embeddings

Harry Styles is going to read us a bedtime story and it’s exactly what we need right now

GFN Thursday: ‘Elder Scrolls Online’ Member Reward

5 Games Dying For DLC In 2026 As Much As Witcher 3

Jude Law As Vladimir Putin Can’t Save This Tedious Slog

Categories

About US

Quick Links

Important Links

Subscribe US

What’s the checkpoint engine in Moonshot AI?

Structure

Technique used

work

Set up and use

set up

Examples of use circumstances

Reuse weights from current situations

FP8 quantization

benchmark

perception

Checkpoint Engine Limitations

Conclusion

FAQ

Log in and proceed studying and revel in professional curated content material.

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

Why and When to Use Sentence Embeddings Over Word Embeddings

Harry Styles is going to read us a bedtime story and it’s exactly what we need right now

GFN Thursday: ‘Elder Scrolls Online’ Member Reward

5 Games Dying For DLC In 2026 As Much As Witcher 3

Jude Law As Vladimir Putin Can’t Save This Tedious Slog

Categories

About US

Quick Links

Important Links

Subscribe US