Think about making an attempt to renovate the foundations of a towering skyscraper with out asking residents to go away work or pause. That is precisely what Moonshotai’s checkpoint engine does for AI fashions. Giant language fashions enable for mind, weight updates. So there isn’t a downtime. This breakthrough permits builders to rapidly and effectively enhance AI, even for fashions with 1 trillion parameters operating on hundreds of GPUs. Designed to make AI programs run easily whereas evolving in actual time, it’s a key device for cutting-edge AI functions. This text explains what it’s, the way it works, and why it’s important for the way forward for massive AI programs.
What’s the checkpoint engine in Moonshot AI?
Moonshot AI’s checkpoint engine is a specialised middleware designed to replace the weights of large-scale language fashions (LLMS) in actual time throughout inference with out interrupting ongoing operations. This function is vital in reinforcement studying situations the place mannequin weights must be up to date ceaselessly. The checkpoint engine is now seamlessly built-in with the VLLM inference framework, offering optimized efficiency by way of pipeline and reminiscence administration strategies. It additionally offers options akin to reusing weights from current situations, decreasing overhead in scaling situations.
Structure
The checkpoint’s core is the parameter server class, which handles weight replace logic and adjusts knowledge stream.
H2D (host of machine): Strikes up to date weights from CPU reminiscence or storage to GPU reminiscence utilizing an optimized switch pipeline. Broadcast: Effectively distributes weight throughout all inference engine situations and leverages CUDA IPC buffers for shared reminiscence communication. Reload: Every inference engine selectively reloads the related weight shards from the published knowledge in line with its shard sample.
These three-stage pipelines be certain that you copy for environment friendly and overlapping communications and velocity.
If GPU reminiscence is restricted, the system can return to serial execution and preserve reliability.
Technique used
The checkpoint engine makes use of two primary strategies to replace the weights of the mannequin throughout inference.
Broadcast Technique: That is the quickest and default method. That is splendid if you want to replace many inference situations on the similar time. It broadcasts the up to date weights synced from CPU reminiscence to all inference GPUs, guaranteeing that every one situations stay in full sync with minimal delay. P2P (peer-to-peer) methodology: Used when inference situations are dynamically added or eliminated throughout execution time. It destroys current inference workloads and permits for clean and versatile updates by sending weights immediately from the CPU of an current occasion to the GPU of a brand new occasion by way of a peer-to-peer forwarding system.
work
The checkpoint engine coordinates the complete switch course of. First, gather the metadata wanted to create a plan, akin to figuring out the suitable bucket dimension for knowledge switch. Then, carry out a switch and management the inference engine by way of the ZeromQ socket to maximise efficiency. Use overlap communication and copying to arrange knowledge transfers into the pipeline, permitting quick and environment friendly weight updates even for heavy workloads.
By implementing the above strategies and structure, the checkpoint engine permits reside weight updates of LLM throughout hundreds of GPUs with minimal latency and repair disruption.
Set up and use
set up
Use the quickest broadcast
Please use the code:
PIP Set up Checkpoint Engine
To make use of a versatile P2P implementation:
Please use the code:
PIP Set up ‘CheckPoint-Engine[p2p]’
This can set up Mooncake-Switch-Engine to assist RDMA transfers between totally different ranks.
Examples of use circumstances
Step 1:
Put together an H800 or H20 machine with 8 GPUs with the most recent VLLM. The checkpoint engine makes use of this endpoint to replace the weights, so embrace the /Collective_RPC API endpoint commit (obtainable in the principle department).
Step 2:
Set up the checkpoint engine
code:
UV PIP Set up ‘Checkpoint-Engine[p2p]’
Step 3:
Within the use case, QWEN/QWEN3-235B-A222B-Instruct-2507 is used because the take a look at mannequin.
code:
HF Obtain QWEN/QWEN3-235B-A22B-INSTRUCT-2507 – LOCAL-DIR/OPT/MODELES/QWEN/QWEN3-235B-A22B-INSTRUCT-2507/
Step 4:
Begin VLLM in growth mode and set the –load-format dummy. Be sure you set it – worker-extension-cls = checkpoint_engine.employee.vllmcolocateworkerextension
code:
vllm_server_dev_mode = 1 python3 -m vllm.entrypoints.openai.api_server – host 0.0.0.0-port 19730-trust-remote-code – -tensor-paralleasize = 8 – Max-Mannequin-len 4096-load-model-name-model-name-model-name-mopoin /decide/fashions/qwen/qwen3-235b-a22b-instruct-2507/ – worker-extension-checkpoint_engine.employee.vllmcolocateworkerextension
Replace weights for every checkpoint engine. There isn’t any want to attend for VLLM to organize. Please use the code beneath:
code:
torchrun – nproc-per-node 8 examples/replace.py-update-method all-checkpoint-path/decide/fashions/qwen/qwen3-235b-a22b-instruct-2507/
Reuse weights from current situations
New Checkpoint Engine situations can be a part of current situations to reuse weight.
Utilizing the next strategies:
Step 1: To avoid wasting International Metas to a file, begin an current occasion with –Save-Metas-File Global_Metas.pkl.
Step 2: – Use Sleep-Time 300 to verify they keep alive.
code:
torchrun – nproc-per-node 8 examples/replace.py -checkpoint-path $ model_path – sleep-time 300 – save-metas-file global_metas.pkl
Step 3: After the checkpoint is registered, the brand new occasion can get a duplicate of the checkpoint by setting –load-metas-file global_metas.pkl
code:
torchrun – nproc-per-node 8 examples/replace.py -load-metas-file global_metas.pkl
FP8 quantization
At present, FP8 quantization doesn’t work in VLLM when updating weights. Use a easy patch in patch/vllm_fp8.patch to deal with right weight updates. Moreover, this patch has been examined solely on DeepSeek-V3.1 and Kimi-K2. Due to this fact, there could also be points which are suitable with different fashions.
take a look at
Run a easy correctness take a look at for CheckPoint_Engine
code:
torchrun – nproc-per-node 8 checks/test_update.py
benchmark
Mannequin Gadget Setup Metadata Assortment Replace (Broadcast) Replace (P2P) GLM-4.5-AIR (BF16) 8X H800 TP8 0.17s3.94s (1.42 GIB) 8.83s (4.77 GIB) (2.69 GIB) 16.47s (4.05 Give) Deepseek-V3.1 (FP8) 16X H20 TP16 1.44s12.22s (2.38 GIB) 25.77s (3.61 GIB) Kimi-K2-intruct (FP8) 16X H20 TP16 1.81s15.45s (2.93 GIB) (4.46 GIB) Deepseek-V3.1 (FP8) 256X H20 TP16 1.40s13.88s (2.54 GIB) 33.30s (FP8) KIMI-K2-INTRUCT (FP8) KIMI-K2-INTRUCT (FP8) 256X H20 TP16 1.88s21.50s (2.99 GIB) 34.49 GIB (4.57 GIB)
perception
Listed below are some observations I made:
Broadcast strategies typically present the quickest replace time optimized for synchronous weight updates on many inference situations. The P2P methodology takes time, however permits dynamic updates when an occasion joins or leaves throughout runtime. These benchmarks present the scalability of checkpoint engines and effectively deal with trillion parameter fashions in clusters starting from 8 to 256 GPUs
Checkpoint Engine Limitations
The Checkpoint Engine is a robust answer for reside weight updates on LLMS, however presently has some limitations.
It really works finest with VLLM for now. The engine is primarily examined with the VLLM framework. If you wish to use it with different AI frameworks or customized setups, extra work could also be required to run easily. The pipeline remains to be bettering. The best seamless pipeline isn’t absolutely full with knowledge but. This implies there may be nonetheless a risk that updates could possibly be made even quicker. P2P updates could be smoother. Peer-to-peer strategies ship knowledge by way of a bottleneck on one primary node earlier than sharing it with different nodes. Requires additional GPU reminiscence: A intelligent broadcast system makes use of extra GPU reminiscence to hurry issues up. On machines with low reminiscence, they change to slower, much less environment friendly processes. Restricted assist for FP8 fashions: If you’re utilizing a brand new FP8 quantization mannequin, a number of experimental patches are required. And but, not all fashions work properly, however they surpass some examined fashions.
Conclusion
Moonshot AI’s checkpoint engine is a recreation changer for updating enormous AI fashions with out stopping. It retains all the things operating easily whereas the mannequin’s “mind” is getting smarter in actual time. There are nonetheless some areas to enhance, however the potentialities are monumental. Should you’re utilizing a big AI system, this device is unquestionably value watching. It helps to make the way forward for AI quicker and extra environment friendly with out downtime.
FAQ
A. Giant language fashions can replace weights in actual time throughout inference with out downtime, so AI programs keep on-line whereas bettering.
A. At present, it’s primarily built-in and examined with the VLLM inference framework.
A. Broadcasting is quicker with synced updates on many GPUs, however P2P permits for versatile updates when situations be a part of or go away.
Log in and proceed studying and revel in professional curated content material.
Proceed studying at no cost


