Prime Mind has launched prime-rl model 0.6.0. This framework targets reinforcement studying with trillion parameter combination of specialists (MoE) fashions. We deal with heavy agent workloads, akin to long-running software program engineering duties.
The analysis crew skilled GLM-5 on SWE duties with sequence lengths as much as 131k. Step occasions remained lower than 5 minutes. The batch measurement was 256 rollouts. The run used solely 28 H200 nodes.
TL;DR
prime-rl 0.6.0 trains trillion-parameter MoE fashions on agent RL workloads. GLM-5 was skilled on SWE with 131k sequence size, lower than 5 minute steps, and 28 H200 nodes. Asynchronous RL separates the coach and inference for unbiased optimization. Inference makes use of FP8, extensive EP, P/D decomposition, KV offload, and router regeneration. Coaching makes use of 3-D parallelism (FSDP, EP, CP) and block-scale FP8.
What’s prime rl 0.6.0?
prime-rl is an open framework for asynchronous reinforcement studying. Submit-train massive open supply fashions on agent duties. Model 0.6.0 extends this to a trillion-parameter MoE scale.
The pattern mannequin introduced is zai-org/GLM-5.1. The optimization can be utilized to different large-scale MoE fashions. Examples embrace moonshotai/Kimi-K2.7-Code and nvidia/NVIDIA-Nemotron-3-Extremely-550B-A55B-BF16.
A whole run of GLM-5.1 begins with a single command in your Slurm cluster.
The position of asynchronous RL
Agent duties have long-tail outliers. Some coding rollouts run for a number of hours. Ready for them earlier than every coverage replace will go away the GPU idle.
Asynchronous RL avoids this. The coach system and the reasoner system are separated. They run and scale independently. The inference coverage is up to date as quickly because the optimizer steps full.
There may be one synchronization level and that’s coverage updates. prime-rl pushes new weights as quickly as they exist. Rollouts which have already been dispatched keep an lively prefix cache. Subsequently, tokens from a number of coverage variations could be combined in a single rollout.
The brand new rollout behaves in a different way. Repopulate your personal KV cache even when the prefixes match. KV cache salt enforces this. Requests from insurance policies which might be too previous can be dropped. The max_off_policy_steps worth controls that threshold.
Inference optimization
Inference is usually the throughput bottleneck in RL techniques. prime-rl optimizes throughput whereas limiting latency.
FP8 inference: Decrease precision means quicker prefill and decoding. prime-rl makes use of FP8 with DeepEP and DeepGEMM kernels.
Vast Knowledgeable Parallelism: Vast EP distributes specialists throughout 32 or extra GPUs. That is mixed with a big information parallel rank (for instance 32). Every GPU maintains a separate knowledgeable and acts as an endpoint. Synchronization is finished layer by layer via dispatch and be a part of operations.
Separation of prefill and decode: For some mannequin↔surroundings pairs, the ratio of prefill:decode tokens reaches 4:1. Sharing employees will increase end-to-end latency. This reduces the advantages of PipelineRL. P/D decomposition separates prefill and decode employees. When the instrument output is lengthy, the decode employee stops throttling.
KV cache administration: Excessive concurrency requires massive KV cache area. prime-rl helps gradual offloading to CPU and disk. vLLM native offload creates one pool per employee. Mooncake Retailer as a substitute centrally swimming pools RAM and disk throughout all nodes.
Request routing: prime-rl ships a fork of vllm-router by default. NVIDIA Dynamo routers are additionally supported as drop-ins. The router makes use of KV cache reuse, queue depth, and reside load to attain employees.
Router Replay (R3): Coaching is silently terminated resulting from coach↔inference mismatch. Router replay captures speculative routing choices. Play straight on the coach. This reduces the KL discrepancy by roughly an order of magnitude. Formed by confirmed specialists [num_layers, top_k, seq_len]. This payload can develop to tons of of GB. At scale, information speeds can attain tens of Gbps. Subsequently, prime-rl treats it as an opaque payload. Optimized PyTorch operations deal with the processing.
Coaching optimization
This coach is constructed on torchtitan, the PyTorch native coaching codebase. Depends on three-dimensional parallelism (FSDP, CP, EP). The GLM-5 case research makes use of all three.
EP exists as a result of the layer stays enormous after FSDP. Utilizing float32 with 78 layers and 800B parameters, a single layer full collect would require roughly 40GB. A single layer provides as much as practically 80GB. Setting EP=8 will dispatch tokens as a substitute of accumulating full specialists. torch-native all2all is barely quicker inside one node. If the EP spans a number of nodes, DeepEP is most popular.
CP turns into essential for sequence lengths higher than 131k. There, activations, not parameters, govern reminiscence. GLM-5 makes use of DSA, however neither Ulysses nor Ring Attend parallelize straight. Subsequently, prime-rl ships with a customized context-parallel implementation.
FP8 coaching. prime-rl makes use of DeepGEMM block-scale FP8 as proposed by DeepSeek V3. This not often improves throughput resulting from quantization overhead. Its true worth lies within the accuracy of coach matching and inference. This reduces the KL discrepancy and stabilizes the coaching.
interactive explainer
Utilization and examples
Lengthy-term SWE brokers: Practice fashions based mostly on actual repository issues. Rollouts can span tons of of turns and gear calls. Right here, the P/D decomposition retains the decoding latency predictable. 1T scale post-training with fewer nodes: GLM-5 runs on 28 H200 nodes. In depth EP and KV offloading improves concurrency and throughput. Giant-scale steady agent RL: Each router replay and FP8 coaching cut back the KL mismatch between coach and inference. Much less mismatch means extra steady coaching.
Verify technical particulars. Additionally, be at liberty to observe us on Twitter. Additionally, remember to hitch over 150,000 SubReddits and subscribe to our publication. hold on! Are you on telegram? Now you can additionally take part by telegram.
Must associate with us to advertise your GitHub repository, Hug Face Web page, product releases, webinars, and extra? Join with us


