Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

GPU communication overhead is a visual bottleneck in manufacturing AI workloads. Based on information cited by the mKernel challenge, communication can devour 43.6% of the forwarding path and 32% of the end-to-end coaching time. In a typical Combination-of-Specialists (MoE) mannequin, device-to-device communication can account for as much as 47% of the whole execution time. Researchers on the UCCL challenge on the College of California, Berkeley, have launched mKernel, a library of persistent CUDA kernels that blends intra-node NVLink communication, inter-node RDMA, and computing right into a single kernel.

Downside: Host-initiated communication

The usual mannequin for multi-GPU communication is host-driven. The CPU executes the management path and calls libraries comparable to NCCL and NVSHMEM. The library points collective operations (AllReduce, AllGather, and so on.) throughout GPUs. Compute and communication are carried out on separate CUDA streams and overlap at kernel boundaries.

The analysis workforce identifies two issues with this method.

(1) CPUs don’t scale for GPU computing. The GB300 NVL72 rack integrates 72 Blackwell Extremely GPUs and 36 Grace CPUs, delivering 720 PFLOPs/s FP8/FP6, 1.44 EFLOPs/s FP4 Tensor Core efficiency, and 130 TB/s all-to-all in-rack NVLink bandwidth. At these speeds, microsecond-scale host orchestration overhead (cudaLaunchKernel calls, CPU-side “all writes full” checks, cross-stream occasions) exhibits up immediately as pipeline bubbles.

(2) Host-driven programs overlap computation and communication at coarse kernel boundaries. Finer-grained overlap on the tile or chunk stage is just not attainable from the host facet.

Another is GPU-driven communication. The GPU itself triggers the switch, and the communication is fused into the identical kernel because the compute. Most present fusion kernel libraries function inside a single node or a single GPU. mKernel is meant for multi-node circumstances.

mKernel options

mKernel is a persistent CUDA kernel library. Every kernel fuses intra-node NVLink communication, inter-node RDMA, and high-density computing right into a single kernel.

Multi-GPU + multi-node in a single kernel: Each intra-node NVLinks and inter-node RDMA exist inside the similar persistent kernel.

Effective-grained intrakernel overlap: Computation and communication overlap at tile/chunk granularity, protecting each intra-node and inter-node GPU communication.

SM-specific persistent kernel: CTA self-assigns roles comparable to computing, intra-communication, inter-sending, and inter-reducing. The variety of SMs devoted to every function is adjustable for every form.

GPU-driven networking constructed on libibverbs: mKernel makes use of GPU-initiated RDMA writes with out counting on NCCL or NVSHMEM. The communications backend was constructed from the bottom as much as maximize efficiency and assist heterogeneous networking gadgets.

5 fusion kernels

Kernel Fuse Description AllGather + GEMMAllGather → GEMME Every rank holds shards of A. The native GEMM consumes tiles as quickly as they arrive, whereas the rank collects shards from its friends through NVLink/RDMA. Compute GEMM + AllReduceGEMM → AllReduceComputes C = A @ B and cut back partial output throughout all ranks in a single invocation. Output tiles are pushed into the discount tree the second they’re generated. MoE dispatch + GEMMAll-to-All dispatch → grouped GEMMRoutes Set MoE token to professional rank (intra-node NVLink + inter-node all-to-all) and run GEMM grouped by professional inside the similar kernel. Tokens are processed as quickly as they land. There are not any staging buffer spherical journeys. Ring Consideration Ring KV Trade → FlashAttentionSequence – Parallel consideration throughout ranks. Every step rotates a KV chunk on the ring, whereas the native FlashAttendant consumes beforehand obtained chunks. Computation and ring sending/receiving are carried out concurrently inside a single persistent kernel. Run GEMM + ReduceScatterGEMM → ReduceScatterComputes C = A @ B and cut back scatter the output. As quickly as every output tile is generated, it’s scaled down and transferred to its personal rank.

Analysis setup

The analysis workforce evaluated mKernel on two 2-node × 8-H200 clusters that differed solely within the inter-node cloth.

Testbed Node × GPU Intra-node transport NICAWS EFA2 × 8 H200NVLinkAWS EFA / SRD16 × 200 Gb/s EFA/Node ConnectX-72 × 8 H200NVLinkInfiniBand8 × 400 Gb/s NVIDIA ConnectX-7/Node

mKernel was benchmarked towards NCCL, Triton-distributed, Flux, Mercury, MagiAttend, Transformer-Engine, and ring-flash-attention. The workforce says additional intensive benchmarking remains to be in progress.

Backend and necessities

mKernel helps two community backends.

Backend MacroTransport execution location CX7-DINTERNODE_BACKEND_IBVERBSlibibverbs RCConnectX-7 / InfiniBand / RoCEEFA-DINTERNODE_BACKEND_EFAlibibverbs + efadv (SRD)AWS p5/p5e (H200, EFA)

Each backends share the identical host-side API and the identical on-GPU kernel. Solely the proxy/session implementation is completely different (session.h for CX7 and session_efa.h for EFA). Necessities: Python with NVIDIA Hopper GPU (default construct goal sm_90a), CUDA 12.9, PyTorch. The CX7 backend requires the libibverbs growth header and library. By default, the EFA backend requires AWS EFA to be put in with libfabric, libibverbs, efadv, and EFA headers below EFA_HOME=/decide/amazon/efa.

Visible clarification of Marktechpost

01/07 — Overview

What’s m kernel?

mKernel is an open supply library of persistent CUDA kernels from the UCCL challenge on the College of California, Berkeley. Fuses intra-node NVLink communication, inter-node RDMA, and high-density computing right into a single kernel.

Most present fusion kernel libraries function inside a single node or a single GPU. mKernel was designed from the start to span node boundaries.

43.6%

Variety of forwarding paths consumed by communication in manufacturing atmosphere

47%

Share of complete execution time for a typical MoE mannequin

32%

Share of end-to-end coaching time spent speaking

02/07 — Downside

why host-driven lack of communication

The usual mannequin is host-driven. The CPU calls NCCL or NVSHMEM to situation collective operations throughout the GPUs. The UCCL workforce recognized two points.

⚡

CPUs don’t scale with GPUs. The GB300 NVL72 rack delivers 720 PFLOP/s FP8/FP6 and 1.44 EFLOP/s FP4. At these speeds, the microsecond-scale overhead from cudaLaunchKernel, CPU-side synchronization checks, and inter-stream occasions exhibits up immediately as pipeline bubbles.

🔲

The overlap is just too tough. Host-driven programs overlap computing and communication solely on the kernel boundary. Finer-grained overlap on the tile or chunk stage is just not attainable from the host facet.

🔀

The reply is GPU-driven communication. The GPU itself triggers fine-grained transfers which are fused into the identical kernel because the compute.

03/07 — Design

4 core design properties

🖧

Multi-GPU + multi-node in a single kernel. Intra-node NVLink and inter-node RDMA each exist inside the similar persistent kernel.

🔬

Effective-grained intrakernel overlap. Computation and communication overlap at tile/chunk granularity, protecting each intra-node and inter-node communication.

⚙️

A persistent kernel particular to SM. CTA self-assigns roles comparable to compute, intra-communication, inter-transmit, and inter-reduce. SM cut up is adjustable for every form.

📡

GPU-powered networking through libibverbs. Use GPU-initiated RDMA writes. There are not any dependencies on NCCL or NVSHMEM. The communication backend is created from scratch.

04/07 — Kernel

The 5 fusion kernel

All Collect + GEMM

All Collect —> GEMM

Every rank holds a shard of A. The native GEMM consumes tiles through NVLink/RDMA as they arrive. matmul begins earlier than the set ends.

GEMM + AllReduce

GEMM —> AllReduce

Compute C = A @ B and cut back partial output throughout all ranks in a single invocation. The output tile enters the discount tree the second it’s generated.

Ministry of the Surroundings dispatch + GEMM

All-to-all dispatch —> grouped GEMM

Route MoE tokens to professional ranks through NVLink + node-to-node all-to-all and run GEMMs grouped by specialists inside the similar kernel. There are not any staging buffer spherical journeys.

name for consideration

Ring KV Trade —> FlashAttendant

Entice parallel consideration throughout ranks. Every step rotates a KV chunk on the ring, whereas the native FlashAttendant consumes beforehand obtained chunks.

GEMM + ReduceScatter

GEMM —> Scale back scattering

Compute C = A @ B and cut back scatter the output. As quickly as every tile is generated, it’s diminished and transferred to its personal rank.

05/07 — Score

analysis setting

Examined on two 2-node × 8-H200 clusters that differed solely within the inter-node cloth.

Testbed node × GPU Node-to-node NIC AWS EFA2 × 8 H200AWS EFA / SRD16 × 200 Gb/s EFA/node ConnectX-72 × 8 H200InfiniBand8 × 400 Gb/s CX7/node

Each testbeds use inside NVLink nodes. Benchmarked: NCCL, Triton-distributed, Flux, Mercury, MagiAttend, Transformer-Engine, and ring-flash-attention. In depth benchmarking remains to be in progress.

06/07 — Backend and necessities

backend and necessities

BackendTransport execution location CX7libibverbs RCConnectX-7 / InfiniBand / RoCE EFAlibibverbs + efadv (SRD)AWS p5/p5e (H200, EFA)

📋

Necessities: NVIDIA Hopper GPU (default sm_90a), CUDA 12.9, Python with PyTorch. CX7 requires the libibverbs header. EFA requires libfabric, libibverbs, and efadv in EFA_HOME=/decide/amazon/efa.

📝

License and Attribution: Licensed below the MIT License. MMA/computing code tailored from ThunderKittens (HazyResearch).

07/07 — Roadmap and key factors

roadmap and Essential factors

✅

Fused GPU-driven multi-node kernels (AG+GEMM, GEMM+AR, MoE Dispatch+GEMM, Ring Consideration, GEMM+RS)

✅

ConnectX-7 and AWS EFA backend

🚧

Full assist for heterogeneous accelerators/NICs with topology-aware discovery, placement, and routing

🚧

Inter-node megakernel: Aggregates a number of fused steps right into a single megakernel that spans transformer layers.

🚧

Blackwell GPU assist

Fuses NVLink, inter-node RDMA, and compute right into a single persistent CUDA kernel.

5 kernels: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attendant, GEMM+ReduceScatter

GPU-initiated RDMA through libibverbs — no NCCL or NVSHMEM dependencies

Requires Hopper GPU (sm_90a) and ConnectX-7 or AWS EFA networking

Essential factors

mKernel fuses intra-node NVLinks, inter-node RDMA, and compute right into a single persistent CUDA kernel. Communication overhead accounts for as much as 47% of the MoE mannequin execution time per the cited operational information. Comprises 5 kernels: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attendant, and GEMM+ReduceScatter. GPU-initiated RDMA doesn’t depend on NCCL or NVSHMEM, and is carried out immediately by way of libibverbs. Presently requires a Hopper GPU (sm_90a) and ConnectX-7 or AWS EFA networking. Assist for Blackwell is on the roadmap.

Take a look at the repository and technical particulars. Additionally, be at liberty to comply with us on Twitter. Additionally, remember to hitch the 150,000+ ML SubReddit and subscribe to our publication. hold on! Are you on telegram? Now you can additionally take part by telegram.

Have to companion with us to advertise your GitHub repository, Hug Face Web page, product releases, webinars, and extra? Join with us

Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication

Downside: Host-initiated communication

mKernel options

5 fusion kernels

Analysis setup

Backend and necessities

Visible clarification of Marktechpost

What’s m kernel?

why host-driven lack of communication

4 core design properties

The 5 fusion kernel

analysis setting

backend and necessities

roadmap and Essential factors

Essential factors

Leave a Reply Cancel reply

Follow US

Popular News

7 Pandas Tricks for Time-Series Feature Engineering

Ed Kelce’s Partner Maureen Maguire Dead at 74

Italian Chicken and Vegetables – Easy 20 Minute Meal

Unlocking geospatial insights with foundation models and cross-modal reasoning

How I Met Your Mother Cast: Where Are They Now? Josh Radnor and More

Categories

About US

Quick Links

Important Links

Subscribe US

Downside: Host-initiated communication

mKernel options

5 fusion kernels

Analysis setup

Backend and necessities

Visible clarification of Marktechpost

What’s m kernel?

why host-driven lack of communication

4 core design properties

The 5 fusion kernel

analysis setting

backend and necessities

roadmap and Essential factors

Essential factors

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

7 Pandas Tricks for Time-Series Feature Engineering

Ed Kelce’s Partner Maureen Maguire Dead at 74

Italian Chicken and Vegetables – Easy 20 Minute Meal

Unlocking geospatial insights with foundation models and cross-modal reasoning

How I Met Your Mother Cast: Where Are They Now? Josh Radnor and More

Categories

About US

Quick Links

Important Links

Subscribe US