![]()
We’re excited to announce day-one assist for NVIDIA Nemotron 3 Nano Omni on Clarifai. at present accessible Clarifai inference engineNano Omni brings quick multimodal inference to builders constructing agent programs, delivering throughput of over 400 tokens per second.
NVIDIA Nemotron 3 Nano Omni is a 30B A3B multimodal inference mannequin constructed for workloads spanning paperwork, photos, video, and audio. With a 256K context window and assist for textual content, picture, video, and audio inputs with textual content output, it supplies builders with a single mannequin for dealing with wealthy multimodal contexts inside agent workflows.
This makes it excellent for subagents in workflows that require each multimodal understanding and pace.
Multimodal fashions of specialised subagents
As agent programs grow to be extra succesful, additionally they grow to be extra specialised. Totally different fashions and elements are accountable for planning, execution, acquisition, and validation, every working inside a broader workflow. With that structure, a mannequin that processes multimodal inputs should do greater than course of remoted inputs. You’ll want to interpret a number of modalities collectively, keep context throughout steps, and reply shortly sufficient to remain throughout the operational loop.
As a light-weight multimodal mannequin for subagents, Nemotron 3 Nano Omni can infer whole screens, paperwork, charts, audio, and video with out having to route every modality right into a separate stack. Somewhat than splitting imaginative and prescient, speech, and language into a number of fashions, it supplies builders with a extra unified method to deal with multimodal inference whereas making the general system simpler to handle.
Constructed for pc use, documentation, and audio-video reasoning
Nano Omni is especially related to the sorts of workloads which can be changing into central to enterprise agent programs.
When used on a pc, the agent should learn the interface, monitor the state of the UI over time, and examine whether or not actions accomplished as anticipated. Reaching doc intelligence requires inferring textual content, tables, graphs, screenshots, scanned pages, and combined visible constructions in the identical cross. Audio and video workflows require connecting what is alleged, what’s proven, and what adjustments over time.
These are all circumstances the place multimodal performance must work reliably in manufacturing, with fashions that may effectively deal with a number of modalities with out having to separate the workflow into separate fashions.
This mannequin affords important enhancements in performance over earlier fashions within the Nemotron household. Vital enhancements in benchmarks reminiscent of OCRBenchV2, OCR_Reasoning, MathVista_MINI, and OSWorld replicate improved mannequin efficiency for the real-world workloads that at this time’s brokers are anticipated to deal with.

Nano Omni is a pure match there, offering builders with a single, multimodal inference stream for duties that subagents are more and more anticipated to deal with.
Agent-friendly tokenomics
In agent programs, subagents tackle repetitive duties throughout paperwork, screens, audio, and video inside a bigger workflow. Every name will increase total system price, throughput, and infrastructure calls for. NVIDIA Nemotron 3 Nano Omni integrates imaginative and prescient, speech, and language right into a single multimodal mannequin, lowering inference hops, orchestration logic, and synchronization between fashions in comparison with separate recognition stacks.
With time-aware recognition and environment friendly video sampling, Nano Omni delivers roughly 2x extra throughput on common and reduces video inference compute by roughly 2.5x. For multimodal agent workflows, this implies greater throughput and decrease computing overhead with out rising stack complexity.
The mannequin makes use of a hybrid professional combination structure with a Transformer-Mamba design and makes use of 3D convolutional layers and environment friendly video sampling for temporal and video inputs. It may possibly run on a single H100, H200, or B200, making it sensible to deploy multimodal subagents with out increasing infrastructure necessities.
Excessive-throughput inference with Clarifai
The Clarifai Reasoning Engine runs NVIDIA Nemotron 3 Nano Omni at over 400 tokens per second, giving builders the throughput they want for manufacturing multimodal agent workflows. That is vital in programs the place subagents are referred to as repeatedly to course of paperwork, interfaces, audio, and video as a part of an ongoing workflow.
The Clarifai Reasoning Engine is constructed to speed up inference by combining optimized kernels, speculative decoding, and adaptive efficiency methods to extend the throughput of inference fashions with out sacrificing accuracy.
Get began with Clarifai
Builders can check out the NVIDIA Nemotron 3 Nano Omni within the Clarifai Playground and also can entry it through an OpenAI-compatible API, making it straightforward to combine into current purposes, instruments, and agent frameworks.
For giant-scale or extra managed deployments, Clarifai makes use of compute orchestration to offer a direct path to manufacturing. Builders can run Nano Omni on the Clarifai Reasoning Engine or deploy Nano Omni throughout their very own cloud, VPC, on-premises, or air-gapped environments whereas managing deployment via a unified management airplane.
NVIDIA Nemotron 3 Nano Omni is on the market at: Make clear at this time.
You probably have any questions on accessing the NVIDIA Nemotron 3 Nano Omni on Clarifai, please be part of us. discord.


