Benchmarking GPT-OSS Across H100s and B200s

This weblog submit focuses on new options and enhancements. For a complete listing containing bug fixes, see I am going to launch the notes.

Benchmarks the GPT-OS for the H100 and B200

Openai has launched new technology of open weight inference fashions GPT-OSS-120B and GPT-OSS-20B underneath the Apache 2.0 license. Constructed for strong directions, adopted, highly effective software use, and superior inference, these fashions are designed for the subsequent technology of agent workflows.

With blended professional (MOE) designs, size of extension context for 131K tokens, and quantization that permits 120B fashions to run on a single 80GB GPU, GPT-Oss combines giant scale with sensible deployments. Builders can alter inference ranges from low to excessive to optimize velocity, value, or accuracy, and use built-in searching, code execution and customized instruments for advanced workflows.

Our analysis group benchmarked the GPT-OSS-120B by way of NVIDIA B200 and H100 GPUs utilizing VLLM, Sglang, and Tensort-LLM. The checks coated single request eventualities and excessive present workloads with requests between 50 and 100. The important thing findings are as follows:

Single Request Velocity: The B200 with TENSORT-LLM affords a 0.023-second time token (TTFT) that surpasses the Twin-H100 setup in some circumstances.

Excessive Concurrency: The B200 maintains 7,236 tokens/sec at most load with low latency per token.

Effectivity: One B200 replaces two H100s with equal or higher efficiency, with much less energy utilization and fewer complexity.

Efficiency enhancements: Some workloads present as much as 15x inference in comparison with a single H100.

Learn our full NVIDIA B200 vs H100 weblog for detailed benchmarks on throughput, latency, time to first token, and different metrics.

In case you are contemplating deploying a GPT-Oss mannequin in your H100S, you are able to do that at this time with Clarifai in a number of clouds. Assist for the B200S will quickly be obtainable, with entry to the most recent NVIDIA GPUs for testing and manufacturing.

Developer Plan

Final month we launched a neighborhood runner and the response from the builders is unimaginable. From AI fanatics to manufacturing groups, many have been eager to run open supply fashions domestically on their very own {hardware}, whereas using the Make clear platform. Native runners mean you can run and take a look at your mannequin earlier than you possibly can entry it by way of public APIs and combine it into any software.

Now, when the most recent GPT-OSS fashions, together with the GPT-OSS-20B arrive, these superior inference fashions could be run domestically with the flexibility to completely management computing and immediately deploy agent workflows.

To make it even simpler, we’re introducing developer plans at promotional costs per 30 days. The group plan contains all the things.

Try our developer plans and begin working your personal mannequin at this time. In case you are able to run GPT-OSS-20B in your {hardware}, comply with this step-by-step tutorial.

Printed mannequin

We have expanded our mannequin library with new open weights and specialised fashions that can be utilized in our workflows.

The newest additions embody:

GPT-OSS-120B – An open weight language mannequin designed for highly effective inference, superior software use, and environment friendly on-device deployment. This mannequin helps prolonged context lengths and varied inference ranges, making it ideally suited for advanced agent purposes.

GPT-5, GPT-5 MINI, and GPT-5 NANO – GPT-5 are the flagship fashions of essentially the most demanding inference and technology duties. The GPT-5 Mini affords a quicker, cost-effective various for real-time purposes. GPT-5 NANO offers ultra-low latency inference for edge and budget-sensitive deployments.

QWEN3-CODER-30B-A3B-INSTRUCT – A high-efficiency coding mannequin with lengthy context help and highly effective agent capabilities appropriate for code technology, refactoring, and improvement automation.

You can begin investigating these fashions and discover straight at Clarifai playgrounds or entry them by way of APIs to combine them into your software.

Orama Assist

Ollama lets you straight obtain and run highly effective open supply fashions in your machine. Clarifai’s native runner now lets you expose fashions working domestically via a safe public API.

You may also add the Ollama Toolkit to the Clarifai CLI to obtain, run and publish Ollama fashions with a single command.

Learn our step-by-step information to working orama fashions domestically and making them accessible by way of APIs.

Enchancment of playgrounds

Now, as a substitute of testing a number of fashions separately, you possibly can examine them aspect by aspect on the playground. Rapidly discover variations in output, velocity and high quality and select the one which’s finest in your use case.

We additionally added enhanced inference management, Pythonic help, and mannequin model selectors for smoother experiments.

Screenshot 2025-08-14 at 6.58.27 pm

Further updates

Python SDK:

Improved logging, pipeline dealing with, authentication, native runner help, and code verification.

Added reside logging, redundant output, and integration with GitHub repository for versatile mannequin initialization.

Platform:

Clarification Group:

Prepared to start out the constructing?

Clarifai’s computational orchestration lets you deploy GPT-OSS, QWEN3-CODER, and different open supply and customized fashions to devoted GPUs reminiscent of NVIDIA B200, H100, ONPREM, or The Cloud. Ship your mannequin, MCP server, or full agent workflow straight out of your {hardware}, offering full management over efficiency, value and safety.

Benchmarking GPT-OSS Across H100s and B200s

Benchmarks the GPT-OS for the H100 and B200

Developer Plan

Printed mannequin

Orama Assist

Enchancment of playgrounds

Further updates

Prepared to start out the constructing?

Leave a Reply Cancel reply

Follow US

Popular News

Ncuti Gatwa to Star in Fashion Biopic ‘The Queen of Fashion’ Post-Doctor Who

The Cost To Remodel A Rundown Two-Bedroom In-Law Unit

X-Men director can’t wait for the MCU’s take on Wolverine

Americans Don’t Know Enough About Social Security

The Shocking Power Of Getting A Different Perspective

Categories

About US

Quick Links

Important Links

Subscribe US

Benchmarks the GPT-OS for the H100 and B200

Developer Plan

Printed mannequin

Orama Assist

Enchancment of playgrounds

Further updates

Prepared to start out the constructing?

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

Ncuti Gatwa to Star in Fashion Biopic ‘The Queen of Fashion’ Post-Doctor Who

The Cost To Remodel A Rundown Two-Bedroom In-Law Unit

X-Men director can’t wait for the MCU’s take on Wolverine

Americans Don’t Know Enough About Social Security

The Shocking Power Of Getting A Different Perspective

Categories

About US

Quick Links

Important Links

Subscribe US