TL;DR
On this submit, we use Synthetic Evaluation benchmarks to discover how main inference suppliers carry out on the GPT-OSS-120B mannequin. Be taught what issues most when evaluating inference platforms, together with throughput, time to first token, and value effectivity. Evaluate the efficiency and deployment effectivity of Vertex AI, Azure, AWS, Databricks, Clarifai, Collectively AI, Fireworks, Nebius, CompactifAI, and Hyperbolic.
introduction
Giant-scale language fashions (LLMs) like GPT-OSS-120B (Open Weight 120 Billion Parameter Knowledgeable Combination Mannequin) are designed for superior inference and multi-step technology. Inference workloads eat tokens shortly and place excessive calls for on computing, so deploying these fashions into manufacturing requires an inference infrastructure that delivers low latency, excessive throughput, and low value.
Variations in {hardware}, software program optimization, and useful resource allocation methods may end up in massive variations in latency, effectivity, and value. These variations have direct implications for real-world functions comparable to reasoning brokers, doc understanding techniques, and co-pilots, the place even small delays can affect general responsiveness and throughput.
Impartial benchmarking has change into important to objectively assess these variations. Fairly than counting on in-house efficiency claims, open, data-driven evaluations now present a extra clear approach to assess how totally different platforms carry out beneath real-world workloads.
On this submit, we evaluate main GPU-based inference suppliers utilizing the GPT-OSS-120B mannequin as a reference benchmark. We discover how every platform performs throughout key inference metrics comparable to throughput, time to first token, and value effectivity, and the way these tradeoffs affect the efficiency and scalability of inference-heavy workloads.
Earlier than we get into the outcomes, let’s take a fast have a look at how Synthetic Evaluation and its benchmarking framework work.
synthetic evaluation benchmark
Synthetic Evaluation (AA) is an impartial benchmarking initiative that runs standardized assessments throughout inference suppliers to measure how fashions comparable to GPT-OSS-120B carry out in real-world circumstances. Their analysis focuses on sensible workloads with lengthy contexts, streaming output, and inference-heavy prompts moderately than quick artificial examples.
You possibly can discover the entire GPT-OSS-120B benchmark outcomes right here.
Though Synthetic Evaluation evaluates a wide range of efficiency metrics, we are going to give attention to three key elements which are necessary when selecting an inference platform for GPT-OSS-120B. Time to first token, throughput, and value per million tokens.
Time to first token (TTFT)
The time between sending the immediate and receiving the mannequin’s first token. A decrease TTFT means the output begins streaming sooner. That is necessary for interactive functions and multi-step inference the place delays can disrupt the circulate. Throughput (tokens per second)
The speed at which tokens are generated after streaming begins. Increased throughput reduces the full completion time for lengthy outputs and permits for extra concurrent requests, straight impacting scalability for big contexts or multi-turn workloads. Price per million tokens (blended value)
A composite metric that considers each enter and output token pricing. This supplies a transparent view of the operational prices of prolonged context and streaming workloads, serving to groups plan for predictable prices.
benchmark methodology
Immediate measurement: The benchmarks featured on this weblog use 1,000-token enter prompts carried out by Synthetic Evaluation, reflecting typical real-world eventualities comparable to chatbot queries and directions that require inference. A reasonably lengthy immediate benchmark can also be out there and may be explored right here for reference. Median measurements: The values reported symbolize the median worth (p50) over the previous 72 hours, capturing sustained efficiency traits moderately than single level spikes or dips. For the newest benchmark outcomes, please go to the Synthetic Evaluation GPT‑OSS‑120B mannequin supplier web page right here. Metrics focus: This overview focuses on time to first token (TTFT), throughput, and mixed value to offer an actionable view for workload planning. Different metrics comparable to end-to-end response time, delay resulting from variety of enter tokens, and time to first response token are additionally measured by Synthetic Evaluation however are usually not included on this overview.
With this system in thoughts, we are able to now evaluate how totally different GPU-based platforms carry out on GPT‑OSS‑120B and what these outcomes imply for inference-heavy workloads.
Supplier comparability (GPT‑OSS‑120B)
Make clear
Time to first token: 0.32 seconds
Throughput: 544 tokens/sec
Mixing value: $0.16 per million tokens
Observe: Very excessive throughput. Low latency. Extremely value efficient. It is a highly effective alternative for inference-heavy workloads.
Primary options:
GPU splitting and autoscaling choices for environment friendly compute utilization
Native runner to run fashions domestically by yourself {hardware} for testing and growth
On-premises, VPC, and multisite deployment choices Management middle to watch and handle utilization and efficiency
Google Vertex AI
Time to first token: 0.40 seconds
Throughput: 392 tokens/sec
Mixing value: $0.26 per million tokens
Observe: Average latency and throughput. Appropriate for basic function inference workloads.
Primary options:
Built-in AI instruments (AutoML, coaching, deployment, monitoring)
Scalable cloud infrastructure for batch and on-line inference
Enterprise-grade safety and compliance
microsoft azure
Time to first token: 0.48 seconds
Throughput: 348 tokens/sec
Mixing value: $0.26 per million tokens
Observe: There might be a barely longer delay. Balances efficiency and value for normal workloads.
Primary options:
Complete AI companies (ML, cognitive companies, customized bots)
Deep integration with the Microsoft ecosystem
World enterprise-grade infrastructure
hyperbola
Time to first token: 0.52 seconds
Throughput: 395 tokens/sec
Mixing value: $0.30 per million tokens
Observe: Price is increased than comparable merchandise. Throughput appropriate for duties that require inference.
Primary options:
AWS
Time to first token: 0.64 seconds
Throughput: 252 tokens/sec
Mixing value: $0.26 per million tokens
Observe: Decrease throughput and better latency. Appropriate for workloads that aren’t extremely time-sensitive.
Primary options:
Broad AI/ML service portfolio (Bedrock, SageMaker)
World cloud infrastructure
Enterprise-grade safety and compliance
information brick
Time to first token: 0.36 seconds
Throughput: 195 tokens/sec
Mixing value: $0.26 per million tokens
Observe: Throughput might be decreased. acceptable delay. Appropriate for batch or background duties.
Primary options:
Unified Analytics Platform (Spark + ML + Notebooks)
Collaborative workspace for groups
Scalable compute for large-scale ML/AI workloads
AI collectively
Time to first token: 0.25 seconds
Throughput: 248 tokens/sec
Mixing value: $0.26 per million tokens
Observe: Very quick delay. Average throughput. Appropriate for functions that emphasize real-time inference.
Primary options:
Actual-time inference and coaching
Cloud/VPC-based deployment orchestration
Versatile and safe platform
Fireworks AI
Time to first token: 0.44 seconds
Throughput: 482 tokens/sec
Mixing value: $0.26 per million tokens
Observe: Excessive throughput and balanced delay. Appropriate for interactive functions.
Primary options:
Compactif AI
Time to first token: 0.29 seconds
Throughput: 186 tokens/sec
Mixing value: $0.10 per million tokens
Observe: Low value. Throughput is low. Ultimate for cost-sensitive workloads with low concurrency wants.
Primary options:
Environment friendly compression mannequin to cut back prices
Simplifying deployment on AWS
Optimized for high-throughput batch inference
nevius base
Time to first token: 0.66 seconds
Throughput: 165 tokens/sec
Mixing value: $0.26 per million tokens
Observe: It will considerably cut back throughput and enhance latency. Might battle with inference-heavy or interactive workloads.
Primary options:
Fundamental AI service endpoints
Commonplace cloud infrastructure
Appropriate for workloads with regular demand
Finest supplier based mostly on worth and throughput
Selecting the best inference supplier for GPT‑OSS‑120B requires evaluating time to first token, throughput, and value based mostly in your workload. Platforms like Clarifai supply excessive throughput, low latency, and aggressive prices, making them ideally suited for duties that require inference or are interactive. Different suppliers, comparable to CompactifAI, prioritize decrease value, however at the price of decrease throughput, which can make them higher fitted to cost-sensitive or batch-oriented workloads. Your best option will depend on which tradeoffs are most necessary to your utility.
nice for the worth
Ultimate for throughput
Clarifai: Low first chunk delay and highest throughput of 544 tokens/sec.
Fireworks AI: Sturdy throughput of 482 tokens/second and average latency.
Hyperbola: Glorious throughput at 395 tokens/sec. It is dearer, however efficient for heavy workloads.
efficiency and suppleness
Along with worth and throughput, flexibility can also be necessary for real-world workloads. Groups usually want to regulate scaling conduct, GPU utilization, and deployment environments to handle value and effectivity.
For instance, Clarifai helps fractional GPU utilization, autoscaling, and native runners. These options enhance effectivity and cut back infrastructure overhead.
These options prolong past GPT‑OSS‑120B. The Clarifai inference engine means that you can run customized or open-weight inference fashions with constant efficiency and reliability. The engine additionally adapts to workload patterns over time, step by step growing the velocity of repetitive duties with out sacrificing accuracy.
Benchmark overview
To date, we have used Synthetic Evaluation Benchmark to match suppliers based mostly on throughput, latency, and value. To see how these tradeoffs play out in observe, here’s a visible abstract of the outcomes throughout totally different suppliers. These charts are taken straight from Synthetic Evaluation.
The primary graph highlights output velocity and worth, and the second graph compares latency and output velocity.
%20.png?width=1000&height=524&name=Output%20Speed%20vs%20Price%20(8%20Oct%2025)%20.png)
Relationship between output velocity and worth
%20.png?width=1000&height=547&name=Latency%20vs%20Output%20Speed%20(8%20Oct%2025)%20.png)
Latency and output velocity
Beneath is an in depth comparability desk summarizing key metrics for GPT-OSS-120B inference between suppliers.
Supplier Throughput (tokens/sec) Time to first token (sec) Blended value ($ / million tokens) Clarifai 544 0.32 0.16 Google Vertex AI 392 0.40 0.26 Microsoft Azure 348 0.48 0.26 Hyperbolic 395 0.52 0.30 AWS 252 0.64 0.26 Databricks 195 0.36 0.26 Collectively AI 248 0.25 0.26 Fireworks AI 482 0.44 0.26 CompactifAI 186 0.29 0.10 Nebius Base 165 0.66 0.26
conclusion
Selecting an inference supplier for GPT-OSS-120B requires a stability between throughput, latency, and value. Every supplier handles these tradeoffs in another way, and the only option will depend on your particular workload and efficiency necessities.
Suppliers with excessive throughput are higher for inference-heavy or interactive duties, whereas suppliers with decrease median throughput could also be higher fitted to batch or background processing the place velocity is much less necessary. Latency additionally performs an necessary position. A short while to first token improves responsiveness for real-time functions, however barely increased latency could also be acceptable for much less time-sensitive duties.
It’s nonetheless necessary to contemplate value. Some suppliers supply sturdy efficiency at low value, whereas others commerce effectivity for worth. Benchmarks overlaying throughput, time to first token, and mixing prices present a transparent basis for understanding these tradeoffs.
Finally, the correct supplier will rely in your engineering issues, workload traits, and which tradeoffs are most necessary to your utility.
Be taught extra about Clarifai’s inference engine
Quickest AI inference and inference on GPU.
Verification by synthetic evaluation


