DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds

DeepReinforce has launched Ornith-1.0, an open supply mannequin household constructed for agent coding. Out there in 4 sizes, from the 9B high-density mannequin to the 397B knowledgeable blended flagship. All Checkpoints are shipped below Hugging Face’s MIT License. The mannequin is post-trained primarily based on pre-trained Gemma 4 and Qwen 3.5.

Most coding brokers mix fashions with human-designed fixation harnesses. Ornith-1.0 as a substitute learns its personal description. The DeepReinforce analysis group studies state-of-the-art outcomes on open fashions of comparable measurement.

TL;DR

Ornith-1.0 ships below MIT in 9B, 31B, 35B-MoE, and 397B-MoE sizes and is constructed on Gemma 4 and Qwen 3.5. The mannequin learns its personal scaffolding throughout RL and collectively optimizes the harness and resolution. Ornith-1.0-397B outperforms Claude Opus 4.7 in each headline benchmarks, however falls wanting Opus 4.8 and the bigger GLM-5.2-744B. Three layers (fastened belief boundaries, deterministic displays, and frozen LLM judges) forestall rewards from being hacked.

What’s Ornis-1.0?

Ornith-1.0 is a set of inference fashions tailor-made for coding brokers. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B mannequin is a mixture of consultants and prompts roughly 3B parameters per token. FP8 and GGUF builds are additionally printed to hurry up native supply.

Every mannequin is an inference mannequin. Replies begin with a block earlier than the ultimate reply. The supplied recipe permits the reasoning parser, so the hint is returned in a separate reasoning_content area. This mannequin additionally points well-formed instrument requires agent loops.

Set up is simple. The 9B mannequin is about 19GB in bf16 and runs on a single 80GB GPU. The supplied recipes goal vLLM, SGLang, and Transformers. Every mannequin exposes an OpenAI-compatible endpoint. Subsequently, the usual agent framework works with none code adjustments.

interactive explainer

self-scaffolding concepts <p class="wp-block-paragraph">Most coding brokers depend on scaffolding, often known as a harness. Scaffolds wrap fashions with reminiscence, instruments, error dealing with, and orchestration logic. AI groups usually manually design one scaffold for every job class.</p> <p class="wp-block-paragraph">Ornith-1.0 as a substitute treats scaffolds as learnable objects. Throughout reinforcement studying, scaffolding coevolves with the mannequin’s insurance policies. Every RL step is executed in two phases.</p> <p class="wp-block-paragraph">First, the mannequin reads the duty and the scaffold earlier than it. Subsequent, we suggest a classy scaffold. Then use the scaffold and duties to generate an answer rollout. Rewards from the rollout shall be returned to each phases.</p> <p class="wp-block-paragraph">Subsequently, the mannequin is optimized for creating orchestrations, not simply solutions. By coaching, larger reward scaffolds mutate and are routinely chosen. Create task-specific methods with out handbook harness design.</p> <p class="wp-block-paragraph">Coaching can also be carried out asynchronously utilizing a pipeline RL setup. Revocation weight lowers the load of previous out-of-policy tokens that cross the brink. Optimization makes use of token-level GRPO goals.</p> <h2 id="forestall-reward-hacking" class="rb-heading-index-4 wp-block-heading">Forestall reward hacking</h2> <p class="wp-block-paragraph">Reward hacking happens once you let the mannequin create its personal scaffolding. The scaffold can learn the displayed take a look at file and hardcode the anticipated output. You may also copy Oracle options inside your surroundings. The DeepReinforce group describes three layers of protection.</p> <p>The outer belief boundary is fastened and unchanging. Separation of the surroundings, instrument surfaces, and assessments are past the attain of the mannequin. The mannequin solely evolves inner coverage scaffolding. Deterministic displays flag prohibited actions. There may be zero reward for studying a shelved path or modifying a validation script. These trajectories are excluded from the calculation of benefit. Frozen LLM judges function a veto. This isn’t as the primary reward, however on prime of the verifier.</p> <h2 id="benchmark" class="rb-heading-index-5 wp-block-heading">benchmark</h2> <p class="wp-block-paragraph">DeepReinforce studies vendor counts throughout a number of agent coding benchmarks. On the flagship scale, Ornith-1.0-397B scored 77.5 in Terminal-Bench 2.1 and 82.4 in SWE-Bench Verified. In SWE-Bench Verified, its 82.4 is decrease than solely the Claude Opus 4.8 (87.6) among the many listed fashions. With Terminal-Bench 2.1, issues get much more difficult.</p> <p class="wp-block-paragraph">Ornith-1.0-397B outperforms Claude Opus 4.7 (70.3) in Terminal-Bench 2.1. Nevertheless, it falls wanting the Claude Opus 4.8 (85) and the bigger GLM-5.2-744B (81.0). Subsequently, “state-of-the-art” claims are restricted to open fashions of comparable measurement.</p> <p class="wp-block-paragraph">Smaller fashions have effectivity in thoughts. The 35B mannequin has a Terminal Bench 2.1 rating of 64.2, which is larger than the Qwen 3.5-397B’s 53.5. The 9B mannequin reaches 43.1 in Terminal-Bench 2.1 and 69.4 in SWE-Bench Verified.</p> <figure class="wp-block-table is-style-stripes">BenchmarkOrnith-1.0-397BQwen3.5-397BQwen3.7-MaxGLM-5.2-744BMinimax-M3-428BDeepSeek-V4-Professional-1.6TClaude Opus 4.7Claude Opus 4.8 Terminal Bench 2.177.553.573.581.0646470.385SWE-Bench Verified 82.476.480.4––80.680.887.6SWE-Bench Pro62.251.660.662.15955.464.369.2SWE-Bench Multilingual 78.969.378.3––76.2––NL2Repo48.236.847.248.942.1––69.7ClawEval Avg77.170.765.2––75.878.2–</figure> <h2 id="use-circumstances-and-quickstarts" class="rb-heading-index-6 wp-block-heading">Use circumstances and quickstarts</h2> <p class="wp-block-paragraph">This mannequin is meant for device-native coding brokers and repository-scale work. Sensible variations embrace multi-file refactoring, bug localization, test-driven patching, and so forth. The 9B mannequin is appropriate for edge or single GPU setups the place latency and price are essential. The 397B mannequin targets most accuracy for lengthy, multi-step duties.</p> <p class="wp-block-paragraph">For instance, builders can run the 9B mannequin regionally to prioritize failing take a look at suites. Platform groups can self-host 397B fashions for inner coding brokers.</p> <p class="wp-block-paragraph">With vLLM, it is a one-liner that gives the service.</p> <div class="dm-code-snippet dark dm-normal-version default no-background-mobile" snippet-height="" style="background-color:#abb8c3"> <div class="control-language">vllmserve deepreinforce-ai/Ornith-1.0-9B –served-model-name Ornith-1.0-9B –max-model-len 262144 –enable-auto-tool-choice –tool-call-parser qwen3_xml –reasoning-parser qwen3 –trust-remote-code</div> </div> <p class="wp-block-paragraph">Then name it with any OpenAI consumer.</p> <div class="dm-code-snippet dark dm-normal-version default no-background-mobile" snippet-height="" style="background-color:#abb8c3"> <div class="control-language">from openai import OpenAI consumer = OpenAI(base_url=”http://localhost:8000/v1″, api_key=”EMPTY”) resp = consumer.chat.completions.create(mannequin=”Ornith-1.0-9B”,messages=[{“role”: “user”, “content”: “Write a Python is_prime(n).”}]temperature=0.6, top_p=0.95, ) msg = resp.decisions[0].message print(getattr(msg, “reasoning_content”, None)) # hint print(msg.content material) # closing reply</div> </div> <p class="wp-block-paragraph">The reasoning hint is returned in reasoning_content and the reply is returned in content material. Really helpful sampling is temperature = 0.6, top_p = 0.95, top_k = 20. This mannequin additionally plugs into OpenHands, OpenClaw, and OpenCode.</p> <p class="wp-block-paragraph">Test the mannequin weight and technical particulars. Additionally, be happy to observe us on Twitter. Additionally, do not forget to hitch over 150,000 SubReddits and subscribe to our publication. hold on! Are you on telegram? Now you can additionally take part by telegram.</p> <p class="wp-block-paragraph">Must companion with us to advertise your GitHub repository, Hug Face Web page, product releases, webinars, and extra? Join with us</p>

DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds

TL;DR

What’s Ornis-1.0?

interactive explainer

Leave a Reply Cancel reply

Follow US

Popular News

The Opening Still Hits So Hard

Building a ‘Human-in-the-Loop’ Approval Gate for Autonomous Agents

Deep Agents Tutorial: LangGraph for Smarter AI

10 Best Foods to Reduce Insulin Resistance Naturally

GFN Thursday: Flight Controls on GeForce NOW

Categories

About US

Quick Links

Important Links

Subscribe US

TL;DR

What’s Ornis-1.0?

interactive explainer

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

The Opening Still Hits So Hard

Building a ‘Human-in-the-Loop’ Approval Gate for Autonomous Agents

Deep Agents Tutorial: LangGraph for Smarter AI

10 Best Foods to Reduce Insulin Resistance Naturally

GFN Thursday: Flight Controls on GeForce NOW

Categories

About US

Quick Links

Important Links

Subscribe US