Agent0: A Fully Autonomous AI Framework that Evolves High-Performing Agents without External Data through Multi-Step Co-Evolution

Giant-scale language fashions require big human datasets. So what if all of the fashions needed to create their very own curriculum and discover ways to use the instruments themselves? A workforce of researchers from UNC-Chapel Hill, Salesforce Analysis, and Stanford College introduces Agent0, a totally autonomous framework that evolves high-performance brokers with out exterior knowledge via multistep coevolution and seamless instrument integration.

Agent0 is meant for mathematical and normal reasoning. This reveals that with cautious activity era and gear integration rollout, it’s attainable to exceed the native capabilities of the bottom mannequin throughout 10 benchmarks.

https://arxiv.org/pdf/2511.16043

Two brokers from one base mannequin

Agent0 begins with base coverage π_base (for instance, Qwen3 4B Base or Qwen3 8B Base). This coverage is reproduced beneath.

A curriculum agent πθ generates duties, and an execution agent πϕ solves these duties with Python instruments.

Coaching progresses in iterations, with two levels in every iteration.

Curriculum evolution: The curriculum agent generates batches of duties. For every activity, the executor samples a number of responses. Composite rewards measure how unsure the performer is, how usually they use the instrument, and the way various their batches are. πθ is up to date by group relative coverage optimization (GRPO) utilizing this reward. Executor evolution: skilled curriculum brokers are frozen. A lot of duties might be generated. Agent0 filters this pool to maintain solely duties near the performer’s practical frontier and trains the performer on these duties utilizing an ambiguity-aware RL goal referred to as Ambiguity Dynamic Coverage Optimization (ADPO).

This loop kinds a suggestions cycle. As performers are empowered with code interpreters, curricula should generate extra advanced and tool-dependent issues to maintain their rewards excessive.

How does Curriculum Agent grade duties?

Curriculum rewards are a mix of three indicators:

Uncertainty reward: For every generated activity x, the performer samples okay responses and votes the bulk for the pseudo-response. Self-consistency p^(x) is the proportion of solutions that match this majority. The reward is highest when p^ is near 0.5, and the reward is decrease if the duty is simply too simple or too tough. This facilitates a tough however solvable activity for at present’s practitioners.

Rewards for utilizing the instrument: Executors can use Python tags to set off the sandbox code interpreter and obtain the tagged outcomes as output. Agent0 counts the variety of instrument calls within the trajectory and provides a scaled capped reward by setting the cap C to 4 in our experiments. This prioritizes duties that truly require instrument calls somewhat than pure psychological arithmetic.

Repetition penalty: Inside every curriculum batch, Agent0 makes use of BLEU-based distance to measure pairwise similarity between duties. Duties are clustered and the penalty interval will increase with the dimensions of the cluster. This prevents the curriculum from producing numerous near-duplicates.

The composite reward multiplies the format verify by a weighted sum of the uncertainty and the instrument reward minus the repetition penalty. This composite worth is enter to GRPO to replace πθ.

How do practitioners be taught from noisy self-labels?

The executor can also be skilled with GRPO, however in multiturn, the instrument makes use of built-in trajectories and pseudo-labels as an alternative of ground-truth solutions.

Constructing a frontier dataset: After iterative curriculum coaching, a frozen curriculum generates a big pool of candidates. For every activity, Agent0 calculates the self-consistency p^(x) with the present executor and retains solely these duties for which p^ is throughout the data band (e.g. 0.3 to 0.8). This defines difficult frontier datasets that keep away from trivial or not possible issues.

Built-in rollout of multi-turn instruments: For every frontier activity, the executor generates a trajectory that may interleave:

Suggestions for pure language inference tokens, Python code segments, and output instruments.

Technology is paused when a instrument name is encountered, the code is executed in a sandbox interpreter constructed on prime of the VeRL instrument, and resumed relying on the outcomes. The trajectory ends when the mannequin produces the ultimate reply throughout the {boxed …} tag.

A majority vote throughout the sampled trajectories defines the pseudo-label and last reward for every trajectory.

ADPO, ambiguity-aware RL: Normal GRPO treats all samples equally, however turns into unstable when the label comes from a majority vote on ambiguous duties. ADPO modifies GRPO in two methods utilizing p^ as an ambiguity sign.

The normalized benefit is scaled by an element that will increase with self-consistency, so trajectories from unreliable duties contribute much less. This units a dynamic clipping higher certain for significance ratios that is determined by self-consistency. Empirical evaluation reveals that mounted prime clipping primarily impacts low likelihood tokens. ADPO adaptively relaxes this certain, thus bettering exploration in unsure duties, as visualized by the likelihood statistics of upclipped tokens.

Outcomes of mathematical and normal reasoning

Agent0 is applied on prime of VeRL and evaluated on Qwen3 4B Base and Qwen3 8B Base. Use the sandboxed Python interpreter as a single exterior instrument.

The analysis workforce evaluated the next 10 benchmarks:

Mathematical Reasoning: AMC, Minerva, MATH, GSM8K, Olympic Bench, AIME24, AIME25. Basic reasoning: SuperGPQA, MMLU Professional, BBEH.

Most datasets report go@1 and AMC and AIME duties report means@32.

For Qwen3 8B Base, Agent0 reaches:

The mathematical imply was 58.2 vs. 49.2 for the bottom mannequin, and the general normal imply was 42.1 vs. 34.5 for the bottom mannequin.

Agent0 additionally improves over sturdy data-free baselines akin to R Zero, Absolute Zero, SPIRAL, and Socratic Zero, each with and with out instruments. Qwen3 8B outperforms R Zero by 6.4 factors and Absolute Zero by 10.6 factors on common. It additionally beats Socratic Zero, which depends on an exterior OpenAI API.

Over three coevolutionary iterations, Qwen3 8B’s common math efficiency elevated from 55.1 to 58.2, and normal reasoning additionally improved with every iteration. This confirms secure self-improvement somewhat than collapse.

Qualitative examples present that curriculum duties evolve from primary geometry inquiries to advanced constraint satisfaction issues, whereas performer trajectories mix reasoning textual content and Python calls to reach on the right reply.

Essential factors

Fully data-free co-evolution: Agent0 eliminates exterior datasets and human annotations. The 2 brokers, curriculum agent and executor agent, are initialized from the identical base LLM and co-evolved solely via reinforcement studying and Python instruments. Frontier curriculum from self-uncertainty: A curriculum agent makes use of performer self-consistency and gear utilization to attain duties. It learns the way to generate frontier duties that explicitly require inference built-in into instruments, that are neither simple nor not possible. ADPO makes use of pseudo-labels to stabilize RL. Performers are skilled utilizing Ambiguity Dynamic Coverage Optimization. ADPO de-weights extremely ambiguous duties and adjusts the clipping vary based mostly on self-consistency. This makes GRPO-style updates secure even when rewards come from majority-voted pseudo-labels. Constant enhancements in math and normal reasoning: On Qwen3 8B base, Agent0 improved the mathematics benchmark from a median of 49.2 to 58.2 and normal reasoning from 34.5 to 42.1. This corresponds to a relative enchancment of roughly 18 % and 24 %. Outperforms earlier zero knowledge frameworks: Agent0 outperforms earlier self-evolving strategies throughout 10 benchmarks, together with those who already use instruments and exterior APIs, akin to R Zero, Absolute Zero, SPIRAL, and Socratic Zero. This reveals that co-evolution and built-in design of instruments is a significant step past earlier single-round self-play approaches.

Modifying notes

Agent0 is a crucial step towards sensible data-free reinforcement studying for tool-integrated inference. This reveals that the bottom LLM can act as each a curriculum agent and an executor agent, and that GRPO with ADPO and VeRL instruments can drive secure enhancements from majority-voted spurious labels. The strategy additionally reveals that the coevolution built-in into the instrument can outperform earlier zero knowledge frameworks akin to R Zero and Absolute Zero on a powerful Qwen3 baseline. Agent0 makes a powerful case that self-evolving, tool-integrated LLM brokers have gotten a viable coaching paradigm.

Try PAPER and Repo. Be at liberty to go to our GitHub web page for tutorials, code, and notebooks. Additionally, be at liberty to comply with us on Twitter. Additionally, remember to hitch the 100,000+ ML SubReddit and subscribe to our e-newsletter. dangle on! Are you on telegram? Now you can additionally take part by telegram.

Michal Sutter is a knowledge science skilled with a grasp’s diploma in knowledge science from the College of Padova. With a powerful basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking advanced datasets into actionable insights.

🙌 Observe MARKTECHPOST: Add us as your most well-liked supply on Google.

Agent0: A Fully Autonomous AI Framework that Evolves High-Performing Agents without External Data through Multi-Step Co-Evolution

Two brokers from one base mannequin

How does Curriculum Agent grade duties?

How do practitioners be taught from noisy self-labels?

Outcomes of mathematical and normal reasoning

Essential factors

Modifying notes

Leave a Reply Cancel reply

Follow US

Popular News

Punk Monkey Usd (punkmonkey-usd.site) program details. Reviews, Scam or Paying

Royal Watchers Question Timing Of Prince Harry Leak

After Dark Is The Kind Of Expansion We Always Wanted

Good Omens Season 3 Release Date Confirmed By Prime Video To Close Out Hit Show

Frozen Peanut Butter Banana Split Cups

Categories

About US

Quick Links

Important Links

Subscribe US

Two brokers from one base mannequin

How does Curriculum Agent grade duties?

How do practitioners be taught from noisy self-labels?

Outcomes of mathematical and normal reasoning

Essential factors

Modifying notes

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

Punk Monkey Usd (punkmonkey-usd.site) program details. Reviews, Scam or Paying

Royal Watchers Question Timing Of Prince Harry Leak

After Dark Is The Kind Of Expansion We Always Wanted

Good Omens Season 3 Release Date Confirmed By Prime Video To Close Out Hit Show

Frozen Peanut Butter Banana Split Cups

Categories

About US

Quick Links

Important Links

Subscribe US