There are particular forms of boring duties that every one AI engineers are acquainted with. It is a immediate tuning loop. Create system prompts, run brokers in opposition to benchmarks, learn fault traces, tweak prompts, add instruments, and rerun. Should you repeat this a number of dozen occasions, the needle might transfer. That is the tedious activity of dressing up your Python recordsdata. Now, a brand new open supply library referred to as AutoAgent, constructed by Kevin Gu of threelayer.inc, proposes an unsettling various. Do not do this work your self. Let the AI do it.
AutoAgent is an open supply library for autonomously enhancing brokers on any area. In a 24-hour run, it achieved #1 on SpreadsheetBench with a rating of 96.5% and #1 on Terminal Bench with a GPT-5 rating of 55.1%.
What truly is AutoAgent?
AutoAgent is described as being “like automated analysis, however for agent engineering.” The thought is to offer an AI agent a activity and let it autonomously construct and iterate on an agent harness in a single day. Modify system prompts, instruments, agent configurations, and orchestrations, run benchmarks, test scores, preserve or discard modifications, and repeat.
To know the analogy, Andrej Karpathy’s automated analysis does the identical factor for ML coaching. It loops by the suggestion-train-evaluation cycle and retains solely the modifications that enhance the validation loss. AutoAgent transfers the identical ratchet loop from ML coaching to agent engineering. As a substitute of optimizing mannequin weights and coaching hyperparameters, you optimize the harness, the system prompts, instrument definitions, routing logic, and orchestration methods that decide how the agent behaves on a activity.
The harness on this context is the scaffolding across the LLM. That’s, what system prompts it receives, what instruments it could invoke, the way it routes between subagents, and the way its duties are formatted as enter. Most agent engineers handcraft this scaffolding. AutoAgent automates the iteration of the scaffold itself.
Structure: 2 brokers, 1 file, 1 directive
GitHub repositories have an deliberately easy construction. Agent.py combines all the harness below take a look at into one file. This consists of configuration, instrument definitions, agent registries, routing/orchestration, and harbor adapter boundaries. The adapter part is explicitly marked as fastened. The remaining are the primary editorial points of MetaAgent. program.md incorporates the meta-agent directions and directives (what sort of agent to construct) and is the one file edited by people.
Consider this as a separation of considerations between people and machines. A human units the course in program.md. A meta-agent (a separate high-level AI) then reads that directive, inspects agent.py, runs the benchmark, diagnoses what failed, and repeats the method by rewriting the related components of agent.py. People by no means contact agent.py straight.
A key infrastructure that maintains loop consistency throughout iterations is outcomes.tsv, an experiment log that’s mechanically created and maintained by the meta-agent. It tracks all experiments carried out and supplies a historical past for the meta-agent to study and modify what to strive subsequent. The whole venture construction additionally features a Dockerfile.base, an elective .agent/ listing for reusable agent workspace artifacts corresponding to prompts and expertise, a activity/ folder for benchmark payloads (added for every benchmark department), and a jobs/ listing for Harbor job output.
This metric is the entire rating produced by the benchmark’s activity take a look at suite. Meta-agents will climb this rating. All experiments generate numerical scores. If it is good, preserve it; if it is not, discard it. This is similar loop as computerized investigation.
Job format and harbor integration
Benchmarks are expressed as Harbor-style duties. Every activity resides below duties/my-task/ and incorporates activity.toml for settings corresponding to timeouts and metadata, directions.md which is the immediate despatched to the agent, a testing/ listing with a take a look at.sh entry level that writes the rating to /logs/reward.txt, and take a look at.py for deterministic checking or validation utilizing LLM-as-judge. The surroundings/Dockerfile defines the duty container and the file/listing holds reference recordsdata mounted within the container. The take a look at writes a rating between 0.0 and 1.0 to the verifier’s log. Meta-agents climb this mountain.
The LLM-as-judge sample right here is value flagging. Along with checking the reply deterministically (as in a unit take a look at), the take a look at suite can use one other LLM to judge whether or not the agent’s output is “adequate”. That is widespread in agent benchmarks the place the proper reply shouldn’t be reducible to a string match.
Necessary factors
Autonomous Harness Engineering Work — AutoAgent proves {that a} meta-agent can utterly substitute the human immediate adjustment loop, iterating by agent.py in a single day with no human straight touching the harness file. Benchmark outcomes validate our method — over 24 hours of execution, AutoAgent achieved first place in SpreadsheetBench (96.5%) and the very best GPT-5 rating in Terminal Bench (55.1%), outperforming all different hand-engineered entries by people. “Mannequin empathy” could also be an actual phenomenon — Claude metaagents optimizing Claude activity brokers seem to diagnose failures extra precisely than optimizing GPT-based brokers, suggesting that pairing same-family fashions could also be necessary when designing AutoAgent loops. The human job strikes from the engineer to the director. You do not write or edit agent.py. Create program.md, a easy Markdown directive that controls the metaagent. This distinction displays broader modifications in agent engineering, from writing code to setting objectives. Plug-and-play with any benchmark — AutoAgent is area agnostic as a result of duties observe Harbor’s open format and brokers run inside Docker containers. Any activity that may be scored, corresponding to spreadsheets, terminal instructions, or your personal customized area, could be a goal for autonomous self-optimization.
Take a look at the repository and tweets. Additionally, be at liberty to observe us on Twitter. Additionally, do not forget to affix the 120,000+ ML SubReddit and subscribe to our publication. hold on! Are you on telegram? Now you can additionally take part by telegram.
Have to accomplice with us to advertise your GitHub repository, Hug Face Web page, product releases, webinars, and extra? Join with us


