An Adaptive Harness to Self-Construct State-of-the-Art Systems

Distyl's adaptive harness achieves state-of-the-art performance on 75 benchmarks.
Given the rise of AI agents that can produce functional solutions from natural language requests, we explored an adjacent question: can agents steer themselves towards state-of-the-art performance, beyond just a functional solution, from the same natural language request?
TL;DR
We pointed our adaptive harness at 75 benchmarks spanning natural language understanding, logical reasoning, code generation, medical QA, legal analysis, agentic tool use, knowledge graphs, and more—each with an established leaderboard and published state-of-the-art results. For all 75 benchmarks, our adaptive harness produced systems outperforming the published state-of-the-art.
At the core of this adaptive harness is architectural discovery—the identification of techniques to build around the best model available for the task at hand. Instead of building towards a functional solution, this architectural discovery actively pushes for state-of-the-art performance. By erring towards model strengths and circumventing model weaknesses, this agentic constructor is able to draw additional performance out of its underlying engine.
To substantiate the claim of state-of-the-art performance, we pointed at 75 benchmarks spanning natural language understanding, logical reasoning, code generation, medical QA, legal analysis, agentic tool use, knowledge graphs, instruction following, and more, each with an established leaderboard and published state-of-the-art results. For each, only a benchmark description, dataset, and official evaluation script were provided—no human guidance, domain expertise, or manual engineering.
What followed was the autonomous exploration of various system architectures and models, writing of associated code, running of evaluations on a held-out training split, analysis of failures, and iteration on techniques—often through dozens of refinement cycles—until performance converged toward an outcome deemed cutting edge.
Across all 75 benchmarks, the resulting systems beat the published state-of-the-art, including on all 15 system-first benchmarks where the competition consisted of purpose-built, expert-engineered systems.1

Results
Model-first benchmarks are conventionally reported as a single model call, with leaderboards indexed by model name. Topping these substantiates the natural claim that a system built on top of a model should outperform a model alone. System-first benchmarks require multi-step agents, tools, or APIs; they evaluate systems by design. Topping these substantiates the claim that the discovery process does not produce naive architectures, but rather can build systems that outperform expert-crafted systems for the same problem.
Model-first benchmarks
These benchmarks were originally designed as model evaluations. For these benchmarks, the systems produced wrap the model in additional structure—using techniques like voting, execution, verification, category-specific prompting, and more—to push past what a single model call could accomplish. All 60 model-first benchmarks were topped, with 21 systems seeing gains of 20+ percentage points and four systems reaching full benchmark saturation.
System-first benchmarks
System-first benchmarks demand multi-step architectures that execute code against real environments, chain tool calls across live APIs, manage stateful side effects, and recover from failures mid-run. Leaderboards are dominated by custom pipelines, RL-trained policies, and domain-specific agent frameworks, often backed by months of human engineering. Of the 15 system-first benchmarks tested, our generated systems topped all 15.
Common patterns
While each generated system employs pointed techniques, the following are some examples of broad technique classes:
Sample and aggregate
Run the same prompt N times at varied temperatures, majority vote wins. Stochastic error becomes a resource—independent failures cancel out.
# N samples of the SAME prompt across VARIED decoding: temperatures,
# thinking budgets, and seeds. Each dimension contributes independent
# diversity; their combination spans more of the solution space than any
# single axis could.
SAMPLING_CONFIGS = [
# Deterministic anchor, weighted 2× with “hybrid temperature” variant.
dict(temperature=0.0, thinking_budget=0, seed=0, weight=2.0),
dict(temperature=0.3, thinking_budget=2048, seed=17, weight=1.0),
dict(temperature=0.5, thinking_budget=4096, seed=42, weight=1.0),
dict(temperature=0.7, thinking_budget=8192, seed=1337, weight=1.0),
dict(temperature=0.9, thinking_budget=8192, seed=2718, weight=1.0),
]
results = await asyncio.gather(*[
run_llm(prompt, **cfg) for cfg in SAMPLING_CONFIGS
], return_exceptions=True)
# Weighted vote: the anchor gets more say; extended-thinking samples
# provide diversity but can wander on long chains.
tally = Counter()
for cfg, r in zip(SAMPLING_CONFIGS, results):
if isinstance(r, Exception) or not r.ok:
continue
tally[r.answer] += cfg[”weight”]
# Escalation path: on empty tally or a tie, hand off to an LLM judge
# that scores the raw candidates head-to-head against the prompt.
top, margin = tally.most_common(1)[0] if tally else (None, 0)
tied = margin > 0 and sum(v == margin for v in tally.values()) > 1
if top is None or tied:
valid = [r for r in results if not isinstance(r, Exception) and r.ok]
return await llm_judge_select(prompt, valid)
return topEnsemble reasoning
Vary the procedure, not just the seed. A forward symbolic proof and a process-of-elimination attack fail on different items—fusing them covers both blind spots.
# Vary the PROCEDURE, not just the seed. N structurally different reasoning
# frameworks, each with a known failure profile — so the strategies fail on
# different slices of the input distribution.
STRATEGIES = {
“forward_deduction”: dict(
system=FORWARD_PROMPT, # chain premises → conclusion
# strong on items with explicit logical structure;
# weak when common-sense inference is required.
),
“elimination”: dict(
system=ELIMINATION_PROMPT, # rule out alternatives one by one
# strong on MCQ-like items; weak on open-ended generation.
),
“claim_decomposition”: dict(
system=DECOMPOSITION_PROMPT, # atomic claims, verified one by one
# strong on multi-part hypotheses; weak on holistic reasoning.
),
}
# Run all strategies in parallel at T=0 — variance must come from the
# framework, not from sampling noise. (If you want both, layer with #1.)
outputs = await asyncio.gather(*[
run_llm(prompt, system=cfg[”system”], temperature=0.0)
for cfg in STRATEGIES.values()
])
# Strategy-aware aggregation: unanimous agreement is a strong signal,
# return immediately. On disagreement, hand the reasoning chains to an
# arbiter — it votes on the ARGUMENTS, not just the surface labels.
votes = Counter(r.answer for r in outputs)
answer, count = votes.most_common(1)[0]
if count == len(STRATEGIES):
return answer
return await strategy_arbiter(
prompt,
reasoning_chains=[(name, r.reasoning, r.answer)
for name, r in zip(STRATEGIES, outputs)],
)Execution-guided refinement
When the task produces a runnable artifact—code, SQL, a constraint check—execution provides additional signal: cardinality, error traces, contradictions.
# The task produces a runnable artifact — code, SQL, a constraint check,
# a plan against an API. Execute it against a REAL environment and feed
# the concrete failure signal back. CoT self-reflection can’t produce
# any of this: it’s raw ground truth from the sandbox.
# Pass 0: cheap shape/precondition estimate that bounds what “correct” means.
# • SQL: expected columns, row-count magnitude
# • code: expected function signature, I/O types, exception surface
shape = estimate_output_shape(prompt)
artifact = generate(prompt, shape_hint=shape)
history = []
for attempt in range(MAX_REFINEMENT_ATTEMPTS):
result = execute(artifact, env=sandbox)
if result.ok and matches_shape(result.output, shape):
return artifact
# Structure the feedback — not a generic “try again”.
feedback = dict(
error_trace = result.stderr_tail, # exact exception / compiler msg
failing_tests = result.failing_test_names, # when run against a suite
shape_diff = diff_shape(result.output, shape),
observed_cardinality = result.row_count,
previous_attempts = history[-3:], # avoid repeating the same fix
)
history.append(dict(artifact=artifact, feedback=feedback))
artifact = refine(prompt, previous=artifact, feedback=feedback)
# Final critic pass: even when execution passes, a self-critique against
# the original intent catches silent-correctness failures (off-by-ones,
# wrong column order, subtle semantic drift).
return await self_critique(prompt, artifact, execution_result=result)Self-reflection
A lightweight second pass that catches the failure modes the first pass may miss—a keyword check, a constraint validator, a targeted challenge to the weakest prediction class.
first_pass = await run_llm(prompt)
# The verifier is NOT a blind retry. It is ONE targeted pass whose entire
# job is to challenge the failure mode the first pass is structurally
# blind to. The whole design rests on (1) the gate, (2) the targeted prompt,
# and (3) strict override criteria — never overturn a confident first pass
# on a weak second opinion.
# (1) GATE — run the verifier only when it’s likely to help.
# Typical signals: low top-1 probability, a tied vote, or a first-pass
# answer in the class that dominates false positives.
if not is_suspect(first_pass):
return first_pass
# (2) TARGETED PROMPT — narrow scope, adversarial framing, deterministic.
VERIFICATION_PROMPT = “”“A previous analysis concluded: {first_pass}.
CHALLENGE this finding. Apply <domain-specific checks>.
Confirm the original answer IF AND ONLY IF every more-specific alternative
can be ruled out. Otherwise, propose a revised answer with justification.”“”
verified = await run_llm(
VERIFICATION_PROMPT.format(first_pass=first_pass.answer),
context=original_inputs,
temperature=0.0,
)
# (3) OVERRIDE CRITERIA — default to the first pass; overturn only when
# the verifier is both DIFFERENT and BETTER-SUPPORTED than the original.
if verified.answer == first_pass.answer:
return first_pass
first_weak = first_pass.confidence < CONFIDENCE_THRESHOLD
verifier_wins = verified.confidence > first_pass.confidence + CONFIDENCE_MARGIN
if first_weak or verifier_wins:
return verified
return first_passIntent-driven dispatch
Intent-class detection followed by dispatch to a specialized prompt. A one-size-fits-all prompt makes compromises that category-specific prompts do not.
# Detect the sub-type of input, then route to a purpose-built solver.
# A specialist with narrow scope beats a generalist burdened by every edge case.
# Three moving parts: a cheap classifier, a handler table, and an OOD fallback.
# (1) Classification — a small, fast model or deterministic rule.
# We’re NOT solving the task yet; don’t spend solver-grade compute here.
category = await classify_input(
prompt,
labels=list(HANDLERS.keys()) + [”unknown”],
mode=”lightweight”,
)
# (2) Dispatch table. Each handler bakes in rules, few-shots, and output
# constraints tuned for its category. Swapping handlers is how the system
# iterates: add a handler, rerun on the failing cluster, measure delta.
HANDLERS = {
“logic_puzzle”: solve_with_constraint_verification, # state table, re-check vs every constraint
“subjective_judgment”: solve_with_folk_psychology_framing, # “what would most people say?”
“arithmetic”: solve_with_step_by_step_computation,# show every digit; no shortcuts
“sequential_ops”: solve_with_explicit_state_table, # track state after every op
“multi_doc_retrieval”: solve_with_targeted_retrieval, # query per need, not a static dump
}
# (3) OOD fallback — always handle the long tail, and log it. Persistent
# “unknown” volume is the signal to author a new handler for that cluster.
handler = HANDLERS.get(category, solve_with_general_reasoning)
if category == “unknown”:
log_unhandled_category(prompt)
return await handler(prompt)Domain-contextualization
Ship formulas, rubrics, best-practices, and domain knowledge as prompt content and tools. This shifts the frame of reference to the problem space.
# Ship expert heuristics as prompt content — NOT a raw data dump.
# On SCOTUS, pasting the full codebook alone gained 0 points. Replacing
# it with a short PRIORITIZED rule list gained +20 points.
# Two properties matter: (a) priority order, and (b) meta-rules that
# govern conflicts between numbered rules.
DOMAIN_RULES = “”“
=== META-RULES (govern how numbered rules compose) ===
A. SUBSTANCE beats procedural vehicle. A case “about” <topic X> stays
<category X> even when the legal machinery (preemption, RICO, First
Amendment overlay) would normally push it elsewhere.
B. Numbered rules CANNOT be overridden by Meta-Rule A.
C. <catch-all category> is a LAST RESORT — use only when no numbered
rule fires. Persistent use of the catch-all is a signal to author
a new numbered rule.
=== PRIORITY RULES (applied in order; first match wins) ===
RULE 1 — <most-specific special case>: items about <narrow topic>
belong to <category X>, regardless of surface features that
would normally push them elsewhere.
RULE 2 — <well-known exception>: ALL items of type <Y> belong to
<category Z>, including sub-types that look like they fit
<adjacent category>. Tiebreaker: <sub-rule extracted from
failure analysis of the baseline run>.
RULE 3 — <cross-cutting classification>: when <attribute A> is present,
the classification is <category>, not the <adjacent category>
that the model commonly confuses it with.
# … ~10 numbered rules, each grounded in a specific failure cluster
# from a baseline run. The long tail usually collapses into this many.
“”“
# Adversarial few-shots: pick examples that are NEAREST to the known
# failure modes, not the easiest or most representative. This is where
# most of the remaining gain comes from, post-rules.
prompt = build_prompt(
user_input,
domain_rules=DOMAIN_RULES,
few_shot_examples=pick_adversarial_examples(user_input),
)
return await run_llm(prompt, temperature=0.0)These patterns compose: voting layered with category routing, execution feedback chained into multi-pass refinement, domain injection fused with post-processing enforcement. A human engineer might explore a handful of compositions per benchmark. Our adaptive harness can explore dozens autonomously, then converge on the combination best suited to the task.
On the benchmark methodology
All benchmarks were evaluated using the official published evaluation harnesses. The adaptive harness received only a benchmark description, the dataset, and the evaluation script, with no access to test labels during development. The training split was used for iteration; the test split was evaluated once. Results were submitted to the respective third-party leaderboards for independent verification where possible.
Underlying model selection spans the frontier models available at the time of exploration, including Opus 4.6, GPT-5.4, and Gemini 3.1 Pro, depending on task characteristics.
Citation
If you’d like to cite this work:
Saluru, Durga Sandeep and Distyl AI, “An Adaptive Harness to Self-Construct State-of-the-Art Systems”, Distyl AI Blog, Apr 2026.
@article{saluru2026button,
author = {Durga Sandeep Saluru and Distyl AI},
title = {An Adaptive Harness to Self-Construct State-of-the-Art Systems},
journal = {Distyl AI Blog},
year = {2026},
month = {Apr},
}
1 Results as of March 31, 2026.


