Research

Distyl Research: Systems to Build Systems

Distyl’s research arm is a systems shop built on the old intuition that intelligence emerges not only from the cognitive unit but also from the surrounding structure. Our work spans topics like system self-improvement, structured world-models, system self-construction, use case discovery, boundary-case synthetic data generation, and more.

Our Work

Selected research

Proxifield: Decentralized Multi-Agent Communication through a Dynamic Topology

Multi-Agent communication has often adopted a centralized, static approach in which an orchestrator relays information to its sub-agents to collaboratively perform a task. However, a top-down communication paradigm is not suitable for all deployments, particularly those with heterogeneous agents and tasks, large agent populations, and long-running deployments. Inspired by localized coordination dynamics in bird flocks and insect swarms, we introduce Proxifield, a decentralized communication methodology that dynamically connects agents to their nearest neighbors based on embedding similarity across task status, role, and interaction history. This adaptive topology enables agents to exchange information through local neighborhoods rather than a fixed hierarchy, allowing coordination patterns to emerge from the evolving structure of the task itself.

Ongoing

ARES: Agentic Research Employees

ARES is a Distyl-internal initiative to build a team of always-on research agents that automates the work of an AI Researcher focused on publishing. The system continuously monitors the research landscape — spanning academic literature and real-time social signals — and distills findings into a range of knowledge artifacts: executive briefings, meta-analyses, podcasts, review papers, and blog posts. Users can subscribe to research topics and receive periodic briefings and delta reports as the landscape evolves. The project is grounded in a prioritization framework that ranks hypotheses and findings by potential impact, business value, and testability, and is designed to integrate naturally with Distyl's broader agentic platform.

Ongoing

Upward Mobility: Generating Complex and Realistic Benchmarks through Evolutionary Data Degradation

Synthetic Data is increasingly being used as a means of producing evaluation data in workflows where data is restricted due to privacy or cost reasons. However, naive synthetic data generation often produces data with one of two failure-modes; (a) The data is too clean and easy for the model to solve, (b) The data is challenging for SOTA models, but accomplishes this through inorganic and manufactured complexity. We seek to address this by utilizing an evolutionary degradation procedure, which mutates initially clean synthetic data, along domain-specific axes of complexity, transforming it into data reflecting the complexities that manifest in real-world enterprise workflows. This search approach is guided by bespoke guardrails to ensure that the mutations increase difficulty in a way which maintains realism and individual datapoint correctness.

Ongoing

MRKRs: Deterministic Provenance for Multi-Agent LLM Systems

As LLM workflows become longer and more agentic, source attribution often breaks down: citations drift across model calls, document-level links fail to identify the exact supporting passage, and models can fabricate plausible-looking references. MRKRs address this by injecting opaque citation markers directly into source text before it enters the model pipeline. Because models copy existing markers rather than inventing new ones, each factual claim can retain a deterministic link back to the exact source passage it came from. The canonical artifact is the MRKRized text itself, not a structured JSON citation layer; companion match-hint files exist only for rendering, lookup, and highlighting. This makes MRKRs especially useful for multi-stage research, retrieval, and graph workflows where outputs need to be rolled back up to source documents, deduplicated, and reused as focused citable context downstream.

Ongoing

An Adaptive Harness to Self-Construct State-of-the-Art Systems

Distyl’s adaptive harness automatically constructs task-specific AI systems by searching over architectures, prompts, model choices, routing, verification, and execution strategies. Given a benchmark and evaluator, it iteratively tests candidate systems, analyzes failures, and refines designs until performance improves, converting latent frontier-model capability into state-of-the-art results with less manual engineering.

Read on our blog

VoicEmu: Modeling the Tails of Human Speech

Multi-modal foundation models fail at classifying paralinguistic attributes — their predictions are biased and drift unpredictably between model checkpoints. We show that simple embedding geometry (PCA-whitened centroids) can be used to classify accents at ~93% accuracy where frontier models perform at ~39%. We then use the same embeddings backbone to drive a generation pipeline that stress-tests voice agents against long-tail users before they reach production. In an internal ablation study, we reproduced production findings where individual paralinguistic attributes increase conversation failure rates by 10–19pp relative to a neutral baseline; layered together, the failure rate increased up to ~30pp.

Ongoing

PrefPO: Pairwise Preference Prompt Optimization

PrefPO is a lightweight prompt optimization framework that uses pairwise preferences to automatically improve prompts without labeled datasets. It matches or exceeds state-of-the-art performance while producing prompts that are cleaner, more readable, and easier to iterate on.

Environment Maps: Structured Environmental Representations for Long-Horizon Agents

Environment Maps is a persistent, agent-agnostic memory framework that converts unstructured multimodal evidence of environment transitions and SME tacit knowledge into a structured graph. This gives LLM agents a human-interpretable, editable map of their operating environment, nearly doubling task success on WebArena (28.2% vs. 14.2% baseline) and outperforming agents with access to the raw unstructured evidence.

arXiv

Lattice: Generative Guardrails for Conversational Agents

Lattice is a self-constructing, continuously improving guardrail framework that builds guardrails through iterative simulation and then autonomously adapts them via risk assessment and adversarial testing, achieving strong gains over existing methods and further improving through closed-loop optimization.

How Many Instructions Can LLMs Follow at Once?

Production-grade LLM systems require robust adherence to dozens or even hundreds of instructions simultaneously. However, the instruction-following capabilities of LLMs at high instruction densities have not yet been characterized, as existing benchmarks only evaluate models on tasks with a single or few instructions. We introduce IFScale, a simple benchmark of 500 keyword-inclusion instructions for a business report writing task to measure how instruction-following performance degrades as instruction density increases. We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions.

arXiv

GenEdit: Compounding Operators and Continuous Improvement to Tackle Text-to-SQL in the Enterprise

Recent advancements in Text-to-SQL, driven by large language models, are democratizing data access. Despite these advancements, enterprise deployments remain challenging due to the need to capture business-specific knowledge, handle complex queries, and meet expectations of continuous improvements. To address these issues, we designed and implemented GenEdit: our Text-to-SQL generation system that improves with user feedback. GenEdit builds and maintains a company-specific knowledge set, employs a pipeline of operators decomposing SQL generation, and uses feedback to update its knowledge set to improve future SQL generations.

arXiv

The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models

Schema linking is a crucial step in Text-to-SQL pipelines. Its goal is to retrieve the relevant tables and columns of a target database for a user's query while disregarding irrelevant ones. However, imperfect schema linking can often exclude required columns needed for accurate query generation. In this work, we revisit schema linking when using the latest generation of large language models (LLMs). Our approach ranks first on the BIRD benchmark achieving an accuracy of 71.83%.

arXiv

End-to-end Text-to-SQL Generation within an Analytics Insight Engine

Recent advancements in Text-to-SQL have pushed database management systems towards greater democratization of data access. Today's language models are at the core of these advancements. They enable impressive Text-to-SQL generation as experienced in the development of Distyl AI's Analytics Insight Engine. We give an overview of our end-to-end approach and highlight the operators generating SQL during inference.

arXiv