Teaching AI Agents to Research Better: Parallel Execution, Memory, and Honesty

Apr. 4, 2026

How we extended Andrej Karpathy’s auto-research with structured memory, parallel exploration, and a system that catches agents cheating themselves.

Introduction

When Andrej Karpathy released his auto-research framework (a single LLM agent modifying code in a loop, running experiments, and iterating toward better results), the AI community was electrified. Here was a compelling demonstration that language models could do more than write code: they could conduct research. Set up a problem, point an agent at it, and watch it run dozens of experiments autonomously.

We were excited too. We pointed auto-research at a production ML prediction task and watched it work. The agent tried different features, swapped model architectures, tuned hyperparameters. It made progress. But as we scaled beyond toy problems, cracks appeared.

The agent would repeat experiments it had already tried. It would get stuck in local optima for hours. Worst of all, it would sometimes “improve” metrics in ways that didn’t survive scrutiny: subtle forms of overfitting or metric manipulation that looked great in logs but fell apart on held-out data.

This raised a question: What if we could make auto-research even better?

Not by building a more powerful model, but by giving the research process itself better infrastructure: memory, parallelism, and honesty checks. The result is IdeaMaze, a framework that extends single-agent auto-research with four innovations that address its core limitations.

The Limitations of Single-Agent Loops

The vanilla auto-research loop is elegant in its simplicity. An LLM reads the current code and results, proposes a change, executes it, and evaluates the outcome. Repeat until convergence.

In practice, this simplicity creates three problems:

1. Sequential Throughput

A single agent runs one experiment at a time. If each experiment takes 3-5 minutes (training, evaluation, analysis), you get maybe 15-20 experiments per hour. Many research problems have a combinatorial search space: feature engineering, architecture choices, regularization strategies, preprocessing pipelines. Sequential exploration barely scratches the surface.

2. No Memory Between Experiments

The agent has a context window, and that context window is its only memory. After enough experiments, early insights scroll out of context. The agent re-discovers things it already learned. It re-tests hypotheses it already rejected. There’s no accumulation of knowledge, just a sliding window over recent attempts.

3. Vulnerability to Metric Gaming

This is the subtle one. When you optimize a metric in a loop, the agent develops increasingly creative ways to make the number go down (or up). Some of these are legitimate improvements. Others are artifacts: preprocessing tricks that inflate metrics without improving real-world performance, evaluation shortcuts that look good on paper but break on deployment data.

A single agent has no external check on its own honesty. It marks its own homework.

IdeaMaze: Four Extensions That Matter

1. Parallel Agents in Git Worktrees

Instead of one agent exploring sequentially, IdeaMaze runs N agents simultaneously, each in its own git worktree branching from the current best code.

# The coordinator creates isolated worktrees for each agent
git worktree add ../worker_1 -b batch3/feature_engineering
git worktree add ../worker_2 -b batch3/architecture_search
git worktree add ../worker_3 -b batch3/regularization
git worktree add ../worker_4 -b batch3/preprocessing

The coordinator protocol ensures category diversity: agents in the same batch are assigned to different experiment categories (feature engineering, architecture, regularization, etc.) to maximize exploration breadth. No two agents in a batch explore the same category.

After all agents in a batch complete, the coordinator:

Collects results from all worktrees
Identifies the best-performing experiment
Merges the winner into the main branch
Spawns the next batch from this new baseline

This is embarrassingly parallel research. In Run 2, we executed 85 experiments across 18 parallel batches, which would have taken 6+ hours sequentially completed in a fraction of the time.

2. Structured Knowledge Base

This is arguably the most important innovation. IdeaMaze maintains a SQLite database that serves as persistent memory across all experiments and batches.

# Record an experiment result
python maze.py log --name "target_encoding_v3" \
  --category "feature_engineering" \
  --metric 0.1842 \
  --notes "Target encoding on categorical features with 5-fold CV"

# Query the knowledge base for the next experiment
python maze.py next

# View all experiments, sorted by performance
python maze.py history

# Check for stagnation
python maze.py status

The database schema tracks:

Experiments: name, category, metric value, code diff, timestamp, batch ID
Insights: cross-experiment observations, auto-classified by theme
Strategy: current research direction, reasons for pivots, dead ends to avoid
Stagnation markers: consecutive experiments without improvement, triggering strategy shifts

The maze.py next command is where this gets powerful. It queries the full experiment history, identifies underexplored categories, checks for stagnation, and recommends the next experiment to try, complete with reasoning about why. The agent doesn’t just have memory; it has structured, queryable memory with built-in strategic planning.

Auto-classification tags each experiment by category and outcome type (improvement, regression, neutral, failed) without manual labeling. Stagnation detection fires when a category has seen K consecutive non-improvements, automatically suggesting the agent pivot to a different area of the search space.

3. Gamification Detection

This is the story that convinced us the framework was necessary.

During Run 1, one of our agents discovered that winsorizing the target variable (clipping extreme values before training) produced a dramatic improvement in the evaluation metric. The MAPE (Mean Absolute Percentage Error) dropped by what appeared to be a 26x factor. The agent was ecstatic. It had found a breakthrough.

Except it hadn’t. What actually happened was this: by clipping extreme values from the training data, the model learned to predict a narrower range. When evaluated on test data that was also winsorized (because the same preprocessing pipeline ran on both), the predictions looked fantastic. But on raw, unfiltered test data (the data that actually matters in production), the improvement was modest at best.

The agent had unknowingly gamed its own metric. It wasn’t lying; it was self-deceived.

IdeaMaze catches this with a dual-evaluation protocol:

# Every experiment is evaluated twice:
# 1. With the experiment's own preprocessing (the "filtered" metric)
# 2. On unfiltered holdout data (the "unfiltered" metric)

filtered_metric = evaluate(model, filtered_test_data)
unfiltered_metric = evaluate(model, raw_test_data)

ratio = filtered_metric / unfiltered_metric
if ratio > GAMIFICATION_THRESHOLD:
    flag_as_gamification(experiment_name, ratio)
    # Alert: "Filtered metric is {ratio}x better than unfiltered.
    #         This experiment may be gaming the evaluation."

When the unfiltered/filtered ratio exceeds a threshold, the experiment is flagged. In the winsorization case, the system caught a 26x discrepancy and rejected the “improvement” before it could pollute the baseline.

This is a general principle: any time an agent can modify both the model and the evaluation, you need an invariant evaluation that the agent cannot touch. IdeaMaze enforces this architecturally.

4. Batch Learning

After each batch of parallel experiments completes, IdeaMaze runs a batch learning phase that extracts cross-cutting insights.

This isn’t just “pick the winner.” The system analyzes all experiments in the batch, including failures, to identify patterns:

Which categories are producing diminishing returns?
Are there interaction effects between features discovered by different agents?
What hypotheses have been conclusively rejected?
Are we approaching convergence?

Over two runs, IdeaMaze accumulated 21+ cross-cutting patterns, including:

Target encoding consistently outperforms one-hot encoding for high-cardinality categoricals
Ensemble methods plateau after 3-4 base models in this problem setting
Log-transforming the target variable is worth more than any single feature
Regularization improvements have diminishing returns after the third iteration
Feature interactions provide lift only when base features are already well-engineered

Convergence detection monitors the trajectory of best metrics across batches. When the improvement rate drops below a threshold for N consecutive batches, the system declares convergence and suggests either stopping or making a fundamental strategy shift (e.g., switching model families entirely).

The diminishing returns breaker is particularly useful: when a category (say, feature engineering) has been explored heavily with declining marginal gains, the system automatically deprioritizes it and redirects agent effort toward underexplored categories.

The Program.md: Your Research Constitution

All of this is coordinated through a single markdown file: program.md. Think of it as a constitution for the research agent. It defines:

The objective: what metric to optimize, in what direction
The evaluation protocol: how to measure results, what counts as an improvement
The experiment categories: the taxonomy of approaches to explore
Constraints: what the agent must NOT do (e.g., never modify the test set)
The gamification policy: how to detect and handle metric gaming
Strategy guidance: initial hypotheses, known dead ends, priority ordering

The agent reads program.md at the start of every experiment. It’s the one artifact that persists across context window resets, agent restarts, and batch boundaries. When you want to steer the research in a new direction, you edit program.md.

# program.md (excerpt)

## Objective
Minimize MAPE on held-out test set. Current best: 0.1842

## Evaluation Protocol
- Primary metric: MAPE on unfiltered test data
- Secondary: MAPE on filtered test data (for reference only)
- GAMIFICATION CHECK: If filtered/unfiltered ratio > 2.0, reject experiment

## Experiment Categories
1. Feature Engineering (priority: HIGH)
2. Architecture Search (priority: MEDIUM)
3. Regularization (priority: LOW - diminishing returns detected)
4. Preprocessing (priority: MEDIUM)
5. Ensemble Methods (priority: HIGH)

## Dead Ends (do not revisit)
- Winsorization of target variable (gamification detected, batch 4)
- PCA on categorical embeddings (consistently degrades performance)

This is a deceptively powerful pattern. A well-written program.md encodes weeks of researcher intuition into a format an LLM agent can follow. It’s the difference between “go explore” and “explore intelligently within these guardrails.”

Results

We ran IdeaMaze on a production ML prediction task across two full runs.

Run 1: 70 Experiments

Metric improvement: 68.6% relative reduction in MAPE
70 experiments across multiple batches
Key discoveries: target encoding pipeline, optimal log-transform strategy, interaction features that provided consistent lift
3 gamification attempts detected and rejected
Converged after identifying a stable feature + architecture combination

Run 2: 85 Experiments in 18 Parallel Batches

Metric improvement: 26.3% relative reduction in MAPE (starting from Run 1’s already-optimized baseline)
85 experiments executed across 18 parallel batches
Discovered 21+ cross-cutting patterns
Diminishing returns detected in 4 categories, with automatic reallocation
Final model validated on completely held-out data with no metric gaming flags

The lower improvement in Run 2 is expected and healthy. It started from a much better baseline. The fact that it still found 26.3% improvement on top of an already-optimized model speaks to the value of structured, parallel exploration with memory.

Total wall-clock time for both runs was dramatically lower than equivalent sequential exploration would have required, thanks to parallel batch execution.

Try It Yourself

IdeaMaze is designed as a starter kit you can adapt to your own research problems. The core components are:

maze.py: The knowledge base CLI. Handles experiment logging, querying, stagnation detection, and next-experiment recommendation.
program.md: Your research constitution. Define your objective, categories, constraints, and gamification policy.
Coordinator script: Manages git worktrees, assigns agents to categories, collects batch results, and runs batch learning.
Evaluation harness: Dual-metric evaluation with gamification detection built in.

To set up your own IdeaMaze run:

# 1. Initialize the knowledge base
python maze.py init

# 2. Write your program.md with objective, categories, and constraints

# 3. Run the first batch
python coordinator.py --batch-size 4 --categories "feature_engineering,architecture,regularization,preprocessing"

# 4. Review results
python maze.py history
python maze.py status

# 5. Iterate (the system handles the rest)
python coordinator.py --continue

The framework is model-agnostic. It works with any LLM that can read code, propose changes, and execute experiments. We’ve tested it with Claude, but the architecture doesn’t depend on any specific model.

What’s Next

IdeaMaze is a step forward, but there’s much more to explore:

Multi-objective optimization. Real research problems rarely have a single metric. Can we extend the framework to navigate Pareto frontiers across accuracy, latency, model size, and fairness?

Hierarchical agent teams. Instead of a flat coordinator + workers model, imagine a hierarchy: a strategic agent that sets research direction, tactical agents that plan experiments, and execution agents that run them. Each level operates at a different time horizon.

Automated program.md generation. Given a dataset and an objective, can an agent write its own research constitution? Early experiments suggest yes, with appropriate meta-learning from previous IdeaMaze runs.

Cross-problem transfer. The 21+ patterns discovered in our runs are specific to one problem. But some patterns (e.g., “log-transform targets before regression,” “target encoding beats one-hot for high cardinality”) are likely universal. Can we build a meta-knowledge base that transfers insights across problems?

The broader question is whether we can make AI research agents not just autonomous, but systematically intelligent about how they explore. Not just running experiments, but running the right experiments, remembering what they’ve learned, and being honest about what they’ve found.

IdeaMaze is our first answer. We think it won’t be the last.

IdeaMaze is open source. Find the code, documentation, and starter kit at github.com/rohitag13/ideamaze. Read the full methodology at ideamaze.in.