Rainier — recursive research system

design discussion · drafted 2026-05-18 · iterating in this file

1. Vision anchor

The reference frame is Yuandong Tian's:

Model + Harness = Self-Improvement. Mirrors AlphaGo (policy net + MCTS), now applied to LLMs that propose research, with a harness that proves or rejects it.

The whole question of a recursive AI lab is: what does the harness do, and what does it refuse to do? Rainier's harness already has most of the moving parts — screener, evaluator, insights table, YAML mutators, weekly auto-research. The recursive evolution is in (a) opening the hypothesis space beyond the six hard-coded checks, (b) closing the offline-evaluation loop tighter, and (c) keeping the live ranking gated while the lab runs unstoppably underneath.

The danger to avoid: a harness that allows the loop to mutate the parts that score truth (the evaluator, the holdout). That's how recursive systems self-deceive. The evaluator runs at repo HEAD and is never mutated by the loop.

2. The recursive-AI-lab playbook

Every system surveyed is some specialization of:

generate(hypothesis)  →  evaluate(deterministic, walk-forward, holdout-clean)
        ▲                              │
        │                              ▼
   archive (keep diverse) ◄────── score (multi-objective vector)
        │                              │
        ▼                              ▼
   bandit (which arm gets budget?)  gate (auto / shadow / human)

The five things that change between systems are:

  1. What's a hypothesis? A Python function (FunSearch, Eureka), a YAML diff (rainier today), a natural-language idea (AI Scientist), a full research plan (RD-Agent), a composed skill (Voyager).
  2. What's the evaluator? Deterministic puzzle scorer (FunSearch), GPU-sim reward (Eureka), backtest engine (RD-Agent, Alpha-GPT), human review (AI Scientist v1).
  3. How is diversity preserved? Islands (FunSearch), MAP-Elites cells (quality-diversity literature), behavioral characterization (POET), archive distillation (Voyager).
  4. Who picks the next try? Best-so-far prompt (FunSearch), curriculum + bandits (Voyager, Language Self-Play), tree search (AI Scientist v2), human (Alpha-GPT 1.0).
  5. What's allowed to change live behavior? Everything (full RSI vision), nothing (Alpha-GPT human-in-loop), threshold-gated subset (RD-Agent shadow-mode, the rainier proposal here).

3. Prior art, ranked by adoptability for rainier

#SystemOne-lineWhat we steal
1 RD-Agent(Q)
Microsoft, 2025
Multi-agent quant R&D: hypothesis → spec → code-gen → backtest → knowledge-forest. 2× CSI300 returns w/ 70% fewer factors, <$10/run. The structured hypothesis → task spec → code-gen → backtest → archive loop. Closest published analog to rainier's existing stack. Knowledge-forest dedup ≈ your ResearchInsight.recurrence_count.
2 FunSearch
DeepMind, Nature 2023
Frozen LLM proposes code → deterministic evaluator scores → island GA evolves population. Islands topology (your 6 insight kinds become 6 islands). Strict rule: only the deterministic evaluator promotes; LLM never grades itself.
3 AlphaEvolve
DeepMind, 2025
Multi-objective code evolution with meta-evolution of the search strategy itself. Evolve scoring modules against a metric vector — return, IC, turnover, drawdown, regime-robustness, explanation quality — not raw return.
4 AI Scientist v2
Sakana, 2025
Agentic tree search: ideate → code → eval → write → reviewer-LLM → next cycle. The "research card" artifact — one durable record per hypothesis (rationale, expected failure mode, evaluator output, lineage). Replaces freeform research notes.
5 Voyager
MineDojo / NVIDIA, 2023
Ever-growing code-skill library + automatic curriculum + iterative prompting with execution feedback. A skill library of interpretable Python signal primitives the LLM composes, instead of writing arbitrary code. Curriculum picks next gap from archive coverage.
6 Eureka
NVIDIA / UPenn, 2023
LLM writes reward code, GPU-sim scores it, evolutionary loop improves. Outperforms human-engineered rewards on 83% of 29 RL tasks. The reward-function-as-evolvable-code idea. For rainier: the composite-score weighting is an evolvable Python module, not a static YAML.
7 Language Self-Play
2025
Single LLM in two roles (Challenger ↔ Solver) generates increasingly hard tasks for itself. Carefully: a challenger that picks adversarial regimes (drawdown windows, sector rotations, regime breaks) the solver must survive. Note the collapse-after-few-iterations warning — KL reg + self-reward needed.
8 MAP-Elites
Mouret & Clune 2015 et al.
Quality-diversity archive: behavioral feature space discretized into cells; keep best-so-far in each cell. The archive shape. Cells indexed by (sector, holding_horizon, signal_density, regime_sensitivity, data_family). Research archive, not live allocator.
9 Alpha-GPT / AlphaAgent / QuantaAlpha
2023–2025
LLM-driven alpha-factor mining at scale.
Alpha-GPT · AlphaAgent · QuantaAlpha
Decay-resistance gate (AlphaAgent), trajectory-level self-evolution (QuantaAlpha — each run is a path, not a point), hierarchical RAG over screener history (Alpha-GPT). Your ScreenedStockRecord table is already the RAG corpus.
10 Anti-overfitting discipline
López de Prado et al.
Deflated Sharpe, purged k-fold + embargo, walk-forward, untouchable holdout. QuantBench framing. Non-negotiable. Every promotion clears Deflated-Sharpe ≥ threshold AND stable IC AND acceptable turnover AND multiple walk-forward folds. Otherwise the loop discovers noise on ~10K obs/year/signal.

4. Proposed architecture

Both codex passes and I converge on this skeleton. Each layer below is annotated with which prior art it borrows from.

                       ┌─────────────────────────────────┐
                       │   SKILL LIBRARY (Voyager-style) │
                       │  ~30 interpretable Python prims │
                       │  composable by LLM via typed DSL│
                       └─────────────┬───────────────────┘
                                     │
                       ┌─────────────▼───────────────────┐
                       │ HYPOTHESIS GENERATOR             │
                       │  6 islands (one per insight kind)│
                       │  LLM composer + DEAP-style GP    │
                       │  emits ResearchHypothesis cards  │
                       │  (FunSearch + RD-Agent + AI Sci) │
                       └─────────────┬───────────────────┘
                                     │
                       ┌─────────────▼───────────────────┐
                       │  DETERMINISTIC EVALUATOR         │
                       │  walk-forward + purged CV +      │
                       │  embargo → Deflated Sharpe       │
                       │  RUNS AT REPO HEAD; NEVER        │
                       │  MUTATED BY THE LOOP             │
                       │  (AlphaEvolve metric vector)     │
                       └─────────────┬───────────────────┘
                                     │
                       ┌─────────────▼───────────────────┐
                       │ MAP-ELITES ARCHIVE              │
                       │  cells: sector × horizon ×       │
                       │  signal_density × regime × family│
                       │  best champion per cell          │
                       └─────────────┬───────────────────┘
                                     │
                       ┌─────────────▼───────────────────┐
                       │ THOMPSON BANDIT — research budget│
                       │  arms: islands, prompts, mutator │
                       │    families                      │
                       │  reward: novelty × evaluator     │
                       │          × shadow-survival       │
                       │  NOT capital allocation (yet)    │
                       └─────────────┬───────────────────┘
                                     │
              ┌──────────────────────┼──────────────────────┐
              │                      │                      │
   ┌──────────▼────────┐   ┌─────────▼────────┐   ┌─────────▼────────┐
   │ AUTO-APPLY tier   │   │ SHADOW-LIVE tier │   │ HUMAN-GATE tier  │
   │ docs, archive,    │   │ scorer runs in   │   │ live ranking,    │
   │ rejected-memory   │   │ shadow alongside │   │ sizing, exclusion│
   │                   │   │ production for   │   │ — operator       │
   │                   │   │ 4–8 weeks min    │   │ accept via       │
   │                   │   │                  │   │ research-card UI │
   └───────────────────┘   └───────────────────┘   └──────────────────┘

Layer-by-layer

L1 — Skill library borrows Voyager + Alpha-GPT

~30 small interpretable Python signal primitives written by the operator, each typed (price→float, price+volume→signal, rank→ordinal, etc.). The LLM composes them via a typed DSL; it does not write arbitrary code. New primitives are operator-added; the LLM can request a primitive but cannot install it.

Primitives examples: rank_trajectory(N=5), capital_flow_streak(N), sector_relative_rank, volatility_compression(window), thesis_confidence, catalyst_proximity(days), regime_indicator, liquidity_floor(adv).

L2 — Hypothesis generator borrows FunSearch + RD-Agent + AI Scientist

Six islands, one per existing insight kind (signal_underperform, signal_overperform, verdict_drift, calibration_off, prompt_regression, new_pattern_discovered). Each island has its own population of ResearchHypothesis cards.

A card carries: rationale (LLM-written, why this might work), expected failure mode (the LLM names how it could be wrong), regime assumption, YAML/Python diff, evaluator metric prediction, lineage (which cards it descends from).

Generation modes: LLM composer (Claude composes from skill library), GP mutation (DEAP over signal expressions), and crossover (combine top-k from sibling cells).

L3 — Deterministic evaluator borrows AlphaEvolve + López de Prado

The single load-bearing module. Walk-forward with purged k-fold + embargo. Returns a metric vector, not a scalar:

This module is implemented once, lives at repo HEAD, and the recursive loop never writes to it. Mutating the scorer is the path to self-deception.

L4 — MAP-Elites archive borrows quality-diversity literature

Cells indexed by behavioral dims. Initial proposal (operator can shape):

Each cell keeps the best k hypotheses by risk-adjusted, deflated-Sharpe score. Empty cells become curriculum targets.

L5 — Thompson bandit borrows Voyager + LSP

Allocates the research budget (LLM calls, GP generations) across:

Reward = novelty (distance from archive) × evaluator quality × shadow-survival rate. Not raw return alone (overfit attractor).

Bandit allocates research compute, never live capital. v1.x can lift to capital allocation only after the shadow tier has produced ~12 months of clean signal.

L6 — Three-tier gating borrows RD-Agent + Alpha-GPT

TierWhat auto-appliesPromotion gate
Auto-applyarchive updates, rejected-hypothesis memory, shadow-scorer creation, research-card writes, dashboard re-rendernone (these don't touch live)
Auto-shadowany hypothesis passing offline evaluator sanity checks runs in shadow alongside productionpasses deterministic evaluator with non-trivial lift
Human-gateanything affecting live ranking, sizing, exclusion logic, sector exposure, prompt contentoperator click in research-card UI (Discord button / dashboard)

Promotion requires: ≥3 walk-forward passes clean, Deflated Sharpe ≥ threshold, stable IC across folds, acceptable turnover, no obvious regime dependency (unless explicitly labeled), AND ≥4 weeks of shadow with realized lift above noise.

Cadence

CycleWhat runs
DailyScore candidates (live + shadow scorers). Log live↔shadow deltas. Update bandit rewards from matured 1d/5d/10d returns. Heartbeat the loop.
WeeklyGenerate new hypotheses via the bandit-allocated budget. Run offline evaluator. Update MAP-Elites archive. Promote/reject within tier rules.
MonthlyPromotion review queue — operator-facing batch of human-gate candidates. Stale hypothesis sweep (mark stale, archive-rejected). Holdout audit (re-confirm the holdout was untouched).
EventRegime-change detector → pauses promotions, never auto-changes live behavior, raises hand.

Observability

One operator-facing dashboard (Streamlit, building on the existing one):

5. Four highest-leverage ideas if we only do four things

  1. Skill library + LLM composer Voyager × Alpha-GPT

    Replaces ad-hoc YAML mutators with composable typed primitives. Highest expressivity gain per LOC. The LLM stops writing arbitrary code and starts composing operator-vetted Lego blocks.

  2. Islands archive + Deflated-Sharpe gate FunSearch × López de Prado

    The single biggest defense against noise discovery on small N. Islands force structural diversity; Deflated Sharpe blocks lucky outliers from promotion.

  3. Shadow tier as default promotion target RD-Agent two-stage

    Anything passing offline gates runs in shadow before touching live ranks. Cheap reality-check signal. 4–8 weeks minimum shadow tenure before human-gate.

  4. Research-card UI bolted onto existing flow AI Scientist

    One artifact per hypothesis. Operator clicks accept/reject in Discord or the dashboard. No new surface — extend the ResearchInsight table you already have. Card carries lineage, evaluator output, expected-failure-mode, regime assumption.

6. Open questions for operator

These are the load-bearing decisions. Each has a stake-in-the-ground answer; argue with it.

Q1 — Live-ranking auto-mutation: yes or never?

Stake: Never. Anything affecting live ranks is human-gate. Pure RSI visions (the operator's link) say go further, but the cost of being wrong on a small-N quant system is months of bad signal. Auto-shadow + monthly human-gate is the rainier-shaped compromise.

Tension: If you want "truly recursive / unstoppable," human-gate is the bottleneck. Counter: human-gate is monthly, not per-hypothesis — the loop runs unstoppably underneath; only promotion pauses.

Q2 — Where does the holdout live, and how is it protected?

Stake: A frozen date-cut — the latest 90 trading days (rolling, advanced quarterly) is untouchable. Enforced by an evaluator that refuses any data with scan_date > HOLDOUT_CUT. Promotion decisions therefore lag the holdout by 3 months.

Cost: ~3 months of latency on promotion decisions. Acceptable if you believe in not lying to yourself.

Tension: When the holdout cut advances, last quarter's "live" data becomes evaluator-fair-game. Need to design the advance ritual so the loop doesn't immediately p-hack the newly-released window.

Q3 — Compute budget per generation cycle.

Stake: ~$50/month LLM spend for hypothesis generation; ~$10/week for full archive sweep + walk-forward evaluation. Six islands × ~20 hypotheses/island/week × ~$0.05/hypothesis = $6/week LLM cost. Evaluator compute is dominant — that's local CPU, free.

Operator decides: ceiling on monthly LLM spend. Determines island count + generation depth + how aggressive the bandit can be.

Q4 — What's the smallest end-to-end slice we ship first?

Stake (per your "ship basic first" memory): The smallest end-to-end slice is: (a) add the skill-library DSL with 6–8 starting primitives, (b) wire one Claude composer that proposes hypotheses against one island (start with signal_underperform since it has the most existing evaluator data), (c) route output to the existing ResearchInsight flow with the auto-shadow tier (shadow scorer alongside production composite_score), (d) add Deflated-Sharpe to the evaluator. No MAP-Elites yet, no bandit yet, no GP yet. All those land iteratively while the basic loop is in use.

Tension: A bandit-less, single-island system is barely recursive. Counter: it's the end-to-end slice that proves the pipes; full islands + MAP-Elites + bandit lands in v0.2 once we've watched v0.1 misbehave.

Q5 — Where does this live in the rainier codebase?

Stake: A new module src/rainier/lab/ (separate from llm_thesis/research.py which keeps the existing weekly auto-research). lab/ contains: skills/ (primitives), islands/ (per-island generators), evaluator.py (walk-forward + deflated-Sharpe), archive.py (MAP-Elites), bandit.py, shadow.py. The existing research.py stays as the "v1" recommend-only flow; lab/ is the "v2" recursive flow.

Alternative: Refactor research.py into the new module rather than parallel-track. Argument against: parallel track lets the existing pipeline keep running while we build, and we can deprecate once the new one earns its trust.

Q6 — How "unstoppable" do you actually mean?

Stake interpretation: The loop runs continuously in the sense that it generates + evaluates + archives without operator input. It pauses only for: (a) human-gate promotion decisions (monthly), (b) regime-change events (raise hand, don't auto-react), (c) evaluator failures (the load-bearing module errored), (d) explicit operator stop. It does not mean: full RSI, self-modifying evaluator, or any path where the system can grade its own truth.

Question to confirm: Are we on the same page about what "unstoppable" means?

7. Current rainier — what we already have

The honest read is that you've already built ~60% of this. The recursive evolution mostly rearranges existing parts and adds three new ones.

Recursive-system componentToday in rainierGap
Hypothesis archiveResearchInsight table with kind/severity/evidence/action/rationale, recurrence_count dedup, status enumAdd lineage + behavioral-cell index (for MAP-Elites)
Outcome ground truthThesisEvaluation with 1d/5d/10d forward returns, nightly backfill at 17:00 PTAdd purged-CV slice index, embargo flag, holdout-cut flag
Generatorllm_thesis/research.py with 6 hard-coded check classesReplace with LLM composer over skill library; check classes become islands
EvaluatorMann-Whitney U on used-vs-absent forward returns; walk_forward.py existsAdd Deflated Sharpe, purged k-fold w/ embargo, multi-metric vector return
MutatorYAML auto-mutators (ruamel.yaml + atomic write) gated by human accept/rejectExtend to scorer-as-code mutations; add tier classification
BanditNew; Thompson sampling over islands/prompts/mutators
MAP-Elites archiveNew; table or in-memory keyed by behavioral cells
Shadow tierNew; shadow_composite_score column already in ScreenedStockRecord, repurpose
Skill libraryImplicit in signals/, features/, llm_thesis/signals/Formalize as typed DSL; LLM composes from registry
ObservabilityStreamlit dashboard, Discord embeds, eval reportAdd lineage tree + archive heatmap + bandit allocation log

8. Risks & non-goals

Risks the architecture explicitly defends against

Non-goals (explicit)

9. Decision log

Decisions we've converged on, with date. When a Q in §6 gets answered, it moves here with the rationale.

DateDecisionRationaleFrom
2026-05-18 This HTML file is the design hub for the recursive research system. Iterate here; tasks get extracted from it later. Operator preference for "easy to link, easy to modify" single artifact over scattered markdown notes. operator

10. Open threads — running exploration list

Rabbit holes worth running down later. One line per thread. Strike-through (or move to §9) when resolved.

11. Deep dives (placeholders)

Sections that will get their own write-up when we converge on the v1 slice. Stubs below — fill in when the discussion reaches each.

11.1 Skill library DSL design

TBD. Types, primitive registry, LLM composition prompt, safety boundaries, versioning. Will list each primitive with signature + example.

11.2 Evaluator — walk-forward + purged CV + Deflated Sharpe

TBD. Fold sizing, embargo length, holdout cut policy, metric vector exact definitions, scoring threshold for promotion to auto-shadow vs. human-gate.

11.3 MAP-Elites archive schema

TBD. Cell definitions, storage (Postgres table vs. JSONB vs. parquet), champion-per-cell selection criteria, archive distillation, eviction policy.

11.4 Thompson bandit — arms, rewards, budget

TBD. Arm taxonomy, reward function exact form, prior distributions, budget allocation rules, exploration caps.

11.5 Three-tier gating — exact promotion rules

TBD. Per-tier acceptance gates, shadow-tenure rules, human-gate UI surface (Discord button vs. dashboard click), rollback mechanism.

11.6 Research-card schema

TBD. Field list, lineage encoding, rendering template (Discord embed + dashboard), accept/reject flow, recurrence dedup integration.

11.7 Observability — dashboards & alerts

TBD. Streamlit page layout, alert thresholds, lineage tree rendering, MAP-Elites heatmap design.

11.8 Migration path from existing llm_thesis/research.py

TBD. Parallel-track vs. refactor, deprecation plan, data migration (existing ResearchInsight rows), feature-flag rollout.

11.9 Ablations & null-result tracking

TBD. Which components are independently testable? Rejected-hypothesis archive design. How do we know the recursive loop is actually helping vs. random noise?

12. Change log

Append-only. One line per edit.

13. Sources