Rainier — recursive research system: design discussion

1. Vision anchor ¶

The reference frame is Yuandong Tian's:

Model + Harness = Self-Improvement. Mirrors AlphaGo (policy net + MCTS), now applied to LLMs that propose research, with a harness that proves or rejects it.

The whole question of a recursive AI lab is: what does the harness do, and what does it refuse to do? Rainier's harness already has most of the moving parts — screener, evaluator, insights table, YAML mutators, weekly auto-research. The recursive evolution is in (a) opening the hypothesis space beyond the six hard-coded checks, (b) closing the offline-evaluation loop tighter, and (c) keeping the live ranking gated while the lab runs unstoppably underneath.

The danger to avoid: a harness that allows the loop to mutate the parts that score truth (the evaluator, the holdout). That's how recursive systems self-deceive. The evaluator runs at repo HEAD and is never mutated by the loop.

2. The recursive-AI-lab playbook ¶

Every system surveyed is some specialization of:

generate(hypothesis)  →  evaluate(deterministic, walk-forward, holdout-clean)
        ▲                              │
        │                              ▼
   archive (keep diverse) ◄────── score (multi-objective vector)
        │                              │
        ▼                              ▼
   bandit (which arm gets budget?)  gate (auto / shadow / human)

The five things that change between systems are:

What's a hypothesis? A Python function (FunSearch, Eureka), a YAML diff (rainier today), a natural-language idea (AI Scientist), a full research plan (RD-Agent), a composed skill (Voyager).
What's the evaluator? Deterministic puzzle scorer (FunSearch), GPU-sim reward (Eureka), backtest engine (RD-Agent, Alpha-GPT), human review (AI Scientist v1).
How is diversity preserved? Islands (FunSearch), MAP-Elites cells (quality-diversity literature), behavioral characterization (POET), archive distillation (Voyager).
Who picks the next try? Best-so-far prompt (FunSearch), curriculum + bandits (Voyager, Language Self-Play), tree search (AI Scientist v2), human (Alpha-GPT 1.0).
What's allowed to change live behavior? Everything (full RSI vision), nothing (Alpha-GPT human-in-loop), threshold-gated subset (RD-Agent shadow-mode, the rainier proposal here).

3. Prior art, ranked by adoptability for rainier ¶

#	System	One-line	What we steal
1	RD-Agent(Q) Microsoft, 2025	Multi-agent quant R&D: hypothesis → spec → code-gen → backtest → knowledge-forest. 2× CSI300 returns w/ 70% fewer factors, <$10/run.	The structured `hypothesis → task spec → code-gen → backtest → archive` loop. Closest published analog to rainier's existing stack. Knowledge-forest dedup ≈ your `ResearchInsight.recurrence_count`.
2	FunSearch DeepMind, Nature 2023	Frozen LLM proposes code → deterministic evaluator scores → island GA evolves population.	Islands topology (your 6 insight kinds become 6 islands). Strict rule: only the deterministic evaluator promotes; LLM never grades itself.
3	AlphaEvolve DeepMind, 2025	Multi-objective code evolution with meta-evolution of the search strategy itself.	Evolve scoring modules against a metric vector — return, IC, turnover, drawdown, regime-robustness, explanation quality — not raw return.
4	AI Scientist v2 Sakana, 2025	Agentic tree search: ideate → code → eval → write → reviewer-LLM → next cycle.	The "research card" artifact — one durable record per hypothesis (rationale, expected failure mode, evaluator output, lineage). Replaces freeform research notes.
5	Voyager MineDojo / NVIDIA, 2023	Ever-growing code-skill library + automatic curriculum + iterative prompting with execution feedback.	A skill library of interpretable Python signal primitives the LLM composes, instead of writing arbitrary code. Curriculum picks next gap from archive coverage.
6	Eureka NVIDIA / UPenn, 2023	LLM writes reward code, GPU-sim scores it, evolutionary loop improves. Outperforms human-engineered rewards on 83% of 29 RL tasks.	The reward-function-as-evolvable-code idea. For rainier: the composite-score weighting is an evolvable Python module, not a static YAML.
7	Language Self-Play 2025	Single LLM in two roles (Challenger ↔ Solver) generates increasingly hard tasks for itself.	Carefully: a challenger that picks adversarial regimes (drawdown windows, sector rotations, regime breaks) the solver must survive. Note the collapse-after-few-iterations warning — KL reg + self-reward needed.
8	MAP-Elites Mouret & Clune 2015 et al.	Quality-diversity archive: behavioral feature space discretized into cells; keep best-so-far in each cell.	The archive shape. Cells indexed by `(sector, holding_horizon, signal_density, regime_sensitivity, data_family)`. Research archive, not live allocator.
9	Alpha-GPT / AlphaAgent / QuantaAlpha 2023–2025	LLM-driven alpha-factor mining at scale. Alpha-GPT · AlphaAgent · QuantaAlpha	Decay-resistance gate (AlphaAgent), trajectory-level self-evolution (QuantaAlpha — each run is a path, not a point), hierarchical RAG over screener history (Alpha-GPT). Your `ScreenedStockRecord` table is already the RAG corpus.
10	Anti-overfitting discipline López de Prado et al.	Deflated Sharpe, purged k-fold + embargo, walk-forward, untouchable holdout. QuantBench framing.	Non-negotiable. Every promotion clears Deflated-Sharpe ≥ threshold AND stable IC AND acceptable turnover AND multiple walk-forward folds. Otherwise the loop discovers noise on ~10K obs/year/signal.

4. Proposed architecture ¶

Both codex passes and I converge on this skeleton. Each layer below is annotated with which prior art it borrows from.

                       ┌─────────────────────────────────┐
                       │   SKILL LIBRARY (Voyager-style) │
                       │  ~30 interpretable Python prims │
                       │  composable by LLM via typed DSL│
                       └─────────────┬───────────────────┘
                                     │
                       ┌─────────────▼───────────────────┐
                       │ HYPOTHESIS GENERATOR             │
                       │  6 islands (one per insight kind)│
                       │  LLM composer + DEAP-style GP    │
                       │  emits ResearchHypothesis cards  │
                       │  (FunSearch + RD-Agent + AI Sci) │
                       └─────────────┬───────────────────┘
                                     │
                       ┌─────────────▼───────────────────┐
                       │  DETERMINISTIC EVALUATOR         │
                       │  walk-forward + purged CV +      │
                       │  embargo → Deflated Sharpe       │
                       │  RUNS AT REPO HEAD; NEVER        │
                       │  MUTATED BY THE LOOP             │
                       │  (AlphaEvolve metric vector)     │
                       └─────────────┬───────────────────┘
                                     │
                       ┌─────────────▼───────────────────┐
                       │ MAP-ELITES ARCHIVE              │
                       │  cells: sector × horizon ×       │
                       │  signal_density × regime × family│
                       │  best champion per cell          │
                       └─────────────┬───────────────────┘
                                     │
                       ┌─────────────▼───────────────────┐
                       │ THOMPSON BANDIT — research budget│
                       │  arms: islands, prompts, mutator │
                       │    families                      │
                       │  reward: novelty × evaluator     │
                       │          × shadow-survival       │
                       │  NOT capital allocation (yet)    │
                       └─────────────┬───────────────────┘
                                     │
              ┌──────────────────────┼──────────────────────┐
              │                      │                      │
   ┌──────────▼────────┐   ┌─────────▼────────┐   ┌─────────▼────────┐
   │ AUTO-APPLY tier   │   │ SHADOW-LIVE tier │   │ HUMAN-GATE tier  │
   │ docs, archive,    │   │ scorer runs in   │   │ live ranking,    │
   │ rejected-memory   │   │ shadow alongside │   │ sizing, exclusion│
   │                   │   │ production for   │   │ — operator       │
   │                   │   │ 4–8 weeks min    │   │ accept via       │
   │                   │   │                  │   │ research-card UI │
   └───────────────────┘   └───────────────────┘   └──────────────────┘

Layer-by-layer ¶

L1 — Skill library borrows Voyager + Alpha-GPT ¶

~30 small interpretable Python signal primitives written by the operator, each typed (price→float, price+volume→signal, rank→ordinal, etc.). The LLM composes them via a typed DSL; it does not write arbitrary code. New primitives are operator-added; the LLM can request a primitive but cannot install it.

Primitives examples: rank_trajectory(N=5), capital_flow_streak(N), sector_relative_rank, volatility_compression(window), thesis_confidence, catalyst_proximity(days), regime_indicator, liquidity_floor(adv).

L2 — Hypothesis generator borrows FunSearch + RD-Agent + AI Scientist ¶

Six islands, one per existing insight kind (signal_underperform, signal_overperform, verdict_drift, calibration_off, prompt_regression, new_pattern_discovered). Each island has its own population of ResearchHypothesis cards.

A card carries: rationale (LLM-written, why this might work), expected failure mode (the LLM names how it could be wrong), regime assumption, YAML/Python diff, evaluator metric prediction, lineage (which cards it descends from).

Generation modes: LLM composer (Claude composes from skill library), GP mutation (DEAP over signal expressions), and crossover (combine top-k from sibling cells).

L3 — Deterministic evaluator borrows AlphaEvolve + López de Prado ¶

The single load-bearing module. Walk-forward with purged k-fold + embargo. Returns a metric vector, not a scalar:

Forward return at 1d / 5d / 10d / 20d
Information coefficient (IC) and IC stability across folds
Information ratio
Turnover & average holding period
Max drawdown & downside capture
Sector / industry concentration
Capacity / liquidity penalty
Regime-conditional metrics (bull / bear / high-vol / low-vol)
Hit rate, tail loss, skew
Deflated Sharpe (false-discovery adjusted)

This module is implemented once, lives at repo HEAD, and the recursive loop never writes to it. Mutating the scorer is the path to self-deception.

L4 — MAP-Elites archive borrows quality-diversity literature ¶

Cells indexed by behavioral dims. Initial proposal (operator can shape):

Dominant sector exposure: {energy, tech, consumer, financial, materials, diversified}
Holding horizon: {short ≤2d, medium 3–10d, long >10d}
Signal density: {sparse high-conviction, broad weak}
Regime sensitivity: {pro-cyclical, defensive, vol-seeking, vol-averse}
Data family: {fundamentals, price-action, thesis-text, event/catalyst, hybrid}

Each cell keeps the best k hypotheses by risk-adjusted, deflated-Sharpe score. Empty cells become curriculum targets.

L5 — Thompson bandit borrows Voyager + LSP ¶

Allocates the research budget (LLM calls, GP generations) across:

Which island gets more generation budget this cycle
Which Claude prompt variant produces evaluator-passing hypotheses more often
Which sector priors deserve exploration
Which mutator families (GP vs LLM-composer vs crossover) are working

Reward = novelty (distance from archive) × evaluator quality × shadow-survival rate. Not raw return alone (overfit attractor).

Bandit allocates research compute, never live capital. v1.x can lift to capital allocation only after the shadow tier has produced ~12 months of clean signal.

L6 — Three-tier gating borrows RD-Agent + Alpha-GPT ¶

Tier	What auto-applies	Promotion gate
Auto-apply	archive updates, rejected-hypothesis memory, shadow-scorer creation, research-card writes, dashboard re-render	none (these don't touch live)
Auto-shadow	any hypothesis passing offline evaluator sanity checks runs in shadow alongside production	passes deterministic evaluator with non-trivial lift
Human-gate	anything affecting live ranking, sizing, exclusion logic, sector exposure, prompt content	operator click in research-card UI (Discord button / dashboard)

Promotion requires: ≥3 walk-forward passes clean, Deflated Sharpe ≥ threshold, stable IC across folds, acceptable turnover, no obvious regime dependency (unless explicitly labeled), AND ≥4 weeks of shadow with realized lift above noise.

Cadence ¶

Cycle	What runs
Daily	Score candidates (live + shadow scorers). Log live↔shadow deltas. Update bandit rewards from matured 1d/5d/10d returns. Heartbeat the loop.
Weekly	Generate new hypotheses via the bandit-allocated budget. Run offline evaluator. Update MAP-Elites archive. Promote/reject within tier rules.
Monthly	Promotion review queue — operator-facing batch of human-gate candidates. Stale hypothesis sweep (mark stale, archive-rejected). Holdout audit (re-confirm the holdout was untouched).
Event	Regime-change detector → pauses promotions, never auto-changes live behavior, raises hand.

Observability ¶

One operator-facing dashboard (Streamlit, building on the existing one):

Active live scorers + shadow scorers (with current allocation)
Recent promotions / rejections with reasons
Hypothesis lineage tree (which card descends from which)
MAP-Elites archive heatmap (filled cells, performance per cell)
Regime dashboard
Overfitting risk flags (e.g. p-hacking score from multiple-comparison adjustment)
Bandit allocation history (which island got budget when)
"Why did this stock rank here?" — per-row explanation drilling into which scorers contributed

5. Four highest-leverage ideas if we only do four things ¶

Skill library + LLM composer Voyager × Alpha-GPT ¶

Replaces ad-hoc YAML mutators with composable typed primitives. Highest expressivity gain per LOC. The LLM stops writing arbitrary code and starts composing operator-vetted Lego blocks.
Islands archive + Deflated-Sharpe gate FunSearch × López de Prado ¶

The single biggest defense against noise discovery on small N. Islands force structural diversity; Deflated Sharpe blocks lucky outliers from promotion.
Shadow tier as default promotion target RD-Agent two-stage ¶

Anything passing offline gates runs in shadow before touching live ranks. Cheap reality-check signal. 4–8 weeks minimum shadow tenure before human-gate.
Research-card UI bolted onto existing flow AI Scientist ¶

One artifact per hypothesis. Operator clicks accept/reject in Discord or the dashboard. No new surface — extend the ResearchInsight table you already have. Card carries lineage, evaluator output, expected-failure-mode, regime assumption.

6. Open questions for operator ¶

These are the load-bearing decisions. Each has a stake-in-the-ground answer; argue with it.

Q1 — Live-ranking auto-mutation: yes or never?

Stake: Never. Anything affecting live ranks is human-gate. Pure RSI visions (the operator's link) say go further, but the cost of being wrong on a small-N quant system is months of bad signal. Auto-shadow + monthly human-gate is the rainier-shaped compromise.

Tension: If you want "truly recursive / unstoppable," human-gate is the bottleneck. Counter: human-gate is monthly, not per-hypothesis — the loop runs unstoppably underneath; only promotion pauses.

Q2 — Where does the holdout live, and how is it protected?

Stake: A frozen date-cut — the latest 90 trading days (rolling, advanced quarterly) is untouchable. Enforced by an evaluator that refuses any data with scan_date > HOLDOUT_CUT. Promotion decisions therefore lag the holdout by 3 months.

Cost: ~3 months of latency on promotion decisions. Acceptable if you believe in not lying to yourself.

Tension: When the holdout cut advances, last quarter's "live" data becomes evaluator-fair-game. Need to design the advance ritual so the loop doesn't immediately p-hack the newly-released window.

Q3 — Compute budget per generation cycle.

Stake: ~$50/month LLM spend for hypothesis generation; ~$10/week for full archive sweep + walk-forward evaluation. Six islands × ~20 hypotheses/island/week × ~$0.05/hypothesis = $6/week LLM cost. Evaluator compute is dominant — that's local CPU, free.

Operator decides: ceiling on monthly LLM spend. Determines island count + generation depth + how aggressive the bandit can be.

Q4 — What's the smallest end-to-end slice we ship first?

Stake (per your "ship basic first" memory): The smallest end-to-end slice is: (a) add the skill-library DSL with 6–8 starting primitives, (b) wire one Claude composer that proposes hypotheses against one island (start with signal_underperform since it has the most existing evaluator data), (c) route output to the existing ResearchInsight flow with the auto-shadow tier (shadow scorer alongside production composite_score), (d) add Deflated-Sharpe to the evaluator. No MAP-Elites yet, no bandit yet, no GP yet. All those land iteratively while the basic loop is in use.

Tension: A bandit-less, single-island system is barely recursive. Counter: it's the end-to-end slice that proves the pipes; full islands + MAP-Elites + bandit lands in v0.2 once we've watched v0.1 misbehave.

Q5 — Where does this live in the rainier codebase?

Stake: A new module src/rainier/lab/ (separate from llm_thesis/research.py which keeps the existing weekly auto-research). lab/ contains: skills/ (primitives), islands/ (per-island generators), evaluator.py (walk-forward + deflated-Sharpe), archive.py (MAP-Elites), bandit.py, shadow.py. The existing research.py stays as the "v1" recommend-only flow; lab/ is the "v2" recursive flow.

Alternative: Refactor research.py into the new module rather than parallel-track. Argument against: parallel track lets the existing pipeline keep running while we build, and we can deprecate once the new one earns its trust.

Q6 — How "unstoppable" do you actually mean?

Stake interpretation: The loop runs continuously in the sense that it generates + evaluates + archives without operator input. It pauses only for: (a) human-gate promotion decisions (monthly), (b) regime-change events (raise hand, don't auto-react), (c) evaluator failures (the load-bearing module errored), (d) explicit operator stop. It does not mean: full RSI, self-modifying evaluator, or any path where the system can grade its own truth.

Question to confirm: Are we on the same page about what "unstoppable" means?

7. Current rainier — what we already have ¶

The honest read is that you've already built ~60% of this. The recursive evolution mostly rearranges existing parts and adds three new ones.

Recursive-system component	Today in rainier	Gap
Hypothesis archive	`ResearchInsight` table with kind/severity/evidence/action/rationale, recurrence_count dedup, status enum	Add lineage + behavioral-cell index (for MAP-Elites)
Outcome ground truth	`ThesisEvaluation` with 1d/5d/10d forward returns, nightly backfill at 17:00 PT	Add purged-CV slice index, embargo flag, holdout-cut flag
Generator	`llm_thesis/research.py` with 6 hard-coded check classes	Replace with LLM composer over skill library; check classes become islands
Evaluator	Mann-Whitney U on used-vs-absent forward returns; `walk_forward.py` exists	Add Deflated Sharpe, purged k-fold w/ embargo, multi-metric vector return
Mutator	YAML auto-mutators (ruamel.yaml + atomic write) gated by human accept/reject	Extend to scorer-as-code mutations; add tier classification
Bandit	—	New; Thompson sampling over islands/prompts/mutators
MAP-Elites archive	—	New; table or in-memory keyed by behavioral cells
Shadow tier	—	New; shadow_composite_score column already in ScreenedStockRecord, repurpose
Skill library	Implicit in `signals/`, `features/`, `llm_thesis/signals/`	Formalize as typed DSL; LLM composes from registry
Observability	Streamlit dashboard, Discord embeds, eval report	Add lineage tree + archive heatmap + bandit allocation log

8. Risks & non-goals ¶

Risks the architecture explicitly defends against ¶

Self-graded truth. Evaluator at repo HEAD, never mutated by the loop. Holdout date-cut enforced in the evaluator itself.
Noise discovery on small N. Deflated Sharpe + multiple-comparison adjustment + minimum walk-forward folds + shadow tenure ≥ 4 weeks.
Mode collapse. Islands prevent one family from dominating. MAP-Elites preserves diversity even when one cell's champion dominates the leaderboard.
Live ranking drift. Human-gate on anything live-affecting. Auto-shadow is the default promotion target.
Loop runaway. Bandit budget capped; LLM spend ceiling; regime-change detector pauses promotions; rejected-hypothesis archive prevents re-discovery of dead ends.

Non-goals (explicit) ¶

Full RSI — system never modifies the evaluator, the holdout, the gating logic, or its own scaffolding.
Auto-capital-allocation — the bandit allocates research budget only.
RL on raw market actions — that's a separate effort (FinRL etc.); not part of the research loop.
Replacing existing llm_thesis/research.py in one shot — parallel track, deprecate later.
Multi-asset / multi-market — QU100 only for v1.

9. Decision log ¶

Decisions we've converged on, with date. When a Q in §6 gets answered, it moves here with the rationale.

Date	Decision	Rationale	From
2026-05-18	This HTML file is the design hub for the recursive research system. Iterate here; tasks get extracted from it later.	Operator preference for "easy to link, easy to modify" single artifact over scattered markdown notes.	operator

10. Open threads — running exploration list ¶

Rabbit holes worth running down later. One line per thread. Strike-through (or move to §9) when resolved.

Should the skill library be code-as-text (composable Python) or expressions-only (a typed DSL with no eval)? Trade-off: expressivity vs. safety of LLM-generated content.
Behavioral cell definitions for MAP-Elites — confirm the 5 dims (sector × horizon × density × regime × family). Are these the right axes, or are there more important ones for QU100 specifically?
Regime-change detection: HMM (in TODOS already) vs. simpler vol-regime breakpoint vs. LLM-narrated regime. Which is the cheapest reliable signal?
Shadow tier mechanics: shadow_composite_score column already exists in ScreenedStockRecord. Reuse or new table?
Prompt versioning: research.py already tracks prompt_version. How does the bandit treat prompt variants — as arms or as a separate axis?
How does Cai Sen pattern weighting interact with the recursive loop? Is the existing tier-1/2/3/4 Cai Sen weight scheme an evolvable parameter or a fixed prior?
LLM cost ceiling — Q3 in §6. Tied to island count and generation depth.
Should we add a "challenger" LLM (Language Self-Play) that picks adversarial regimes, or is walk-forward + purged CV enough?
Capacity / liquidity penalty — what's the realistic capacity floor for QU100 names? Affects evaluator metric weight.
Multi-asset / multi-market expansion — explicit non-goal for v1, but worth a paragraph on what would need to change later.

11. Deep dives (placeholders) ¶

Sections that will get their own write-up when we converge on the v1 slice. Stubs below — fill in when the discussion reaches each.

11.1 Skill library DSL design ¶

TBD. Types, primitive registry, LLM composition prompt, safety boundaries, versioning. Will list each primitive with signature + example.

11.2 Evaluator — walk-forward + purged CV + Deflated Sharpe ¶

TBD. Fold sizing, embargo length, holdout cut policy, metric vector exact definitions, scoring threshold for promotion to auto-shadow vs. human-gate.

11.3 MAP-Elites archive schema ¶

TBD. Cell definitions, storage (Postgres table vs. JSONB vs. parquet), champion-per-cell selection criteria, archive distillation, eviction policy.

11.4 Thompson bandit — arms, rewards, budget ¶

TBD. Arm taxonomy, reward function exact form, prior distributions, budget allocation rules, exploration caps.

11.5 Three-tier gating — exact promotion rules ¶

TBD. Per-tier acceptance gates, shadow-tenure rules, human-gate UI surface (Discord button vs. dashboard click), rollback mechanism.

11.6 Research-card schema ¶

TBD. Field list, lineage encoding, rendering template (Discord embed + dashboard), accept/reject flow, recurrence dedup integration.

11.7 Observability — dashboards & alerts ¶

TBD. Streamlit page layout, alert thresholds, lineage tree rendering, MAP-Elites heatmap design.

11.8 Migration path from existing `llm_thesis/research.py` ¶

TBD. Parallel-track vs. refactor, deprecation plan, data migration (existing ResearchInsight rows), feature-flag rollout.

11.9 Ablations & null-result tracking ¶

TBD. Which components are independently testable? Rejected-hypothesis archive design. How do we know the recursive loop is actually helping vs. random noise?

12. Change log ¶

Append-only. One line per edit.

2026-05-18 — Initial draft. Vision anchor, prior art survey, architecture proposal, 4 leverage ideas, 6 open questions, current-state mapping, risks. (coord)
2026-05-18 — Added decision log, open threads, deep-dive stubs, change log. File is now structured as a doc hub. (coord)