The reference frame is Yuandong Tian's:
Model + Harness = Self-Improvement. Mirrors AlphaGo (policy net + MCTS), now applied to LLMs that propose research, with a harness that proves or rejects it.
The whole question of a recursive AI lab is: what does the harness do, and what does it refuse to do? Rainier's harness already has most of the moving parts — screener, evaluator, insights table, YAML mutators, weekly auto-research. The recursive evolution is in (a) opening the hypothesis space beyond the six hard-coded checks, (b) closing the offline-evaluation loop tighter, and (c) keeping the live ranking gated while the lab runs unstoppably underneath.
The danger to avoid: a harness that allows the loop to mutate the parts that score truth (the evaluator, the holdout). That's how recursive systems self-deceive. The evaluator runs at repo HEAD and is never mutated by the loop.
Every system surveyed is some specialization of:
generate(hypothesis) → evaluate(deterministic, walk-forward, holdout-clean)
▲ │
│ ▼
archive (keep diverse) ◄────── score (multi-objective vector)
│ │
▼ ▼
bandit (which arm gets budget?) gate (auto / shadow / human)
The five things that change between systems are:
| # | System | One-line | What we steal |
|---|---|---|---|
| 1 | RD-Agent(Q) |
Multi-agent quant R&D: hypothesis → spec → code-gen → backtest → knowledge-forest. 2× CSI300 returns w/ 70% fewer factors, <$10/run. | The structured hypothesis → task spec → code-gen → backtest → archive loop. Closest published analog to rainier's existing stack. Knowledge-forest dedup ≈ your ResearchInsight.recurrence_count. |
| 2 | FunSearch |
Frozen LLM proposes code → deterministic evaluator scores → island GA evolves population. | Islands topology (your 6 insight kinds become 6 islands). Strict rule: only the deterministic evaluator promotes; LLM never grades itself. |
| 3 | AlphaEvolve |
Multi-objective code evolution with meta-evolution of the search strategy itself. | Evolve scoring modules against a metric vector — return, IC, turnover, drawdown, regime-robustness, explanation quality — not raw return. |
| 4 | AI Scientist v2 |
Agentic tree search: ideate → code → eval → write → reviewer-LLM → next cycle. | The "research card" artifact — one durable record per hypothesis (rationale, expected failure mode, evaluator output, lineage). Replaces freeform research notes. |
| 5 | Voyager |
Ever-growing code-skill library + automatic curriculum + iterative prompting with execution feedback. | A skill library of interpretable Python signal primitives the LLM composes, instead of writing arbitrary code. Curriculum picks next gap from archive coverage. |
| 6 | Eureka |
LLM writes reward code, GPU-sim scores it, evolutionary loop improves. Outperforms human-engineered rewards on 83% of 29 RL tasks. | The reward-function-as-evolvable-code idea. For rainier: the composite-score weighting is an evolvable Python module, not a static YAML. |
| 7 | Language Self-Play |
Single LLM in two roles (Challenger ↔ Solver) generates increasingly hard tasks for itself. | Carefully: a challenger that picks adversarial regimes (drawdown windows, sector rotations, regime breaks) the solver must survive. Note the collapse-after-few-iterations warning — KL reg + self-reward needed. |
| 8 | MAP-Elites |
Quality-diversity archive: behavioral feature space discretized into cells; keep best-so-far in each cell. | The archive shape. Cells indexed by (sector, holding_horizon, signal_density, regime_sensitivity, data_family). Research archive, not live allocator. |
| 9 | Alpha-GPT / AlphaAgent / QuantaAlpha |
LLM-driven alpha-factor mining at scale. Alpha-GPT · AlphaAgent · QuantaAlpha |
Decay-resistance gate (AlphaAgent), trajectory-level self-evolution (QuantaAlpha — each run is a path, not a point), hierarchical RAG over screener history (Alpha-GPT). Your ScreenedStockRecord table is already the RAG corpus. |
| 10 | Anti-overfitting discipline |
Deflated Sharpe, purged k-fold + embargo, walk-forward, untouchable holdout. QuantBench framing. | Non-negotiable. Every promotion clears Deflated-Sharpe ≥ threshold AND stable IC AND acceptable turnover AND multiple walk-forward folds. Otherwise the loop discovers noise on ~10K obs/year/signal. |
Both codex passes and I converge on this skeleton. Each layer below is annotated with which prior art it borrows from.
┌─────────────────────────────────┐
│ SKILL LIBRARY (Voyager-style) │
│ ~30 interpretable Python prims │
│ composable by LLM via typed DSL│
└─────────────┬───────────────────┘
│
┌─────────────▼───────────────────┐
│ HYPOTHESIS GENERATOR │
│ 6 islands (one per insight kind)│
│ LLM composer + DEAP-style GP │
│ emits ResearchHypothesis cards │
│ (FunSearch + RD-Agent + AI Sci) │
└─────────────┬───────────────────┘
│
┌─────────────▼───────────────────┐
│ DETERMINISTIC EVALUATOR │
│ walk-forward + purged CV + │
│ embargo → Deflated Sharpe │
│ RUNS AT REPO HEAD; NEVER │
│ MUTATED BY THE LOOP │
│ (AlphaEvolve metric vector) │
└─────────────┬───────────────────┘
│
┌─────────────▼───────────────────┐
│ MAP-ELITES ARCHIVE │
│ cells: sector × horizon × │
│ signal_density × regime × family│
│ best champion per cell │
└─────────────┬───────────────────┘
│
┌─────────────▼───────────────────┐
│ THOMPSON BANDIT — research budget│
│ arms: islands, prompts, mutator │
│ families │
│ reward: novelty × evaluator │
│ × shadow-survival │
│ NOT capital allocation (yet) │
└─────────────┬───────────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
┌──────────▼────────┐ ┌─────────▼────────┐ ┌─────────▼────────┐
│ AUTO-APPLY tier │ │ SHADOW-LIVE tier │ │ HUMAN-GATE tier │
│ docs, archive, │ │ scorer runs in │ │ live ranking, │
│ rejected-memory │ │ shadow alongside │ │ sizing, exclusion│
│ │ │ production for │ │ — operator │
│ │ │ 4–8 weeks min │ │ accept via │
│ │ │ │ │ research-card UI │
└───────────────────┘ └───────────────────┘ └──────────────────┘
~30 small interpretable Python signal primitives written by the operator, each typed (price→float, price+volume→signal, rank→ordinal, etc.). The LLM composes them via a typed DSL; it does not write arbitrary code. New primitives are operator-added; the LLM can request a primitive but cannot install it.
Primitives examples: rank_trajectory(N=5), capital_flow_streak(N), sector_relative_rank, volatility_compression(window), thesis_confidence, catalyst_proximity(days), regime_indicator, liquidity_floor(adv).
Six islands, one per existing insight kind (signal_underperform, signal_overperform, verdict_drift, calibration_off, prompt_regression, new_pattern_discovered). Each island has its own population of ResearchHypothesis cards.
A card carries: rationale (LLM-written, why this might work), expected failure mode (the LLM names how it could be wrong), regime assumption, YAML/Python diff, evaluator metric prediction, lineage (which cards it descends from).
Generation modes: LLM composer (Claude composes from skill library), GP mutation (DEAP over signal expressions), and crossover (combine top-k from sibling cells).
The single load-bearing module. Walk-forward with purged k-fold + embargo. Returns a metric vector, not a scalar:
This module is implemented once, lives at repo HEAD, and the recursive loop never writes to it. Mutating the scorer is the path to self-deception.
Cells indexed by behavioral dims. Initial proposal (operator can shape):
Each cell keeps the best k hypotheses by risk-adjusted, deflated-Sharpe score. Empty cells become curriculum targets.
Allocates the research budget (LLM calls, GP generations) across:
Reward = novelty (distance from archive) × evaluator quality × shadow-survival rate. Not raw return alone (overfit attractor).
Bandit allocates research compute, never live capital. v1.x can lift to capital allocation only after the shadow tier has produced ~12 months of clean signal.
| Tier | What auto-applies | Promotion gate |
|---|---|---|
| Auto-apply | archive updates, rejected-hypothesis memory, shadow-scorer creation, research-card writes, dashboard re-render | none (these don't touch live) |
| Auto-shadow | any hypothesis passing offline evaluator sanity checks runs in shadow alongside production | passes deterministic evaluator with non-trivial lift |
| Human-gate | anything affecting live ranking, sizing, exclusion logic, sector exposure, prompt content | operator click in research-card UI (Discord button / dashboard) |
Promotion requires: ≥3 walk-forward passes clean, Deflated Sharpe ≥ threshold, stable IC across folds, acceptable turnover, no obvious regime dependency (unless explicitly labeled), AND ≥4 weeks of shadow with realized lift above noise.
| Cycle | What runs |
|---|---|
| Daily | Score candidates (live + shadow scorers). Log live↔shadow deltas. Update bandit rewards from matured 1d/5d/10d returns. Heartbeat the loop. |
| Weekly | Generate new hypotheses via the bandit-allocated budget. Run offline evaluator. Update MAP-Elites archive. Promote/reject within tier rules. |
| Monthly | Promotion review queue — operator-facing batch of human-gate candidates. Stale hypothesis sweep (mark stale, archive-rejected). Holdout audit (re-confirm the holdout was untouched). |
| Event | Regime-change detector → pauses promotions, never auto-changes live behavior, raises hand. |
One operator-facing dashboard (Streamlit, building on the existing one):
Replaces ad-hoc YAML mutators with composable typed primitives. Highest expressivity gain per LOC. The LLM stops writing arbitrary code and starts composing operator-vetted Lego blocks.
The single biggest defense against noise discovery on small N. Islands force structural diversity; Deflated Sharpe blocks lucky outliers from promotion.
Anything passing offline gates runs in shadow before touching live ranks. Cheap reality-check signal. 4–8 weeks minimum shadow tenure before human-gate.
One artifact per hypothesis. Operator clicks accept/reject in Discord or the dashboard. No new surface — extend the ResearchInsight table you already have. Card carries lineage, evaluator output, expected-failure-mode, regime assumption.
These are the load-bearing decisions. Each has a stake-in-the-ground answer; argue with it.
Stake: Never. Anything affecting live ranks is human-gate. Pure RSI visions (the operator's link) say go further, but the cost of being wrong on a small-N quant system is months of bad signal. Auto-shadow + monthly human-gate is the rainier-shaped compromise.
Tension: If you want "truly recursive / unstoppable," human-gate is the bottleneck. Counter: human-gate is monthly, not per-hypothesis — the loop runs unstoppably underneath; only promotion pauses.
Stake: A frozen date-cut — the latest 90 trading days (rolling, advanced quarterly) is untouchable. Enforced by an evaluator that refuses any data with scan_date > HOLDOUT_CUT. Promotion decisions therefore lag the holdout by 3 months.
Cost: ~3 months of latency on promotion decisions. Acceptable if you believe in not lying to yourself.
Tension: When the holdout cut advances, last quarter's "live" data becomes evaluator-fair-game. Need to design the advance ritual so the loop doesn't immediately p-hack the newly-released window.
Stake: ~$50/month LLM spend for hypothesis generation; ~$10/week for full archive sweep + walk-forward evaluation. Six islands × ~20 hypotheses/island/week × ~$0.05/hypothesis = $6/week LLM cost. Evaluator compute is dominant — that's local CPU, free.
Operator decides: ceiling on monthly LLM spend. Determines island count + generation depth + how aggressive the bandit can be.
Stake (per your "ship basic first" memory): The smallest end-to-end slice is: (a) add the skill-library DSL with 6–8 starting primitives, (b) wire one Claude composer that proposes hypotheses against one island (start with signal_underperform since it has the most existing evaluator data), (c) route output to the existing ResearchInsight flow with the auto-shadow tier (shadow scorer alongside production composite_score), (d) add Deflated-Sharpe to the evaluator. No MAP-Elites yet, no bandit yet, no GP yet. All those land iteratively while the basic loop is in use.
Tension: A bandit-less, single-island system is barely recursive. Counter: it's the end-to-end slice that proves the pipes; full islands + MAP-Elites + bandit lands in v0.2 once we've watched v0.1 misbehave.
Stake: A new module src/rainier/lab/ (separate from llm_thesis/research.py which keeps the existing weekly auto-research). lab/ contains: skills/ (primitives), islands/ (per-island generators), evaluator.py (walk-forward + deflated-Sharpe), archive.py (MAP-Elites), bandit.py, shadow.py. The existing research.py stays as the "v1" recommend-only flow; lab/ is the "v2" recursive flow.
Alternative: Refactor research.py into the new module rather than parallel-track. Argument against: parallel track lets the existing pipeline keep running while we build, and we can deprecate once the new one earns its trust.
Stake interpretation: The loop runs continuously in the sense that it generates + evaluates + archives without operator input. It pauses only for: (a) human-gate promotion decisions (monthly), (b) regime-change events (raise hand, don't auto-react), (c) evaluator failures (the load-bearing module errored), (d) explicit operator stop. It does not mean: full RSI, self-modifying evaluator, or any path where the system can grade its own truth.
Question to confirm: Are we on the same page about what "unstoppable" means?
The honest read is that you've already built ~60% of this. The recursive evolution mostly rearranges existing parts and adds three new ones.
| Recursive-system component | Today in rainier | Gap |
|---|---|---|
| Hypothesis archive | ResearchInsight table with kind/severity/evidence/action/rationale, recurrence_count dedup, status enum | Add lineage + behavioral-cell index (for MAP-Elites) |
| Outcome ground truth | ThesisEvaluation with 1d/5d/10d forward returns, nightly backfill at 17:00 PT | Add purged-CV slice index, embargo flag, holdout-cut flag |
| Generator | llm_thesis/research.py with 6 hard-coded check classes | Replace with LLM composer over skill library; check classes become islands |
| Evaluator | Mann-Whitney U on used-vs-absent forward returns; walk_forward.py exists | Add Deflated Sharpe, purged k-fold w/ embargo, multi-metric vector return |
| Mutator | YAML auto-mutators (ruamel.yaml + atomic write) gated by human accept/reject | Extend to scorer-as-code mutations; add tier classification |
| Bandit | — | New; Thompson sampling over islands/prompts/mutators |
| MAP-Elites archive | — | New; table or in-memory keyed by behavioral cells |
| Shadow tier | — | New; shadow_composite_score column already in ScreenedStockRecord, repurpose |
| Skill library | Implicit in signals/, features/, llm_thesis/signals/ | Formalize as typed DSL; LLM composes from registry |
| Observability | Streamlit dashboard, Discord embeds, eval report | Add lineage tree + archive heatmap + bandit allocation log |
llm_thesis/research.py in one shot — parallel track, deprecate later.| Date | Decision | Rationale | From |
|---|---|---|---|
| 2026-05-18 | This HTML file is the design hub for the recursive research system. Iterate here; tasks get extracted from it later. | Operator preference for "easy to link, easy to modify" single artifact over scattered markdown notes. | operator |
shadow_composite_score column already exists in ScreenedStockRecord. Reuse or new table?research.py already tracks prompt_version. How does the bandit treat prompt variants — as arms or as a separate axis?llm_thesis/research.py ¶