origin/main — shadow trading, paper_trade.shadow, paper/replay.py, regime tag) · PR base: mainAnswer YES/NO per row. Everything below is context.
| # | Decision | Default | Cost | Risk | Consequence if NO |
|---|---|---|---|---|---|
| 1 | WS1 — replay the live pattern layer over 1yr of stock_prices for a pattern forward-return audit (full-composite replay deferred to WS3 — needs an as-of money-flow selector) |
YES | 1 PR | Med | Keep tuning blind; no ground truth for any weight |
| 2 | Stage WS2–WS4 behind WS1 (spec the tuning only after we see the numbers) | YES | — | Low | Spec weight changes we can't justify |
| 3 | Every model change ships as an A/B experiment (champion vs challenger), promote on a measured win | YES | per-change | Low | Direct live flips — the failure mode we're fixing |
| 4 | Build a champion/challenger model-config system (champion.yaml + history + results registry) layered over today's config |
YES | Small | Low | No way to A/B, auto-tune, version, or roll back a weight set |
| 5 | Wait-and-accrue live data instead of replaying the backfill | NO | 0 | High | Tuning blocked for months |
stock_prices. Replaying the live pattern layer across it re-derives a large pattern corpus with forward returns — the measurement we lack. (Money-flow history also exists, ~1,400 days, but the live money-flow selector is latest-only — no as-of variant — so full 3-layer composite replay is deferred to WS3; the 1-year audit is pattern-layer.)settings.yaml; the new piece is a champion/challenger model-config system that makes a weight set comparable, auto-writable, versioned, and revertible.The loop just went live (#143 added shadow trading + the reclaim audit). Right now it ranks stocks by patterns whose track record is unknown — and the first portfolio review already caught those patterns pointing the wrong way in a rising market. Every day we run on guessed weights, we accumulate decisions we can't defend or improve. The backfill needed to measure already exists, the detector change is zero, and the shadow rails for paper trading landed last PR — so the cost is at its lowest it will be.
Already shipped (verified on origin/main)
- Daily screen tags each QU100 stock with a chart pattern + confidence (detect_patterns → _filter_actionable → best_pattern), ranks by a 3-layer composite, sends the top 5 to the LLM.
- Pattern/layer weights + thresholds live in StockScreenerConfig and are already overridable from config/settings.yaml (stock_screener: block).
- One year of daily OHLC for the QU100 universe in Postgres stock_prices (legacy local TimescaleDB).
- paper_trade.shadow + shadow isolation, paper/replay.py, and a regime helper (compute_market_regime, SPY vs 200-SMA) — #143.
Missing / broken
- No measurement of pattern predictiveness. The per-pattern confidence weights and the 3-layer split were never validated against outcomes.
- No replay path that matches the live ranker — the only historical engine (qu100_portfolio) uses a different ranking (2 patterns, top-20, confidence-only, hardcoded config), so it can't measure what production actually does.
- The #143 shadow path is a WATCH-buy gate keyed on the LLM verdict, not a second screener-config run — so it is not yet an A/B harness for config variants.
- No champion/challenger model-config system: a weight set can't be A/B-compared, auto-written, versioned, or rolled back.
- Symptom already observed: top patterns were bearish setups in an uptrend — wrong-signed — and the loop bought nothing while the rally ran.
champion.yaml is seeded byte-identical to today's effective config).stock_prices year. 1yr daily OHLC in Postgres stock_prices (legacy local TimescaleDB)
│ replay the LIVE ranker as-of each day t (no look-ahead):
│ detect_patterns → _filter_actionable → best_pattern → 3-layer composite
▼
pattern corpus: per actionable emission → pattern_type, confidence,
composite contribution, entry/stop/target, regime tag,
forward return at 5 / 10 / 20 trading days
│
▼
per-pattern table: n · win-rate · mean/median fwd return · directional-correctness
│ (grouped by pattern × regime × horizon)
│
├──► WS2 recalibrate per-pattern confidence weights ─┐
├──► WS3 rebalance the 3 layer weights ├─ each an A/B experiment
└──► WS4 show the LLM each pattern's measured record ─┘ (champion vs challenger,
promote on a measured win)
What changes (WS1, the concrete deliverable): a faithful historical replay of
the live pattern layer (net-new — the existing qu100_portfolio engine
diverges; full 3-layer composite replay is deferred to WS3, which needs an as-of
money-flow selector — WS1's composite parity is verified on recent dates via the
live latest-snapshot path), a regenerable corpus artifact, the champion.yaml
config system, and a human report (REPORT-qu100-pattern-hit-rate). No weight
values change; behavior is preserved.
What WS2–WS4 will change (framed, spec'd after the numbers): values of existing config — per-pattern confidence weights, the three layer weights, the LLM thesis prompt — each rolled out as champion-vs-challenger, never a direct edit.
There are two sets of weights, both hand-set today (in StockScreenerConfig,
overridable in settings.yaml):
false_breakout = 1.0, bull_flag = 0.75, …).We do not hand-pick new numbers. We derive them from WS1's measured expectancy table by a fixed, documented rule, so every weight traces to evidence.
Per-pattern weight — from each pattern's measured edge:
for each pattern P (optionally per regime):
edge(P) = risk-adjusted forward return ── mean_fwd_return / volatility
at the chosen horizon (from WS1)
weight(P) = min-max normalize edge over observed patterns → [0, 1], then clamp
if n(P) < floor → keep the CURRENT weight (never overfit a handful of samples)
Caveat the implementer must respect: the per-pattern weight is only 35% of
score_pattern's confidence (the rest is volume/clarity/R:R/status). So setting a
weight to 0 does not drop a bad pattern's contribution to 0 — true suppression
needs a separate exclusion/threshold, not weight 0. Weights may also be made
regime-conditional.
Layer weight — measured, then swept:
diagnostic: IC(L) = correlation( layer L's score , forward return ) over the corpus
→ tells us WHICH layer predicts (money_flow vs pattern vs sector)
method: grid-sweep the three weights (sum=1) over the faithful replay and pick
the split with the best risk-adjusted return on a HELD-OUT slice
IC is a diagnostic only — the three layer scores aren't on comparable scales, so
"weights ∝ IC" would mis-blend. The grid-sweep on held-out data is the actual
weight-setting method (precedent: backtest/sweep.py). This is the lever that
answers "is pattern's 0.65 too high?" — if money-flow predicts as well, the
sweep shifts weight to it and the strong-flow names stop getting buried.
The output of WS2/WS3 is a challenger champion.yaml, every number justified
by the audit.
Rule: champion (current, live) keeps running; challenger (the change) runs beside it; promote only on a measured win. Two distinct modes:
Mode 1 — Offline (backtest) A/B — feasible once WS1's replay exists.
Instantiate two StockScreenerConfigs, run the WS1 replay twice over the corpus,
compare the selected baskets' forward returns. Zero LLM cost, immediate, picks the
best candidate. Weakness: can overfit. Scope note: pattern-weight A/B (WS2)
runs over the full 1-year pattern corpus; layer-rebalance A/B (WS3) needs the
money-flow layer replayed as-of date, which requires building the as-of
money-flow selector first (the history exists; the selector does not).
Mode 2 — Live shadow A/B — net-new wiring (NOT a freebie from #143).
run a second screener config. Live shadow A/B needs new plumbing: the daily
pipeline (which today threads a single Settings end-to-end via
pipeline/post_scrape.py) must run the challenger config too and tag its outputs
with the config version. The shadow challenger is measured on screener-rank /
forward-return only — no LLM thesis spend (it never drives top-5 LLM calls, so it
can't blow max_usd_per_scan).
same daily QU100 input
│
┌───────────────────┴───────────────────┐
champion config challenger config(s)
(live: opens REAL (shadow: tagged rows,
paper positions) isolated — never the live book, #143)
│ │
champion outcomes challenger outcomes
└───────────────► compare on metric ◄─────┘
risk-adjusted return · hit-rate · expectancy
│
challenger wins by a margin AND has enough n
│
operator promotes → challenger becomes champion
(loses / inconclusive → discard; champion untouched)
Promotion gates (placeholders, finalized in the WS2 plan): per-pattern cell
n ≥ 30; challenger must beat champion by a stated margin on the held-out window;
the live-shadow arm needs a multi-month forward window to conclude — so WS2–WS4
promote on offline A/B plus an accruing forward-shadow check, not instantly.
The weights are already file-backed (settings.yaml:stock_screener) — so the new
value is not "move out of code," it's a system that makes a weight set
A/B-comparable, auto-writable, versioned, and revertible. One YAML is a model:
# config/model/champion.yaml — the LIVE model. One file, everything tunable.
version: 3
parent: 2 # which config this was derived from
created: 2026-06-20
note: "WS2 recalibration from pattern audit 2026-06-18"
score: # how it scored when promoted (for tracking)
window: "2025-06..2026-06"
risk_adj_return: 0.42
hit_rate: 0.58
# flat StockScreenerConfig field names (match settings.yaml:stock_screener) so the
# loader deep-merges directly; pattern_weights is the one nested dict field.
layer_weight_money_flow: 0.35
layer_weight_sector: 0.10
layer_weight_pattern: 0.55
pattern_weights: {false_breakout: 0.90, bull_flag: 0.70} # ...
neckline_tolerance_pct: 0.03
volume_breakout_multiplier: 1.5
strong_buy_threshold: 0.80
buy_threshold: 0.65
watch_threshold: 0.50
Precedence (must be explicit): champion.yaml > settings.yaml:stock_screener
StockScreenerConfigcode defaults. The loader populatessettings.stock_screenerfromchampion.yaml, and must hookload_settings(which bothget_settingsand the scheduler's hot-reloadload_settings_freshdelegate to) — otherwise a promotion wouldn't take effect without a daemon restart.
This one file is the thing every part of the loop operates on:
pattern audit (WS1) → derive weights (§6.1) → challenger.yaml (auto-written)
└──► A/B vs champion.yaml (§6.2) ──► wins + operator promotes ──►
champion.yaml (version++, parent=old, score recorded)
+ config/model/history/ + results registry (Parquet, NOT Neon)
challenger.yaml; the A/B harness scores
it; on a win the operator promotes it to champion.yaml. Every step is a file.config/model/history/ or git
history) with its parent and score. The results registry is a Parquet/CSV
file (matches the project's feature-store convention; explicitly not a Neon
table — avoids the two-DATABASE_URL footgun) recording (version, window, metric).champion.yaml is a pointer to the current best —
promotion never overwrites history, so reverting is re-promoting a prior version
(instant rollback). Every config run, winners and losers, stays in the registry.Wiring & scope: the screener already takes a StockScreenerConfig; we add the
champion.yaml loader (with settings.yaml then code defaults as fallback) hooked
into load_settings_fresh(), and A/B instantiates two configs from two files. This
ships with WS1 as a behavior-preserving refactor (seeded byte-identical) — so
WS2–WS4 have a file to write.
The audit turns a guess into an action:
If a pattern's forward return ≈ 0 / win-rate ≈ 50% → drop or downweight it (WS2).
If money-flow alone out-predicts pattern → lower the 0.65 pattern weight (WS3).
If a pattern only works in one regime → make its weight regime-conditional.
If bearish patterns precede rises in uptrends → confirm the #143 fix quantitatively.
detect_patterns → _filter_actionable
→ best_pattern → 3-layer composite the production screen uses — asserted by a
parity test, not just "calls detect_patterns."n < 30 are flagged thin,
never silently dropped or over-tuned.champion.yaml seeded from today's effective config produces a byte-identical
ranking on a fixture (behavior-preserving) — asserted by test.| Tradeoff | Why accepted |
|---|---|
| Universe = symbols present in our scrape history (~current QU100), replayed over 1yr | Historical QU100 membership isn't recorded; this is a fixed-universe-over-history approximation; disclosed |
| One year of history, daily bars only | The data we already have; intraday adds cost without changing pattern-level calibration |
| Calibration, not retraining | Sample too small for ML; in-context weight tuning is the right tool now |
| Live-shadow A/B needs a multi-month forward window | Offline A/B gives an immediate read; the forward arm guards against overfit before live promotion |
A — Do nothing. Keep hand-set weights. (rejected: we already saw them fail)
B — Wait for live data to mature, then calibrate. (rejected: months of blocked tuning)
C — Reuse the qu100_portfolio backtest engine for audit. (rejected: diverges from the live ranker)
D — Replay the LIVE ranker over the backfill, then A/B-tune via the config system. (chosen)
E — Retrain an ML scorer on outcomes. (rejected: sample far too small)
champion.yaml config system (loader hooked into load_settings_fresh, seeded
byte-identical). Behavior-preserving — no weight changes. Merge.champion.yaml._filter_actionable, ranks differently), the corpus measures something
production never does. Mitigation: parity test against the live screen path.champion.yaml revertible._screen_money_flow is latest-only — no as-of selector, and the 2026-06-04
backfill stamped all days with one shared captured_at. So full-composite
replay needs a new as-of selector (deferred to WS3); the 1-year audit is
pattern-layer only. Mitigation: WS1 composite parity uses recent dates; the
pattern-predictiveness goal needs only the pattern layer.analysis/stock_patterns.py (detect_patterns, score_pattern).analysis/stock_screener.py (screen_stocks, _filter_actionable, best_pattern, the 3-layer composite).core/config.py (StockScreenerConfig), overridable in config/settings.yaml (stock_screener:); hot-reload via load_settings_fresh().llm_thesis/research.py (compute_market_regime, SPY vs 200-SMA); SPY backfill paper/ingest.py (ensure_spy_history).paper/ (paper_trade.shadow, paper/replay.py) — #143.qu100_portfolio.py.stock_prices (OHLC, legacy local TimescaleDB via core.database.get_session()) — not the data/cache/qu100_backtest adjusted-close artifact (insufficient for detect_patterns).t with ≥ min_daily_bars
history, replay the live ranking as-of t: detect_patterns over bars up to t
(default config) → _filter_actionable → best_pattern → 3-layer composite.
Record the actionable emission(s) the live ranker would consume, plus (optionally,
flagged) the raw all-emissions superset for completeness.close[t+H]/close[t] − 1.
Emissions within H of the window end → null at that horizon, never 0.t, via compute_market_regime.t from the price index. Output
byte-stable for a fixed fixture.paper/ or backtest/;
a CLI (rainier pattern-audit or extend backtest-qu100); corpus as a regenerable
Parquet cache (add to the CLAUDE.md disk-hygiene list); report at
docs/REPORT-qu100-pattern-hit-rate.md.champion.yaml loader populating settings.stock_screener,
precedence champion.yaml > settings.yaml > defaults, hooked into load_settings
(covers get_settings + load_settings_fresh); config/model/history/ + Parquet results registry.screen_stocks path on a fixture.champion.yaml seeded from current config → byte-identical ranking on a fixture.detect_patterns); corpus source named (stock_prices, not the adjusted-close cache); config reframed as a champion/challenger system with explicit precedence + load_settings_fresh hook; A/B split into offline (real) vs live-shadow (net-new wiring, no LLM spend); weight-mechanics caveats (35% / exclusion-not-zero, IC-diagnostic-vs-grid-sweep); survivorship + n-floor + accrual stated. (coordinator)