Status: draft — awaiting operator approval
Scope: measure, then A/B-tune, the pattern-matching that feeds the QU100-LLM loop
Priority: P1 · Depends on: #143 (merged into origin/main — shadow trading, paper_trade.shadow, paper/replay.py, regime tag) · PR base: main

1. Decision Matrix ¶

Answer YES/NO per row. Everything below is context.

#	Decision	Default	Cost	Risk	Consequence if NO
1	WS1 — replay the live pattern layer over 1yr of `stock_prices` for a pattern forward-return audit (full-composite replay deferred to WS3 — needs an as-of money-flow selector)	YES	1 PR	Med	Keep tuning blind; no ground truth for any weight
2	Stage WS2–WS4 behind WS1 (spec the tuning only after we see the numbers)	YES	—	Low	Spec weight changes we can't justify
3	Every model change ships as an A/B experiment (champion vs challenger), promote on a measured win	YES	per-change	Low	Direct live flips — the failure mode we're fixing
4	Build a champion/challenger model-config system (`champion.yaml` + history + results registry) layered over today's config	YES	Small	Low	No way to A/B, auto-tune, version, or roll back a weight set
5	Wait-and-accrue live data instead of replaying the backfill	NO	0	High	Tuning blocked for months

2. Executive Summary ¶

Pattern shape carries 65% of the QU100-LLM ranking weight, yet we have never measured whether our patterns predict price moves — every weight is a hand-set guess.
Live data is too thin to calibrate (~2 weeks of emissions, fewer than 50 names with a usable forward window). Waiting for it to mature costs months.
One year of daily OHLC for the QU100 universe already exists in Postgres stock_prices. Replaying the live pattern layer across it re-derives a large pattern corpus with forward returns — the measurement we lack. (Money-flow history also exists, ~1,400 days, but the live money-flow selector is latest-only — no as-of variant — so full 3-layer composite replay is deferred to WS3; the 1-year audit is pattern-layer.)
Proposal: ship the audit (WS1) first; read the numbers; then tune the three axes you asked for (pattern calibration, layer rebalance, LLM presentation) as A/B experiments.
The weights are already file-backed in settings.yaml; the new piece is a champion/challenger model-config system that makes a weight set comparable, auto-writable, versioned, and revertible.
Risk: WS1 builds a new replay path (the existing backtest engine diverges from the live ranker) and a config loader — so it is behavior-preserving, not literally "read-only." It changes no weights and no detector logic.

3. Why Now ¶

The loop just went live (#143 added shadow trading + the reclaim audit). Right now it ranks stocks by patterns whose track record is unknown — and the first portfolio review already caught those patterns pointing the wrong way in a rising market. Every day we run on guessed weights, we accumulate decisions we can't defend or improve. The backfill needed to measure already exists, the detector change is zero, and the shadow rails for paper trading landed last PR — so the cost is at its lowest it will be.

4. Problem & Current State ¶

Already shipped (verified on origin/main) - Daily screen tags each QU100 stock with a chart pattern + confidence (detect_patterns → _filter_actionable → best_pattern), ranks by a 3-layer composite, sends the top 5 to the LLM. - Pattern/layer weights + thresholds live in StockScreenerConfig and are already overridable from config/settings.yaml (stock_screener: block). - One year of daily OHLC for the QU100 universe in Postgres stock_prices (legacy local TimescaleDB). - paper_trade.shadow + shadow isolation, paper/replay.py, and a regime helper (compute_market_regime, SPY vs 200-SMA) — #143.

Missing / broken - No measurement of pattern predictiveness. The per-pattern confidence weights and the 3-layer split were never validated against outcomes. - No replay path that matches the live ranker — the only historical engine (qu100_portfolio) uses a different ranking (2 patterns, top-20, confidence-only, hardcoded config), so it can't measure what production actually does. - The #143 shadow path is a WATCH-buy gate keyed on the LLM verdict, not a second screener-config run — so it is not yet an A/B harness for config variants. - No champion/challenger model-config system: a weight set can't be A/B-compared, auto-written, versioned, or rolled back. - Symptom already observed: top patterns were bearish setups in an uptrend — wrong-signed — and the loop bought nothing while the rally ran.

5. Non-Goals ¶

Does not change any live weight or threshold value in WS1 (champion.yaml is seeded byte-identical to today's effective config).
Does not change detector logic — the detector is the instrument; altering it invalidates the measurement.
Does not add regime-gating beyond tagging (that family shipped in #143).
Does not backfill new prices — reuses the existing stock_prices year.
Does not retrain or fine-tune any model — sample size is too small; this is calibration, not learning.

6. Proposed Design ¶

   1yr daily OHLC in Postgres stock_prices  (legacy local TimescaleDB)
          │   replay the LIVE ranker as-of each day t (no look-ahead):
          │     detect_patterns → _filter_actionable → best_pattern → 3-layer composite
          ▼
   pattern corpus:  per actionable emission → pattern_type, confidence,
                    composite contribution, entry/stop/target, regime tag,
                    forward return at 5 / 10 / 20 trading days
          │
          ▼
   per-pattern table:  n · win-rate · mean/median fwd return · directional-correctness
          │            (grouped by pattern × regime × horizon)
          │
          ├──► WS2  recalibrate per-pattern confidence weights   ─┐
          ├──► WS3  rebalance the 3 layer weights                 ├─ each an A/B experiment
          └──► WS4  show the LLM each pattern's measured record   ─┘   (champion vs challenger,
                                                                        promote on a measured win)

What changes (WS1, the concrete deliverable): a faithful historical replay of the live pattern layer (net-new — the existing qu100_portfolio engine diverges; full 3-layer composite replay is deferred to WS3, which needs an as-of money-flow selector — WS1's composite parity is verified on recent dates via the live latest-snapshot path), a regenerable corpus artifact, the champion.yaml config system, and a human report (REPORT-qu100-pattern-hit-rate). No weight values change; behavior is preserved.

What WS2–WS4 will change (framed, spec'd after the numbers): values of existing config — per-pattern confidence weights, the three layer weights, the LLM thesis prompt — each rolled out as champion-vs-challenger, never a direct edit.

6.1 How we tune the patterns and adjust the weights

There are two sets of weights, both hand-set today (in StockScreenerConfig, overridable in settings.yaml):

Per-pattern confidence weights — one multiplier per pattern type (false_breakout = 1.0, bull_flag = 0.75, …).
Layer weights — money-flow 0.25 / sector 0.10 / pattern 0.65. (Note: sector is a binary 0.10 boost, not a continuous score.)

We do not hand-pick new numbers. We derive them from WS1's measured expectancy table by a fixed, documented rule, so every weight traces to evidence.

Per-pattern weight — from each pattern's measured edge:

for each pattern P  (optionally per regime):
    edge(P)   = risk-adjusted forward return        ── mean_fwd_return / volatility
                                                        at the chosen horizon (from WS1)
    weight(P) = min-max normalize edge over observed patterns → [0, 1], then clamp
    if n(P) < floor → keep the CURRENT weight   (never overfit a handful of samples)

Caveat the implementer must respect: the per-pattern weight is only 35% of score_pattern's confidence (the rest is volume/clarity/R:R/status). So setting a weight to 0 does not drop a bad pattern's contribution to 0 — true suppression needs a separate exclusion/threshold, not weight 0. Weights may also be made regime-conditional.

Layer weight — measured, then swept:

diagnostic:  IC(L) = correlation( layer L's score , forward return )   over the corpus
             → tells us WHICH layer predicts (money_flow vs pattern vs sector)
method:      grid-sweep the three weights (sum=1) over the faithful replay and pick
             the split with the best risk-adjusted return on a HELD-OUT slice

IC is a diagnostic only — the three layer scores aren't on comparable scales, so "weights ∝ IC" would mis-blend. The grid-sweep on held-out data is the actual weight-setting method (precedent: backtest/sweep.py). This is the lever that answers "is pattern's 0.65 too high?" — if money-flow predicts as well, the sweep shifts weight to it and the strong-flow names stop getting buried.

The output of WS2/WS3 is a challenger champion.yaml, every number justified by the audit.

6.2 How we A/B test a change before it goes live

Rule: champion (current, live) keeps running; challenger (the change) runs beside it; promote only on a measured win. Two distinct modes:

Mode 1 — Offline (backtest) A/B — feasible once WS1's replay exists. Instantiate two StockScreenerConfigs, run the WS1 replay twice over the corpus, compare the selected baskets' forward returns. Zero LLM cost, immediate, picks the best candidate. Weakness: can overfit. Scope note: pattern-weight A/B (WS2) runs over the full 1-year pattern corpus; layer-rebalance A/B (WS3) needs the money-flow layer replayed as-of date, which requires building the as-of money-flow selector first (the history exists; the selector does not).

Mode 2 — Live shadow A/B — net-new wiring (NOT a freebie from #143).

143's shadow path opens WATCH-buy rows keyed on the LLM verdict; it does not

run a second screener config. Live shadow A/B needs new plumbing: the daily pipeline (which today threads a single Settings end-to-end via pipeline/post_scrape.py) must run the challenger config too and tag its outputs with the config version. The shadow challenger is measured on screener-rank / forward-return only — no LLM thesis spend (it never drives top-5 LLM calls, so it can't blow max_usd_per_scan).

                     same daily QU100 input
                              │
          ┌───────────────────┴───────────────────┐
     champion config                         challenger config(s)
     (live: opens REAL                       (shadow: tagged rows,
      paper positions)                        isolated — never the live book, #143)
          │                                         │
     champion outcomes                       challenger outcomes
          └───────────────► compare on metric ◄─────┘
                    risk-adjusted return · hit-rate · expectancy
                              │
              challenger wins by a margin AND has enough n
                              │
                operator promotes → challenger becomes champion
            (loses / inconclusive → discard; champion untouched)

Promotion gates (placeholders, finalized in the WS2 plan): per-pattern cell n ≥ 30; challenger must beat champion by a stated margin on the held-out window; the live-shadow arm needs a multi-month forward window to conclude — so WS2–WS4 promote on offline A/B plus an accruing forward-shadow check, not instantly.

6.3 The champion/challenger model-config system

The weights are already file-backed (settings.yaml:stock_screener) — so the new value is not "move out of code," it's a system that makes a weight set A/B-comparable, auto-writable, versioned, and revertible. One YAML is a model:

# config/model/champion.yaml — the LIVE model. One file, everything tunable.
version: 3
parent: 2                       # which config this was derived from
created: 2026-06-20
note: "WS2 recalibration from pattern audit 2026-06-18"
score:                          # how it scored when promoted (for tracking)
  window: "2025-06..2026-06"
  risk_adj_return: 0.42
  hit_rate: 0.58
# flat StockScreenerConfig field names (match settings.yaml:stock_screener) so the
# loader deep-merges directly; pattern_weights is the one nested dict field.
layer_weight_money_flow: 0.35
layer_weight_sector: 0.10
layer_weight_pattern: 0.55
pattern_weights: {false_breakout: 0.90, bull_flag: 0.70}   # ...
neckline_tolerance_pct: 0.03
volume_breakout_multiplier: 1.5
strong_buy_threshold: 0.80
buy_threshold: 0.65
watch_threshold: 0.50

Precedence (must be explicit): champion.yaml > settings.yaml:stock_screener

StockScreenerConfig code defaults. The loader populates settings.stock_screener from champion.yaml, and must hook load_settings (which both get_settings and the scheduler's hot-reload load_settings_fresh delegate to) — otherwise a promotion wouldn't take effect without a daemon restart.

This one file is the thing every part of the loop operates on:

   pattern audit (WS1) → derive weights (§6.1) → challenger.yaml (auto-written)
        └──► A/B vs champion.yaml (§6.2) ──► wins + operator promotes ──►
             champion.yaml (version++, parent=old, score recorded)
             + config/model/history/  +  results registry (Parquet, NOT Neon)

Compare: champion vs challenger is a YAML diff plus the A/B metric delta.
Auto-improving: the tuner writes a challenger.yaml; the A/B harness scores it; on a win the operator promotes it to champion.yaml. Every step is a file.
Tracking: every promoted version is retained (config/model/history/ or git history) with its parent and score. The results registry is a Parquet/CSV file (matches the project's feature-store convention; explicitly not a Neon table — avoids the two-DATABASE_URL footgun) recording (version, window, metric).
Safe to be wrong: champion.yaml is a pointer to the current best — promotion never overwrites history, so reverting is re-promoting a prior version (instant rollback). Every config run, winners and losers, stays in the registry.
Finding good combinations: the weights interact — a knob good alone can hurt in combination. A/B isn't limited to one challenger: the harness scores a batch of candidate combinations, and the registry becomes a research dataset you mine for which combinations actually work.

Wiring & scope: the screener already takes a StockScreenerConfig; we add the champion.yaml loader (with settings.yaml then code defaults as fallback) hooked into load_settings_fresh(), and A/B instantiates two configs from two files. This ships with WS1 as a behavior-preserving refactor (seeded byte-identical) — so WS2–WS4 have a file to write.

7. Expected Outcome ¶

The audit turns a guess into an action:

If a pattern's forward return ≈ 0 / win-rate ≈ 50%  → drop or downweight it (WS2).
If money-flow alone out-predicts pattern             → lower the 0.65 pattern weight (WS3).
If a pattern only works in one regime                → make its weight regime-conditional.
If bearish patterns precede rises in uptrends        → confirm the #143 fix quantitatively.

8. Success Metrics ¶

The audit emits a per-pattern table with n, win-rate, and mean/median forward return at 5/10/20 days, split by regime, over the 1-year window — reproducible from one command.
The replay matches live consumption: same detect_patterns → _filter_actionable → best_pattern → 3-layer composite the production screen uses — asserted by a parity test, not just "calls detect_patterns."
Per-pattern n is reported on every cell; cells with n < 30 are flagged thin, never silently dropped or over-tuned.
champion.yaml seeded from today's effective config produces a byte-identical ranking on a fixture (behavior-preserving) — asserted by test.
At least one concrete tuning hypothesis per axis (WS2/WS3/WS4) is documented from the table.

9. Tradeoffs ¶

Tradeoff	Why accepted
Universe = symbols present in our scrape history (~current QU100), replayed over 1yr	Historical QU100 membership isn't recorded; this is a fixed-universe-over-history approximation; disclosed
One year of history, daily bars only	The data we already have; intraday adds cost without changing pattern-level calibration
Calibration, not retraining	Sample too small for ML; in-context weight tuning is the right tool now
Live-shadow A/B needs a multi-month forward window	Offline A/B gives an immediate read; the forward arm guards against overfit before live promotion

10. Alternatives Considered ¶

A — Do nothing. Keep hand-set weights.                  (rejected: we already saw them fail)
B — Wait for live data to mature, then calibrate.        (rejected: months of blocked tuning)
C — Reuse the qu100_portfolio backtest engine for audit. (rejected: diverges from the live ranker)
D — Replay the LIVE ranker over the backfill, then A/B-tune via the config system. (chosen)
E — Retrain an ML scorer on outcomes.                    (rejected: sample far too small)

11. Rollout Plan ¶

WS1 (this PR): faithful live-ranker replay + corpus + report, and the champion.yaml config system (loader hooked into load_settings_fresh, seeded byte-identical). Behavior-preserving — no weight changes. Merge.
Read-out: operator reviews the hit-rate report; short follow-up design pass specs WS2–WS4 against the actual numbers.
WS2 / WS3 / WS4 (one PR each): each as an A/B experiment — offline first (two configs over the replay), then an accruing live-shadow check; promote only on a measured win. Rollback = re-promote the prior champion.yaml.

12. Risks ¶

Replay parity drift. If the audit replay diverges from the live ranker (skips _filter_actionable, ranks differently), the corpus measures something production never does. Mitigation: parity test against the live screen path.
Detector look-ahead. The replay must feed only bars up to the as-of day; an off-by-one leaks the future and inflates win-rates. Mitigation: forward-return + determinism tests.
Weights interact / overfit. A change that wins on one window can lose on another; combinations aren't separable. Mitigation: evaluate whole configs on a held-out window, score a batch, retain losers, keep champion.yaml revertible.
Survivorship. Universe = scraped names over history, not reconstructed historical QU100 membership. Disclosed; gate on persisted membership later only if it moves a decision.
Thin per-pattern cells. Rare patterns may stay below n=30 even over a year — report n; don't over-tune thin cells.
Money-flow as-of gap. Money-flow history exists (~1,400 days), but the live _screen_money_flow is latest-only — no as-of selector, and the 2026-06-04 backfill stamped all days with one shared captured_at. So full-composite replay needs a new as-of selector (deferred to WS3); the 1-year audit is pattern-layer only. Mitigation: WS1 composite parity uses recent dates; the pattern-predictiveness goal needs only the pattern layer.
Unknown: whether one year is enough signal per pattern × regime cell — the audit itself answers this.

13. Future Work Not Chosen ¶

Per-pattern ML scorer — revisit at a much larger sample.
Intraday pattern calibration — only if daily-level tuning proves insufficient.
Historical-membership reconstruction — only if survivorship bias changes a ranking.
Auto-promotion of A/B winners — keep promotion operator-gated for now.

Appendix ¶

A. Code locations (module paths; exact lines live in the PR / companion notes)

Pattern detection + scoring: analysis/stock_patterns.py (detect_patterns, score_pattern).
Live ranker: analysis/stock_screener.py (screen_stocks, _filter_actionable, best_pattern, the 3-layer composite).
Weights/thresholds: core/config.py (StockScreenerConfig), overridable in config/settings.yaml (stock_screener:); hot-reload via load_settings_fresh().
Regime tag: llm_thesis/research.py (compute_market_regime, SPY vs 200-SMA); SPY backfill paper/ingest.py (ensure_spy_history).
Shadow / replay rails: paper/ (paper_trade.shadow, paper/replay.py) — #143.
Existing (diverging) backtest engine, for reference only: qu100_portfolio.py.
Corpus source: Postgres stock_prices (OHLC, legacy local TimescaleDB via core.database.get_session()) — not the data/cache/qu100_backtest adjusted-close artifact (insufficient for detect_patterns).

B. WS1 implementation notes

Replay loop: for each symbol, for each trading day t with ≥ min_daily_bars history, replay the live ranking as-of t: detect_patterns over bars up to t (default config) → _filter_actionable → best_pattern → 3-layer composite. Record the actionable emission(s) the live ranker would consume, plus (optionally, flagged) the raw all-emissions superset for completeness.
Forward return: at horizons H ∈ {5, 10, 20} trading days, close[t+H]/close[t] − 1. Emissions within H of the window end → null at that horizon, never 0.
Regime tag: SPY vs 200-SMA at t, via compute_market_regime.
Aggregation: group by (pattern_type, regime, horizon): n, win-rate, mean, median, directional-correctness (does a bearish pattern actually precede a decline?).
Determinism: no wall-clock; the as-of date is t from the price index. Output byte-stable for a fixed fixture.
Surfaces (decided in the task plan): audit module under paper/ or backtest/; a CLI (rainier pattern-audit or extend backtest-qu100); corpus as a regenerable Parquet cache (add to the CLAUDE.md disk-hygiene list); report at docs/REPORT-qu100-pattern-hit-rate.md.
Config system: champion.yaml loader populating settings.stock_screener, precedence champion.yaml > settings.yaml > defaults, hooked into load_settings (covers get_settings + load_settings_fresh); config/model/history/ + Parquet results registry.

C. Validation categories (detailed cases in the PR)

Parity: the replay's selection matches the live screen_stocks path on a fixture.
Replay over a small synthetic fixture → expected emission count + dates.
Forward return at H for a known price path → exact value; near-window-end → null, not 0.
Aggregation groups correctly by (pattern, regime, horizon); per-pattern n surfaced.
Directional-correctness: bearish pattern + later decline → correct; bearish + rise → wrong.
Determinism: same fixture twice → identical corpus bytes.
Config behavior-preservation: champion.yaml seeded from current config → byte-identical ranking on a fixture.

D. Change log

2026-06-15 — WS1 task-plan review pass (corrected): money-flow history EXISTS (~1,400 days) but the live selector is latest-only (no as-of variant), so the 1yr audit is pattern-layer and full-composite replay is deferred to WS3 (needs an as-of money-flow selector); composite double-counts sector; champion.yaml uses flat StockScreenerConfig field names for the deep-merge loader. (coordinator)
2026-06-15 — Review pass (codex + independent Claude): WS1 reframed as faithful live-ranker replay (parity, not raw detect_patterns); corpus source named (stock_prices, not the adjusted-close cache); config reframed as a champion/challenger system with explicit precedence + load_settings_fresh hook; A/B split into offline (real) vs live-shadow (net-new wiring, no LLM spend); weight-mechanics caveats (35% / exclusion-not-zero, IC-diagnostic-vs-grid-sweep); survivorship + n-floor + accrual stated. (coordinator)
2026-06-14 — Initial draft. (coordinator)