DESIGN — QU100-LLM pattern-matching tuning

Rendered from docs/DESIGN-qu100-pattern-tuning.md — 2026-06-15. The .md is the source of truth; this file is the local render.

1. Decision Matrix

Answer YES/NO per row. Everything below is context.

# Decision Default Cost Risk Consequence if NO
1 WS1 — replay the live pattern layer over 1yr of stock_prices for a pattern forward-return audit (full-composite replay deferred to WS3 — needs an as-of money-flow selector) YES 1 PR Med Keep tuning blind; no ground truth for any weight
2 Stage WS2–WS4 behind WS1 (spec the tuning only after we see the numbers) YES Low Spec weight changes we can't justify
3 Every model change ships as an A/B experiment (champion vs challenger), promote on a measured win YES per-change Low Direct live flips — the failure mode we're fixing
4 Build a champion/challenger model-config system (champion.yaml + history + results registry) layered over today's config YES Small Low No way to A/B, auto-tune, version, or roll back a weight set
5 Wait-and-accrue live data instead of replaying the backfill NO 0 High Tuning blocked for months

2. Executive Summary

3. Why Now

The loop just went live (#143 added shadow trading + the reclaim audit). Right now it ranks stocks by patterns whose track record is unknown — and the first portfolio review already caught those patterns pointing the wrong way in a rising market. Every day we run on guessed weights, we accumulate decisions we can't defend or improve. The backfill needed to measure already exists, the detector change is zero, and the shadow rails for paper trading landed last PR — so the cost is at its lowest it will be.

4. Problem & Current State

Already shipped (verified on origin/main) - Daily screen tags each QU100 stock with a chart pattern + confidence (detect_patterns_filter_actionablebest_pattern), ranks by a 3-layer composite, sends the top 5 to the LLM. - Pattern/layer weights + thresholds live in StockScreenerConfig and are already overridable from config/settings.yaml (stock_screener: block). - One year of daily OHLC for the QU100 universe in Postgres stock_prices (legacy local TimescaleDB). - paper_trade.shadow + shadow isolation, paper/replay.py, and a regime helper (compute_market_regime, SPY vs 200-SMA) — #143.

Missing / broken - No measurement of pattern predictiveness. The per-pattern confidence weights and the 3-layer split were never validated against outcomes. - No replay path that matches the live ranker — the only historical engine (qu100_portfolio) uses a different ranking (2 patterns, top-20, confidence-only, hardcoded config), so it can't measure what production actually does. - The #143 shadow path is a WATCH-buy gate keyed on the LLM verdict, not a second screener-config run — so it is not yet an A/B harness for config variants. - No champion/challenger model-config system: a weight set can't be A/B-compared, auto-written, versioned, or rolled back. - Symptom already observed: top patterns were bearish setups in an uptrend — wrong-signed — and the loop bought nothing while the rally ran.

5. Non-Goals

6. Proposed Design

   1yr daily OHLC in Postgres stock_prices  (legacy local TimescaleDB)
          │   replay the LIVE ranker as-of each day t (no look-ahead):
          │     detect_patterns → _filter_actionable → best_pattern → 3-layer composite
          ▼
   pattern corpus:  per actionable emission → pattern_type, confidence,
                    composite contribution, entry/stop/target, regime tag,
                    forward return at 5 / 10 / 20 trading days
          │
          ▼
   per-pattern table:  n · win-rate · mean/median fwd return · directional-correctness
          │            (grouped by pattern × regime × horizon)
          │
          ├──► WS2  recalibrate per-pattern confidence weights   ─┐
          ├──► WS3  rebalance the 3 layer weights                 ├─ each an A/B experiment
          └──► WS4  show the LLM each pattern's measured record   ─┘   (champion vs challenger,
                                                                        promote on a measured win)

What changes (WS1, the concrete deliverable): a faithful historical replay of the live pattern layer (net-new — the existing qu100_portfolio engine diverges; full 3-layer composite replay is deferred to WS3, which needs an as-of money-flow selector — WS1's composite parity is verified on recent dates via the live latest-snapshot path), a regenerable corpus artifact, the champion.yaml config system, and a human report (REPORT-qu100-pattern-hit-rate). No weight values change; behavior is preserved.

What WS2–WS4 will change (framed, spec'd after the numbers): values of existing config — per-pattern confidence weights, the three layer weights, the LLM thesis prompt — each rolled out as champion-vs-challenger, never a direct edit.

6.1 How we tune the patterns and adjust the weights

There are two sets of weights, both hand-set today (in StockScreenerConfig, overridable in settings.yaml):

  1. Per-pattern confidence weights — one multiplier per pattern type (false_breakout = 1.0, bull_flag = 0.75, …).
  2. Layer weights — money-flow 0.25 / sector 0.10 / pattern 0.65. (Note: sector is a binary 0.10 boost, not a continuous score.)

We do not hand-pick new numbers. We derive them from WS1's measured expectancy table by a fixed, documented rule, so every weight traces to evidence.

Per-pattern weight — from each pattern's measured edge:

for each pattern P  (optionally per regime):
    edge(P)   = risk-adjusted forward return        ── mean_fwd_return / volatility
                                                        at the chosen horizon (from WS1)
    weight(P) = min-max normalize edge over observed patterns → [0, 1], then clamp
    if n(P) < floor → keep the CURRENT weight   (never overfit a handful of samples)

Caveat the implementer must respect: the per-pattern weight is only 35% of score_pattern's confidence (the rest is volume/clarity/R:R/status). So setting a weight to 0 does not drop a bad pattern's contribution to 0 — true suppression needs a separate exclusion/threshold, not weight 0. Weights may also be made regime-conditional.

Layer weight — measured, then swept:

diagnostic:  IC(L) = correlation( layer L's score , forward return )   over the corpus
             → tells us WHICH layer predicts (money_flow vs pattern vs sector)
method:      grid-sweep the three weights (sum=1) over the faithful replay and pick
             the split with the best risk-adjusted return on a HELD-OUT slice

IC is a diagnostic only — the three layer scores aren't on comparable scales, so "weights ∝ IC" would mis-blend. The grid-sweep on held-out data is the actual weight-setting method (precedent: backtest/sweep.py). This is the lever that answers "is pattern's 0.65 too high?" — if money-flow predicts as well, the sweep shifts weight to it and the strong-flow names stop getting buried.

The output of WS2/WS3 is a challenger champion.yaml, every number justified by the audit.

6.2 How we A/B test a change before it goes live

Rule: champion (current, live) keeps running; challenger (the change) runs beside it; promote only on a measured win. Two distinct modes:

Mode 1 — Offline (backtest) A/B — feasible once WS1's replay exists. Instantiate two StockScreenerConfigs, run the WS1 replay twice over the corpus, compare the selected baskets' forward returns. Zero LLM cost, immediate, picks the best candidate. Weakness: can overfit. Scope note: pattern-weight A/B (WS2) runs over the full 1-year pattern corpus; layer-rebalance A/B (WS3) needs the money-flow layer replayed as-of date, which requires building the as-of money-flow selector first (the history exists; the selector does not).

Mode 2 — Live shadow A/B — net-new wiring (NOT a freebie from #143).

143's shadow path opens WATCH-buy rows keyed on the LLM verdict; it does not

run a second screener config. Live shadow A/B needs new plumbing: the daily pipeline (which today threads a single Settings end-to-end via pipeline/post_scrape.py) must run the challenger config too and tag its outputs with the config version. The shadow challenger is measured on screener-rank / forward-return only — no LLM thesis spend (it never drives top-5 LLM calls, so it can't blow max_usd_per_scan).

                     same daily QU100 input
                              │
          ┌───────────────────┴───────────────────┐
     champion config                         challenger config(s)
     (live: opens REAL                       (shadow: tagged rows,
      paper positions)                        isolated — never the live book, #143)
          │                                         │
     champion outcomes                       challenger outcomes
          └───────────────► compare on metric ◄─────┘
                    risk-adjusted return · hit-rate · expectancy
                              │
              challenger wins by a margin AND has enough n
                              │
                operator promotes → challenger becomes champion
            (loses / inconclusive → discard; champion untouched)

Promotion gates (placeholders, finalized in the WS2 plan): per-pattern cell n ≥ 30; challenger must beat champion by a stated margin on the held-out window; the live-shadow arm needs a multi-month forward window to conclude — so WS2–WS4 promote on offline A/B plus an accruing forward-shadow check, not instantly.

6.3 The champion/challenger model-config system

The weights are already file-backed (settings.yaml:stock_screener) — so the new value is not "move out of code," it's a system that makes a weight set A/B-comparable, auto-writable, versioned, and revertible. One YAML is a model:

# config/model/champion.yaml — the LIVE model. One file, everything tunable.
version: 3
parent: 2                       # which config this was derived from
created: 2026-06-20
note: "WS2 recalibration from pattern audit 2026-06-18"
score:                          # how it scored when promoted (for tracking)
  window: "2025-06..2026-06"
  risk_adj_return: 0.42
  hit_rate: 0.58
# flat StockScreenerConfig field names (match settings.yaml:stock_screener) so the
# loader deep-merges directly; pattern_weights is the one nested dict field.
layer_weight_money_flow: 0.35
layer_weight_sector: 0.10
layer_weight_pattern: 0.55
pattern_weights: {false_breakout: 0.90, bull_flag: 0.70}   # ...
neckline_tolerance_pct: 0.03
volume_breakout_multiplier: 1.5
strong_buy_threshold: 0.80
buy_threshold: 0.65
watch_threshold: 0.50

Precedence (must be explicit): champion.yaml > settings.yaml:stock_screener

StockScreenerConfig code defaults. The loader populates settings.stock_screener from champion.yaml, and must hook load_settings (which both get_settings and the scheduler's hot-reload load_settings_fresh delegate to) — otherwise a promotion wouldn't take effect without a daemon restart.

This one file is the thing every part of the loop operates on:

   pattern audit (WS1) → derive weights (§6.1) → challenger.yaml (auto-written)
        └──► A/B vs champion.yaml (§6.2) ──► wins + operator promotes ──►
             champion.yaml (version++, parent=old, score recorded)
             + config/model/history/  +  results registry (Parquet, NOT Neon)

Wiring & scope: the screener already takes a StockScreenerConfig; we add the champion.yaml loader (with settings.yaml then code defaults as fallback) hooked into load_settings_fresh(), and A/B instantiates two configs from two files. This ships with WS1 as a behavior-preserving refactor (seeded byte-identical) — so WS2–WS4 have a file to write.

7. Expected Outcome

The audit turns a guess into an action:

If a pattern's forward return ≈ 0 / win-rate ≈ 50%  → drop or downweight it (WS2).
If money-flow alone out-predicts pattern             → lower the 0.65 pattern weight (WS3).
If a pattern only works in one regime                → make its weight regime-conditional.
If bearish patterns precede rises in uptrends        → confirm the #143 fix quantitatively.

8. Success Metrics

9. Tradeoffs

Tradeoff Why accepted
Universe = symbols present in our scrape history (~current QU100), replayed over 1yr Historical QU100 membership isn't recorded; this is a fixed-universe-over-history approximation; disclosed
One year of history, daily bars only The data we already have; intraday adds cost without changing pattern-level calibration
Calibration, not retraining Sample too small for ML; in-context weight tuning is the right tool now
Live-shadow A/B needs a multi-month forward window Offline A/B gives an immediate read; the forward arm guards against overfit before live promotion

10. Alternatives Considered

A — Do nothing. Keep hand-set weights.                  (rejected: we already saw them fail)
B — Wait for live data to mature, then calibrate.        (rejected: months of blocked tuning)
C — Reuse the qu100_portfolio backtest engine for audit. (rejected: diverges from the live ranker)
D — Replay the LIVE ranker over the backfill, then A/B-tune via the config system. (chosen)
E — Retrain an ML scorer on outcomes.                    (rejected: sample far too small)

11. Rollout Plan

  1. WS1 (this PR): faithful live-ranker replay + corpus + report, and the champion.yaml config system (loader hooked into load_settings_fresh, seeded byte-identical). Behavior-preserving — no weight changes. Merge.
  2. Read-out: operator reviews the hit-rate report; short follow-up design pass specs WS2–WS4 against the actual numbers.
  3. WS2 / WS3 / WS4 (one PR each): each as an A/B experiment — offline first (two configs over the replay), then an accruing live-shadow check; promote only on a measured win. Rollback = re-promote the prior champion.yaml.

12. Risks

13. Future Work Not Chosen


Appendix

A. Code locations (module paths; exact lines live in the PR / companion notes)

B. WS1 implementation notes

C. Validation categories (detailed cases in the PR)

D. Change log