PGPortfolio on Mag 7 + QQQ/SPY/ES — research doc

research / planning · drafted 2026-05-19 · iterating in this file · not yet a build plan

1. Vision / framing

Reimplement PGPortfolio (Jiang/Xu/Liang 2017) on a small US-equity universe — Mag 7 + QQQ + SPY + ES — to (a) learn the differentiable-allocator technique, (b) establish a credible 2026 baseline for portfolio DRL on US equities, (c) decide whether the technique deserves a place in rainier's stack as an allocation overlay on top of the QU100 screener.

What's actually valuable in PGPortfolio: the closed-form differentiable transaction-cost reward (no Monte Carlo, no critic), the portfolio-vector memory trick, and OSBL sampling for non-stationary series. Not the headline "DRL beats markets" framing — that was crypto in 2017 and is poorly replicated on equities.

The goal is honest: build a working EIIE allocator, run it on Mag 7 with leakage-proof walk-forward, and measure it against simple baselines (1/N, SPY, QQQ, momentum). If it beats baselines after realistic costs across multiple regimes — great, integrate. If it doesn't — we have a published-quality null result and a reusable backtest harness for rainier.

Per the recursive-research doc's anti-overfitting discipline (§8): every result here clears Deflated Sharpe + walk-forward + purged CV before it's reported as alpha.

2. What PGPortfolio actually is

Three components, all load-bearing:

2.1 EIIE — Ensemble of Identical Independent Evaluators (CNN)

Input tensor X_t of shape (f, n, m) = (3, 50, 11) in the paper:

Row-isolated convolutions (every asset processed independently with shared weights — that's the "identical independent" part):

  1. Conv 1: 1×3 kernel → 2 feature maps of size m × 48.
  2. Conv 2: 1×48 kernel → 20 feature maps of size m × 1.
  3. PVM concat: append previous portfolio weights w_{t-1} as a 21st feature map.
  4. Conv 3: 1×1 kernel → one score per asset.
  5. Cash bias + softmax: learned scalar for cash, softmax over m+1 outputs → portfolio weights summing to 1.

Why row-isolated: the same evaluator processes each asset, so the network generalizes across assets and the model size doesn't blow up with m. PVM lets the network see its previous decision without recurrent backprop through time — a clever workaround.

2.2 Loss — differentiable log-growth with closed-form transaction cost

Objective is mean log portfolio growth across the training window:

R = (1/t_f) · Σ_t log( μ_t · y_t · w_{t-1} )

Where:

The transaction-cost recursion (Eq. 14/15 in the paper) has a closed form solvable by fixed-point iteration (typically 3–5 iters to convergence):

μ_t = [ 1 − c_p·w'_{t,0} − (c_s + c_p − c_s·c_p) · Σ_i (w'_{t,i} − μ_t·w_{t,i})⁺ ]
      / (1 − c_p · w_{t,0})

Where c_p, c_s are purchase/sale commission rates. This is the killer feature: the reward is explicit, differentiable from prices + chosen weights — no critic, no bootstrapping, no Monte Carlo policy-gradient noise. Train it like supervised learning with backprop through time-steps in a window.

2.3 OSBL — Online Stochastic Batch Learning

Financial series are non-stationary. The paper's solution:

Why this matters: uniform replay trains the agent on stale regimes; pure online training overfits the latest move. Geometric weighting is the compromise. For 2026 US equities, β needs re-tuning — the paper's value was for 30-min crypto, not daily stocks.

3. Modern landscape (2026)

#ProjectStatusVerdict for our use
1 TradeMaster (NTU)
~2.7k stars, last commit 1747cc1 Jun 2025
Actively maintained Best EIIE reference implementation. PyTorch port of EIIE + DJ30 US-stock tutorial. Use as the architectural baseline for our PyTorch rewrite. Don't fork — read and reimplement clean.
2 RLPortfolio
~57 stars, last commit 071cf75 Mar 2025
Active but small Modern PyTorch + Gymnasium, tests + docs, policy-gradient portfolio env. Good for the environment scaffolding (replay buffer, OSBL sampler, walk-forward driver).
3 FinRL
~15.1k stars, updated Apr 2026
Actively maintained Not a faithful EIIE clone, but the best infrastructure for data plumbing, Gym envs, baselines, transaction-cost models. Use FinRL for the harness; reimplement EIIE inside it.
4 PGPortfolio (original)
~1.9k stars, TF 1.x, ~unmaintained since 2017
Archived in spirit Read for ground truth (the README explicitly warns about 2017 regime + slippage). Don't run — TF 1.x is 9 years stale.
5 wassname/rl-portfolio-management
~562 stars, archived Mar 2025
Archived Cautionary tale. PyTorch notebooks; author archived with an explicit note that train growth did not generalize on test. Read for what to avoid; do not start from.

Recommended stack

FinRL harness + TradeMaster-style EIIE reimplementation, written in PyTorch from scratch. ~400–600 LOC for the EIIE module; FinRL handles data, env, costs, baselines, evaluation. Reimplementing rather than forking means we own the code, can audit every line for leakage, and the recursive-research-system doc's evaluator-immutability rule applies cleanly.

4. Critical adaptations for Mag 7 + QQQ/SPY/ES

DimensionOriginal (crypto, 2017)Adapted (US equities, 2026)
Universe size11 assets + cash10 assets (AAPL, MSFT, GOOG, AMZN, NVDA, META, TSLA, QQQ, SPY) + ES + cash. Small N → regularize hard, reduce filter widths, dropout, weight decay, early stopping.
Bar interval30-minuteDaily for v1 (matches rainier's data cadence). Optional 60-min for v2.
Lookback windown=50 barsn=60 (≈3 trading months) on daily bars. Tune.
Transaction cost0.25% flat, no slippageRealistic IBKR-style: $0.005/share commission, half-spread + 2bp slippage, no short borrow (long-only v1). For ES: round-turn commission + tick slippage + bid-ask spread.
Universe ruleTop-11 by volume at test start (LEAKAGE — see §8)Frozen universe defined at training-window start, no peeking forward. Mag 7 is openly survivorship-biased → declare it ex-post and report it.
Market hours24/7 cryptoNYSE/NASDAQ session hours. Overnight gaps treated as a special bar transition. No intraday trading inside the bar.
Corporate actionsNone for cryptoAdjusted close (split + dividend), corporate-action calendar respected, no look-ahead on ex-div dates.
ES futuresn/aContinuous back-adjusted contract (Panama or ratio-roll). Explicit roll calendar (3rd Fri of Mar/Jun/Sep/Dec). Margin model: notional-aware (1 ES = $50 × index → ~$300K notional per contract).
Long/shortLong-only with cashLong-only v1 (matches paper). Long/short v2 if v1 generalizes.
Rebalance cadenceEvery 30-min barDaily close (matches bar interval). Compare with weekly rebalance to gauge turnover sensitivity.
Regime splitOne crypto bull market2022 bear, 2023–2024 AI rally, 2025–2026 validation — report metrics conditional on each regime separately.

5. Proposed implementation plan

Phased so we ship a working v0 fast and iterate. Each phase is a separate fleet task.

v0  data + harness         (no model — just baselines + walk-forward)
 │
 ▼
v1  EIIE reimplementation  (PyTorch, TradeMaster-style, smoke test)
 │
 ▼
v2  full backtest          (all baselines, all metrics, all regimes)
 │
 ▼
v3  research integration   (link to recursive-research-system doc archive)
 │
 ▼
v4  decide                 (does this enter rainier as an overlay? Y/N)

v0 — data + harness (~1 week, 1 worker)

v1 — EIIE in PyTorch (~1 week, 1 worker)

v2 — full backtest (~2 weeks, 1 worker)

v3 — research integration (~3 days)

v4 — decision gate

If v2 gate passed: file rainier task to integrate as overlay. If failed: keep the harness, ship the null result, move on. Either way, the harness becomes rainier infrastructure.

6. Backtest plan

Universe

Period & splits

WindowTrainValidateTest (OOS)Regime label
W12010-01 → 2018-122019-01 → 2020-062020-07 → 2021-12COVID rebound + tech bull
W22010-01 → 2020-122021-01 → 2021-122022-01 → 2022-12Rates shock / bear
W32010-01 → 2022-122023-01 → 2023-062023-07 → 2024-12AI rally
W42010-01 → 2024-122025-01 → 2025-062025-07 → 2026-05Validation / unseen

Purged k-fold inside each train window with 5-day embargo around fold boundaries (López de Prado standard). Embargo prevents label-leakage from overlapping forward-return windows.

Baselines (mandatory)

  1. UCRP — Uniform Constant Rebalanced Portfolio (1/N rebalanced daily).
  2. BCRP — Best Constant Rebalanced Portfolio (look-ahead optimal; oracle, not deployable).
  3. 1/N buy-and-hold — equal-weight, no rebalance.
  4. SPY buy-and-hold.
  5. QQQ buy-and-hold.
  6. TS-momentum — Moskowitz/Ooi/Pedersen 12-month momentum, vol-targeted.
  7. Mean-variance + Ledoit-Wolf shrinkage — classic Markowitz with shrunk covariance.
  8. Risk parity — equal-risk-contribution weights.
  9. No-trade band 1/N — 1/N with cost-aware rebalance only when weights drift > threshold.

Metrics (the full vector)

CategoryMetricWhy
ReturnfAPV (final Accumulated Portfolio Value)Paper's headline. Useful but easily gamed by lucky run.
ReturnCAGRAnnualized version of fAPV.
Risk-adjSharpe ratioStandard. Reported alongside DSR/PSR.
Risk-adjDeflated Sharpe (DSR)Adjusts for multiple trials. Mandatory for any "this beats baseline" claim.
Risk-adjProbabilistic Sharpe (PSR)Lower bound on true Sharpe given finite sample.
Risk-adjSortino ratioDownside-only volatility — better for skewed distributions.
DrawdownMax DrawdownStandard.
DrawdownCalmar / MARReturn ÷ MDD. Penalizes deep drawdowns.
DrawdownDrawdown durationHow long underwater — operator-felt pain.
CostTurnoverBars w/ rebalance ÷ total bars × Σ|Δw|.
CostAverage holding periodInverse turnover proxy.
CostCost dragReturns delta with vs. without costs — isolates the cost penalty.
DistributionHit rateFraction of bars where return > 0.
ConcentrationMax weight, HerfindahlDid the policy collapse to one asset?
BenchmarkAlpha vs. SPY, Beta vs. SPY, Info ratioCapital-asset-pricing-style decomposition.
RobustnessRegime-conditional Sharpe (per window)Does the policy work across 2022 bear AND 2023 AI rally, or just one?
RobustnessCSCV PBOProbability of Backtest Overfitting. If > 0.5, the result is noise.

Promotion criteria for "this works"

The model is considered to have generalized if and only if:

  1. DSR ≥ 0.5 on out-of-sample (deflated for trial count).
  2. Calmar ≥ 1.0 on out-of-sample.
  3. Beats best-baseline (likely TS-momentum or QQQ B&H) in ≥3 of 4 walk-forward windows.
  4. CSCV PBO ≤ 0.4.
  5. Regime-conditional Sharpe non-negative in the 2022 bear window (the hardest test).

If any of these fails: the model has not generalized. Report honestly. Don't deploy.

7. Open questions

Q1 — Universe inclusion (ES futures)

Stake: Include ES from v1.1 (not v1). Reason: continuous-contract construction adds non-trivial complexity (roll calendar, margin treatment, contract size). Get the 10-equity baseline working first, then add ES.

Alternative: ES as benchmark only (not in the policy's universe), comparing EIIE-on-Mag-7 against ES-buy-and-hold. Simpler.

Q2 — Long-only or long/short?

Stake: Long-only for v1 and v2. Reasons: (a) matches the paper, (b) avoids short-borrow cost modeling complexity, (c) Mag 7 is a long-bias universe by construction. Long/short defers indefinitely.

Q3 — Bar interval

Stake: Daily bars for v1. Reasons: (a) matches rainier's data cadence, (b) avoids intraday data licensing, (c) crypto's 30-min cadence assumed 24/7 trading which doesn't apply. 60-min for v2 if v1 motivates it.

Q4 — Data source

Stake: yfinance for v0 (free, sufficient for daily). Upgrade to Polygon or IBKR if (a) we need intraday, or (b) we need adjusted-corporate-actions guarantees yfinance can't provide.

Cost: yfinance is free but rate-limited and quality-uncertain on edge cases (splits, ex-div). Polygon $29/mo unlocks 5y history; $79/mo unlocks 2y of options. IBKR requires an active account.

Q5 — Where does the code live?

Stake: New module src/rainier/portfolio_drl/ in the rainier repo. Reasons: shares data plumbing (StockPrice table), backtest discipline (walk-forward), and evaluation infrastructure with existing rainier work. The recursive-research-system doc's evaluator-immutability rule applies cleanly.

Alternative: Standalone repo. Argument against: forks the backtest harness and the cost models, which means double the maintenance.

Q6 — Reimplement or adopt TradeMaster?

Stake: Reimplement EIIE in ~500 LOC of clean PyTorch using TradeMaster as a reference. Reasons: (a) we own the code and can audit every line for leakage, (b) TradeMaster has dependencies and design choices we don't need, (c) the EIIE module itself is small (3 conv layers + PVM).

Alternative: Fork TradeMaster, customize Mag 7. Faster to first run but harder to audit and harder to extend.

Q7 — Compute budget

Stake: Local CPU is sufficient. EIIE is tiny (~10K parameters); training a single window takes minutes on CPU, seconds on a laptop GPU. No cloud needed.

Question: Operator confirm — laptop training is fine, or should we provision a GPU box for sweeps?

Q8 — What's the kill switch?

Stake: If v2 gate fails (per §6 promotion criteria), we publish the null result + the harness in a deep-dive section of the recursive-research-system doc, and stop. No production integration. This is a research project with a defined exit.

Tension: Hard to walk away from sunk cost. The kill switch must be set up front. If §6 promotion criteria fail, EIIE does not enter rainier.

8. Risks & known failures

Inherited from PGPortfolio (2017)

Specific to this adaptation

Post-2020 literature reality check

Codex's verdict: "The strongest evidence remains crypto or curated academic benchmarks. For US equities, DRL is credible as an experimental allocator or execution-aware overlay, not as presumed alpha. A 2026 paper must beat simple QQQ/SPY and momentum after costs, with leakage-proof walk-forward evidence."

This is the right frame. We're not setting out to "beat the market". We're testing whether a specific 2017 technique generalizes to a specific 2026 universe under honest conditions.

Non-goals

9. Current rainier state vs. needed

ComponentToday in rainierNeededGap
Daily OHLCVStockPrice table (China A-shares)US equities 2010-present + ES continuousNEW: ingestion path for yfinance / Polygon
Backtest enginebacktest/engine.py, walk_forward.pyWalk-forward + purged k-fold + embargo + DSR/PSR/PBOEXTEND: add DSR/PSR/PBO; purged k-fold may need new wiring
Cost modelLikely crypto-style or noneUS-equity commission + spread + slippage; ES round-turn + tick slippageNEW
BaselinesQU100-flavor signal mixingUCRP, BCRP, 1/N, TS-momentum, MV+shrinkage, risk parityNEW (some reuse from existing screener)
EIIE modulePyTorch 3-conv + PVM + cash bias + softmaxNEW (~300–500 LOC)
Transaction-cost rewardμ_t fixed-point recursion, differentiableNEW (~50 LOC)
OSBL samplerGeometric recency-weighted batch startsNEW (~30 LOC)
Evaluation reporterDiscord embed, Streamlit dashboardWalk-forward summary, regime-conditional Sharpe table, PBO plotEXTEND: add the metric vector renderer

Bottom line: a meaningful chunk of net-new code, but most of the heavy infrastructure (walk-forward, baselines, data, evaluation reporting) is reusable rainier surface.

10. Decision log

DateDecisionRationaleFrom
2026-05-19Research-only doc first; defer build until §7 questions resolvedOperator directive — "start research and give me the research doc first, then we decide next step"operator

11. Open threads

12. Deep dives

12.1 EIIE math — full derivation

TBD. Step-by-step: input normalization, the row-isolated conv intuition, why softmax with cash bias works, why the μ_t recursion has a unique fixed point, gradient flow through μ_t.

12.2 US-equity cost model — exact form

TBD. Commission per share, half-spread + slippage in bp, ES round-turn commission + tick slippage + bid-ask, no short borrow for v1, ADV participation if/when size matters.

12.3 Walk-forward + purged k-fold + embargo

TBD. Fold construction, embargo length, holdout cut policy, label-leakage prevention. Reuses §4/§11.2 from the recursive-research-system doc.

12.4 Deflated Sharpe / PSR / CSCV PBO — exact computation

TBD. Trial count, skewness/kurtosis adjustment, IID-violation handling. Reference: Bailey/López de Prado 2014.

12.5 ES continuous-contract construction

TBD. Panama vs. ratio roll, roll calendar, margin treatment, contract-size translation to portfolio weight.

12.6 Regime-conditional attribution

TBD. How to split returns by regime (rule-based vs. HMM), how to report per-regime Sharpe + drawdown, how to weight regimes for the overall promotion gate.

13. Change log

14. Sources