PGPortfolio on Mag 7 + QQQ/SPY/ES

1. Vision / framing ¶

Reimplement PGPortfolio (Jiang/Xu/Liang 2017) on a small US-equity universe — Mag 7 + QQQ + SPY + ES — to (a) learn the differentiable-allocator technique, (b) establish a credible 2026 baseline for portfolio DRL on US equities, (c) decide whether the technique deserves a place in rainier's stack as an allocation overlay on top of the QU100 screener.

What's actually valuable in PGPortfolio: the closed-form differentiable transaction-cost reward (no Monte Carlo, no critic), the portfolio-vector memory trick, and OSBL sampling for non-stationary series. Not the headline "DRL beats markets" framing — that was crypto in 2017 and is poorly replicated on equities.

The goal is honest: build a working EIIE allocator, run it on Mag 7 with leakage-proof walk-forward, and measure it against simple baselines (1/N, SPY, QQQ, momentum). If it beats baselines after realistic costs across multiple regimes — great, integrate. If it doesn't — we have a published-quality null result and a reusable backtest harness for rainier.

Per the recursive-research doc's anti-overfitting discipline (§8): every result here clears Deflated Sharpe + walk-forward + purged CV before it's reported as alpha.

2. What PGPortfolio actually is ¶

Three components, all load-bearing:

2.1 EIIE — Ensemble of Identical Independent Evaluators (CNN)

Input tensor X_t of shape (f, n, m) = (3, 50, 11) in the paper:

f=3 features per (asset, time): normalized close, high, low (all divided by the latest close).
n=50 bars of lookback window.
m=11 non-cash assets.

Row-isolated convolutions (every asset processed independently with shared weights — that's the "identical independent" part):

Conv 1: 1×3 kernel → 2 feature maps of size m × 48.
Conv 2: 1×48 kernel → 20 feature maps of size m × 1.
PVM concat: append previous portfolio weights w_{t-1} as a 21st feature map.
Conv 3: 1×1 kernel → one score per asset.
Cash bias + softmax: learned scalar for cash, softmax over m+1 outputs → portfolio weights summing to 1.

Why row-isolated: the same evaluator processes each asset, so the network generalizes across assets and the model size doesn't blow up with m. PVM lets the network see its previous decision without recurrent backprop through time — a clever workaround.

2.2 Loss — differentiable log-growth with closed-form transaction cost

Objective is mean log portfolio growth across the training window:

R = (1/t_f) · Σ_t log( μ_t · y_t · w_{t-1} )

Where:

y_t = v_t / v_{t-1} — per-asset price relative.
w'_t = (y_t ⊙ w_{t-1}) / (y_t · w_{t-1}) — drifted weights after the price move.
μ_t — transaction-cost remainder: the fraction of portfolio value that survives rebalancing from w'_t to target w_t.

The transaction-cost recursion (Eq. 14/15 in the paper) has a closed form solvable by fixed-point iteration (typically 3–5 iters to convergence):

μ_t = [ 1 − c_p·w'_{t,0} − (c_s + c_p − c_s·c_p) · Σ_i (w'_{t,i} − μ_t·w_{t,i})⁺ ]
      / (1 − c_p · w_{t,0})

Where c_p, c_s are purchase/sale commission rates. This is the killer feature: the reward is explicit, differentiable from prices + chosen weights — no critic, no bootstrapping, no Monte Carlo policy-gradient noise. Train it like supervised learning with backprop through time-steps in a window.

2.3 OSBL — Online Stochastic Batch Learning

Financial series are non-stationary. The paper's solution:

All historical data stays available (permanent replay).
Batch start times sampled with geometric recency weighting: Pβ(t_b) = β · (1−β)^(t − t_b − n_b) — so recent windows are picked more often, but old windows aren't excluded.
Within a batch, time order is preserved (no shuffling).

Why this matters: uniform replay trains the agent on stale regimes; pure online training overfits the latest move. Geometric weighting is the compromise. For 2026 US equities, β needs re-tuning — the paper's value was for 30-min crypto, not daily stocks.

3. Modern landscape (2026) ¶

#	Project	Status	Verdict for our use
1	TradeMaster (NTU) ~2.7k stars, last commit `1747cc1` Jun 2025	Actively maintained	Best EIIE reference implementation. PyTorch port of EIIE + DJ30 US-stock tutorial. Use as the architectural baseline for our PyTorch rewrite. Don't fork — read and reimplement clean.
2	RLPortfolio ~57 stars, last commit `071cf75` Mar 2025	Active but small	Modern PyTorch + Gymnasium, tests + docs, policy-gradient portfolio env. Good for the environment scaffolding (replay buffer, OSBL sampler, walk-forward driver).
3	FinRL ~15.1k stars, updated Apr 2026	Actively maintained	Not a faithful EIIE clone, but the best infrastructure for data plumbing, Gym envs, baselines, transaction-cost models. Use FinRL for the harness; reimplement EIIE inside it.
4	PGPortfolio (original) ~1.9k stars, TF 1.x, ~unmaintained since 2017	Archived in spirit	Read for ground truth (the README explicitly warns about 2017 regime + slippage). Don't run — TF 1.x is 9 years stale.
5	wassname/rl-portfolio-management ~562 stars, archived Mar 2025	Archived	Cautionary tale. PyTorch notebooks; author archived with an explicit note that train growth did not generalize on test. Read for what to avoid; do not start from.

Recommended stack

FinRL harness + TradeMaster-style EIIE reimplementation, written in PyTorch from scratch. ~400–600 LOC for the EIIE module; FinRL handles data, env, costs, baselines, evaluation. Reimplementing rather than forking means we own the code, can audit every line for leakage, and the recursive-research-system doc's evaluator-immutability rule applies cleanly.

4. Critical adaptations for Mag 7 + QQQ/SPY/ES ¶

Dimension	Original (crypto, 2017)	Adapted (US equities, 2026)
Universe size	11 assets + cash	10 assets (AAPL, MSFT, GOOG, AMZN, NVDA, META, TSLA, QQQ, SPY) + ES + cash. Small N → regularize hard, reduce filter widths, dropout, weight decay, early stopping.
Bar interval	30-minute	Daily for v1 (matches rainier's data cadence). Optional 60-min for v2.
Lookback window	`n=50` bars	`n=60` (≈3 trading months) on daily bars. Tune.
Transaction cost	0.25% flat, no slippage	Realistic IBKR-style: $0.005/share commission, half-spread + 2bp slippage, no short borrow (long-only v1). For ES: round-turn commission + tick slippage + bid-ask spread.
Universe rule	Top-11 by volume at test start (LEAKAGE — see §8)	Frozen universe defined at training-window start, no peeking forward. Mag 7 is openly survivorship-biased → declare it ex-post and report it.
Market hours	24/7 crypto	NYSE/NASDAQ session hours. Overnight gaps treated as a special bar transition. No intraday trading inside the bar.
Corporate actions	None for crypto	Adjusted close (split + dividend), corporate-action calendar respected, no look-ahead on ex-div dates.
ES futures	n/a	Continuous back-adjusted contract (Panama or ratio-roll). Explicit roll calendar (3rd Fri of Mar/Jun/Sep/Dec). Margin model: notional-aware (1 ES = $50 × index → ~$300K notional per contract).
Long/short	Long-only with cash	Long-only v1 (matches paper). Long/short v2 if v1 generalizes.
Rebalance cadence	Every 30-min bar	Daily close (matches bar interval). Compare with weekly rebalance to gauge turnover sensitivity.
Regime split	One crypto bull market	2022 bear, 2023–2024 AI rally, 2025–2026 validation — report metrics conditional on each regime separately.

5. Proposed implementation plan ¶

Phased so we ship a working v0 fast and iterate. Each phase is a separate fleet task.

v0  data + harness         (no model — just baselines + walk-forward)
 │
 ▼
v1  EIIE reimplementation  (PyTorch, TradeMaster-style, smoke test)
 │
 ▼
v2  full backtest          (all baselines, all metrics, all regimes)
 │
 ▼
v3  research integration   (link to recursive-research-system doc archive)
 │
 ▼
v4  decide                 (does this enter rainier as an overlay? Y/N)

v0 — data + harness (~1 week, 1 worker)

Download daily OHLCV for the 10-asset universe + ES continuous, 2010-01-01 → present, from yfinance / Polygon / IBKR depending on what we have.
Build the walk-forward driver: anchored windows, purged k-fold + embargo, holdout cut.
Implement all baselines: UCRP, BCRP, 1/N, equal-weight, B&H SPY, B&H QQQ, TS-momentum, mean-variance with shrinkage, risk parity.
Implement realistic US-equity cost model + ES continuous-contract construction.
Compute the full metric vector for all baselines.
Gate: baselines reproduce known characteristics (1/N Sharpe ≈ historical, SPY Sharpe ≈ 0.5 over 2010s).

v1 — EIIE in PyTorch (~1 week, 1 worker)

Implement EIIE module from scratch: 3 conv layers + PVM concat + cash bias + softmax.
Implement the differentiable μ_t recursion with fixed-point iteration (3–5 iters per step).
Implement OSBL sampler with geometric recency-weighted batch starts.
Train on a tiny 1-year window with deliberately trivial costs to verify the loss decreases and the model finds 1/N at minimum.
Gate: reproduce paper's qualitative behavior on a controlled toy (EIIE on a 4-asset synthetic with known optimal returns equal-weight; loss must converge).

v2 — full backtest (~2 weeks, 1 worker)

Anchored walk-forward: train 2010–2018, val 2019–2020, test 2021, roll through 2022, 2023–2024, 2025-2026.
Hyperparameter sweep on val: window size, β (OSBL recency), filter counts, dropout, weight decay.
Test on each out-of-sample window without retraining (or with rolling retraining — compare both).
Report full metric vector: fAPV, Sharpe, MDD, turnover, holding period, cost drag, Calmar, Sortino, hit rate, regime-conditional Sharpe, Deflated Sharpe, Probabilistic Sharpe, CSCV PBO.
Gate: Does EIIE beat best-baseline (likely TS-momentum or QQQ B&H) on Deflated Sharpe across ≥3 of 4 out-of-sample regime windows? Y/N decides v3.

v3 — research integration (~3 days)

Write up findings in a deep-dive section of the recursive-research-system doc (§11.x).
Decide whether to file a rainier task to add EIIE as an allocation overlay on top of QU100 screener output, OR to file this as a null-result note + reusable harness.
Archive the trained model + walk-forward results with the evaluator hash (per recursive-research doc §4 immutability rule).

v4 — decision gate

If v2 gate passed: file rainier task to integrate as overlay. If failed: keep the harness, ship the null result, move on. Either way, the harness becomes rainier infrastructure.

6. Backtest plan ¶

Universe

v1 (long-only): AAPL, MSFT, GOOG (use GOOG, not GOOGL), AMZN, NVDA, META, TSLA, QQQ, SPY, plus cash. 9 risky + cash.
v1.1 (add ES): add continuous back-adjusted ES futures. 10 risky + cash.
v2 (long/short): defer.

Period & splits

Window	Train	Validate	Test (OOS)	Regime label
W1	2010-01 → 2018-12	2019-01 → 2020-06	2020-07 → 2021-12	COVID rebound + tech bull
W2	2010-01 → 2020-12	2021-01 → 2021-12	2022-01 → 2022-12	Rates shock / bear
W3	2010-01 → 2022-12	2023-01 → 2023-06	2023-07 → 2024-12	AI rally
W4	2010-01 → 2024-12	2025-01 → 2025-06	2025-07 → 2026-05	Validation / unseen

Purged k-fold inside each train window with 5-day embargo around fold boundaries (López de Prado standard). Embargo prevents label-leakage from overlapping forward-return windows.

Baselines (mandatory)

UCRP — Uniform Constant Rebalanced Portfolio (1/N rebalanced daily).
BCRP — Best Constant Rebalanced Portfolio (look-ahead optimal; oracle, not deployable).
1/N buy-and-hold — equal-weight, no rebalance.
SPY buy-and-hold.
QQQ buy-and-hold.
TS-momentum — Moskowitz/Ooi/Pedersen 12-month momentum, vol-targeted.
Mean-variance + Ledoit-Wolf shrinkage — classic Markowitz with shrunk covariance.
Risk parity — equal-risk-contribution weights.
No-trade band 1/N — 1/N with cost-aware rebalance only when weights drift > threshold.

Metrics (the full vector)

Category	Metric	Why
Return	fAPV (final Accumulated Portfolio Value)	Paper's headline. Useful but easily gamed by lucky run.
Return	CAGR	Annualized version of fAPV.
Risk-adj	Sharpe ratio	Standard. Reported alongside DSR/PSR.
Risk-adj	Deflated Sharpe (DSR)	Adjusts for multiple trials. Mandatory for any "this beats baseline" claim.
Risk-adj	Probabilistic Sharpe (PSR)	Lower bound on true Sharpe given finite sample.
Risk-adj	Sortino ratio	Downside-only volatility — better for skewed distributions.
Drawdown	Max Drawdown	Standard.
Drawdown	Calmar / MAR	Return ÷ MDD. Penalizes deep drawdowns.
Drawdown	Drawdown duration	How long underwater — operator-felt pain.
Cost	Turnover	Bars w/ rebalance ÷ total bars × Σ\|Δw\|.
Cost	Average holding period	Inverse turnover proxy.
Cost	Cost drag	Returns delta with vs. without costs — isolates the cost penalty.
Distribution	Hit rate	Fraction of bars where return > 0.
Concentration	Max weight, Herfindahl	Did the policy collapse to one asset?
Benchmark	Alpha vs. SPY, Beta vs. SPY, Info ratio	Capital-asset-pricing-style decomposition.
Robustness	Regime-conditional Sharpe (per window)	Does the policy work across 2022 bear AND 2023 AI rally, or just one?
Robustness	CSCV PBO	Probability of Backtest Overfitting. If > 0.5, the result is noise.

Promotion criteria for "this works"

The model is considered to have generalized if and only if:

DSR ≥ 0.5 on out-of-sample (deflated for trial count).
Calmar ≥ 1.0 on out-of-sample.
Beats best-baseline (likely TS-momentum or QQQ B&H) in ≥3 of 4 walk-forward windows.
CSCV PBO ≤ 0.4.
Regime-conditional Sharpe non-negative in the 2022 bear window (the hardest test).

If any of these fails: the model has not generalized. Report honestly. Don't deploy.

7. Open questions ¶

Q1 — Universe inclusion (ES futures)

Stake: Include ES from v1.1 (not v1). Reason: continuous-contract construction adds non-trivial complexity (roll calendar, margin treatment, contract size). Get the 10-equity baseline working first, then add ES.

Alternative: ES as benchmark only (not in the policy's universe), comparing EIIE-on-Mag-7 against ES-buy-and-hold. Simpler.

Q2 — Long-only or long/short?

Stake: Long-only for v1 and v2. Reasons: (a) matches the paper, (b) avoids short-borrow cost modeling complexity, (c) Mag 7 is a long-bias universe by construction. Long/short defers indefinitely.

Q3 — Bar interval

Stake: Daily bars for v1. Reasons: (a) matches rainier's data cadence, (b) avoids intraday data licensing, (c) crypto's 30-min cadence assumed 24/7 trading which doesn't apply. 60-min for v2 if v1 motivates it.

Q4 — Data source

Stake: yfinance for v0 (free, sufficient for daily). Upgrade to Polygon or IBKR if (a) we need intraday, or (b) we need adjusted-corporate-actions guarantees yfinance can't provide.

Cost: yfinance is free but rate-limited and quality-uncertain on edge cases (splits, ex-div). Polygon $29/mo unlocks 5y history; $79/mo unlocks 2y of options. IBKR requires an active account.

Q5 — Where does the code live?

Stake: New module src/rainier/portfolio_drl/ in the rainier repo. Reasons: shares data plumbing (StockPrice table), backtest discipline (walk-forward), and evaluation infrastructure with existing rainier work. The recursive-research-system doc's evaluator-immutability rule applies cleanly.

Alternative: Standalone repo. Argument against: forks the backtest harness and the cost models, which means double the maintenance.

Q6 — Reimplement or adopt TradeMaster?

Stake: Reimplement EIIE in ~500 LOC of clean PyTorch using TradeMaster as a reference. Reasons: (a) we own the code and can audit every line for leakage, (b) TradeMaster has dependencies and design choices we don't need, (c) the EIIE module itself is small (3 conv layers + PVM).

Alternative: Fork TradeMaster, customize Mag 7. Faster to first run but harder to audit and harder to extend.

Q7 — Compute budget

Stake: Local CPU is sufficient. EIIE is tiny (~10K parameters); training a single window takes minutes on CPU, seconds on a laptop GPU. No cloud needed.

Question: Operator confirm — laptop training is fine, or should we provision a GPU box for sweeps?

Q8 — What's the kill switch?

Stake: If v2 gate fails (per §6 promotion criteria), we publish the null result + the harness in a deep-dive section of the recursive-research-system doc, and stop. No production integration. This is a research project with a defined exit.

Tension: Hard to walk away from sunk cost. The kill switch must be set up front. If §6 promotion criteria fail, EIIE does not enter rainier.

8. Risks & known failures ¶

Inherited from PGPortfolio (2017)

v2 paper leakage bug. Test span ≈30% shorter than the actual experiment, so the volume-observation interval used for universe selection overlapped the backtest. Avoidance: freeze universe by rule at training-window start; for Mag 7, openly declare ex-post selection and report it.
Overfitting reported by wassname. The most-starred PyTorch reimplementation archived itself with a note that training growth did not generalize. Avoidance: hard regularization, anchored walk-forward, CSCV PBO gate.
Cost under-modeling. 0.25% flat + zero slippage is a crypto fantasy. Avoidance: realistic commission + spread + slippage model, including ADV participation limits for size if we ever go beyond paper.
Stale regime learning. Train-2010-to-2018 model meeting 2022 rates shock is novel territory. Avoidance: walk-forward retraining, regime-conditional reporting.

Specific to this adaptation

Survivorship + narrative bias. Mag 7 is the ex-post most-loved basket of 2020–2024. Picking these names assumes the operator knew they'd win. Mitigation: declare explicitly. Optionally add a 2010 mini-cap basket (the future Mag 7's of that vintage — many failed) as a robustness check.
Small N (~10 assets). EIIE's cross-asset data-efficiency was designed for ~11 coins. With 10 US equities the network capacity may exceed the signal, leading to overfitting. Mitigation: shrink filter widths, dropout, weight decay.
Correlated universe. Mag 7 are all big-cap tech, all highly correlated, all rate-sensitive. The "diversification" the policy can find is limited. Mitigation: include QQQ + SPY + cash + ES as low-correlation alternatives.
ES continuous-contract bias. Different roll methods (Panama, ratio, calendar) give different return series. Mitigation: document the roll method explicitly; report sensitivity.
Regime-of-one risk. The 2023–2024 AI rally is unusually concentrated. A model trained through it may simply learn "buy NVDA." Mitigation: explicit regime-conditional metrics; weight 2022 bear-market test heavily.
Reproducibility on yfinance. Free data has silent quality issues (adjusted close changing post-hoc). Mitigation: snapshot the dataset and hash it; pin every walk-forward run to that snapshot.

Post-2020 literature reality check

Codex's verdict: "The strongest evidence remains crypto or curated academic benchmarks. For US equities, DRL is credible as an experimental allocator or execution-aware overlay, not as presumed alpha. A 2026 paper must beat simple QQQ/SPY and momentum after costs, with leakage-proof walk-forward evidence."

This is the right frame. We're not setting out to "beat the market". We're testing whether a specific 2017 technique generalizes to a specific 2026 universe under honest conditions.

Non-goals

Live trading. Backtest-only for this entire research arc.
Multi-frequency (intraday + daily fusion). One bar interval per run.
Alternative architectures (Transformer, GNN). EIIE is the subject; substituting the architecture is a different research question.
Universe expansion to S&P 500 / Russell 1000 — that's a different problem (large-N, factor models).

9. Current rainier state vs. needed ¶

Component	Today in rainier	Needed	Gap
Daily OHLCV	`StockPrice` table (China A-shares)	US equities 2010-present + ES continuous	NEW: ingestion path for yfinance / Polygon
Backtest engine	`backtest/engine.py`, `walk_forward.py`	Walk-forward + purged k-fold + embargo + DSR/PSR/PBO	EXTEND: add DSR/PSR/PBO; purged k-fold may need new wiring
Cost model	Likely crypto-style or none	US-equity commission + spread + slippage; ES round-turn + tick slippage	NEW
Baselines	QU100-flavor signal mixing	UCRP, BCRP, 1/N, TS-momentum, MV+shrinkage, risk parity	NEW (some reuse from existing screener)
EIIE module	—	PyTorch 3-conv + PVM + cash bias + softmax	NEW (~300–500 LOC)
Transaction-cost reward	—	μ_t fixed-point recursion, differentiable	NEW (~50 LOC)
OSBL sampler	—	Geometric recency-weighted batch starts	NEW (~30 LOC)
Evaluation reporter	Discord embed, Streamlit dashboard	Walk-forward summary, regime-conditional Sharpe table, PBO plot	EXTEND: add the metric vector renderer

Bottom line: a meaningful chunk of net-new code, but most of the heavy infrastructure (walk-forward, baselines, data, evaluation reporting) is reusable rainier surface.

Date	Decision	Rationale	From
2026-05-19	Research-only doc first; defer build until §7 questions resolved	Operator directive — "start research and give me the research doc first, then we decide next step"	operator

12. Deep dives ¶

12.1 EIIE math — full derivation ¶

TBD. Step-by-step: input normalization, the row-isolated conv intuition, why softmax with cash bias works, why the μ_t recursion has a unique fixed point, gradient flow through μ_t.

12.2 US-equity cost model — exact form ¶

TBD. Commission per share, half-spread + slippage in bp, ES round-turn commission + tick slippage + bid-ask, no short borrow for v1, ADV participation if/when size matters.

12.3 Walk-forward + purged k-fold + embargo ¶

TBD. Fold construction, embargo length, holdout cut policy, label-leakage prevention. Reuses §4/§11.2 from the recursive-research-system doc.

12.4 Deflated Sharpe / PSR / CSCV PBO — exact computation ¶

TBD. Trial count, skewness/kurtosis adjustment, IID-violation handling. Reference: Bailey/López de Prado 2014.

12.5 ES continuous-contract construction ¶

TBD. Panama vs. ratio roll, roll calendar, margin treatment, contract-size translation to portfolio weight.

12.6 Regime-conditional attribution ¶

TBD. How to split returns by regime (rule-based vs. HMM), how to report per-regime Sharpe + drawdown, how to weight regimes for the overall promotion gate.

PGPortfolio on Mag 7 + QQQ/SPY/ES — research doc