Reimplement PGPortfolio (Jiang/Xu/Liang 2017) on a small US-equity universe — Mag 7 + QQQ + SPY + ES — to (a) learn the differentiable-allocator technique, (b) establish a credible 2026 baseline for portfolio DRL on US equities, (c) decide whether the technique deserves a place in rainier's stack as an allocation overlay on top of the QU100 screener.
What's actually valuable in PGPortfolio: the closed-form differentiable transaction-cost reward (no Monte Carlo, no critic), the portfolio-vector memory trick, and OSBL sampling for non-stationary series. Not the headline "DRL beats markets" framing — that was crypto in 2017 and is poorly replicated on equities.
The goal is honest: build a working EIIE allocator, run it on Mag 7 with leakage-proof walk-forward, and measure it against simple baselines (1/N, SPY, QQQ, momentum). If it beats baselines after realistic costs across multiple regimes — great, integrate. If it doesn't — we have a published-quality null result and a reusable backtest harness for rainier.
Per the recursive-research doc's anti-overfitting discipline (§8): every result here clears Deflated Sharpe + walk-forward + purged CV before it's reported as alpha.
Three components, all load-bearing:
Input tensor X_t of shape (f, n, m) = (3, 50, 11) in the paper:
f=3 features per (asset, time): normalized close, high, low (all divided by the latest close).n=50 bars of lookback window.m=11 non-cash assets.Row-isolated convolutions (every asset processed independently with shared weights — that's the "identical independent" part):
1×3 kernel → 2 feature maps of size m × 48.1×48 kernel → 20 feature maps of size m × 1.w_{t-1} as a 21st feature map.1×1 kernel → one score per asset.m+1 outputs → portfolio weights summing to 1.Why row-isolated: the same evaluator processes each asset, so the network generalizes across assets and the model size doesn't blow up with m. PVM lets the network see its previous decision without recurrent backprop through time — a clever workaround.
Objective is mean log portfolio growth across the training window:
R = (1/t_f) · Σ_t log( μ_t · y_t · w_{t-1} )
Where:
y_t = v_t / v_{t-1} — per-asset price relative.w'_t = (y_t ⊙ w_{t-1}) / (y_t · w_{t-1}) — drifted weights after the price move.μ_t — transaction-cost remainder: the fraction of portfolio value that survives rebalancing from w'_t to target w_t.The transaction-cost recursion (Eq. 14/15 in the paper) has a closed form solvable by fixed-point iteration (typically 3–5 iters to convergence):
μ_t = [ 1 − c_p·w'_{t,0} − (c_s + c_p − c_s·c_p) · Σ_i (w'_{t,i} − μ_t·w_{t,i})⁺ ]
/ (1 − c_p · w_{t,0})
Where c_p, c_s are purchase/sale commission rates. This is the killer feature: the reward is explicit, differentiable from prices + chosen weights — no critic, no bootstrapping, no Monte Carlo policy-gradient noise. Train it like supervised learning with backprop through time-steps in a window.
Financial series are non-stationary. The paper's solution:
Pβ(t_b) = β · (1−β)^(t − t_b − n_b) — so recent windows are picked more often, but old windows aren't excluded.Why this matters: uniform replay trains the agent on stale regimes; pure online training overfits the latest move. Geometric weighting is the compromise. For 2026 US equities, β needs re-tuning — the paper's value was for 30-min crypto, not daily stocks.
| # | Project | Status | Verdict for our use |
|---|---|---|---|
| 1 | TradeMaster (NTU) |
Actively maintained | Best EIIE reference implementation. PyTorch port of EIIE + DJ30 US-stock tutorial. Use as the architectural baseline for our PyTorch rewrite. Don't fork — read and reimplement clean. |
| 2 | RLPortfolio |
Active but small | Modern PyTorch + Gymnasium, tests + docs, policy-gradient portfolio env. Good for the environment scaffolding (replay buffer, OSBL sampler, walk-forward driver). |
| 3 | FinRL |
Actively maintained | Not a faithful EIIE clone, but the best infrastructure for data plumbing, Gym envs, baselines, transaction-cost models. Use FinRL for the harness; reimplement EIIE inside it. |
| 4 | PGPortfolio (original) |
Archived in spirit | Read for ground truth (the README explicitly warns about 2017 regime + slippage). Don't run — TF 1.x is 9 years stale. |
| 5 | wassname/rl-portfolio-management |
Archived | Cautionary tale. PyTorch notebooks; author archived with an explicit note that train growth did not generalize on test. Read for what to avoid; do not start from. |
FinRL harness + TradeMaster-style EIIE reimplementation, written in PyTorch from scratch. ~400–600 LOC for the EIIE module; FinRL handles data, env, costs, baselines, evaluation. Reimplementing rather than forking means we own the code, can audit every line for leakage, and the recursive-research-system doc's evaluator-immutability rule applies cleanly.
| Dimension | Original (crypto, 2017) | Adapted (US equities, 2026) |
|---|---|---|
| Universe size | 11 assets + cash | 10 assets (AAPL, MSFT, GOOG, AMZN, NVDA, META, TSLA, QQQ, SPY) + ES + cash. Small N → regularize hard, reduce filter widths, dropout, weight decay, early stopping. |
| Bar interval | 30-minute | Daily for v1 (matches rainier's data cadence). Optional 60-min for v2. |
| Lookback window | n=50 bars | n=60 (≈3 trading months) on daily bars. Tune. |
| Transaction cost | 0.25% flat, no slippage | Realistic IBKR-style: $0.005/share commission, half-spread + 2bp slippage, no short borrow (long-only v1). For ES: round-turn commission + tick slippage + bid-ask spread. |
| Universe rule | Top-11 by volume at test start (LEAKAGE — see §8) | Frozen universe defined at training-window start, no peeking forward. Mag 7 is openly survivorship-biased → declare it ex-post and report it. |
| Market hours | 24/7 crypto | NYSE/NASDAQ session hours. Overnight gaps treated as a special bar transition. No intraday trading inside the bar. |
| Corporate actions | None for crypto | Adjusted close (split + dividend), corporate-action calendar respected, no look-ahead on ex-div dates. |
| ES futures | n/a | Continuous back-adjusted contract (Panama or ratio-roll). Explicit roll calendar (3rd Fri of Mar/Jun/Sep/Dec). Margin model: notional-aware (1 ES = $50 × index → ~$300K notional per contract). |
| Long/short | Long-only with cash | Long-only v1 (matches paper). Long/short v2 if v1 generalizes. |
| Rebalance cadence | Every 30-min bar | Daily close (matches bar interval). Compare with weekly rebalance to gauge turnover sensitivity. |
| Regime split | One crypto bull market | 2022 bear, 2023–2024 AI rally, 2025–2026 validation — report metrics conditional on each regime separately. |
Phased so we ship a working v0 fast and iterate. Each phase is a separate fleet task.
v0 data + harness (no model — just baselines + walk-forward)
│
▼
v1 EIIE reimplementation (PyTorch, TradeMaster-style, smoke test)
│
▼
v2 full backtest (all baselines, all metrics, all regimes)
│
▼
v3 research integration (link to recursive-research-system doc archive)
│
▼
v4 decide (does this enter rainier as an overlay? Y/N)
If v2 gate passed: file rainier task to integrate as overlay. If failed: keep the harness, ship the null result, move on. Either way, the harness becomes rainier infrastructure.
| Window | Train | Validate | Test (OOS) | Regime label |
|---|---|---|---|---|
| W1 | 2010-01 → 2018-12 | 2019-01 → 2020-06 | 2020-07 → 2021-12 | COVID rebound + tech bull |
| W2 | 2010-01 → 2020-12 | 2021-01 → 2021-12 | 2022-01 → 2022-12 | Rates shock / bear |
| W3 | 2010-01 → 2022-12 | 2023-01 → 2023-06 | 2023-07 → 2024-12 | AI rally |
| W4 | 2010-01 → 2024-12 | 2025-01 → 2025-06 | 2025-07 → 2026-05 | Validation / unseen |
Purged k-fold inside each train window with 5-day embargo around fold boundaries (López de Prado standard). Embargo prevents label-leakage from overlapping forward-return windows.
| Category | Metric | Why |
|---|---|---|
| Return | fAPV (final Accumulated Portfolio Value) | Paper's headline. Useful but easily gamed by lucky run. |
| Return | CAGR | Annualized version of fAPV. |
| Risk-adj | Sharpe ratio | Standard. Reported alongside DSR/PSR. |
| Risk-adj | Deflated Sharpe (DSR) | Adjusts for multiple trials. Mandatory for any "this beats baseline" claim. |
| Risk-adj | Probabilistic Sharpe (PSR) | Lower bound on true Sharpe given finite sample. |
| Risk-adj | Sortino ratio | Downside-only volatility — better for skewed distributions. |
| Drawdown | Max Drawdown | Standard. |
| Drawdown | Calmar / MAR | Return ÷ MDD. Penalizes deep drawdowns. |
| Drawdown | Drawdown duration | How long underwater — operator-felt pain. |
| Cost | Turnover | Bars w/ rebalance ÷ total bars × Σ|Δw|. |
| Cost | Average holding period | Inverse turnover proxy. |
| Cost | Cost drag | Returns delta with vs. without costs — isolates the cost penalty. |
| Distribution | Hit rate | Fraction of bars where return > 0. |
| Concentration | Max weight, Herfindahl | Did the policy collapse to one asset? |
| Benchmark | Alpha vs. SPY, Beta vs. SPY, Info ratio | Capital-asset-pricing-style decomposition. |
| Robustness | Regime-conditional Sharpe (per window) | Does the policy work across 2022 bear AND 2023 AI rally, or just one? |
| Robustness | CSCV PBO | Probability of Backtest Overfitting. If > 0.5, the result is noise. |
The model is considered to have generalized if and only if:
If any of these fails: the model has not generalized. Report honestly. Don't deploy.
Stake: Include ES from v1.1 (not v1). Reason: continuous-contract construction adds non-trivial complexity (roll calendar, margin treatment, contract size). Get the 10-equity baseline working first, then add ES.
Alternative: ES as benchmark only (not in the policy's universe), comparing EIIE-on-Mag-7 against ES-buy-and-hold. Simpler.
Stake: Long-only for v1 and v2. Reasons: (a) matches the paper, (b) avoids short-borrow cost modeling complexity, (c) Mag 7 is a long-bias universe by construction. Long/short defers indefinitely.
Stake: Daily bars for v1. Reasons: (a) matches rainier's data cadence, (b) avoids intraday data licensing, (c) crypto's 30-min cadence assumed 24/7 trading which doesn't apply. 60-min for v2 if v1 motivates it.
Stake: yfinance for v0 (free, sufficient for daily). Upgrade to Polygon or IBKR if (a) we need intraday, or (b) we need adjusted-corporate-actions guarantees yfinance can't provide.
Cost: yfinance is free but rate-limited and quality-uncertain on edge cases (splits, ex-div). Polygon $29/mo unlocks 5y history; $79/mo unlocks 2y of options. IBKR requires an active account.
Stake: New module src/rainier/portfolio_drl/ in the rainier repo. Reasons: shares data plumbing (StockPrice table), backtest discipline (walk-forward), and evaluation infrastructure with existing rainier work. The recursive-research-system doc's evaluator-immutability rule applies cleanly.
Alternative: Standalone repo. Argument against: forks the backtest harness and the cost models, which means double the maintenance.
Stake: Reimplement EIIE in ~500 LOC of clean PyTorch using TradeMaster as a reference. Reasons: (a) we own the code and can audit every line for leakage, (b) TradeMaster has dependencies and design choices we don't need, (c) the EIIE module itself is small (3 conv layers + PVM).
Alternative: Fork TradeMaster, customize Mag 7. Faster to first run but harder to audit and harder to extend.
Stake: Local CPU is sufficient. EIIE is tiny (~10K parameters); training a single window takes minutes on CPU, seconds on a laptop GPU. No cloud needed.
Question: Operator confirm — laptop training is fine, or should we provision a GPU box for sweeps?
Stake: If v2 gate fails (per §6 promotion criteria), we publish the null result + the harness in a deep-dive section of the recursive-research-system doc, and stop. No production integration. This is a research project with a defined exit.
Tension: Hard to walk away from sunk cost. The kill switch must be set up front. If §6 promotion criteria fail, EIIE does not enter rainier.
Codex's verdict: "The strongest evidence remains crypto or curated academic benchmarks. For US equities, DRL is credible as an experimental allocator or execution-aware overlay, not as presumed alpha. A 2026 paper must beat simple QQQ/SPY and momentum after costs, with leakage-proof walk-forward evidence."
This is the right frame. We're not setting out to "beat the market". We're testing whether a specific 2017 technique generalizes to a specific 2026 universe under honest conditions.
| Component | Today in rainier | Needed | Gap |
|---|---|---|---|
| Daily OHLCV | StockPrice table (China A-shares) | US equities 2010-present + ES continuous | NEW: ingestion path for yfinance / Polygon |
| Backtest engine | backtest/engine.py, walk_forward.py | Walk-forward + purged k-fold + embargo + DSR/PSR/PBO | EXTEND: add DSR/PSR/PBO; purged k-fold may need new wiring |
| Cost model | Likely crypto-style or none | US-equity commission + spread + slippage; ES round-turn + tick slippage | NEW |
| Baselines | QU100-flavor signal mixing | UCRP, BCRP, 1/N, TS-momentum, MV+shrinkage, risk parity | NEW (some reuse from existing screener) |
| EIIE module | — | PyTorch 3-conv + PVM + cash bias + softmax | NEW (~300–500 LOC) |
| Transaction-cost reward | — | μ_t fixed-point recursion, differentiable | NEW (~50 LOC) |
| OSBL sampler | — | Geometric recency-weighted batch starts | NEW (~30 LOC) |
| Evaluation reporter | Discord embed, Streamlit dashboard | Walk-forward summary, regime-conditional Sharpe table, PBO plot | EXTEND: add the metric vector renderer |
Bottom line: a meaningful chunk of net-new code, but most of the heavy infrastructure (walk-forward, baselines, data, evaluation reporting) is reusable rainier surface.
| Date | Decision | Rationale | From |
|---|---|---|---|
| 2026-05-19 | Research-only doc first; defer build until §7 questions resolved | Operator directive — "start research and give me the research doc first, then we decide next step" | operator |