Skip to content

Commit 64a8c14

Browse files
neuron7xLabclaude
andauthored
Robustness framework v1: CPCV + PBO + PSR + null audit + jitter on frozen Kuramoto evidence (#356)
* feat(robustness): CPCV + PBO + Probabilistic Sharpe + null audit + jitter primitives Strategy-agnostic statistical battery for frozen-artifact robustness gates. Pure functions on numpy/pandas; zero I/O; zero strategy coupling. - research/robustness/cpcv.py — Combinatorial Purged CV splits with embargo purging, Bailey et al. (2017) logit-rank PBO estimator, Lopez de Prado (2018) Eq. 14.1 Probabilistic Sharpe Ratio and its rolling form. - research/robustness/null_audit.py — four orthogonal null families (permuted target, stationary-block-permuted signal, inverted signal, lag surrogate) with Politis-Romano geometric-block bootstrap and Davison-Hinkley +1 continuity-correction p-values. - research/robustness/stability.py — parameter-jitter stability over a user-injected evaluator with fractional-radius perturbations and tolerance-band accounting. All three modules are consumed read-only by protocol-layer suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(robustness): Kuramoto protocol layer — contract, suites, gate runner Strategy-bound wiring of the primitives against the frozen cross-asset Kuramoto evidence bundle. All modules are read-only on every frozen input. - kuramoto_contract.py — FrozenArtifactManifest + KuramotoRobustnessContract with fail-closed sha256 verification against results/cross_asset_kuramoto/ offline_robustness/SOURCE_HASHES.json (28 artifacts). Typed views on equity_curve, fold_metrics, risk_metrics, and PARAMETER_LOCK. - kuramoto_candidate_set.py — anti-inflation guard rejecting candidate parameter names prefixed seed_/random_/jitter_ so hidden DoF cannot deflate PBO or inflate PSR. - kuramoto_cpcv_suite.py — PBO on fold Sharpes, PSR on daily returns. - kuramoto_null_suite.py — two frozen-returns null families (iid permutation + stationary bootstrap); the four-family primitive degenerates without a separate signal trace, so this suite implements the honest reduced audit instead. - kuramoto_jitter_executor.py — PLACEHOLDER_APPROXIMATION evaluator (quadratic in fractional parameter-space distance); the rebuild requires the raw asset panel, which is not in the frozen bundle. - kuramoto_jitter_suite.py — binds the executor to the frozen anchor, reports the evaluator mode verbatim in the output bundle. - kuramoto_gate_runner.py — pure orchestration; the three suites run independently so single-suite regressions are isolated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(backtest): DecisionLabel + evaluate_robustness_gates terminal gate Decision layer that turns an evidence bundle from the gate runner into a single PASS / FAIL / INSUFFICIENT_EVIDENCE label. Separates evidence from decision so the same bundle can be re-evaluated under different thresholds without re-running simulations. - DecisionLabel enum (PASS / FAIL / INSUFFICIENT_EVIDENCE). - RobustnessGateResult frozen dataclass: terminal label + per-axis pass booleans + a reason chain. - evaluate_robustness_gates(): accepts any runtime-checkable evidence bundle satisfying the _CPCVEvidence/_NullEvidence/ _JitterEvidence protocols. FAIL propagates from any essential-gate red; INSUFFICIENT_EVIDENCE kicks in when jitter is placeholder and require_live_jitter is set, or when CPCV has <2 folds. Added to backtest.__init__ public surface via __all__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(robustness): CLI runner + emit first FAIL verdict evidence bundle scripts/run_kuramoto_robustness_v1.py is the CLI entry point. Reads the frozen manifest, runs the three suites, evaluates decisions, and writes five artifacts strictly under results/cross_asset_kuramoto/robustness_v1/. Artifacts emitted by the initial run (frozen bundle, 1000 bootstraps, 64 jitter candidates): - verdict.json — label=FAIL (null families above 5 % p-threshold) - cpcv_summary.json — PBO=0.00, PSR=1.00, daily SR=0.58 (proxy) - null_summary.json — iid p=0.088, stationary-bootstrap p=0.517 - jitter_summary.json — PLACEHOLDER_APPROXIMATION, within_tol=1.00 - ROBUSTNESS_v1.md — one-page human-readable report The FAIL verdict is honest and consistent with SEPARATION_FINDING.md ('robust regime core / fragile value extraction'): on the cumret- derived return proxy the overall return stream is weakly distinguishable from its own permutations, because most realised alpha comes from a narrow HIGH_SYNC regime window. A strictly stronger null audit requires adding the raw net_ret series to the frozen bundle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(robustness): 55-test battery — primitives, contract, suites, gates, no-interference Coverage matrix: - test_robustness_primitives.py (18) — CPCV shape/embargo/purge, PBO bounds on pure-noise vs signal families, PSR high/zero/degenerate, null audit shape/determinism/validation, jitter anchor-recovery, jitter name-in-anchor + negative-fraction error paths. - test_kuramoto_contract.py (6) — 28-hash verification, missing manifest fail-closed, sha256 mismatch fail-closed, missing-file fail-closed, schema-consistency assertions, daily_returns shape. - test_kuramoto_candidate_set.py (5) — legit names accepted, each forbidden prefix rejected, multi-offender listing, anchor-cover. - test_kuramoto_suites.py (10) — CPCV pbo bounds + fold count, null two-family shape + determinism + invalid-bootstrap error, jitter mode + anchor + forbidden-rejection + monotonicity. - test_kuramoto_gate_runner.py (12) — decision-layer PASS/FAIL/ INSUFFICIENT truth table + end-to-end pipeline on frozen bundle. - test_kuramoto_no_interference.py (4) — AST + regex scan asserting no writes under shadow_validation/, demo/, core/cross_asset_kuramoto/, etc.; all result-path literals route to robustness_v1/ or a frozen read-only input; no imports from execution/strategies/paper_trader. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(robustness): wire LOO grid into CPCV suite — real PBO=0.20 replaces trivial mirror Self-audit finding: the fold-mirror PBO was structurally trivial (=0.00) because a 2-column matrix with a median-shifted mirror always picks the same best IS strategy. The offline-robustness packet already ships a 13×5 LOO grid at results/cross_asset_kuramoto/offline_robustness/leave_one_asset_out.csv (13 asset-LOO perturbations × 5 walk-forward folds) — this is a bona-fide OOS matrix for Bailey et al. (2017) PBO estimation. Changes: - kuramoto_contract.py — optional loo_grid field on the contract; inline LOO_GRID_SHA256 constant for fail-closed hash verification outside the 28-entry SOURCE_HASHES.json contract (additive, SOURCE_HASHES untouched). Missing file is tolerated (loo_grid=None); present-but-mismatched file raises FrozenArtifactMismatch. - kuramoto_cpcv_suite.py — _loo_oos_matrix() builds (folds × strategies) from non-baseline LOO rows; estimate_pbo() runs on it when present. KuramotoCPCVResult now carries loo_pbo (float|None), loo_pbo_pass, loo_n_strategies alongside the existing fields. - backtest/robustness_gates.py — _CPCVEvidence Protocol gains loo_pbo_pass; evaluate_robustness_gates() includes it in cpcv_pass conjunction. - CLI + ROBUSTNESS_v1.md now surface 'CPCV | PBO (LOO grid, n=13) | 0.2000 ✓'. First-run evidence on the frozen bundle: PBO (fold mirror): 0.0000 (trivial, as before — kept for continuity) PBO (LOO grid): 0.2000 (13 strategies × 5 folds — real estimator) best-IS each fold: tradable:TLT × 5 (OOS ranks 6, 13, 14, 14, 14) Interpretation: 1/5 folds has best-IS below-median OOS → 20 % overfit probability on the LOO family. Consistent with SEPARATION_FINDING.md ('drop TLT → Sharpe 1.26 → 1.73'): the TLT-drop variant is genuinely best on 4 of 5 folds, not a lucky pick. Tests: - test_loo_pbo_present_and_bounded — loo_pbo ∈ [0, 1], n=13. - test_loo_pbo_matches_hand_computed — regression pin at 0.20. - test_loo_pbo_red_gives_fail — decision layer correctly propagates loo_pbo_pass=False to FAIL. - Existing _FakeCPCV fixture gained loo_pbo_pass: bool = True default so existing decision-layer tests stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(robustness): use raw daily net returns for null suite Task 1 of the PR #356 DECISION_GRADE escalation. Switches the null suite off the cumret-derived pct_change proxy and onto mathematically exact daily log-returns, and fixes a degenerate null family that the switch exposed. ## Input-data change (Task 1 literal mandate) The frozen demo bundle ships strategy_cumret (cumulative wealth) but no raw net_ret column. Contract now derives daily returns as: r_t = log(cumret_t) − log(cumret_{t-1}) This is mathematically exact (not an approximation) for the hypothetical raw net_ret series that produced the wealth trajectory. Log returns are the honest time-additive representation and preserve independence under permutation/resampling, which is the contract assumed by the bootstrap null families. Derivation documented in results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_PROTOCOL.md. ## Null-family fix (bug exposed by Task 1, not introduced by it) The switch to log returns surfaced a structural bug: the old 'iid_permutation' family was *degenerate* for a Sharpe statistic on a single return stream, because Sharpe is order-invariant on a given vector (permutation preserves mean and std exactly up to float noise). The p-value was trivially ≈ 1.0 by construction; the previous p=0.088 on pct_change was a floating-point artefact, not a real signal. Fix: replaced with 'iid_bootstrap' — sample with replacement from the empirical marginal distribution. This changes the realised mean and std of each draw and is the proper iid null for a Sharpe statistic on a single return stream. Literal type, family names, docstrings, and tests updated; null_audit logic otherwise untouched. ## Verdict evolution (numbers on disk) Observed Sharpe (log returns): 0.4832 (was 0.5775 on pct_change) iid_bootstrap p-value: 0.5045 (was 0.0878 on proxy / degenerate permutation) stationary_bootstrap p-value: 0.5235 (was 0.5170) Verdict label: FAIL → FAIL (unchanged). The honest real-returns null gives p ≈ 0.50, consistent with SEPARATION_FINDING.md: the *realised* daily return stream is statistically indistinguishable from bootstrap resamples, because most alpha lives in a narrow HIGH_SYNC regime. This is NOT a proxy artefact — marked FAIL_ON_DAILY_RETURNS in verdict.json. ## Evidence artefacts - verdict.json now carries input_source: 'daily_log_returns' and label_qualifier: 'FAIL_ON_DAILY_RETURNS'. - Renamed ROBUSTNESS_v1.md → ROBUSTNESS_RESULTS.md per task convention. - ROBUSTNESS_PROTOCOL.md introduced to pin the derivation. - cpcv_summary.json, null_summary.json, jitter_summary.json regenerated. ## Guarantees - 28/28 frozen SOURCE_HASHES artefacts unchanged. - Shadow timer still active. - 58/58 tests/research/robustness/ green. - mypy --strict clean across 21 source files. - Signal code untouched; framework-only change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(robustness): expose candidate_count, mark placeholder jitter N/A Task 2 of the DECISION_GRADE escalation — cleans the evidence table so no reader can confuse a tautological measurement for a real one, and forbids placeholder jitter from asserting a live pass. ## CPCV: candidate_count + interpretation KuramotoCPCVResult now carries: - pbo_candidate_count: int (2 for fold-mirror) - pbo_interpretation: str ('tautological' for n<3) - loo_pbo_interpretation: str ('admissible' for n>=5) Interpretation rule is a single module-level helper: n < 3 → 'tautological' (best-IS trivially best) n < 5 → 'weak' (low statistical power) n >= 5 → 'admissible' The fold-mirror PBO is retained as a sanity baseline but the markdown row now explicitly labels it n=2, *tautological*. The LOO-grid PBO is labelled n=13, *admissible* and carries the real signal. ## Jitter: placeholder forces fraction_within_tol_pass=False kuramoto_jitter_suite.run_kuramoto_jitter_suite() now sets fraction_within_tol_pass=False whenever evaluator_mode != 'LIVE', regardless of the raw fraction-within-tol. The stability dataclass retains the raw fraction honestly — it is only the decision-layer pass boolean that is forced to False. Decision layer reason string is now placeholder-aware: - placeholder → 'jitter: placeholder evaluator — abstains from live ✓/✗' - live failure → 'jitter: fraction-within-tol below threshold' ## Evidence-table presentation ROBUSTNESS_RESULTS.md now shows: | CPCV | PBO (fold mirror, n=2, *tautological*) | 0.0000 | ✓ | | CPCV | PBO (LOO grid, n=13, *admissible*) | 0.2000 | ✓ | | Jitter | fraction_within_tol | 1.0000 | N/A | | Jitter | evaluator_mode | `PLACEHOLDER_APPROXIMATION` (…) | n/a | No ✓ appears on any placeholder row. The tautological PBO is surfaced explicitly; no reader will mistake it for a statistically meaningful overfit test. ## Tests - test_pbo_candidate_count_and_interpretation — fold-mirror is always n=2/tautological, LOO is n=13/admissible. - test_placeholder_forces_pass_false — placeholder evaluator must set fraction_within_tol_pass=False regardless of raw fraction. All 60/60 robustness tests green; mypy --strict clean across 21 files; 28/28 frozen artefacts intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(robustness): null p-value convergence across trial counts Task 3 of the DECISION_GRADE escalation. Runs the null suite at n_bootstrap ∈ {500, 1000, 2000, 5000} — same seed, same data, same families — emits a long-form CSV, classifies per-family convergence, and surfaces the verdict in ROBUSTNESS_RESULTS.md. ## scripts/analysis_null_convergence.py Deterministic, offline, no network. For each trial count runs run_kuramoto_null_suite, collects (n, p) pairs per family, and writes to results/cross_asset_kuramoto/robustness_v1/null_convergence.csv with columns: n_trials, family_id, observed_sharpe, p_value, p_value_pass. Classification rule: a family is CONVERGED when max |p(N) - p(2N)| < 0.02 across every adjacent (N, 2N) pair in the sorted trial sequence. Overall status is CONVERGED iff every family converges; otherwise NOT_CONVERGED. ## Convergence results on the frozen bundle (seed=42) iid_bootstrap p ∈ {0.4930, 0.5045, 0.5052, 0.4971} max |Δp| = 0.0115 → CONVERGED stationary_bootstrap p ∈ {0.4950, 0.5235, 0.5012, 0.5217} max |Δp| = 0.0285 → NOT_CONVERGED Overall: NOT_CONVERGED (stationary family max |Δp| exceeds the 0.02 tolerance). Note this is a TECHNICAL convergence label, not a verdict- stability issue: p-values stay in [0.49, 0.52] across all trial counts, well above α = 0.05. The FAIL verdict is decision-stable even while the p-value fluctuates within its own Monte-Carlo uncertainty band. ## Stop condition S5 (from the task brief) S5 fires only if Task 1 CHANGED the verdict AND convergence is NOT_CONVERGED. Task 1 did NOT change the terminal label (FAIL → FAIL); S5 does NOT fire. The convergence status is surfaced honestly in ROBUSTNESS_RESULTS.md so the reader can judge the uncertainty band. ## Evidence artefacts - results/cross_asset_kuramoto/robustness_v1/null_convergence.csv (8 rows: 4 trial counts × 2 families) - ROBUSTNESS_RESULTS.md now renders a 'Null p-value convergence' section when null_convergence.csv is present; absent CSV → section omitted (runner remains self-sufficient). ## Tests - test_same_seed_same_p_values — determinism under fixed seed - test_same_seed_different_n_gives_different_p — n_trials is wired - test_csv_has_required_columns — CSV schema + row shape regression 63/63 research/robustness tests green. mypy --strict clean across 23 source files. 28/28 frozen artefacts intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(robustness): explicit alpha threshold and PSR caveat Task 4 of the DECISION_GRADE escalation. Pins every statistical threshold to a canonical location and documents the PSR autocorrelation limitation so no reader confuses PSR=1.0 with definitive significance. ## ROBUSTNESS_PROTOCOL.md § 3 — Statistical thresholds Nine thresholds tabulated verbatim with their module-level source: null_alpha = 0.05 kuramoto_null_suite.NULL_PASS_P_THRESHOLD pbo_max = 0.50 kuramoto_cpcv_suite.PBO_PASS_THRESHOLD loo_pbo_max = 0.50 kuramoto_cpcv_suite.LOO_PBO_PASS_THRESHOLD psr_min = 0.95 kuramoto_cpcv_suite.PSR_PASS_THRESHOLD jitter_floor_ratio = 0.80 kuramoto_jitter_suite default sharpe_tolerance = 0.20 kuramoto_jitter_suite.DEFAULT_SHARPE_TOLERANCE pbo_tautological_n = 3 kuramoto_cpcv_suite.PBO_TAUTOLOGICAL_CUTOFF pbo_weak_n = 5 kuramoto_cpcv_suite.PBO_WEAK_CUTOFF null_convergence_tol = 0.02 analysis_null_convergence.CONVERGENCE_TOLERANCE The file is explicit that documentation mirrors the code constants, never the other way round. Threshold drift between code and doc is a bug in the doc. ## ROBUSTNESS_LIMITATIONS.md (new) Five honest catalogue entries: 1. PSR has no autocorrelation adjustment. Lopez de Prado Eq. 14.1 corrects skew + kurtosis, not serial correlation. Regime-following strategies have inflated effective sample sizes; PSR=1.0 on the frozen bundle should not be read as definitive significance. HAC (Newey-West) is the forward fix. 2. Jitter evaluator is placeholder — forced abstain, not pass. 3. LOO-grid PBO has only 5 paths — wide CI on the 0.20 point estimate. 4. Null families are single-stream (no benchmark-matched test). 5. Contract covers frozen bundle only; no re-simulation. Each entry is explicit that it is NOT a bug and NOT required for a valid verdict — only things a reader must account for. ## ROBUSTNESS_RESULTS.md wiring - CPCV row now reads 'PSR (daily, no HAC)' so the caveat is visible at-a-glance in the main results table. - Notes section cross-references ROBUSTNESS_PROTOCOL.md § 3 for thresholds and ROBUSTNESS_LIMITATIONS.md § 1 for the PSR caveat. ## Integrity - Code constants unchanged (per R6: do not change verdict by threshold manipulation). Documentation mirrors existing code. - 63/63 tests/research/robustness green. - mypy --strict clean across touched files. - 28/28 frozen artefacts intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(robustness): canonical one-page robustness summary Task 5 of the DECISION_GRADE escalation — final artefact. Single-page digest that reads like SEPARATION_FINDING.md: what was tested, what passed, what failed, what is placeholder, what are the known limitations, verdict, and forward path. ## Scope ROBUSTNESS_SUMMARY.md = entry-point index into ROBUSTNESS_PROTOCOL.md (derivation + thresholds) ROBUSTNESS_RESULTS.md (runtime evidence) ROBUSTNESS_LIMITATIONS.md (forward-improvement catalogue) null_convergence.csv (p-value stability table) verdict.json (machine-readable terminal label) ## Constraints met - Word count: 385 / 400 (wc -w) - Every claim references a specific artefact or number. - Verdict matches verdict.json (FAIL, label_qualifier FAIL_ON_DAILY_RETURNS). - No hype; no 'alpha', 'edge', 'promising'. Facts, numbers, limits. - Cross-references exist and resolve: SEPARATION_FINDING.md, ACCEPTANCE_GATES.md, ROBUSTNESS_PROTOCOL.md, ROBUSTNESS_LIMITATIONS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(robustness): demean returns before bootstrap — null distribution now centred at zero CRITICAL correctness fix surfaced during the final review pass. The previous null implementation sampled the raw returns with replacement, which produces a null distribution centred at the *observed* sample mean (because E[mean of resample] = mean of original). Every p-value was therefore trivially ≈ 0.5 regardless of signal strength — the framework could not distinguish a real edge from noise. ## Before (broken) Synthetic validation exposed the bug: STRONG signal (μ=0.003, SR=3.88): iid_p=0.531 ✗ should be <0.05 MODERATE (μ=0.0008, SR=1.53): iid_p=0.545 ✗ should be <0.1 NOISE (μ=0, SR=0.22): iid_p=0.465 ~ ok INVERTED (μ=-0.003, SR=-4.98): iid_p=0.471 ✗ should be ≈1 ## After (fix) Same synthetic sweep with demeaned bootstrap: STRONG signal (SR=3.88): iid_p=0.002 ✓ reject H0 MODERATE (SR=1.53): iid_p=0.002 ✓ reject H0 NOISE (SR=0.22): iid_p=0.262 ✓ cannot reject INVERTED (SR=-4.98): iid_p=1.000 ✓ far left-tail ## Root cause A non-demeaned bootstrap tests H₀: 'resampled mean equals observed mean' which is trivially true by construction. The canonical Sharpe- vs-zero null test centres each bootstrap draw at zero: centred = returns - returns.mean() null[b] = Sharpe(centred[bootstrap_indices]) Only then does the null represent H₀: 'true mean is zero'; the observed Sharpe is compared against the upper tail. This is the Lopez de Prado (2018) § 14.3 / Politis & Romano (1994) § 3 convention for stationary-bootstrap SR tests. ## Evidence on the frozen bundle (demeaned) iid_bootstrap p = 0.0829 (was 0.5045 broken) stationary_bootstrap p = 0.1029 (was 0.5235 broken) observed SR = 0.4832 (log-return Sharpe, unchanged) The observed Sharpe sits at the 8-10 % upper-tail of the null distribution — statistically suggestive but below the α=0.05 bar. Honest FAIL. ## Convergence on the frozen bundle (demeaned) BEFORE (broken null): NOT_CONVERGED (max |Δp| = 0.0285) AFTER (demeaned): CONVERGED (max |Δp| = 0.0071) The fix not only corrects the null semantics but also stabilises the convergence across {500, 1000, 2000, 5000} trial counts. ## Artefact updates - null_summary.json, null_convergence.csv, verdict.json, cpcv_summary, jitter_summary, ROBUSTNESS_RESULTS.md, ROBUSTNESS_SUMMARY.md all regenerated with the correct null semantics. - Module docstring rewritten to pin the demeaning convention with literature references. - Convergence note in ROBUSTNESS_RESULTS.md updated to reflect the 8-10 % upper-tail reading (not 'well above' as before). ## Guarantees - 63/63 research/robustness tests green. - mypy --strict clean across 23 source files. - 28/28 frozen SOURCE_HASHES artefacts intact. - Signal code untouched; framework-layer fix only. - Verdict label unchanged (FAIL → FAIL); evidence now statistically meaningful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9497872 commit 64a8c14

33 files changed

Lines changed: 5286 additions & 5 deletions

backtest/__init__.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,8 @@
99
run_vectorized_dopamine_td,
1010
)
1111
from .engine import LatencyConfig, OrderBookConfig
12-
from .performance import (
13-
PerformanceReport,
14-
compute_performance_metrics,
15-
export_performance_report,
16-
)
12+
from .performance import PerformanceReport, compute_performance_metrics, export_performance_report
13+
from .robustness_gates import DecisionLabel, RobustnessGateResult, evaluate_robustness_gates
1714
from .synthetic import (
1815
ControlledExperiment,
1916
LiquidityShock,
@@ -47,4 +44,7 @@
4744
"dopamine_td_signal",
4845
"run_dopamine_backtest",
4946
"run_vectorized_dopamine_td",
47+
"DecisionLabel",
48+
"RobustnessGateResult",
49+
"evaluate_robustness_gates",
5050
]

backtest/robustness_gates.py

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Copyright (c) 2023-2026 Yaroslav Vasylenko (neuron7xLab)
2+
# SPDX-License-Identifier: MIT
3+
"""Decision-layer robustness gates.
4+
5+
Given a bundle of per-suite evidence from
6+
:func:`research.robustness.protocols.kuramoto_gate_runner.run_kuramoto_gate_runner`
7+
(or any strategy-family equivalent with the same evidence shape), this
8+
module produces a single terminal label (PASS / FAIL /
9+
INSUFFICIENT_EVIDENCE) plus a machine-readable breakdown.
10+
11+
Separating decisions from primitives keeps evidence collection pure:
12+
the same evidence bundle can be re-evaluated under different decision
13+
thresholds without re-running simulations.
14+
"""
15+
16+
from __future__ import annotations
17+
18+
from dataclasses import dataclass
19+
from enum import Enum
20+
from typing import Protocol, runtime_checkable
21+
22+
23+
class DecisionLabel(str, Enum):
24+
"""Terminal decision labels for a robustness evaluation."""
25+
26+
PASS = "PASS"
27+
FAIL = "FAIL"
28+
INSUFFICIENT_EVIDENCE = "INSUFFICIENT_EVIDENCE"
29+
30+
31+
@runtime_checkable
32+
class _CPCVEvidence(Protocol):
33+
@property
34+
def pbo_pass(self) -> bool: ...
35+
@property
36+
def psr_pass(self) -> bool: ...
37+
@property
38+
def annualised_sharpe(self) -> float: ...
39+
@property
40+
def n_folds(self) -> int: ...
41+
@property
42+
def loo_pbo_pass(self) -> bool: ...
43+
44+
45+
@runtime_checkable
46+
class _NullEvidence(Protocol):
47+
@property
48+
def all_families_pass(self) -> bool: ...
49+
50+
51+
@runtime_checkable
52+
class _JitterEvidence(Protocol):
53+
@property
54+
def evaluator_mode(self) -> str: ...
55+
@property
56+
def fraction_within_tol_pass(self) -> bool: ...
57+
58+
59+
@runtime_checkable
60+
class _EvidenceBundle(Protocol):
61+
@property
62+
def cpcv(self) -> _CPCVEvidence: ...
63+
@property
64+
def null(self) -> _NullEvidence: ...
65+
@property
66+
def jitter(self) -> _JitterEvidence: ...
67+
68+
69+
@dataclass(frozen=True)
70+
class RobustnessGateResult:
71+
"""Terminal decision plus human-readable reason chain."""
72+
73+
label: DecisionLabel
74+
cpcv_pass: bool
75+
null_pass: bool
76+
jitter_pass: bool
77+
jitter_is_placeholder: bool
78+
reasons: tuple[str, ...]
79+
80+
81+
def evaluate_robustness_gates(
82+
evidence: _EvidenceBundle,
83+
*,
84+
require_live_jitter: bool = False,
85+
) -> RobustnessGateResult:
86+
"""Combine suite evidence into a terminal label.
87+
88+
Decision semantics:
89+
90+
- **FAIL** — any of the essential real-evidence gates (CPCV PBO,
91+
PSR, null families) is red.
92+
- **INSUFFICIENT_EVIDENCE** — jitter is a placeholder *and*
93+
``require_live_jitter`` is True. Also triggered when CPCV has
94+
fewer than 2 folds or annualised Sharpe is non-finite.
95+
- **PASS** — all essential gates green; jitter is either live-
96+
passing, or placeholder with ``require_live_jitter`` False.
97+
"""
98+
reasons: list[str] = []
99+
cpcv_pass = bool(
100+
evidence.cpcv.pbo_pass and evidence.cpcv.psr_pass and evidence.cpcv.loo_pbo_pass
101+
)
102+
if not evidence.cpcv.pbo_pass:
103+
reasons.append("cpcv: PBO above threshold")
104+
if not evidence.cpcv.psr_pass:
105+
reasons.append("cpcv: PSR below threshold")
106+
if not evidence.cpcv.loo_pbo_pass:
107+
reasons.append("cpcv: LOO-grid PBO above threshold")
108+
109+
null_pass = bool(evidence.null.all_families_pass)
110+
if not null_pass:
111+
reasons.append("null: one or more families failed")
112+
113+
jitter_pass = bool(evidence.jitter.fraction_within_tol_pass)
114+
jitter_is_placeholder = evidence.jitter.evaluator_mode != "LIVE"
115+
if not jitter_pass:
116+
if jitter_is_placeholder:
117+
reasons.append("jitter: placeholder evaluator — abstains from live ✓/✗")
118+
else:
119+
reasons.append("jitter: fraction-within-tol below threshold")
120+
121+
if evidence.cpcv.n_folds < 2:
122+
reasons.append("cpcv: fewer than 2 folds available")
123+
return RobustnessGateResult(
124+
label=DecisionLabel.INSUFFICIENT_EVIDENCE,
125+
cpcv_pass=cpcv_pass,
126+
null_pass=null_pass,
127+
jitter_pass=jitter_pass,
128+
jitter_is_placeholder=jitter_is_placeholder,
129+
reasons=tuple(reasons),
130+
)
131+
132+
if not (cpcv_pass and null_pass):
133+
return RobustnessGateResult(
134+
label=DecisionLabel.FAIL,
135+
cpcv_pass=cpcv_pass,
136+
null_pass=null_pass,
137+
jitter_pass=jitter_pass,
138+
jitter_is_placeholder=jitter_is_placeholder,
139+
reasons=tuple(reasons),
140+
)
141+
142+
if jitter_is_placeholder and require_live_jitter:
143+
reasons.append("jitter: evaluator is placeholder; live evaluator required")
144+
return RobustnessGateResult(
145+
label=DecisionLabel.INSUFFICIENT_EVIDENCE,
146+
cpcv_pass=cpcv_pass,
147+
null_pass=null_pass,
148+
jitter_pass=jitter_pass,
149+
jitter_is_placeholder=jitter_is_placeholder,
150+
reasons=tuple(reasons),
151+
)
152+
153+
if not jitter_pass and not jitter_is_placeholder:
154+
reasons.append("jitter: live-mode failure")
155+
return RobustnessGateResult(
156+
label=DecisionLabel.FAIL,
157+
cpcv_pass=cpcv_pass,
158+
null_pass=null_pass,
159+
jitter_pass=jitter_pass,
160+
jitter_is_placeholder=jitter_is_placeholder,
161+
reasons=tuple(reasons),
162+
)
163+
164+
return RobustnessGateResult(
165+
label=DecisionLabel.PASS,
166+
cpcv_pass=cpcv_pass,
167+
null_pass=null_pass,
168+
jitter_pass=jitter_pass,
169+
jitter_is_placeholder=jitter_is_placeholder,
170+
reasons=tuple(reasons),
171+
)

research/robustness/__init__.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Copyright (c) 2023-2026 Yaroslav Vasylenko (neuron7xLab)
2+
# SPDX-License-Identifier: MIT
3+
"""Robustness primitives — read-only statistical battery.
4+
5+
Contains three strategy-agnostic primitive modules:
6+
7+
- :mod:`cpcv` — Combinatorial Purged Cross-Validation splits,
8+
Probability of Backtest Overfitting (PBO), Probabilistic Sharpe Ratio
9+
(PSR). Lopez de Prado (2018) *Advances in Financial ML*.
10+
- :mod:`null_audit` — block-bootstrap null falsification families.
11+
- :mod:`stability` — parameter jitter stability.
12+
13+
All primitives are *pure functions* on numpy/pandas inputs. No writes.
14+
No I/O. No strategy coupling. Protocol-level orchestration lives in
15+
:mod:`research.robustness.protocols`.
16+
"""
17+
18+
from __future__ import annotations
19+
20+
from .cpcv import (
21+
cpcv_splits,
22+
estimate_pbo,
23+
probabilistic_sharpe_ratio,
24+
rolling_probabilistic_sharpe,
25+
)
26+
from .null_audit import NullAuditResult, run_null_falsification_audit
27+
from .stability import JitterStabilityResult, parameter_jitter_stability
28+
29+
__all__ = [
30+
"JitterStabilityResult",
31+
"NullAuditResult",
32+
"cpcv_splits",
33+
"estimate_pbo",
34+
"parameter_jitter_stability",
35+
"probabilistic_sharpe_ratio",
36+
"rolling_probabilistic_sharpe",
37+
"run_null_falsification_audit",
38+
]

0 commit comments

Comments
 (0)