Skip to content

Commit b292c2c

Browse files
neuron7xLabclaude
andcommitted
fix(robustness): demean returns before bootstrap — null distribution now centred at zero
CRITICAL correctness fix surfaced during the final review pass. The previous null implementation sampled the raw returns with replacement, which produces a null distribution centred at the *observed* sample mean (because E[mean of resample] = mean of original). Every p-value was therefore trivially ≈ 0.5 regardless of signal strength — the framework could not distinguish a real edge from noise. ## Before (broken) Synthetic validation exposed the bug: STRONG signal (μ=0.003, SR=3.88): iid_p=0.531 ✗ should be <0.05 MODERATE (μ=0.0008, SR=1.53): iid_p=0.545 ✗ should be <0.1 NOISE (μ=0, SR=0.22): iid_p=0.465 ~ ok INVERTED (μ=-0.003, SR=-4.98): iid_p=0.471 ✗ should be ≈1 ## After (fix) Same synthetic sweep with demeaned bootstrap: STRONG signal (SR=3.88): iid_p=0.002 ✓ reject H0 MODERATE (SR=1.53): iid_p=0.002 ✓ reject H0 NOISE (SR=0.22): iid_p=0.262 ✓ cannot reject INVERTED (SR=-4.98): iid_p=1.000 ✓ far left-tail ## Root cause A non-demeaned bootstrap tests H₀: 'resampled mean equals observed mean' which is trivially true by construction. The canonical Sharpe- vs-zero null test centres each bootstrap draw at zero: centred = returns - returns.mean() null[b] = Sharpe(centred[bootstrap_indices]) Only then does the null represent H₀: 'true mean is zero'; the observed Sharpe is compared against the upper tail. This is the Lopez de Prado (2018) § 14.3 / Politis & Romano (1994) § 3 convention for stationary-bootstrap SR tests. ## Evidence on the frozen bundle (demeaned) iid_bootstrap p = 0.0829 (was 0.5045 broken) stationary_bootstrap p = 0.1029 (was 0.5235 broken) observed SR = 0.4832 (log-return Sharpe, unchanged) The observed Sharpe sits at the 8-10 % upper-tail of the null distribution — statistically suggestive but below the α=0.05 bar. Honest FAIL. ## Convergence on the frozen bundle (demeaned) BEFORE (broken null): NOT_CONVERGED (max |Δp| = 0.0285) AFTER (demeaned): CONVERGED (max |Δp| = 0.0071) The fix not only corrects the null semantics but also stabilises the convergence across {500, 1000, 2000, 5000} trial counts. ## Artefact updates - null_summary.json, null_convergence.csv, verdict.json, cpcv_summary, jitter_summary, ROBUSTNESS_RESULTS.md, ROBUSTNESS_SUMMARY.md all regenerated with the correct null semantics. - Module docstring rewritten to pin the demeaning convention with literature references. - Convergence note in ROBUSTNESS_RESULTS.md updated to reflect the 8-10 % upper-tail reading (not 'well above' as before). ## Guarantees - 63/63 research/robustness tests green. - mypy --strict clean across 23 source files. - 28/28 frozen SOURCE_HASHES artefacts intact. - Signal code untouched; framework-layer fix only. - Verdict label unchanged (FAIL → FAIL); evidence now statistically meaningful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b6ca8e2 commit b292c2c

6 files changed

Lines changed: 2069 additions & 2052 deletions

File tree

research/robustness/protocols/kuramoto_null_suite.py

Lines changed: 39 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -7,24 +7,33 @@
77
bundle only ships *realised* strategy returns (no raw position trace),
88
so the generic primitive's three "signal-side" families degenerate.
99
10-
This suite implements the two null families that are meaningful given
11-
only a realised return stream:
12-
13-
1. **iid_bootstrap** — sample the returns with replacement, i.i.d.
14-
from the empirical marginal distribution; tests for information
15-
beyond the first-moment/second-moment marginal distribution.
16-
(Plain permutation would be a *degenerate* null here: Sharpe is
17-
order-invariant as a function of the vector, so permutation
18-
preserves it up to floating-point noise. With-replacement sampling
19-
changes the realised mean and std of each draw and is the proper
20-
iid null for a Sharpe statistic on a single return stream.)
21-
2. **stationary_bootstrap** — Politis & Romano block bootstrap with
22-
geometric block length (mean = 21 bars); tests information beyond
23-
short-horizon autocorrelation.
24-
25-
Positive observed Sharpe with a low p-value on both families is the
26-
minimum evidence that the realised return stream is not a re-ordering
27-
artefact.
10+
This suite implements two **demeaned** bootstrap null families that
11+
are meaningful given only a realised return stream:
12+
13+
1. **iid_bootstrap** — sample the *demeaned* returns with replacement,
14+
i.i.d. from the empirical marginal distribution, and compute the
15+
Sharpe of each resample. Under H₀ ('true mean is zero') this null
16+
distribution is centred at 0 and the observed Sharpe is compared
17+
against its upper tail.
18+
2. **stationary_bootstrap** — Politis & Romano (1994) block bootstrap
19+
with geometric block length (mean = 21 bars) applied to the
20+
*demeaned* returns; tests the same H₀ under a stationary-series
21+
assumption that preserves short-horizon autocorrelation.
22+
23+
Why demeaning matters: bootstrapping the raw returns produces a null
24+
distribution centred at the sample mean
25+
(``E[mean of resample] = mean of original``) so every p-value would
26+
trivially equal ≈ 0.5 regardless of signal strength. Plain permutation
27+
would be even worse — Sharpe is order-invariant on a given vector, so
28+
the permutation null preserves the observed Sharpe up to floating-
29+
point noise and yields a trivial p → 1. The demeaning step is what
30+
turns each bootstrap draw into a draw from the null *hypothesis*,
31+
not from the observed sample. See Lopez de Prado (2018) § 14.3 and
32+
Politis & Romano (1994) § 3 for the convention.
33+
34+
A low p-value on both families is the minimum evidence that the
35+
realised Sharpe is distinguishable from zero under the null of no
36+
information beyond the marginal distribution.
2837
"""
2938

3039
from __future__ import annotations
@@ -119,17 +128,25 @@ def run_kuramoto_null_suite(
119128
raise ValueError(f"n_bootstrap must be >= 1, got {n_bootstrap}")
120129
returns = contract.daily_strategy_returns().to_numpy(dtype=np.float64)
121130
observed = _sharpe(returns, periods_per_year)
131+
# Demean before bootstrap: under H0 the true mean is 0, so the null
132+
# distribution of Sharpe should be centred at 0, not at the sample
133+
# Sharpe. Bootstrapping the raw returns would centre the null around
134+
# the observed Sharpe by construction (E[mean of resample] = mean of
135+
# original) and trivialise every p-value to ≈ 0.5. See Lopez de
136+
# Prado (2018) § 14.3 and Politis & Romano (1994) § 3 for the
137+
# demeaning convention on stationary-bootstrap SR tests.
138+
centred = returns - returns.mean()
122139
rng = np.random.default_rng(seed)
123140

124141
null_iid = np.empty(n_bootstrap, dtype=np.float64)
125142
for b in range(n_bootstrap):
126-
idx = rng.integers(0, returns.size, size=returns.size)
127-
null_iid[b] = _sharpe(returns[idx], periods_per_year)
143+
idx = rng.integers(0, centred.size, size=centred.size)
144+
null_iid[b] = _sharpe(centred[idx], periods_per_year)
128145

129146
null_sb = np.empty(n_bootstrap, dtype=np.float64)
130147
for b in range(n_bootstrap):
131-
sb_idx = _stationary_bootstrap_indices(returns.size, mean_block, rng)
132-
null_sb[b] = _sharpe(returns[sb_idx], periods_per_year)
148+
sb_idx = _stationary_bootstrap_indices(centred.size, mean_block, rng)
149+
null_sb[b] = _sharpe(centred[sb_idx], periods_per_year)
133150

134151
families: list[FrozenNullResult] = []
135152
family_name: FrozenNullFamily

results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_RESULTS.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ Terminal decision: **FAIL**
1010
| CPCV | PSR (daily, no HAC) | 1.0000 ||
1111
| CPCV | Annualised Sharpe (daily) | 0.4832 | n/a |
1212
| CPCV | PBO (LOO grid, n=13, *admissible*) | 0.2000 ||
13-
| Null | iid_bootstrap p-value | 0.5045 ||
14-
| Null | stationary_bootstrap p-value | 0.5235 ||
13+
| Null | iid_bootstrap p-value | 0.0829 ||
14+
| Null | stationary_bootstrap p-value | 0.1029 ||
1515
| Jitter | fraction_within_tol | 1.0000 | N/A |
1616
| Jitter | evaluator_mode | `PLACEHOLDER_APPROXIMATION` (not decision-grade; live evaluator required to flip this row to ✓ / ✗) | n/a |
1717

@@ -22,11 +22,11 @@ Terminal decision: **FAIL**
2222

2323
## Null p-value convergence
2424

25-
- overall status: **NOT_CONVERGED**
26-
- overall max |Δp|: 0.0285 (tolerance 0.0200)
27-
- iid_bootstrap: max |Δp| = 0.0115
28-
- stationary_bootstrap: max |Δp| = 0.0285
29-
- Note: verdict stability under convergence is independent of the CONVERGED/NOT_CONVERGED label. Both families' p-values stay well above α = 0.05 across all trial counts (500 → 5000), so the FAIL verdict is decision-stable even if the p-value fluctuates within its own uncertainty band.
25+
- overall status: **CONVERGED**
26+
- overall max |Δp|: 0.0071 (tolerance 0.0200)
27+
- iid_bootstrap: max |Δp| = 0.0029
28+
- stationary_bootstrap: max |Δp| = 0.0071
29+
- Note: the demeaned bootstrap families converge to p ∈ [0.08, 0.10] — the observed Sharpe is statistically suggestive but does not clear the strict α = 0.05 bar. Verdict FAIL is decision-stable across trial counts (500 → 5000).
3030

3131
## Notes
3232

results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_SUMMARY.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,14 @@ combines evidence into `PASS` / `FAIL` / `INSUFFICIENT_EVIDENCE`.
2525

2626
| Gate | Value | Threshold | Status |
2727
|---|---:|---:|:-:|
28-
| iid_bootstrap null p-value | 0.5045 | ≤ 0.05 ||
29-
| stationary_bootstrap null p-value | 0.5235 | ≤ 0.05 ||
30-
31-
Both nulls are with-replacement resamples of the realised daily log-
32-
return stream. `p ≈ 0.50` means the observed Sharpe (0.483) is
33-
statistically indistinguishable from bootstrap resamples of its own
34-
marginal distribution. Consistent with `SEPARATION_FINDING.md`:
35-
most realised alpha lives in the narrow HIGH_SYNC regime.
28+
| iid_bootstrap null p-value | 0.0829 | ≤ 0.05 ||
29+
| stationary_bootstrap null p-value | 0.1029 | ≤ 0.05 ||
30+
31+
Both nulls are *demeaned* bootstrap resamples the canonical test of
32+
H₀ that the true mean is zero. Observed Sharpe (0.483) sits at the
33+
8–10 % upper-tail of the null distribution: suggestive but below the
34+
strict α = 0.05 bar. Consistent with `SEPARATION_FINDING.md`: most
35+
realised alpha lives in the narrow HIGH_SYNC regime.
3636

3737
## What is placeholder
3838

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
n_trials,family_id,observed_sharpe,p_value,p_value_pass
2-
500,iid_bootstrap,0.48319185,0.49301397,False
3-
500,stationary_bootstrap,0.48319185,0.49500998,False
4-
1000,iid_bootstrap,0.48319185,0.5044955,False
5-
1000,stationary_bootstrap,0.48319185,0.52347652,False
6-
2000,iid_bootstrap,0.48319185,0.50524738,False
7-
2000,stationary_bootstrap,0.48319185,0.50124938,False
8-
5000,iid_bootstrap,0.48319185,0.49710058,False
9-
5000,stationary_bootstrap,0.48319185,0.52169566,False
2+
500,iid_bootstrap,0.48319185,0.08582834,False
3+
500,stationary_bootstrap,0.48319185,0.09580838,False
4+
1000,iid_bootstrap,0.48319185,0.08291708,False
5+
1000,stationary_bootstrap,0.48319185,0.1028971,False
6+
2000,iid_bootstrap,0.48319185,0.08095952,False
7+
2000,stationary_bootstrap,0.48319185,0.10044978,False
8+
5000,iid_bootstrap,0.48319185,0.0789842,False
9+
5000,stationary_bootstrap,0.48319185,0.09858028,False

0 commit comments

Comments
 (0)