fix(robustness): demean returns before bootstrap — null distribution now centred at zero

neuron7xLab · claude · neuron7xLab · commit b292c2c7786d · 2026-04-22T14:35:16.000+03:00
CRITICAL correctness fix surfaced during the final review pass. The
previous null implementation sampled the raw returns with replacement,
which produces a null distribution centred at the *observed* sample
mean (because E[mean of resample] = mean of original). Every p-value
was therefore trivially ≈ 0.5 regardless of signal strength — the
framework could not distinguish a real edge from noise.

## Before (broken)

Synthetic validation exposed the bug:
  STRONG signal (μ=0.003, SR=3.88):   iid_p=0.531  ✗ should be &lt;0.05
  MODERATE   (μ=0.0008, SR=1.53):     iid_p=0.545  ✗ should be &lt;0.1
  NOISE      (μ=0, SR=0.22):          iid_p=0.465  ~ ok
  INVERTED   (μ=-0.003, SR=-4.98):    iid_p=0.471  ✗ should be ≈1

## After (fix)

Same synthetic sweep with demeaned bootstrap:
  STRONG signal (SR=3.88):   iid_p=0.002  ✓ reject H0
  MODERATE    (SR=1.53):     iid_p=0.002  ✓ reject H0
  NOISE       (SR=0.22):     iid_p=0.262  ✓ cannot reject
  INVERTED    (SR=-4.98):    iid_p=1.000  ✓ far left-tail

## Root cause

A non-demeaned bootstrap tests H₀: 'resampled mean equals observed
mean' which is trivially true by construction. The canonical Sharpe-
vs-zero null test centres each bootstrap draw at zero:

    centred = returns - returns.mean()
    null[b] = Sharpe(centred[bootstrap_indices])

Only then does the null represent H₀: 'true mean is zero'; the
observed Sharpe is compared against the upper tail. This is the
Lopez de Prado (2018) § 14.3 / Politis &amp; Romano (1994) § 3 convention
for stationary-bootstrap SR tests.

## Evidence on the frozen bundle (demeaned)

  iid_bootstrap         p = 0.0829  (was 0.5045 broken)
  stationary_bootstrap  p = 0.1029  (was 0.5235 broken)
  observed SR           = 0.4832 (log-return Sharpe, unchanged)

The observed Sharpe sits at the 8-10 % upper-tail of the null
distribution — statistically suggestive but below the α=0.05 bar.
Honest FAIL.

## Convergence on the frozen bundle (demeaned)

BEFORE (broken null): NOT_CONVERGED  (max |Δp| = 0.0285)
AFTER (demeaned):     CONVERGED      (max |Δp| = 0.0071)

The fix not only corrects the null semantics but also stabilises the
convergence across {500, 1000, 2000, 5000} trial counts.

## Artefact updates

- null_summary.json, null_convergence.csv, verdict.json, cpcv_summary,
  jitter_summary, ROBUSTNESS_RESULTS.md, ROBUSTNESS_SUMMARY.md all
  regenerated with the correct null semantics.
- Module docstring rewritten to pin the demeaning convention with
  literature references.
- Convergence note in ROBUSTNESS_RESULTS.md updated to reflect the
  8-10 % upper-tail reading (not 'well above' as before).

## Guarantees

- 63/63 research/robustness tests green.
- mypy --strict clean across 23 source files.
- 28/28 frozen SOURCE_HASHES artefacts intact.
- Signal code untouched; framework-layer fix only.
- Verdict label unchanged (FAIL → FAIL); evidence now statistically
  meaningful.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/research/robustness/protocols/kuramoto_null_suite.py b/research/robustness/protocols/kuramoto_null_suite.py
@@ -7,24 +7,33 @@
 bundle only ships *realised* strategy returns (no raw position trace),
 so the generic primitive's three "signal-side" families degenerate.
 
-This suite implements the two null families that are meaningful given
-only a realised return stream:
-
-1. **iid_bootstrap** — sample the returns with replacement, i.i.d.
-   from the empirical marginal distribution; tests for information
-   beyond the first-moment/second-moment marginal distribution.
-   (Plain permutation would be a *degenerate* null here: Sharpe is
-   order-invariant as a function of the vector, so permutation
-   preserves it up to floating-point noise. With-replacement sampling
-   changes the realised mean and std of each draw and is the proper
-   iid null for a Sharpe statistic on a single return stream.)
-2. **stationary_bootstrap** — Politis & Romano block bootstrap with
-   geometric block length (mean = 21 bars); tests information beyond
-   short-horizon autocorrelation.
-
-Positive observed Sharpe with a low p-value on both families is the
-minimum evidence that the realised return stream is not a re-ordering
-artefact.
+This suite implements two **demeaned** bootstrap null families that
+are meaningful given only a realised return stream:
+
+1. **iid_bootstrap** — sample the *demeaned* returns with replacement,
+   i.i.d. from the empirical marginal distribution, and compute the
+   Sharpe of each resample. Under H₀ ('true mean is zero') this null
+   distribution is centred at 0 and the observed Sharpe is compared
+   against its upper tail.
+2. **stationary_bootstrap** — Politis & Romano (1994) block bootstrap
+   with geometric block length (mean = 21 bars) applied to the
+   *demeaned* returns; tests the same H₀ under a stationary-series
+   assumption that preserves short-horizon autocorrelation.
+
+Why demeaning matters: bootstrapping the raw returns produces a null
+distribution centred at the sample mean
+(``E[mean of resample] = mean of original``) so every p-value would
+trivially equal ≈ 0.5 regardless of signal strength. Plain permutation
+would be even worse — Sharpe is order-invariant on a given vector, so
+the permutation null preserves the observed Sharpe up to floating-
+point noise and yields a trivial p → 1. The demeaning step is what
+turns each bootstrap draw into a draw from the null *hypothesis*,
+not from the observed sample. See Lopez de Prado (2018) § 14.3 and
+Politis & Romano (1994) § 3 for the convention.
+
+A low p-value on both families is the minimum evidence that the
+realised Sharpe is distinguishable from zero under the null of no
+information beyond the marginal distribution.
 """
 
 from __future__ import annotations
@@ -119,17 +128,25 @@ def run_kuramoto_null_suite(
         raise ValueError(f"n_bootstrap must be >= 1, got {n_bootstrap}")
     returns = contract.daily_strategy_returns().to_numpy(dtype=np.float64)
     observed = _sharpe(returns, periods_per_year)
+    # Demean before bootstrap: under H0 the true mean is 0, so the null
+    # distribution of Sharpe should be centred at 0, not at the sample
+    # Sharpe. Bootstrapping the raw returns would centre the null around
+    # the observed Sharpe by construction (E[mean of resample] = mean of
+    # original) and trivialise every p-value to ≈ 0.5. See Lopez de
+    # Prado (2018) § 14.3 and Politis & Romano (1994) § 3 for the
+    # demeaning convention on stationary-bootstrap SR tests.
+    centred = returns - returns.mean()
     rng = np.random.default_rng(seed)
 
     null_iid = np.empty(n_bootstrap, dtype=np.float64)
     for b in range(n_bootstrap):
-        idx = rng.integers(0, returns.size, size=returns.size)
-        null_iid[b] = _sharpe(returns[idx], periods_per_year)
+        idx = rng.integers(0, centred.size, size=centred.size)
+        null_iid[b] = _sharpe(centred[idx], periods_per_year)
 
     null_sb = np.empty(n_bootstrap, dtype=np.float64)
     for b in range(n_bootstrap):
-        sb_idx = _stationary_bootstrap_indices(returns.size, mean_block, rng)
-        null_sb[b] = _sharpe(returns[sb_idx], periods_per_year)
+        sb_idx = _stationary_bootstrap_indices(centred.size, mean_block, rng)
+        null_sb[b] = _sharpe(centred[sb_idx], periods_per_year)
 
     families: list[FrozenNullResult] = []
     family_name: FrozenNullFamily
diff --git a/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_RESULTS.md b/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_RESULTS.md
@@ -10,8 +10,8 @@ Terminal decision: **FAIL**
 | CPCV | PSR (daily, no HAC) | 1.0000 | ✓ |
 | CPCV | Annualised Sharpe (daily) | 0.4832 | n/a |
 | CPCV | PBO (LOO grid, n=13, *admissible*) | 0.2000 | ✓ |
-| Null | iid_bootstrap p-value | 0.5045 | ✗ |
-| Null | stationary_bootstrap p-value | 0.5235 | ✗ |
+| Null | iid_bootstrap p-value | 0.0829 | ✗ |
+| Null | stationary_bootstrap p-value | 0.1029 | ✗ |
 | Jitter | fraction_within_tol | 1.0000 | N/A |
 | Jitter | evaluator_mode | `PLACEHOLDER_APPROXIMATION` (not decision-grade; live evaluator required to flip this row to ✓ / ✗) | n/a |
 
@@ -22,11 +22,11 @@ Terminal decision: **FAIL**
 
 ## Null p-value convergence
 
-- overall status: **NOT_CONVERGED**
-- overall max |Δp|: 0.0285 (tolerance 0.0200)
-- iid_bootstrap: max |Δp| = 0.0115
-- stationary_bootstrap: max |Δp| = 0.0285
-- Note: verdict stability under convergence is independent of the CONVERGED/NOT_CONVERGED label. Both families' p-values stay well above α = 0.05 across all trial counts (500 → 5000), so the FAIL verdict is decision-stable even if the p-value fluctuates within its own uncertainty band.
+- overall status: **CONVERGED**
+- overall max |Δp|: 0.0071 (tolerance 0.0200)
+- iid_bootstrap: max |Δp| = 0.0029
+- stationary_bootstrap: max |Δp| = 0.0071
+- Note: the demeaned bootstrap families converge to p ∈ [0.08, 0.10] — the observed Sharpe is statistically suggestive but does not clear the strict α = 0.05 bar. Verdict FAIL is decision-stable across trial counts (500 → 5000).
 
 ## Notes
 
diff --git a/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_SUMMARY.md b/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_SUMMARY.md
@@ -25,14 +25,14 @@ combines evidence into `PASS` / `FAIL` / `INSUFFICIENT_EVIDENCE`.
 
 | Gate | Value | Threshold | Status |
 |---|---:|---:|:-:|
-| iid_bootstrap null p-value | 0.5045 | ≤ 0.05 | ✗ |
-| stationary_bootstrap null p-value | 0.5235 | ≤ 0.05 | ✗ |
-
-Both nulls are with-replacement resamples of the realised daily log-
-return stream. `p ≈ 0.50` means the observed Sharpe (0.483) is
-statistically indistinguishable from bootstrap resamples of its own
-marginal distribution. Consistent with `SEPARATION_FINDING.md`:
-most realised alpha lives in the narrow HIGH_SYNC regime.
+| iid_bootstrap null p-value | 0.0829 | ≤ 0.05 | ✗ |
+| stationary_bootstrap null p-value | 0.1029 | ≤ 0.05 | ✗ |
+
+Both nulls are *demeaned* bootstrap resamples — the canonical test of
+H₀ that the true mean is zero. Observed Sharpe (0.483) sits at the
+8–10 % upper-tail of the null distribution: suggestive but below the
+strict α = 0.05 bar. Consistent with `SEPARATION_FINDING.md`: most
+realised alpha lives in the narrow HIGH_SYNC regime.
 
 ## What is placeholder
 
diff --git a/results/cross_asset_kuramoto/robustness_v1/null_convergence.csv b/results/cross_asset_kuramoto/robustness_v1/null_convergence.csv
@@ -1,9 +1,9 @@
 n_trials,family_id,observed_sharpe,p_value,p_value_pass
-500,iid_bootstrap,0.48319185,0.49301397,False
-500,stationary_bootstrap,0.48319185,0.49500998,False
-1000,iid_bootstrap,0.48319185,0.5044955,False
-1000,stationary_bootstrap,0.48319185,0.52347652,False
-2000,iid_bootstrap,0.48319185,0.50524738,False
-2000,stationary_bootstrap,0.48319185,0.50124938,False
-5000,iid_bootstrap,0.48319185,0.49710058,False
-5000,stationary_bootstrap,0.48319185,0.52169566,False
+500,iid_bootstrap,0.48319185,0.08582834,False
+500,stationary_bootstrap,0.48319185,0.09580838,False
+1000,iid_bootstrap,0.48319185,0.08291708,False
+1000,stationary_bootstrap,0.48319185,0.1028971,False
+2000,iid_bootstrap,0.48319185,0.08095952,False
+2000,stationary_bootstrap,0.48319185,0.10044978,False
+5000,iid_bootstrap,0.48319185,0.0789842,False
+5000,stationary_bootstrap,0.48319185,0.09858028,False
diff --git a/results/cross_asset_kuramoto/robustness_v1/null_summary.json b/results/cross_asset_kuramoto/robustness_v1/null_summary.json
diff --git a/scripts/run_kuramoto_robustness_v1.py b/scripts/run_kuramoto_robustness_v1.py