docs(robustness): explicit alpha threshold and PSR caveat

neuron7xLab · claude · neuron7xLab · commit eb3aac8ccf75 · 2026-04-22T13:58:04.000+03:00
Task 4 of the DECISION_GRADE escalation. Pins every statistical
threshold to a canonical location and documents the PSR autocorrelation
limitation so no reader confuses PSR=1.0 with definitive significance.

## ROBUSTNESS_PROTOCOL.md § 3 — Statistical thresholds

Nine thresholds tabulated verbatim with their module-level source:
  null_alpha           = 0.05   kuramoto_null_suite.NULL_PASS_P_THRESHOLD
  pbo_max              = 0.50   kuramoto_cpcv_suite.PBO_PASS_THRESHOLD
  loo_pbo_max          = 0.50   kuramoto_cpcv_suite.LOO_PBO_PASS_THRESHOLD
  psr_min              = 0.95   kuramoto_cpcv_suite.PSR_PASS_THRESHOLD
  jitter_floor_ratio   = 0.80   kuramoto_jitter_suite default
  sharpe_tolerance     = 0.20   kuramoto_jitter_suite.DEFAULT_SHARPE_TOLERANCE
  pbo_tautological_n   = 3      kuramoto_cpcv_suite.PBO_TAUTOLOGICAL_CUTOFF
  pbo_weak_n           = 5      kuramoto_cpcv_suite.PBO_WEAK_CUTOFF
  null_convergence_tol = 0.02   analysis_null_convergence.CONVERGENCE_TOLERANCE

The file is explicit that documentation mirrors the code constants,
never the other way round. Threshold drift between code and doc is a
bug in the doc.

## ROBUSTNESS_LIMITATIONS.md (new)

Five honest catalogue entries:
  1. PSR has no autocorrelation adjustment.
     Lopez de Prado Eq. 14.1 corrects skew + kurtosis, not serial
     correlation. Regime-following strategies have inflated effective
     sample sizes; PSR=1.0 on the frozen bundle should not be read as
     definitive significance. HAC (Newey-West) is the forward fix.
  2. Jitter evaluator is placeholder — forced abstain, not pass.
  3. LOO-grid PBO has only 5 paths — wide CI on the 0.20 point estimate.
  4. Null families are single-stream (no benchmark-matched test).
  5. Contract covers frozen bundle only; no re-simulation.

Each entry is explicit that it is NOT a bug and NOT required for a
valid verdict — only things a reader must account for.

## ROBUSTNESS_RESULTS.md wiring

- CPCV row now reads 'PSR (daily, no HAC)' so the caveat is visible
  at-a-glance in the main results table.
- Notes section cross-references ROBUSTNESS_PROTOCOL.md § 3 for
  thresholds and ROBUSTNESS_LIMITATIONS.md § 1 for the PSR caveat.

## Integrity

- Code constants unchanged (per R6: do not change verdict by
  threshold manipulation). Documentation mirrors existing code.
- 63/63 tests/research/robustness green.
- mypy --strict clean across touched files.
- 28/28 frozen artefacts intact.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_LIMITATIONS.md b/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_LIMITATIONS.md
@@ -0,0 +1,89 @@
+# Cross-asset Kuramoto · Robustness v1 limitations
+
+Honest catalogue of what the v1 framework *does not* measure cleanly.
+Nothing below is a bug: every entry is a known statistical or data-
+access limitation that a reader MUST account for when interpreting
+`verdict.json`, `null_summary.json`, or `ROBUSTNESS_RESULTS.md`.
+
+## 1. PSR has no autocorrelation adjustment
+
+`research.robustness.cpcv.probabilistic_sharpe_ratio` implements the
+Lopez de Prado (2018) Eq. 14.1 PSR. The formula corrects for skewness
+(γ₃) and kurtosis (γ₄) of the sample distribution but **does not**
+correct for serial correlation in the return stream.
+
+Strategy returns that exhibit positive first-order autocorrelation —
+which is typical of regime-following strategies — inflate the
+effective sample size used in the Sharpe-variance denominator.
+Consequences:
+
+- The reported `psr_daily = 1.0000` on the frozen bundle should
+  **not** be read as definitive statistical significance.
+- Under HAC (heteroscedasticity- and autocorrelation-consistent)
+  adjustment (Newey–West, Andrews–Monahan kernel), the effective
+  sample size shrinks and the PSR would be materially lower.
+
+Implementing HAC-adjusted PSR is a forward improvement and is
+out of scope for v1. The caveat is cross-linked from
+`ROBUSTNESS_RESULTS.md` under the CPCV row.
+
+## 2. Jitter evaluator is `PLACEHOLDER_APPROXIMATION`
+
+`kuramoto_jitter_executor.make_placeholder_evaluator` returns a
+smooth quadratic in fractional parameter-space distance scaled by the
+anchor Sharpe. This exercises the primitive contract but does **not**
+rebuild the strategy under perturbed parameters.
+
+- The row in `ROBUSTNESS_RESULTS.md` shows `N/A`, not ✓.
+- `fraction_within_tol_pass` is forced to `False` regardless of raw
+  fraction — the decision layer treats placeholder evidence as
+  abstention, not a pass.
+- Replacing the executor requires access to the raw asset panel (not
+  in the frozen bundle); pairing that panel with the frozen parameter
+  lock yields a live evaluator.
+
+## 3. LOO-grid PBO has low path count
+
+`results/cross_asset_kuramoto/offline_robustness/leave_one_asset_out.csv`
+ships 5 folds × 13 perturbations. Bailey et al.'s CPCV PBO achieves
+full statistical power at C(N, k) paths with N ≥ 8. With 5 paths the
+PBO estimate has wide confidence intervals; the reported 0.20 is a
+point estimate, not a CI-backed lower bound.
+
+A higher-power PBO requires either a richer strategy-parameter grid
+(non-frozen; out of scope) or importance-sampled CPCV over an expanded
+fold geometry.
+
+## 4. Null families do not include benchmark-matched tests
+
+The single-stream null suite compares the realised Sharpe against
+bootstrapped resamples of itself. It does **not** test whether the
+strategy outperforms a matched-cost, matched-lag benchmark such as
+BF1 equal-weight. That measurement lives in the offline packet
+(`benchmark_family.csv`) and is cross-referenced by
+`SEPARATION_FINDING.md`.
+
+## 5. Contract covers the frozen bundle only
+
+Everything above operates on `SOURCE_HASHES.json` (28 artefacts) +
+`leave_one_asset_out.csv` (inline-hash-verified extension). The framework
+does **not** re-run the spike or re-simulate the strategy. It is a
+*read-only* audit layer.
+
+## Forward improvements
+
+Any of the five items above can be closed without changing the
+existing primitives:
+
+1. HAC-PSR adjustment (Newey–West kernel inside
+   `probabilistic_sharpe_ratio`).
+2. Live jitter evaluator (raw asset panel + frozen parameter lock).
+3. Higher-power PBO (expand LOO grid or import full spike parameter
+   sweep).
+4. Benchmark-matched null families (import `benchmark_family.csv`).
+5. Protocol-level contract covering the live-shadow evidence rail
+   (not just the demo bundle).
+
+None of these is required for a valid FAIL or PASS verdict on the
+current frozen evidence; each would tighten the confidence interval
+around that verdict.
diff --git a/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_PROTOCOL.md b/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_PROTOCOL.md
@@ -57,14 +57,27 @@ information content beyond short-horizon autocorrelation.
 Both families share a seeded `np.random.default_rng` and emit a
 Davison–Hinkley +1 continuity-corrected upper-tail p-value.
 
-## 3. Decision thresholds
-
-See `ROBUSTNESS_PROTOCOL.md § Statistical thresholds` (populated by
-Task 4) for the canonical `alpha`, `pbo_max`, `psr_min`, and
-jitter-tolerance values. All thresholds are encoded as module-level
-constants in `research/robustness/protocols/*_suite.py` and
-`backtest/robustness_gates.py`; the documentation mirrors the constants,
-never the other way round.
+## 3. Statistical thresholds
+
+All thresholds are encoded as module-level constants; this section
+mirrors the constants, never the other way round. Drift between code
+and this section is a bug in the documentation.
+
+| Threshold | Value | Where set | Semantics |
+|---|---:|---|---|
+| `null_alpha` | 0.05 | `kuramoto_null_suite.NULL_PASS_P_THRESHOLD` | Upper-tail α for either null family |
+| `pbo_max` | 0.50 | `kuramoto_cpcv_suite.PBO_PASS_THRESHOLD` | Fold-mirror PBO must be below this |
+| `loo_pbo_max` | 0.50 | `kuramoto_cpcv_suite.LOO_PBO_PASS_THRESHOLD` | LOO-grid PBO must be below this |
+| `psr_min` | 0.95 | `kuramoto_cpcv_suite.PSR_PASS_THRESHOLD` | Probabilistic Sharpe must exceed this |
+| `jitter_floor_ratio` | 0.80 | `kuramoto_jitter_suite.run_kuramoto_jitter_suite` default `fraction_within_tol_pass` | Fraction of jitter candidates within `sharpe_tolerance` (live evaluator only) |
+| `sharpe_tolerance` | 0.20 | `kuramoto_jitter_suite.DEFAULT_SHARPE_TOLERANCE` | Absolute |ΔSharpe| band for jitter evaluator |
+| `pbo_tautological_n` | 3 | `kuramoto_cpcv_suite.PBO_TAUTOLOGICAL_CUTOFF` | Below this candidate count, PBO is tautological |
+| `pbo_weak_n` | 5 | `kuramoto_cpcv_suite.PBO_WEAK_CUTOFF` | Below this candidate count, PBO is weak |
+| `null_convergence_tol` | 0.02 | `analysis_null_convergence.CONVERGENCE_TOLERANCE` | Max \|Δp\| across adjacent trial counts for CONVERGED |
+
+Threshold semantics are one-sided unless stated otherwise.
+Null-family tests are upper-tail: reject H₀ when *observed* Sharpe is
+in the upper α tail of the bootstrap distribution.
 
 ## 4. Artefacts written
 
diff --git a/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_RESULTS.md b/results/cross_asset_kuramoto/robustness_v1/ROBUSTNESS_RESULTS.md
@@ -7,7 +7,7 @@ Terminal decision: **FAIL**
 | Suite | Metric | Value | Pass |
 |---|---|---:|:-:|
 | CPCV | PBO (fold mirror, n=2, *tautological*) | 0.0000 | ✓ |
-| CPCV | PSR (daily) | 1.0000 | ✓ |
+| CPCV | PSR (daily, no HAC) | 1.0000 | ✓ |
 | CPCV | Annualised Sharpe (daily) | 0.4832 | n/a |
 | CPCV | PBO (LOO grid, n=13, *admissible*) | 0.2000 | ✓ |
 | Null | iid_bootstrap p-value | 0.5045 | ✗ |
@@ -34,3 +34,5 @@ Terminal decision: **FAIL**
 - Null suite uses mathematically exact daily log-returns (`diff(log(strategy_cumret))`) — no approximation. See `ROBUSTNESS_PROTOCOL.md` § 1 for the derivation contract.
 - PBO interpretation: fewer than 3 candidates is `tautological`, fewer than 5 is `weak`, 5+ is `admissible`. The fold-mirror PBO is always tautological by construction and is kept only as a sanity baseline; the LOO-grid PBO is the decision-grade one.
 - Jitter row shows `N/A` while the evaluator is `PLACEHOLDER_APPROXIMATION`; a live rebuild is required to replace the row with a real ✓ / ✗.
+- PSR column is *not* HAC-adjusted. Under positive serial correlation — typical of regime-following strategies — the effective sample size is smaller than the nominal T, and `psr_daily = 1.0000` is inflated. See `ROBUSTNESS_LIMITATIONS.md` § 1 for the forward-improvement path (Newey–West kernel).
+- Decision thresholds (α = 0.05, pbo_max = 0.50, psr_min = 0.95, jitter_floor = 0.80) are documented verbatim in `ROBUSTNESS_PROTOCOL.md` § 3.
diff --git a/scripts/run_kuramoto_robustness_v1.py b/scripts/run_kuramoto_robustness_v1.py
@@ -110,7 +110,7 @@ def _render_markdown(
         f"*{cpcv_dict['pbo_interpretation']}*) | "
         f"{cpcv_dict['pbo']:.4f} | "
         f"{'✓' if cpcv_dict['pbo_pass'] else '✗'} |",
-        f"| CPCV | PSR (daily) | {cpcv_dict['psr_daily']:.4f} | "
+        f"| CPCV | PSR (daily, no HAC) | {cpcv_dict['psr_daily']:.4f} | "
         f"{'✓' if cpcv_dict['psr_pass'] else '✗'} |",
         f"| CPCV | Annualised Sharpe (daily) | {cpcv_dict['annualised_sharpe']:.4f} | n/a |",
     ]
@@ -189,6 +189,15 @@ def _render_markdown(
             "- Jitter row shows `N/A` while the evaluator is "
             "`PLACEHOLDER_APPROXIMATION`; a live rebuild is required to "
             "replace the row with a real ✓ / ✗.",
+            "- PSR column is *not* HAC-adjusted. Under positive serial "
+            "correlation — typical of regime-following strategies — the "
+            "effective sample size is smaller than the nominal T, and "
+            "`psr_daily = 1.0000` is inflated. See "
+            "`ROBUSTNESS_LIMITATIONS.md` § 1 for the forward-improvement "
+            "path (Newey–West kernel).",
+            "- Decision thresholds (α = 0.05, pbo_max = 0.50, "
+            "psr_min = 0.95, jitter_floor = 0.80) are documented "
+            "verbatim in `ROBUSTNESS_PROTOCOL.md` § 3.",
             "",
         ]
     )