You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(robustness): expose candidate_count, mark placeholder jitter N/A
Task 2 of the DECISION_GRADE escalation — cleans the evidence table so
no reader can confuse a tautological measurement for a real one, and
forbids placeholder jitter from asserting a live pass.
## CPCV: candidate_count + interpretation
KuramotoCPCVResult now carries:
- pbo_candidate_count: int (2 for fold-mirror)
- pbo_interpretation: str ('tautological' for n<3)
- loo_pbo_interpretation: str ('admissible' for n>=5)
Interpretation rule is a single module-level helper:
n < 3 → 'tautological' (best-IS trivially best)
n < 5 → 'weak' (low statistical power)
n >= 5 → 'admissible'
The fold-mirror PBO is retained as a sanity baseline but the markdown
row now explicitly labels it n=2, *tautological*. The LOO-grid PBO is
labelled n=13, *admissible* and carries the real signal.
## Jitter: placeholder forces fraction_within_tol_pass=False
kuramoto_jitter_suite.run_kuramoto_jitter_suite() now sets
fraction_within_tol_pass=False whenever evaluator_mode != 'LIVE',
regardless of the raw fraction-within-tol. The stability dataclass
retains the raw fraction honestly — it is only the decision-layer pass
boolean that is forced to False.
Decision layer reason string is now placeholder-aware:
- placeholder → 'jitter: placeholder evaluator — abstains from live ✓/✗'
- live failure → 'jitter: fraction-within-tol below threshold'
## Evidence-table presentation
ROBUSTNESS_RESULTS.md now shows:
| CPCV | PBO (fold mirror, n=2, *tautological*) | 0.0000 | ✓ |
| CPCV | PBO (LOO grid, n=13, *admissible*) | 0.2000 | ✓ |
| Jitter | fraction_within_tol | 1.0000 | N/A |
| Jitter | evaluator_mode | `PLACEHOLDER_APPROXIMATION` (…) | n/a |
No ✓ appears on any placeholder row. The tautological PBO is surfaced
explicitly; no reader will mistake it for a statistically meaningful
overfit test.
## Tests
- test_pbo_candidate_count_and_interpretation — fold-mirror is always
n=2/tautological, LOO is n=13/admissible.
- test_placeholder_forces_pass_false — placeholder evaluator must set
fraction_within_tol_pass=False regardless of raw fraction.
All 60/60 robustness tests green; mypy --strict clean across 21 files;
28/28 frozen artefacts intact.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| Jitter | evaluator_mode |`PLACEHOLDER_APPROXIMATION`(not decision-grade; live evaluator required to flip this row to ✓ / ✗) | n/a |
17
17
18
18
## Reasons
19
19
20
20
- null: one or more families failed
21
+
- jitter: placeholder evaluator — abstains from live ✓/✗
21
22
22
23
## Notes
23
24
24
25
- Evidence is derived from the frozen `offline_robustness/SOURCE_HASHES.json` bundle; 28 artifacts hash-verified.
25
-
- Null suite uses cumulative-return pct_change as a return proxy; raw `net_ret` is not in the frozen demo bundle, which limits statistical power relative to the published headline Sharpe (`risk_metrics.csv::sharpe = 1.2619`).
26
-
- Jitter evaluator is PLACEHOLDER_APPROXIMATION: rebuild under perturbed parameters requires the raw asset panel.
26
+
- Null suite uses mathematically exact daily log-returns (`diff(log(strategy_cumret))`) — no approximation. See `ROBUSTNESS_PROTOCOL.md` § 1 for the derivation contract.
27
+
- PBO interpretation: fewer than 3 candidates is `tautological`, fewer than 5 is `weak`, 5+ is `admissible`. The fold-mirror PBO is always tautological by construction and is kept only as a sanity baseline; the LOO-grid PBO is the decision-grade one.
28
+
- Jitter row shows `N/A` while the evaluator is `PLACEHOLDER_APPROXIMATION`; a live rebuild is required to replace the row with a real ✓ / ✗.
0 commit comments