fix(dro-ara): apply ADF to log-returns, not raw prices

neuron7xLab · claude · neuron7xLab · commit d38ded63e8bd · 2026-04-21T17:17:21.000+03:00
Engine previously mixed stationarity conventions: DFA computes Hurst on diff(log(price)) (engine.py:138), but ADF ran on raw prices. For canonical I(1) asset prices, ADF rejected stationarity ≈93–99% of the time across all askar assets, making INV-DRO3 a near-tautology and reducing INV-DRO4's LONG-gate to permanent INVALID. Patch (4 lines in State.from_window): - compute log_returns = np.diff(np.log(np.abs(arr) + 1e-12)) - run ADF on log_returns instead of raw arr Conventions now consistent with DFA. Empirical impact (SPDR S&P 500, 69 walk-forward folds): stationary rate: 1.4 % → 100 % (binding constraint relaxed) rs_train_max: 0.298 (now bound by RS_LONG_THRESH) gate-on at current θ: 0 → still 0 (rs threshold now binding) The patch unblocks calibration: with stationarity no longer the dominant filter, threshold tuning of (H_CRITICAL, RS_LONG_THRESH) becomes informative — the next step (T3 grid re-run, T4 calibration) is now meaningful. Test impact: 4 tests rewritten to encode post-RFC semantics, 1 added. - tests/core/dro_ara/test_falsification.py: * test_random_walk_is_invalid_or_transition → test_random_walk_returns_are_stationary_no_long (RW returns i.i.d. → stationary; INV-DRO4 forbids LONG) * test_gbm_with_drift_is_non_stationary → test_gbm_with_drift_returns_are_stationary_no_long (GBM returns N(μ,σ²) → stationary; INV-DRO4 forbids LONG) - tests/core/dro_ara/test_invariants.py: * NEW test_inv_dro3_tightening_post_rfc_ou_stationary_rate (INV-DRO3 tightening guard: OU stationary rate > 50 % across 30 seeds) - tests/core/strategies/test_dro_ara_filter.py: * test_apply_on_gbm_drifts_to_zero → test_apply_on_gbm_is_systematically_reduced (statistical: mean filter mult ≤ 0.55 across 30 seeds, ≥ 80 % reduced) - tests/research/dro_ara/test_backtest_smoke.py: * test_backtest_on_gbm_yields_flat_positions → test_backtest_on_gbm_has_flat_and_active_bars_mix (filter still zeroes DRIFT path → flat-frac > 10 %) - tests/research/dro_ara/test_power_mc.py: * test_gbm_drift_classifies_as_invalid_majority → test_gbm_drift_not_classified_as_critical_majority (false-positive guard: p_critical(GBM) ≤ 0.40) Fail-closed audit (per RFC §8 + feedback_fail_closed_audit) — all PASS: 1. tests/core/dro_ara/test_properties.py 8/8 green 2. Hypothesis fuzz (14 @given strategies) green, no crash 3. SPDR 69-fold smoke gate-on/stat ≥ 20 % PASS (100 %) 4. Full repo regression 11274 passed, 0 failed Quality gates: ruff clean · black clean · mypy --strict clean. Refs: PR #345 (RFC), docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md §3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/core/dro_ara/engine.py b/core/dro_ara/engine.py
@@ -3,9 +3,10 @@
 """DRO-ARA v7 — Deterministic Recursive Observer + Action Result Acceptor.
 
 Measures statistical regime of a price series via Hurst (H) on log-returns (DFA-1),
-confirms stationarity via lag-augmented ADF (AIC lag selection, Ng & Perron 2001),
-and emits a deterministic regime + trading signal through a bounded ARA feedback
-loop.
+confirms stationarity via lag-augmented ADF on log-returns (AIC lag selection,
+Ng & Perron 2001), and emits a deterministic regime + trading signal through a
+bounded ARA feedback loop. Both statistical tests operate on the same transform
+(∆ log price) — the convention was aligned in PR #345 (RFC-stationarity).
 
 Public invariants (never relaxed):
 
@@ -213,7 +214,11 @@ class State:
     @classmethod
     def from_window(cls, x: NDArray[np.float64] | np.ndarray) -> "State":
         arr: NDArray[np.float64] = np.asarray(x, dtype=np.float64)
-        stat = _adf_stationary(arr)
+        # INV-DRO3 convention fix (PR #345 RFC): ADF runs on log-returns, not
+        # raw prices. Aligns with DFA input (engine.py:138) — was a
+        # near-tautology on I(1) asset prices when tested on levels.
+        log_returns = np.diff(np.log(np.abs(arr) + 1e-12))
+        stat = _adf_stationary(log_returns)
         g, H, r2 = derive_gamma(arr)
         reg = classify(g, r2, stat)
         rs = risk_scalar(g) if reg != Regime.INVALID else 0.0
diff --git a/tests/core/dro_ara/test_falsification.py b/tests/core/dro_ara/test_falsification.py
@@ -6,12 +6,16 @@
 the observer classifies them correctly. This is the integration gate: if any
 scenario fails, the engine must not ship.
 
+Post-PR #345 convention: ADF runs on log-returns (not raw prices), matching
+the DFA transform. INVALID therefore encodes a *true* unit root in returns,
+not in levels — a non-trivial condition.
+
 Scenarios (deterministic, seeded):
 
-* OU mean-reverting prices        → stationary, CRITICAL (H < 0.45)
-* Pure random walk (GBM no drift) → non-stationary, INVALID
-* GBM with positive drift         → non-stationary, INVALID
-* White noise prices              → stationary, TRANSITION (H ≈ 0.5)
+* OU mean-reverting prices        → stationary returns, CRITICAL/TRANSITION
+* Pure random walk (GBM no drift) → stationary returns (i.i.d.), never LONG
+* GBM with positive drift         → stationary returns (μ+σZ), never LONG
+* White noise prices              → stationary returns, TRANSITION (H ≈ 0.5)
 """
 
 from __future__ import annotations
@@ -71,17 +75,36 @@ def test_ou_mean_reverting_is_critical() -> None:
     assert out["regime"] in {Regime.CRITICAL.value, Regime.TRANSITION.value}, out
 
 
-def test_random_walk_is_invalid_or_transition() -> None:
+def test_random_walk_returns_are_stationary_no_long() -> None:
+    """Random walk: prices I(1), log-returns i.i.d. → ADF stationary post-RFC.
+
+    Hurst estimate for a pure RW is ≈ 0.5 ± finite-sample noise; regime
+    therefore lands in {CRITICAL, TRANSITION, DRIFT}. The invariant that
+    *must* hold (INV-DRO4): RW has no true mean-reversion edge, so signal
+    must never be LONG — regardless of which specific non-INVALID regime
+    the finite sample produces.
+    """
     price = _random_walk(SEED, N_SAMPLES)
     out = geosync_observe(price, window=WINDOW, step=STEP)
-    assert out["regime"] in {Regime.INVALID.value, Regime.TRANSITION.value}, out
+    assert out["stationary"] is True, f"RW returns must be stationary post-RFC: {out}"
+    assert out["regime"] != Regime.INVALID.value, f"RW should not be INVALID: {out}"
+    assert out["signal"] != "LONG", f"INV-DRO4: RW must never emit LONG: {out}"
+
 
+def test_gbm_with_drift_returns_are_stationary_no_long() -> None:
+    """GBM with drift: prices I(1), log-returns ~ N(μ, σ²) → ADF stationary.
 
-def test_gbm_with_drift_is_non_stationary() -> None:
+    Post-RFC (PR #345) the stationarity test targets returns. GBM returns
+    have no unit root — they are i.i.d. Gaussian — so ``stationary=True``.
+    Trend at the price level surfaces in the ARA trend path as
+    DRIFT/DIVERGING, which blocks LONG via INV-DRO4. The true falsification
+    invariant is the signal gate, not the stationarity classification.
+    """
     price = _gbm_drift(SEED, N_SAMPLES, mu=0.002, sigma=0.01)
     out = geosync_observe(price, window=WINDOW, step=STEP)
-    assert out["stationary"] is False, f"GBM with drift must fail ADF, got {out}"
-    assert out["regime"] == Regime.INVALID.value
+    assert out["stationary"] is True, f"GBM returns must be stationary post-RFC: {out}"
+    assert out["regime"] != Regime.INVALID.value, f"GBM post-RFC must not be INVALID: {out}"
+    assert out["signal"] != "LONG", f"INV-DRO4: GBM drift must never emit LONG: {out}"
 
 
 def test_white_noise_prices_are_stationary() -> None:
diff --git a/tests/core/dro_ara/test_invariants.py b/tests/core/dro_ara/test_invariants.py
@@ -104,3 +104,35 @@ def test_observe_deterministic() -> None:
     a = geosync_observe(price)
     b = geosync_observe(price)
     assert a == b
+
+
+def test_inv_dro3_tightening_post_rfc_ou_stationary_rate() -> None:
+    """INV-DRO3 semantic tightening (PR #345 RFC): ADF on log-returns.
+
+    Before the RFC, ADF ran on raw prices → near-tautology that declared
+    virtually every I(1) asset non-stationary. After the RFC, stationarity
+    is a non-trivial property of returns. For a *true* stationary process
+    (Ornstein–Uhlenbeck), INV-DRO3 must be satisfied on the vast majority
+    of seeds: > 50 % stationary rate across independent draws.
+
+    If this test regresses, the convention has likely been reverted.
+    """
+    rng_seeds = list(range(30))
+    stationary_count = 0
+    for seed in rng_seeds:
+        r = np.random.default_rng(seed)
+        n = 1024
+        mu, theta, sigma = 100.0, 0.08, 0.6
+        x = np.empty(n, dtype=np.float64)
+        x[0] = mu
+        for t in range(1, n):
+            x[t] = x[t - 1] + theta * (mu - x[t - 1]) + sigma * r.normal()
+        out = geosync_observe(x)
+        if out["stationary"] is True:
+            stationary_count += 1
+    rate = stationary_count / len(rng_seeds)
+    assert rate > 0.50, (
+        f"INV-DRO3 tightening regressed: OU stationary rate = {rate:.2f} "
+        f"({stationary_count}/{len(rng_seeds)}), expected > 0.50. "
+        f"Convention may have been reverted to ADF-on-raw-prices."
+    )
diff --git a/tests/core/strategies/test_dro_ara_filter.py b/tests/core/strategies/test_dro_ara_filter.py
@@ -77,13 +77,41 @@ def test_apply_on_ou_yields_nonzero_multiplier() -> None:
     assert filtered == pytest.approx(mult)  # raw == 1.0
 
 
-def test_apply_on_gbm_drifts_to_zero() -> None:
-    price = _gbm(seed=2)
-    filtered, obs = apply_regime_filter(raw_signal=1.0, price_window=price)
-    mult = float(obs["regime_multiplier"])  # type: ignore[arg-type]
-    assert obs["regime"] in {"INVALID", "DRIFT"}
-    assert mult == 0.0
-    assert filtered == 0.0
+def test_apply_on_gbm_is_systematically_reduced() -> None:
+    """GBM with drift: filter systematically reduces signal on average.
+
+    Post-PR #345 RFC: ADF runs on log-returns, so GBM is no longer forced
+    to INVALID — returns are stationary. Finite-sample Hurst clusters
+    around 0.5, so most seeds land in TRANSITION/DRIFT, and the trend
+    path further halves CRITICAL cases when DIVERGING is detected.
+
+    Aggregate invariant across a seed ensemble:
+      - Mean multiplier ≤ 0.55 (≥ 45 % reduction on average)
+      - ≥ 80 % of seeds have multiplier < 1.0 (not full pass-through)
+
+    These bounds are statistical, not per-seed: on any single GBM draw the
+    filter may still fully pass, but over an ensemble the filter must
+    demonstrate systematic reduction.
+    """
+    mults: list[float] = []
+    reduced = 0
+    for seed in range(30):
+        price = _gbm(seed=seed)
+        _, obs = apply_regime_filter(raw_signal=1.0, price_window=price)
+        mult = float(obs["regime_multiplier"])  # type: ignore[arg-type]
+        mults.append(mult)
+        if mult < MULTIPLIER_CRITICAL:
+            reduced += 1
+    mean_mult = float(np.mean(mults))
+    reduced_rate = reduced / len(mults)
+    assert mean_mult <= 0.55, (
+        f"GBM filter must reduce on average: mean multiplier = {mean_mult:.3f}, "
+        f"expected ≤ 0.55 across 30 seeds"
+    )
+    assert reduced_rate >= 0.80, (
+        f"GBM filter must not full-pass on most seeds: reduced_rate = "
+        f"{reduced_rate:.2f}, expected ≥ 0.80 across 30 seeds"
+    )
 
 
 def test_apply_preserves_raw_sign_on_critical() -> None:
diff --git a/tests/research/dro_ara/test_backtest_smoke.py b/tests/research/dro_ara/test_backtest_smoke.py
@@ -71,11 +71,24 @@ def test_backtest_symbol_schema() -> None:
     assert np.all(np.isfinite(bt["pnl_net"]))
 
 
-def test_backtest_on_gbm_yields_flat_positions() -> None:
+def test_backtest_on_gbm_has_flat_and_active_bars_mix() -> None:
+    """GBM with drift: filter admits TRANSITION/CRITICAL, zeroes DRIFT.
+
+    Post-PR #345 RFC: ADF on returns, so GBM is no longer uniformly INVALID.
+    Finite-sample H on GBM-drift clusters near 0.5 → regime distribution
+    lands in {TRANSITION ≈ 67 %, CRITICAL ≈ 17 %, DRIFT ≈ 17 %, INVALID ≈ 0 %}.
+    The filter still protects by zeroing DRIFT/INVALID (~17 % of windows),
+    producing an expected flat-bar fraction > 10 % on the active timeline.
+    """
     price = _gbm(SEED, 2000)
     positions = build_positions(price, window=512, step=64, momentum_lag=24)
-    non_flat = int(np.sum(np.abs(positions) > 0))
-    assert non_flat <= 16, f"GBM should filter to ≈flat, got {non_flat} active bars"
+    active_timeline = positions[512 + 64 :]
+    flat_bars = int(np.sum(active_timeline == 0))
+    flat_frac = flat_bars / max(len(active_timeline), 1)
+    assert flat_frac >= 0.10, (
+        f"Filter must zero some GBM bars (DRIFT path), got flat_frac={flat_frac:.3f}"
+    )
+    assert set(np.unique(positions).tolist()) <= {-1, 0, 1}
 
 
 def test_walk_forward_on_synthetic_panel() -> None:
diff --git a/tests/research/dro_ara/test_power_mc.py b/tests/research/dro_ara/test_power_mc.py
@@ -55,13 +55,30 @@ def test_ou_classifies_as_critical_majority() -> None:
     assert rate >= 0.5, f"OU P(CRITICAL) too low: {rate:.3f}"
 
 
-def test_gbm_drift_classifies_as_invalid_majority() -> None:
+def test_gbm_drift_not_classified_as_critical_majority() -> None:
+    """GBM with drift: low false-positive rate for CRITICAL classification.
+
+    Post-PR #345 RFC: ADF on returns, so GBM is no longer uniformly INVALID
+    (INV-DRO3 now encodes true unit root in returns, and GBM returns are
+    i.i.d. Gaussian). The surviving false-positive invariant: GBM must not
+    be classified CRITICAL in the majority — CRITICAL is reserved for
+    genuinely anti-persistent H < 0.45, which GBM-drift does not satisfy
+    in expectation. Empirically on seed=42, n=30: CRITICAL rate ≈ 17 %.
+
+    Contract: p_critical(GBM) ≤ 0.40 (well below OU's ≥ 0.50 threshold).
+    This is the complementary check to ``test_ou_classifies_as_critical_majority``.
+    """
     mc = run_mc(n_samples=30, length=1536, window=512, step=64, seed=42)
+    rate = mc["p_critical"]["gbm_drift"]["p_critical_boot_median"]
+    assert rate <= 0.40, f"GBM CRITICAL false-positive rate too high: {rate:.3f}, expected ≤ 0.40"
+
     gbm = mc["confusion_matrix"]["gbm_drift"]
     total = sum(gbm.values())
     assert total > 0
-    invalid_rate = gbm["INVALID"] / total
-    assert invalid_rate >= 0.8, f"GBM→INVALID rate too low: {invalid_rate:.3f}"
+    non_critical_rate = (total - gbm["CRITICAL"]) / total
+    assert non_critical_rate >= 0.60, (
+        f"GBM should land in non-CRITICAL regimes majority: {non_critical_rate:.3f}"
+    )
 
 
 def test_bootstrap_rate_valid_range() -> None: